之前分析了 二维中值滤波器的并行加速

由于二维中值滤波器是控制密集型的滤波器(排序操作),所以SSE加速不太明显

这次选用了计算密集型的双边滤波器

针对双边滤波器在5*5的滤波核下的运算速度做优化和分析

以下会有主区域、全图、主循环、完整(初始化+主循环)的概念

1.     由于双边滤波的滤波半径为2+1,所以不能忽略图像四周边界的区域了。

所以,以下会对主区域和全图滤波做一个预算时间的对比。

2.     在快速算法中还做了查找表优化,所以滤波函数是个有状态滤波器,算法需要初始化。

以下会对主循环和包含初始化的运算时间做个对比

总结先写:

发现同样的版本 使用浮点比整形还要快。浮点的SSE并行并没有提速4倍。

目前分析是因为编译器自动做了优化,将浮点运算的速度提升了2-3倍。以至于比整形的版本还略快一点。

运算时间

旧版整型 主区域 主循环

8.766ms

新版整型 全图   主循环

6.324ms

Opencv  浮点 全图 主循环

5.713ms

新版    浮点 全图 主循环

5.301ms

SSE 浮点 全图 主循环

4.778ms

omp浮点 全图 主循环

1.527ms

SSE+omp浮点 全图 主循环

1.355ms

并行算法优化分析

1.     整型双边滤波

运算时间

旧版整型 主区域 主循环

8.766ms

新版整型 全图   主循环

6.324ms

2.     整型、浮点型双边滤波

运算时间

新版  整型 全图 主循环

6.324ms

新版  浮点 全图 主循环

5.301ms

opencv浮点 全图 主循环

5.713ms

3.     浮点型SSE双边滤波主区域、全图耗时

主循环

完整

浮点型

主区域

5.002ms

5.078ms

全图

5.198ms

5.301ms

浮点型SSE

主区域

4.658ms

4.764ms

全图

4.778ms

5.076ms

4.     浮点型SSE omp双边滤波

主循环

浮点型

主区域

5.002ms

全图

5.198ms

浮点型SSE

主区域

4.658ms

全图

4.778ms

浮点型omp

主区域

1.434ms

全图

1.527ms

浮点型SSE+omp

主区域

1.285ms

全图

1.355ms

一.重构了关于整型优化的双边滤波器

原先版本是对矩形区域做滤波的,现在改成了圆形区域。减少了近一半的计算量。

旧版的整型双边滤波主区域主循环耗时为8.766ms

新版的整型双边滤波全图主循环耗时为6.324ms

二.设计了浮点型优化的双边滤波器

浮点型双边滤波主区域主循环耗时为5.002ms

浮点型双边滤波主区域完整耗时为5.078ms

浮点型双边滤波全图主循环耗时为5.198ms

浮点型双边滤波全图完整耗时为5.301ms

Opencv浮点型双边滤波全图完整耗时为5.713ms

这里可以看出全图运算大概比主区域多耗时0.2ms

算法初始化耗时0.1ms

三.增加了SSE加速优化的双边滤波器

浮点型SSE加速双边滤波主区域主循环耗时为4.658ms

浮点型SSE加速双边滤波主区域完整耗时为4.764ms

浮点型SSE加速双边滤波全图主循环耗时为4.778ms

浮点型SSE加速双边滤波全图完整耗时为5.076ms

这里可以看出全图运算大概比主区域多耗时0.1ms

算法初始化耗时0.2ms

四.增加了omp加速优化的双边滤波

浮点型omp加速双边滤波主区域主循环耗时为1.434ms

浮点型omp加速双边滤波全图主循环耗时为1.527ms

这里可以看出全图运算大概比主区域多耗时0.1ms

五.增加了SSE+omp加速优化的双边滤波

浮点型SSE+omp加速双边滤波主区域主循环耗时为1.285ms

浮点型SSE+omp加速双边滤波全图主循环耗时为1.355ms

这里可以看出全图运算大概比主区域多耗时0.1ms

以下是具体的运算优化耗时

l  opencv full 5.713ms

l  mainBody mainLoop 5.002ms

l  mainBody 5.078ms

l  Full mainLoop5.198ms

l  Full 5.301ms

l  mainBody sse mainLoop 4.658ms

l  mainBody sse 4.764ms

l  Full sse mainLoop 4.778ms

l  Full sse 5.076ms

l  mainBody omp mainLoop 1.434ms

l  Full omp mainLoop 1.527ms

l  mainBody sse_omp mainLoop 1.285ms

l  Full sse_omp mainLoop 1.355ms

l  Int Full mainLoop 6.324ms

l  INT old version mainBody mainLoop 8.766ms

具体代码如下

宏定义

因为整型的双边会有求和溢出风险,所以这里限制了滤波直径为11/半径5

#define MALLOC               malloc
#define FREE(p)              if (p != NULL) { free(p); p = NULL;}#define ALIGN16              __declspec(align(16))
#define ALIGN_MALLOC16(n)    _aligned_malloc(n, 16)
#define ALIGN_MALLOC32(n)    _aligned_malloc(n, 32)
#define ALIGN_MALLOC64(n)    _aligned_malloc(n, 64)
#define ALIGN_MALLOC128(n)   _aligned_malloc(n, 128)
#define ALIGN_FREE(p)        if (p != NULL) { _aligned_free(p); p = NULL;}
#define BF_INT_BITS      (10)
#define BF_INT_SCALE     (1 << BF_INT_BITS)
#define BF_INT_SHIFT     ((S32)(((BF_INT_BITS + 1) << 1) - 31 + 16) + 7) //more 7 bits[>11*11]
#define BF_INT_BITS2     ((S32)((BF_INT_BITS << 1) - BF_INT_SHIFT))
#define BF_INT_SCALE2    (1 << BF_INT_BITS2)#define BF_BUF_LEN     (1024)
#define EDGEPRES_R_MAX   (5)

类的定义

typedef class edgePresFiltMain
{
public:edgePresFiltMain(S32 width, S32 height, S32 radius, F32 sigmaVal, F32 sigmaSp);~edgePresFiltMain();void edgePreserve_mainBody(U16 *src, U16 *dst);void edgePreserve_mainBody_omp(U16 *src, U16 *dst);void edgePreserve_mainBody_sse(U16 *src, U16 *dst);void edgePreserve_mainBody_sse_omp(U16 *src, U16 *dst);void *hdl;}edgePresFiltMain_;typedef class edgePresFilt
{
public:edgePresFilt(S32 width, S32 height, S32 radius, F32 sigmaVal, F32 sigmaSp);~edgePresFilt();void edgePreserve(U16 *src, U16 *dst);void edgePreserve_omp(U16 *src, U16 *dst);void edgePreserve_sse(U16 *src, U16 *dst);void edgePreserve_sse_omp(U16 *src, U16 *dst);void *hdl;}edgePresFilt_;typedef class edgePresFiltInt
{
public:edgePresFiltInt(S32 width, S32 height, S32 radius, F32 sigmaVal, F32 sigmaSp);~edgePresFiltInt();void edgePreserve(U16 *src, U16 *dst);void edgePreserve_omp(U16 *src, U16 *dst);void *hdl;}edgePresFiltInt_;

纯C函数声明

// no smooth on the border
void edgePreserve_mainBody(U16 *src, U16 *dst, S32 width, S32 height,S32 radius, F32 sigmaVal, F32 sigmaSp);// smooth on the border
void edgePreserve(U16 *src, U16 *dst, S32 width, S32 height,S32 radius, F32 sigmaVal, F32 sigmaSp);// no smooth on the border
void edgePreserve_mainBody_sse(U16 *src, U16 *dst, S32 width, S32 height,S32 radius, F32 sigmaVal, F32 sigmaSp);// smooth on the border
void edgePreserve_sse(U16 *src, U16 *dst, S32 width, S32 height,S32 radius, F32 sigmaVal, F32 sigmaSp);// smooth on the border
void edgePreserveInt(U16 *src, U16 *dst, S32 width, S32 height,S32 radius, F32 sigmaVal, F32 sigmaSp);

边界处理

static void borderReflect(U16 *src, S32 width, S32 height, U16 *dst, S32 radius)
{S32 i = 0;S32 j = 0;S32 itmp1 = radius - 1;S32 itmp2 = -1 - radius;S32 width2 = width + (radius << 1);U16 *psrc = src;U16 *pdst = dst + width2 * radius;for (i = 0; i < height; i++){for (j = 0; j < radius; j++){pdst[j] = psrc[itmp1 - j];}memcpy(pdst + radius, psrc, sizeof(U16) * width);psrc += width;pdst += width2;for (j = -1; j >= -radius; j--){pdst[j] = psrc[itmp2 - j];}}psrc = pdst - width2;for (i = 0; i < radius; i++){memcpy(pdst, psrc, sizeof(U16) * width2);psrc -= width2;pdst += width2;}psrc = dst + width2 * radius;pdst = dst + width2 * (radius - 1);for (i = 0; i < radius; i++){memcpy(pdst, psrc, sizeof(U16) * width2);psrc += width2;pdst -= width2;}
}static void borderCopy(U16 *src, U16 *dst, S32 width, S32 height, S32 radius)
{S32 i = 0;S32 j = 0;S32 xend = height - (radius << 1);U16 *psrc = src;U16 *pdst = dst;memcpy(pdst, psrc, sizeof(U16) * (width * radius - radius));psrc += radius * width;pdst += radius * width;for (i = 0; i < xend; i++){memcpy(pdst - radius, psrc - radius, sizeof(U16) * (radius << 1));psrc += width;pdst += width;}memcpy(pdst - radius, psrc - radius, sizeof(U16) * (width * radius + radius));
}
static void edgePreserve_LUT(S32 radius, S32 width, F32 sigmaVal, F32 sigmaSp,S32 *spOfs, F32 *spWt, F32 *valWt)
{S32 i = 0;S32 j = 0;S32 k = 0;F32 sigmaValCoeff = -0.5f / (sigmaVal * sigmaVal);F32 sigmaSpCoeff = -0.5f / (sigmaSp * sigmaSp);for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){if ((i == 0) && (j == 0)){continue;}spOfs[k] = i * width + j;spWt[k] = expf((i * i + j * j) * sigmaSpCoeff);k++;}}}for (i = 0; i < BF_BUF_LEN - 1; i++){valWt[i] = expf(i * i * sigmaValCoeff);}valWt[i] = 0.f;
}
// Smooth main body of src to the dst img
// src img is bigger than dst img
static void edgePreserve_mainBody_process(U16 *src, S32 srcWidth, S32 srcHeight, S32 srcStep,U16 *dst, S32 dstStep, S32 radius, S32 *spOfs, F32 *spWt, F32 *valWt, S32 maxk)
{const S32 xEnd = srcWidth - radius;const S32 dstHeight = srcHeight - (radius << 1);S32 i = 0;S32 j = 0;S32 k = 0;S32 n = 0;U16 val0 = 0;U16 val = 0;S32 tmp = 0;F32 w = 0.f;F32 sum = 0.f;F32 wsum = 0.f;U16 *psrc = src + radius * srcStep;U16 *pdst = dst;for (i = 0; i < dstHeight; i++){for (j = radius; j < xEnd; j++){val0 = psrc[j];sum = 0.f;wsum = 0.f;for (k = 0; k < maxk; k++){val = psrc[j + spOfs[k]];tmp = val - val0;tmp = ABS(tmp);tmp = MIN(tmp, BF_BUF_LEN - 1);w = spWt[k] * valWt[tmp];sum += val * w;wsum += w;}pdst[j - radius] = (U16)((sum + val0) / (wsum + 1.f));}psrc += srcStep;pdst += dstStep;}
}// Smooth main body of src to the dst img
// src img is bigger than dst img
static void edgePreserve_mainBody_omp_process(U16 *src, S32 srcWidth, S32 srcHeight, S32 srcStep,U16 *dst, S32 dstStep, S32 radius, S32 *spOfs, F32 *spWt, F32 *valWt, S32 maxk)
{const S32 xEnd = srcWidth - radius;const S32 dstHeight = srcHeight - (radius << 1);S32 i = 0;#pragma omp parallel forfor (i = 0; i < dstHeight; i++){S32 j = 0;S32 k = 0;U16 val0 = 0;U16 val = 0;S32 tmp = 0;F32 w = 0.f;F32 sum = 0.f;F32 wsum = 0.f;U16 *psrc = src + (radius + i) * srcStep;U16 *pdst = dst + i * dstStep;for (j = radius; j < xEnd; j++){val0 = psrc[j];sum = 0.f;wsum = 0.f;for (k = 0; k < maxk; k++){val = psrc[j + spOfs[k]];tmp = val - val0;tmp = ABS(tmp);tmp = MIN(tmp, BF_BUF_LEN - 1);w = spWt[k] * valWt[tmp];sum += val * w;wsum += w;}pdst[j - radius] = (U16)((sum + val0) / (wsum + 1.f));}}
}// Smooth main body of src to the dst img
// src img is bigger than dst img
static void edgePreserve_mainBody_sse_process(U16 *src, S32 srcWidth, S32 srcHeight, S32 srcStep,U16 *dst, S32 dstStep, S32 radius, S32 *spOfs, F32 *spWt, F32 *valWt, S32 maxk)
{const S32 xEnd = srcWidth - radius;const S32 dstHeight = srcHeight - (radius << 1);const U32 ALIGN16 bufSignMask[] = { 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff };const F32 ALIGN16 bufLutLen[] = { BF_BUF_LEN - 1, BF_BUF_LEN - 1, BF_BUF_LEN - 1, BF_BUF_LEN - 1 };S32 ALIGN16 buf[4];F32 ALIGN16 bufSum[4];S32 i = 0;S32 j = 0;S32 k = 0;F32 val0 = 0.f;F32 sum = 0.f;F32 wsum = 0.f;U16 *psrc = src + radius * srcStep;U16 *pdst = dst;__m128 _val;__m128 _val0;__m128 _idx;__m128 _psum;__m128 _sw;__m128 _cw;__m128 _w;const __m128 _signMask = _mm_load_ps((const float*)bufSignMask);const __m128 _lutLen = _mm_load_ps((const float*)bufLutLen);for (i = 0; i < dstHeight; i++){for (j = radius; j < xEnd; j++){val0 = psrc[j] * 1.0f;_psum = _mm_setzero_ps();_val0 = _mm_set1_ps(val0);for (k = 0; k <= maxk - 4; k += 4){_val = _mm_set_ps(psrc[j + spOfs[k + 3]], psrc[j + spOfs[k + 2]],psrc[j + spOfs[k + 1]], psrc[j + spOfs[k]]);//                 _sw = _mm_loadu_ps(spWt + k);_sw = _mm_load_ps(spWt + k);_idx = _mm_and_ps(_signMask, _mm_sub_ps(_val, _val0));_mm_store_si128((__m128i*)buf, _mm_cvtps_epi32(_mm_min_ps(_idx, _lutLen)));_cw = _mm_set_ps(valWt[buf[3]], valWt[buf[2]], valWt[buf[1]], valWt[buf[0]]);_w = _mm_mul_ps(_cw, _sw);_val = _mm_mul_ps(_w, _val);_sw = _mm_hadd_ps(_w, _val);_sw = _mm_hadd_ps(_sw, _sw);_psum = _mm_add_ps(_sw, _psum);}_mm_storel_pi((__m64*)bufSum, _psum);sum = bufSum[1] + val0;wsum = bufSum[0] + 1.f;pdst[j - radius] = (U16)(sum / wsum);}psrc += srcStep;pdst += dstStep;}
}// Smooth main body of src to the dst img
// src img is bigger than dst img
static void edgePreserve_mainBody_sse_omp_process(U16 *src, S32 srcWidth, S32 srcHeight, S32 srcStep,U16 *dst, S32 dstStep, S32 radius, S32 *spOfs, F32 *spWt, F32 *valWt, S32 maxk)
{const S32 xEnd = srcWidth - radius;const S32 dstHeight = srcHeight - (radius << 1);const U32 ALIGN16 bufSignMask[] = { 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff };const F32 ALIGN16 bufLutLen[] = { BF_BUF_LEN - 1, BF_BUF_LEN - 1, BF_BUF_LEN - 1, BF_BUF_LEN - 1 };const __m128 _signMask = _mm_load_ps((const float*)bufSignMask);const __m128 _lutLen = _mm_load_ps((const float*)bufLutLen);S32 i = 0;#pragma omp parallel forfor (i = 0; i < dstHeight; i++){S32 ALIGN16 buf[4];F32 ALIGN16 bufSum[4];S32 j = 0;S32 k = 0;F32 val0 = 0.f;F32 sum = 0.f;F32 wsum = 0.f;U16 *psrc = src + (radius + i) * srcStep;U16 *pdst = dst + i * dstStep;__m128 _val;__m128 _val0;__m128 _idx;__m128 _psum;__m128 _sw;__m128 _cw;__m128 _w;for (j = radius; j < xEnd; j++){val0 = psrc[j] * 1.0f;_psum = _mm_setzero_ps();_val0 = _mm_set1_ps(val0);for (k = 0; k <= maxk - 4; k += 4){_val = _mm_set_ps(psrc[j + spOfs[k + 3]], psrc[j + spOfs[k + 2]],psrc[j + spOfs[k + 1]], psrc[j + spOfs[k]]);//                 _sw = _mm_loadu_ps(spWt + k);_sw = _mm_load_ps(spWt + k);_idx = _mm_and_ps(_signMask, _mm_sub_ps(_val, _val0));_mm_store_si128((__m128i*)buf, _mm_cvtps_epi32(_mm_min_ps(_idx, _lutLen)));_cw = _mm_set_ps(valWt[buf[3]], valWt[buf[2]], valWt[buf[1]], valWt[buf[0]]);_w = _mm_mul_ps(_cw, _sw);_val = _mm_mul_ps(_w, _val);_sw = _mm_hadd_ps(_w, _val);_sw = _mm_hadd_ps(_sw, _sw);_psum = _mm_add_ps(_sw, _psum);}_mm_storel_pi((__m64*)bufSum, _psum);sum = bufSum[1] + val0;wsum = bufSum[0] + 1.f;pdst[j - radius] = (U16)(sum / wsum);}}
}// no smooth on the border
void edgePreserve_mainBody(U16 *src, U16 *dst, S32 width, S32 height,S32 radius, F32 sigmaVal, F32 sigmaSp)
{S32 i = 0;S32 j = 0;S32 maxk = -1;    //exclude the centerS32 *spOfs = NULL;F32 *spWt = NULL;F32 *valWt = NULL;for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){maxk++;}}}spWt = (F32 *)ALIGN_MALLOC16(sizeof(F32) * maxk);spOfs = (S32 *)MALLOC(sizeof(S32) * maxk);valWt = (F32 *)MALLOC(sizeof(F32) * BF_BUF_LEN);edgePreserve_LUT(radius, width, sigmaVal, sigmaSp, spOfs, spWt, valWt);borderCopy(src, dst, width, height, radius);edgePreserve_mainBody_process(src, width, height, width,dst + width * radius + radius, width,radius, spOfs, spWt, valWt, maxk);ALIGN_FREE(spWt);FREE(spOfs);FREE(valWt);
}// smooth on the border
void edgePreserve(U16 *src, U16 *dst, S32 width, S32 height,S32 radius, F32 sigmaVal, F32 sigmaSp)
{S32 i = 0;S32 j = 0;S32 maxk = -1;    //exclude the centerS32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);S32 *spOfs = NULL;F32 *spWt = NULL;F32 *valWt = NULL;U16 *buf = (U16 *)MALLOC(sizeof(U16) * width2 * height2);borderReflect(src, width, height, buf, radius);for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){maxk++;}}}spWt = (F32 *)ALIGN_MALLOC16(sizeof(F32) * maxk);spOfs = (S32 *)MALLOC(sizeof(S32) * maxk);valWt = (F32 *)MALLOC(sizeof(F32) * BF_BUF_LEN);edgePreserve_LUT(radius, width2, sigmaVal, sigmaSp, spOfs, spWt, valWt);edgePreserve_mainBody_process(buf, width2, height2, width2,dst, width, radius, spOfs, spWt, valWt, maxk);ALIGN_FREE(spWt);FREE(spOfs);FREE(valWt);FREE(buf);
}// no smooth on the border
void edgePreserve_mainBody_sse(U16 *src, U16 *dst, S32 width, S32 height,S32 radius, F32 sigmaVal, F32 sigmaSp)
{S32 i = 0;S32 j = 0;S32 maxk = -1;    //exclude the centerS32 *spOfs = NULL;F32 *spWt = NULL;F32 *valWt = NULL;for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){maxk++;}}}spWt = (F32 *)ALIGN_MALLOC16(sizeof(F32) * maxk);spOfs = (S32 *)MALLOC(sizeof(S32) * maxk);valWt = (F32 *)MALLOC(sizeof(F32) * BF_BUF_LEN);edgePreserve_LUT(radius, width, sigmaVal, sigmaSp, spOfs, spWt, valWt);borderCopy(src, dst, width, height, radius);//     edgePreserve_noBorder_sse_mainloop(src, dst, width, height, radius, spOfs, spWt, valWt, maxk);edgePreserve_mainBody_sse_process(src, width, height, width,dst + width * radius + radius, width,radius, spOfs, spWt, valWt, maxk);ALIGN_FREE(spWt);FREE(spOfs);FREE(valWt);
}// smooth on the border
void edgePreserve_sse(U16 *src, U16 *dst, S32 width, S32 height,S32 radius, F32 sigmaVal, F32 sigmaSp)
{S32 i = 0;S32 j = 0;S32 maxk = -1;    //exclude the centerS32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);S32 *spOfs = NULL;F32 *spWt = NULL;F32 *valWt = NULL;U16 *buf = (U16 *)MALLOC(sizeof(U16) * width2 * height2);borderReflect(src, width, height, buf, radius);for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){maxk++;}}}spWt = (F32 *)ALIGN_MALLOC16(sizeof(F32) * maxk);spOfs = (S32 *)MALLOC(sizeof(S32) * maxk);valWt = (F32 *)MALLOC(sizeof(F32) * BF_BUF_LEN);edgePreserve_LUT(radius, width2, sigmaVal, sigmaSp, spOfs, spWt, valWt);//     edgePreserve_sse_mainloop(buf, dst, width, height, radius, spOfs, spWt, valWt, maxk);edgePreserve_mainBody_sse_process(buf, width2, height2, width2,dst, width, radius, spOfs, spWt, valWt, maxk);ALIGN_FREE(spWt);FREE(spOfs);FREE(valWt);FREE(buf);
}typedef struct EDGE_PRES_FILT_HDL
{S32 maxk;S32 width;S32 height;S32 radius;S32 *spOfs;F32 *spWt;F32 *valWt;U16 *buf;
}EDGE_PRES_FILT_HDL_;edgePresFiltMain::edgePresFiltMain(S32 width, S32 height, S32 radius, F32 sigmaVal, F32 sigmaSp)
{S32 i = 0;S32 j = 0;S32 maxk = -1;    //exclude the centerEDGE_PRES_FILT_HDL *pHdl = NULL;pHdl = new EDGE_PRES_FILT_HDL;hdl = pHdl;for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){maxk++;}}}pHdl->maxk = maxk;pHdl->width = width;pHdl->height = height;pHdl->radius = radius;pHdl->spWt = (F32 *)ALIGN_MALLOC16(sizeof(F32) * maxk);pHdl->spOfs = (S32 *)MALLOC(sizeof(S32) * maxk);pHdl->valWt = (F32 *)MALLOC(sizeof(F32) * BF_BUF_LEN);edgePreserve_LUT(radius, width, sigmaVal, sigmaSp, pHdl->spOfs, pHdl->spWt, pHdl->valWt);
}edgePresFiltMain::~edgePresFiltMain()
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;ALIGN_FREE(pHdl->spWt);FREE(pHdl->spOfs);FREE(pHdl->valWt);delete hdl;
}void edgePresFiltMain::edgePreserve_mainBody(U16 *src, U16 *dst)
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;S32 width = pHdl->width;S32 height = pHdl->height;S32 radius = pHdl->radius;borderCopy(src, dst, width, height, radius);edgePreserve_mainBody_process(src, width, height, width,dst + width * radius + radius, width,radius, pHdl->spOfs, pHdl->spWt, pHdl->valWt, pHdl->maxk);
}void edgePresFiltMain::edgePreserve_mainBody_omp(U16 *src, U16 *dst)
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;S32 width = pHdl->width;S32 height = pHdl->height;S32 radius = pHdl->radius;borderCopy(src, dst, width, height, radius);edgePreserve_mainBody_omp_process(src, width, height, width,dst + width * radius + radius, width,radius, pHdl->spOfs, pHdl->spWt, pHdl->valWt, pHdl->maxk);
}void edgePresFiltMain::edgePreserve_mainBody_sse(U16 *src, U16 *dst)
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;S32 width = pHdl->width;S32 height = pHdl->height;S32 radius = pHdl->radius;borderCopy(src, dst, width, height, radius);edgePreserve_mainBody_sse_process(src, width, height, width,dst + width * radius + radius, width,radius, pHdl->spOfs, pHdl->spWt, pHdl->valWt, pHdl->maxk);
}void edgePresFiltMain::edgePreserve_mainBody_sse_omp(U16 *src, U16 *dst)
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;S32 width = pHdl->width;S32 height = pHdl->height;S32 radius = pHdl->radius;borderCopy(src, dst, width, height, radius);edgePreserve_mainBody_sse_omp_process(src, width, height, width,dst + width * radius + radius, width,radius, pHdl->spOfs, pHdl->spWt, pHdl->valWt, pHdl->maxk);
}edgePresFilt::edgePresFilt(S32 width, S32 height, S32 radius, F32 sigmaVal, F32 sigmaSp)
{S32 i = 0;S32 j = 0;S32 maxk = -1;    //exclude the centerS32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);EDGE_PRES_FILT_HDL *pHdl = NULL;pHdl = new EDGE_PRES_FILT_HDL;hdl = pHdl;for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){maxk++;}}}pHdl->maxk = maxk;pHdl->width = width;pHdl->height = height;pHdl->radius = radius;pHdl->spWt = (F32 *)ALIGN_MALLOC16(sizeof(F32) * maxk);pHdl->spOfs = (S32 *)MALLOC(sizeof(S32) * maxk);pHdl->valWt = (F32 *)MALLOC(sizeof(F32) * BF_BUF_LEN);pHdl->buf = (U16 *)MALLOC(sizeof(U16) * width2 * height2);edgePreserve_LUT(radius, width2, sigmaVal, sigmaSp, pHdl->spOfs, pHdl->spWt, pHdl->valWt);
}edgePresFilt::~edgePresFilt()
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;ALIGN_FREE(pHdl->spWt);FREE(pHdl->spOfs);FREE(pHdl->valWt);FREE(pHdl->buf);delete hdl;
}void edgePresFilt::edgePreserve(U16 *src, U16 *dst)
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;S32 width = pHdl->width;S32 height = pHdl->height;S32 radius = pHdl->radius;S32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);borderReflect(src, width, height, pHdl->buf, radius);edgePreserve_mainBody_process(pHdl->buf, width2, height2, width2,dst, width, radius, pHdl->spOfs, pHdl->spWt, pHdl->valWt, pHdl->maxk);
}void edgePresFilt::edgePreserve_omp(U16 *src, U16 *dst)
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;S32 width = pHdl->width;S32 height = pHdl->height;S32 radius = pHdl->radius;S32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);borderReflect(src, width, height, pHdl->buf, radius);edgePreserve_mainBody_omp_process(pHdl->buf, width2, height2, width2,dst, width, radius, pHdl->spOfs, pHdl->spWt, pHdl->valWt, pHdl->maxk);
}void edgePresFilt::edgePreserve_sse(U16 *src, U16 *dst)
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;S32 width = pHdl->width;S32 height = pHdl->height;S32 radius = pHdl->radius;S32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);borderReflect(src, width, height, pHdl->buf, radius);edgePreserve_mainBody_sse_process(pHdl->buf, width2, height2, width2,dst, width, radius, pHdl->spOfs, pHdl->spWt, pHdl->valWt, pHdl->maxk);
}void edgePresFilt::edgePreserve_sse_omp(U16 *src, U16 *dst)
{EDGE_PRES_FILT_HDL *pHdl = (EDGE_PRES_FILT_HDL *)hdl;S32 width = pHdl->width;S32 height = pHdl->height;S32 radius = pHdl->radius;S32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);borderReflect(src, width, height, pHdl->buf, radius);edgePreserve_mainBody_sse_omp_process(pHdl->buf, width2, height2, width2,dst, width, radius, pHdl->spOfs, pHdl->spWt, pHdl->valWt, pHdl->maxk);
}//static void edgePreserveInt_LUT(S32 radius, S32 width, F32 sigmaVal, F32 sigmaSp,S32 *spOfs, S32 *spWt, S32 *valWt)
{S32 i = 0;S32 j = 0;S32 k = 0;F32 sigmaValCoeff = -0.5f / (sigmaVal * sigmaVal);F32 sigmaSpCoeff = -0.5f / (sigmaSp * sigmaSp);for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){if ((i == 0) && (j == 0)){continue;}spOfs[k] = i * width + j;spWt[k] = (S32)(expf((i * i + j * j) * sigmaSpCoeff) * BF_INT_SCALE);k++;}}}for (i = 0; i < BF_BUF_LEN - 1; i++){valWt[i] = (S32)(expf(i * i * sigmaValCoeff) * BF_INT_SCALE);}valWt[i] = 0;
}// Smooth main body of src to the dst img
// src img is bigger than dst img
static void edgePreserveInt_mainBody_process(U16 *src, S32 srcWidth, S32 srcHeight, S32 srcStep,U16 *dst, S32 dstStep, S32 radius, S32 *spOfs, S32 *spWt, S32 *valWt, S32 maxk)
{const S32 xEnd = srcWidth - radius;const S32 dstHeight = srcHeight - (radius << 1);S32 i = 0;S32 j = 0;S32 k = 0;S32 n = 0;U16 val0 = 0;U16 val = 0;S32 tmp = 0;S32 w = 0;S32 sum = 0;S32 wsum = 0;U16 *psrc = src + radius * srcStep;U16 *pdst = dst;for (i = 0; i < dstHeight; i++){for (j = radius; j < xEnd; j++){val0 = psrc[j];sum = 0;wsum = 0;for (k = 0; k < maxk; k++){val = psrc[j + spOfs[k]];tmp = val - val0;tmp = ABS(tmp);tmp = MIN(tmp, BF_BUF_LEN - 1);w = spWt[k] * valWt[tmp];w >>= BF_INT_SHIFT;wsum += w;sum += (val * w);}pdst[j - radius] = (U16)((sum + (val0 << BF_INT_BITS2)) / (wsum + BF_INT_SCALE2));}psrc += srcStep;pdst += dstStep;}
}// smooth on the border
void edgePreserveInt(U16 *src, U16 *dst, S32 width, S32 height,S32 radius_, F32 sigmaVal, F32 sigmaSp)
{S32 i = 0;S32 j = 0;S32 maxk = -1;    //exclude the centerS32 radius = GP_MIN(radius_, GP_EDGEPRES_R_MAX);S32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);S32 *spOfs = NULL;S32 *spWt = NULL;S32 *valWt = NULL;U16 *buf = (U16 *)MALLOC(sizeof(U16) * width2 * height2);borderReflect(src, width, height, buf, radius);for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){maxk++;}}}spWt = (S32 *)ALIGN_MALLOC16(sizeof(S32) * maxk);spOfs = (S32 *)MALLOC(sizeof(S32) * maxk);valWt = (S32 *)MALLOC(sizeof(S32) * BF_BUF_LEN);edgePreserveInt_LUT(radius, width2, sigmaVal, sigmaSp, spOfs, spWt, valWt);edgePreserveInt_mainBody_process(buf, width2, height2, width2,dst, width, radius, spOfs, spWt, valWt, maxk);ALIGN_FREE(spWt);FREE(spOfs);FREE(valWt);FREE(buf);
}typedef struct EDGE_PRES_FILT_INT_HDL
{S32 maxk;S32 width;S32 height;S32 radius;S32 *spOfs;S32 *spWt;S32 *valWt;U16 *buf;
}EDGE_PRES_FILT_INT_HDL_;edgePresFiltInt::edgePresFiltInt(S32 width, S32 height, S32 radius_, F32 sigmaVal, F32 sigmaSp)
{S32 i = 0;S32 j = 0;S32 maxk = -1;    //exclude the centerS32 radius = GP_MIN(radius_, GP_EDGEPRES_R_MAX);S32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);EDGE_PRES_FILT_INT_HDL *pHdl = NULL;pHdl = new EDGE_PRES_FILT_INT_HDL;hdl = pHdl;for (i = -radius; i <= radius; i++){for (j = -radius; j <= radius; j++){if (sqrtf((i * i + j * j) * 1.f) <= radius){maxk++;}}}pHdl->maxk = maxk;pHdl->width = width;pHdl->height = height;pHdl->radius = radius;pHdl->spWt = (S32 *)ALIGN_MALLOC16(sizeof(S32) * maxk);pHdl->spOfs = (S32 *)MALLOC(sizeof(S32) * maxk);pHdl->valWt = (S32 *)MALLOC(sizeof(S32) * BF_BUF_LEN);pHdl->buf = (U16 *)MALLOC(sizeof(U16) * width2 * height2);edgePreserveInt_LUT(radius, width2, sigmaVal, sigmaSp, pHdl->spOfs, pHdl->spWt, pHdl->valWt);
}edgePresFiltInt::~edgePresFiltInt()
{EDGE_PRES_FILT_INT_HDL *pHdl = (EDGE_PRES_FILT_INT_HDL *)hdl;ALIGN_FREE(pHdl->spWt);FREE(pHdl->spOfs);FREE(pHdl->valWt);FREE(pHdl->buf);delete hdl;
}void edgePresFiltInt::edgePreserve(U16 *src, U16 *dst)
{EDGE_PRES_FILT_INT_HDL *pHdl = (EDGE_PRES_FILT_INT_HDL *)hdl;S32 width = pHdl->width;S32 height = pHdl->height;S32 radius = pHdl->radius;S32 width2 = width + (radius << 1);S32 height2 = height + (radius << 1);borderReflect(src, width, height, pHdl->buf, radius);edgePreserveInt_mainBody_process(pHdl->buf, width2, height2, width2,dst, width, radius, pHdl->spOfs, pHdl->spWt, pHdl->valWt, pHdl->maxk);
}

并行加速实战 双边滤波器相关推荐

  1. 基于FPGA的引导滤波并行加速实现

    前面一篇文章中,已经详细的分析了引导滤波的理论,公式的推导,以及和双边滤波的对比分析,即在边缘的处理上双边滤波会引起人为的黑/白边.我们已经知道何博士引导滤波的优秀之处,那么本篇文章,我带你推演,如何 ...

  2. 用openMP进行并行加速

    用openMP进行并行加速 参考:http://blog.csdn.net/lanbing510/article/details/17108451 最近在看多核编程.简单来说,由于现在电脑CPU一般都 ...

  3. 双边滤波器在灰度和彩色图像处理中的应用

    原文链接:http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/MANDUCHI1/Bilateral_Filtering.html 版权归原 ...

  4. 【双边滤波】基于小波变换的多尺度自适应THZ增强双边滤波器的MATLAB仿真

    1.软件版本 MATLAB2021a 2.本算法理论知识 提出了一种"基于小波变换的多尺度自适应双边滤波器"算法. 其对应的算法流程如下所示: 下面,我们从理论上限介绍一下这里所采 ...

  5. 双边滤波器的原理及实现

    双边滤波器是什么? 双边滤波(Bilateral filter)是一种可以保边去噪的滤波器.之所以可以达到此去噪效果,是因为滤波器是由两个函数构成.一个函数是由几何空间距离决定滤波器系数.另一个由像素 ...

  6. bilateral filter双边滤波器的通俗理解

    bilateral filter双边滤波器的通俗理解 图像去噪的方法很多,如中值滤波,高斯滤波,维纳滤波等等.但这些降噪方法容易模糊图片的边缘细节,对于高频细节的保护效果并不明显.相比较而言,bila ...

  7. 能使曲线变平滑的一维滤波器_双边滤波器的原理及实现

    双边滤波(Bilateral filter)是一种非线性的滤波方法,是结合图像的空间邻近度和像素值相似度的一种折衷处理,同时考虑空域信息和灰度相似性,达到保边去噪的目的. 双边滤波器之所以能够做到在平 ...

  8. 双边滤波器—— Matlab实现

    例:先用双边滤波器(BF)对原图像进行滤波得到低频部分,原图和低频作差后得到高频分量,高频分量和低频分量分别增强后再进行合成. 双边滤波的特点是保边去噪,相较于高斯滤波,在平滑图像的同时,增加了对图像 ...

  9. 为SSD加速 实战4KB对齐技巧1/3

    本篇文章分块 ※为SSD加速 实战4KB对齐技巧1/3 ※为SSD加速 实战4KB对齐技巧2/3 ※为SSD加速 实战4KB对齐技巧3/3 载入游戏仅需几秒,这让固态硬盘SSD成为大家喜欢的存储利器. ...

最新文章

  1. Android特色开发之Google MAP
  2. PC厂商如何演化移动互联网市场格局?
  3. 安装mysql总是未响应状态_求助啊 WIN7下安装mysql出问题 老是说未响应~!!
  4. Java基础——异常处理
  5. 使用tensorflow出现 ImportError: DLL load failed: 找不到指定的程序
  6. 分享:几款代码混淆器(DotFuscator, .NET Reactor, xenocode)
  7. Redis-Predis 扩展
  8. ios开发 热搜词demo_一场比赛16个热搜,uzi因焕烽躺枪,阿bin评价赛后太揪心
  9. 走出囚徒困境的方法_囚徒困境的一种计算方法
  10. Matlab线性/非线性规划优化算法(1)
  11. 浏览器的“sleep”
  12. Atitit alldiaryindex v1 t717 目录 1. Fix 1 2. Diary detail 1 2.1. Diary 1987---2016.12 1 2.2. Diary20
  13. 南大计算机学硕复试,2017年南京大学计算机科学与技术系考研复试名单
  14. 历年系统架构设计师考试之设计模式试题-2012年
  15. 第十届“中国电机工程学会杯”全国大学生电工数学建模竞赛 B 题 全面二孩政策对我国人口结构的影响
  16. compiled.php,laravel compiled.php 缓存 命令行
  17. 【DeepLearning-Note】Implementation of Convolutiona Netural Network
  18. 【一文快速理解23种设计模式】
  19. OPPO和vivo拉抬,联发科业绩继续下滑
  20. 科学种草 | 破解小红书素人爆文的奥秘

热门文章

  1. IPv6和IPv4互通实验 isatap-6to4
  2. JS 使用 SIN COS
  3. 分享给你一个酷炫的前端组件库,还不用起来?
  4. 用户态与内核态的区别与理解
  5. python汉诺塔递归算法流程图_详解汉诺塔Python递归程序
  6. 感染源在哪里?-java题解
  7. 如何从官网找到Visual Studio Express 2015及其他版本(包括其他语言)
  8. 无法在别的计算机里显示u盘,为什么我的u盘某些一部分文件夹在一个电脑上能显示而在另一个电脑上不显示?...
  9. Linux 多进程通信开发(五): 信号量
  10. DTCC 2020 | 阿里云王涛:阿里巴巴电商数据库上云实践