[FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()
徐鋆
xujunzz at sjtu.edu.cn
Thu Dec 5 07:49:41 EET 2019
Hi, chen
----- 原始邮件 -----
> 发件人: "chen" <chenm003 at 163.com>
> 收件人: "FFmpeg development discussions and patches" <ffmpeg-devel at ffmpeg.org>
> 发送时间: 星期二, 2019年 12 月 03日 下午 4:59:06
> 主题: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()
> comments inline in code
>
>
> At 2019-12-03 15:52:07, xujunzz at sjtu.edu.cn wrote:
>>From: Xu Jun <xujunzz at sjtu.edu.cn>
>>
>>+; void filter_column(uint8_t *dst, int height,
>>+; float rdiv, float bias, const int *const matrix,
>>+; const uint8_t *c[], int length, int radius,
>>+; int dstride, int stride);
>>+
>>+%if ARCH_X86_64
>>+INIT_XMM sse4
>>+%if UNIX64
>>+cglobal filter_column, 8, 15, 7, dst, height, matrix, ptr, width, rad, dstride,
>>stride, i, ci, dst_off, off16, c_off, sum, r
>>+%else
>>+cglobal filter_column, 8, 15, 7, dst, height, rdiv, bias, matrix, ptr, width,
>>rad, dstride, stride, i, ci, dst_off, off16, c_off, sum, r
>
>>+%endif
> no idea, these are difficult to read and understand
I will rename some variables to make it more readable. Do I need to add some notes here?
>
>
>
>
>>+
>>+%if WIN64
>>+ SWAP m0, m2
>>+ SWAP m1, m3
>>+ mov r2q, matrixmp
>>+ mov r3q, ptrmp
>>+ mov r4q, widthmp
>>+ mov r5q, radmp
>>+ mov r6q, dstridemp
>>+ mov r7q, stridemp
>>+ DEFINE_ARGS dst, height, matrix, ptr, width, rad, dstride, stride, i, ci,
>>dst_off, off16, c_off, sum, r
>>+%endif
>>+
>>+movsxdifnidn widthq, widthd
>>+movsxdifnidn radq, radd
>>+movsxdifnidn dstrideq, dstrided
>>+movsxdifnidn strideq, strided
>>+sal radq, 1
>
>>+add radq, 1 ;2*radius+1
> I don't know how about compare to "LEA x,[y*2+1]"
> And....I want not discuss in between SAL and SHL
>
I think lea is better and I will change in the next version.
>
>>+movsxdifnidn heightq, heightd
>>+VBROADCASTSS m0, m0
>>+VBROADCASTSS m1, m1
>>+pxor m6, m6
>>+movss m5, [half]
>>+VBROADCASTSS m5, m5
>>+
>>+xor dst_offq, dst_offq
>>+xor c_offq, c_offq
>>+
>>+.loopy:
>>+ xor off16q, off16q
>>+ cmp widthq, mmsize/4
>>+ jl .loopr
>>+
>>+ mov rq, widthq
>>+ and rq, mmsize/4-1
>>+ sub widthq, rq
>>+
>
>>+ .loop16: ;parallel process 16 elements in a row
> Processing 4 column per loop, are you means, we want to save lots of unused
> register?
> We claim X64, so we have 16 of XMMs
Will use more XMMs and process 16 column at a time.
>
>
>>+ pxor m4, m4
>>+ xor iq, iq
>>+ .loopi:
>
>>+ movss m2, [matrixq + 4*iq]
> no idea that you working on Float data path, we are lucky, Intel CPU sounds not
> penalty in here.
Will change to Interger data path using movd.
And movd seems to have less CPI than movss.
>
>
>>+ VBROADCASTSS m2, m2
>>+ mov ciq, [ptrq + iq * gprsize]
>>+ movss m3, [ciq + c_offq] ;c[i][y*stride + off16]
>>+ punpcklbw m3, m6
>
>>+ punpcklwd m3, m6
> Since you claim SSE4, the instruction PMOVZXBD available, moreover, SSE4
> register can be full fill 16 of uint8, but load 4 of them only.
I thought that since I would multiply 4 ints, loading 4 uint8s per loop is OK.
Now I know that read 16 uint8s and shuffle them is faster.
Will change in next version.
>
>>+ pmulld m2, m3
>>+ paddd m4, m2
>>+
>>+ add iq, 1
>
>>+ cmp iq, radq
> When you initial iq to radq and decrement per loop, you can reduce one
> instruction
> I know iq is work as index in the loop, but we can found some trick over there.
Will change in next V.
>>+ jl .loopi
>>+
>>+ cvtdq2ps m4, m4
>>+ mulps m4, m0 ; sum *= rdiv
>>+ addps m4, m1 ; sum += bias
>
>>+ addps m4, m5 ; sum += 0.5
> I don't know how about precision mismatch if we pre-compute (bias+0.5)
Here may not be modified after discussions.
>
>
>>+ cvttps2dq m4, m4
>>+ packssdw m4, m4
>>+ packuswb m4, m4
>>+ movss [dstq + dst_offq], m4
>>+ add c_offq, mmsize/4
>>+ add dst_offq, mmsize/4
>>+
>>+ add off16q, mmsize/4
>>+ cmp off16q, widthq
>>+ jl .loop16
>>+
>>+ add widthq, rq
>>+ cmp off16q, widthq
>>+ jge .paraend
>>+
>
>>+ .loopr:
> no idea about this loop, if we can read beyond, we can reuse above SIMD code
Here may not be modified too.
Xu Jun
>
>
>>+ xor sumd, sumd
>>+ xor iq, iq
>>+ .loopr_i:
>>+ mov ciq, [ptrq + iq * gprsize]
>>+ movzx rd, byte [ciq + c_offq]
>>+ imul rd, [matrixq + 4*iq]
>>+ add sumd, rd
>>+
>>+ add iq, 1
>>+ cmp iq, radq
>>+ jl .loopr_i
>>+
>>+ pxor m4, m4
>>+ cvtsi2ss m4, sumd
>>+ mulss m4, m0 ; sum *= rdiv
>>+ addss m4, m1 ; sum += bias
>>+ addss m4, m5 ; sum += 0.5
>>+ cvttps2dq m4, m4
>>+ packssdw m4, m4
>>+ packuswb m4, m4
>>+ movd sumd, m4
>>+ mov [dstq + dst_offq], sumb
>>+ add c_offq, 1
>>+ add dst_offq, 1
>>+ add off16q, 1
>>+ cmp off16q, widthq
>>+ jl .loopr
>>+
>>+ .paraend:
>>+ sub c_offq, widthq
>>+ sub dst_offq, widthq
>>+ add c_offq, strideq
>>+ add dst_offq, dstrideq
>>+
>>+ sub heightq, 1
>>+ cmp heightq, 0
>>+ jg .loopy
>>+
>>+.end:
>>+ RET
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request at ffmpeg.org with subject "unsubscribe".
--
敬颂钧安,
徐鋆
电子信息与电气工程学院
上海交通大学
邮箱:xujunzz at sjtu.edu.cn
地址:上海市闵行区东川路800号
Yours sincerely,
Xu Jun
School of Electronic, Information and Electrical Engineering
Shanghai Jiao Tong University
Email: xujunzz at sjtu.edu.cn
No. 800, Dongchuan Road, Minhang District, Shanghai 200240, China
宜しくお愿いたします
徐鋆
電子情報と電気工程学院
上海交通大学
メールアドレス :xujunzz at sjtu.edu.cn
住所:上海市閔行区ドンチュワンルー800号
More information about the ffmpeg-devel
mailing list