[FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()
Song, Ruiling
ruiling.song at intel.com
Wed Dec 4 04:28:29 EET 2019
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
> chen
> Sent: Wednesday, December 4, 2019 9:36 AM
> To: FFmpeg development discussions and patches <ffmpeg-
> devel at ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86
> SIMD for filter_column()
>
>
>
> At 2019-12-04 08:59:08, "Song, Ruiling" <ruiling.song at intel.com> wrote:
> >> -----Original Message-----
> >> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
> >> chen
> >> Sent: Tuesday, December 3, 2019 4:59 PM
> >> To: FFmpeg development discussions and patches <ffmpeg-
> >> devel at ffmpeg.org>
> >> Subject: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86
> >> SIMD for filter_column()
> >>
> >> comments inline in code
> >>
> >>
> >> At 2019-12-03 15:52:07, xujunzz at sjtu.edu.cn wrote:
> >> >From: Xu Jun <xujunzz at sjtu.edu.cn>
> >[...]
> >> >+
> >> >+ cvtdq2ps m4, m4
> >> >+ mulps m4, m0 ; sum *= rdiv
> >> >+ addps m4, m1 ; sum += bias
> >>
> >> >+ addps m4, m5 ; sum += 0.5
> >> I don't know how about precision mismatch if we pre-compute (bias+0.5)
>
> >I think it is hard to prove it is safe to do pre-compute.
> Agree, I also worried precision issue since float operator is execute order
> dependent.
> How about ROUNDPS?
Seems no exactly match.
>
>
> >
> >>
> >>
> >> >+ cvttps2dq m4, m4
> >> >+ packssdw m4, m4
> >> >+ packuswb m4, m4
> >> >+ movss [dstq + dst_offq], m4
> >> >+ add c_offq, mmsize/4
> >> >+ add dst_offq, mmsize/4
> >> >+
> >> >+ add off16q, mmsize/4
> >> >+ cmp off16q, widthq
> >> >+ jl .loop16
> >> >+
> >> >+ add widthq, rq
> >> >+ cmp off16q, widthq
> >> >+ jge .paraend
> >> >+
> >>
> >> >+ .loopr:
> >> no idea about this loop, if we can read beyond, we can reuse above SIMD
> >> code
> >Reuse above SIMD code may write to the memory that does not belong to
> this slice-thread.
>
> >IMO, the code to handle remainder columns is still necessary.
>
>
> Depends on algorithm & size,
> For example width=23
> Process #0 [0:15]
> Process #1 [7:22]
> Both of them is multiple of 16
Sounds interesting. But FFmpeg does not do like this now.
One question is will this get a penalty for writing to same address of memory (both are writing to 7-15) from different threads?
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request at ffmpeg.org with subject "unsubscribe".
More information about the ffmpeg-devel
mailing list