[FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()

Wed Dec 4 04:28:29 EET 2019

> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
> chen
> Sent: Wednesday, December 4, 2019 9:36 AM
> To: FFmpeg development discussions and patches <ffmpeg-
> devel at ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86
> SIMD for filter_column()
> 
> 
> 
> At 2019-12-04 08:59:08, "Song, Ruiling" <ruiling.song at intel.com> wrote:
> >> -----Original Message-----
> >> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
> >> chen
> >> Sent: Tuesday, December 3, 2019 4:59 PM
> >> To: FFmpeg development discussions and patches <ffmpeg-
> >> devel at ffmpeg.org>
> >> Subject: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86
> >> SIMD for filter_column()
> >>
> >> comments inline in code
> >>
> >>
> >> At 2019-12-03 15:52:07, xujunzz at sjtu.edu.cn wrote:
> >> >From: Xu Jun <xujunzz at sjtu.edu.cn>
> >[...]
> >> >+
> >> >+        cvtdq2ps m4, m4
> >> >+        mulps m4, m0     ; sum *= rdiv
> >> >+        addps m4, m1     ; sum += bias
> >>
> >> >+        addps m4, m5     ; sum += 0.5
> >> I don't know how about precision mismatch if we pre-compute (bias+0.5)
> 
> >I think it is hard to prove it is safe to do pre-compute.
> Agree, I also worried precision issue since float operator is execute order
> dependent.
> How about ROUNDPS?
Seems no exactly match.
> 
> 
> >
> >>
> >>
> >> >+        cvttps2dq m4, m4
> >> >+        packssdw m4, m4
> >> >+        packuswb m4, m4
> >> >+        movss [dstq + dst_offq], m4
> >> >+        add c_offq, mmsize/4
> >> >+        add dst_offq, mmsize/4
> >> >+
> >> >+        add off16q, mmsize/4
> >> >+        cmp off16q, widthq
> >> >+        jl .loop16
> >> >+
> >> >+    add widthq, rq
> >> >+    cmp off16q, widthq
> >> >+    jge .paraend
> >> >+
> >>
> >> >+    .loopr:
> >> no idea about this loop, if we can read beyond, we can reuse above SIMD
> >> code
> >Reuse above SIMD code may write to the memory that does not belong to
> this slice-thread.
> 
> >IMO, the code to handle remainder columns is still necessary.
> 
> 
> Depends on algorithm & size,
> For example width=23
> Process #0 [0:15]
> Process #1 [7:22]
> Both of them is multiple of 16
Sounds interesting. But FFmpeg does not do like this now.
One question is will this get a penalty for writing to same address of memory (both are writing to 7-15) from different threads?

> 
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request at ffmpeg.org with subject "unsubscribe".