[FFmpeg-devel] [PATCH] x86/vf_blend: Add SSE4.1 optimization for divide
Timothy Gu
timothygu99 at gmail.com
Sun Feb 14 04:20:49 CET 2016
I've already answered these on IRC but for the sake of completion I'll include
the answers here as well.
On Sat, Feb 13, 2016 at 10:26:58PM -0300, James Almer wrote:
> On 2/13/2016 9:27 PM, Timothy Gu wrote:
> > ---
> >
> > The reason why this function uses SSE4.1 is the roundps instruction. Would
> > love to find a way to truncate a float to integer in SSE2.
CVTTPS2DQ—Convert with Truncation Packed Single-Precision FP Values to Packed
Dword Integers
> > + punpcklwd m0, m2 ; 000x000x
> > + punpcklwd m1, m2
> > +
> > + cvtdq2ps m0, m0
> > + cvtdq2ps m1, m1
> > + divps m0, m1 ; a / b
> > + mulps m0, m3 ; a / b * 255
> > + roundps m0, m0, 3 ; truncate
> > + minps m0, m3
>
> Are these two really needed? After a quick glance GCC seems to simply generate more
> or less the same code you're using here sans these two. (convert to float, div, mul,
> convert to int, saturate to uint8_t).
roundps becomes unnecessary after cvttps2dq. minps is needed for divide-by-0
cases.
Timothy
More information about the ffmpeg-devel
mailing list