[FFmpeg-devel] [PATCH] x86/vf_blend: Add SSE4.1 optimization for divide

Sun Feb 14 04:20:49 CET 2016

I've already answered these on IRC but for the sake of completion I'll include
the answers here as well.

On Sat, Feb 13, 2016 at 10:26:58PM -0300, James Almer wrote:
> On 2/13/2016 9:27 PM, Timothy Gu wrote:
> > ---
> > 
> > The reason why this function uses SSE4.1 is the roundps instruction. Would
> > love to find a way to truncate a float to integer in SSE2.

CVTTPS2DQ—Convert with Truncation Packed Single-Precision FP Values to Packed
Dword Integers

> > +        punpcklwd       m0, m2               ; 000x000x
> > +        punpcklwd       m1, m2
> > +
> > +        cvtdq2ps        m0, m0
> > +        cvtdq2ps        m1, m1
> > +        divps           m0, m1               ; a / b
> > +        mulps           m0, m3               ; a / b * 255
> > +        roundps         m0, m0, 3            ; truncate
> > +        minps           m0, m3
> 
> Are these two really needed? After a quick glance GCC seems to simply generate more
> or less the same code you're using here sans these two. (convert to float, div, mul,
> convert to int, saturate to uint8_t).

roundps becomes unnecessary after cvttps2dq. minps is needed for divide-by-0
cases.

Timothy