[FFmpeg-devel] [PATCH] flac/x86: add ff_flac_lpc_32_sse4()

Loren Merritt lorenm at u.washington.edu
Tue Feb 4 14:34:02 CET 2014


On Tue, 4 Feb 2014, James Almer wrote:

> On 03/02/14 8:17 PM, Loren Merritt wrote:
> > benchmarked on sandybridge x86_64:
> > 1358232 decicycles in flac_lpc_32_c
> > 1244575 decicycles in flac_lpc_32_sse4, James Almer's patch
> >  650045 decicycles in flac_lpc_32_sse4, this patch
>
> Wonder why storing two samples at a time generates this kind of boost in C with
> the 16 bits function, but not this one.

I expect it's because 64bit imul is slow on AMD (4 uops). The loop
bottlenecks on that, and no other inefficiencies matter.
pmuldq is faster than imul even if you have only one sample per xmmreg.

> > +.ret:
> > +    REP_RET
>
> Isn't this only necessary for functions < SSSE3? At least that's what
> x86inc mentions.

Yes. (I didn't bother to remember that exception.)

--Loren Merritt


More information about the ffmpeg-devel mailing list