[FFmpeg-devel] [PATCH] flac/x86: add ff_flac_lpc_32_sse4()

Sun Feb 2 19:51:10 CET 2014

On 01/02/14 9:24 AM, Loren Merritt wrote:
> On Sat, 1 Feb 2014, James Almer wrote:
>> On 01/02/14 1:38 AM, James Almer wrote:
>>> x64
>>> 1261661 decicycles in flac_lpc_32_c, 32768 runs
>>> 1045689 decicycles in ff_flac_lpc_32_sse4, 32768 runs
>>>
>>> 1431506 decicycles in flac_lpc_32_c, 32768 runs
>>> 1209322 decicycles in ff_flac_lpc_32_sse4, 32768 runs
>>>
>>> x86
>>> 1429597 decicycles in flac_lpc_32_c, 32768 runs
>>> 953667 decicycles in ff_flac_lpc_32_sse4, 32768 runs
>>>
>>> 1610348 decicycles in flac_lpc_32_c, 32768 runs
>>> 1079424 decicycles in ff_flac_lpc_32_sse4, 32768 runs
>>>
>>> About 100 to 500 ms faster decoding using -threads 1 depending on song and arch.
>>> Tested using a few 24 bits samples on an AMD FX 6300, Win7 x64 and x86.
>>> Biggest speedup appears to be on x86 builds.
>>>
>>> Signed-off-by: James Almer <jamrial at gmail.com>
>>> ---
>>>  libavcodec/flacdsp.c          |  2 ++
>>>  libavcodec/flacdsp.h          |  1 +
>>>  libavcodec/x86/Makefile       |  2 ++
>>>  libavcodec/x86/flacdsp.asm    | 61 +++++++++++++++++++++++++++++++++++++++++++
>>>  libavcodec/x86/flacdsp_init.c | 39 +++++++++++++++++++++++++++
>>>  5 files changed, 105 insertions(+)
>>>  create mode 100644 libavcodec/x86/flacdsp.asm
>>>  create mode 100644 libavcodec/x86/flacdsp_init.c
>>>
>>
>> Couldn't test with Valgrind, or on a Linux box for that matter.
>> I have access to this FX 6300 for the time being so I used it to write this, but can't
>> install a VM.
>>
>> I originally wrote this doing two calculations per packed instruction (using all 128
>> bits on the xmm registers instead of 64), but after punpckldq-ing and pshufd-ing values
>> around and adding extra checks for odd pred_order values it somehow ended up slower
>> than the pure c implementation.
>> This will do until i get that other version working faster. If i can, of course.
> 
> Did you try applying the optimization from flac_lpc_16_c to flac_lpc_32_c?
> 

Yes, and it was slower, which i suppose is why the code is not shared between the two 
functions like it's done with the encoder.

> A simd implementation shouldn't need any shuffles, just leave the samples
> in their natural order in the xmmregs and let a single pmuldq apply to
> nonadjacent samples. You also shouldn't need any check on the parity of
> pred_order if you zero-pad coefs[].
> 
> --Loren Merritt

So you mean read four values, pmuldq first and third, sum, increase the counter by 
one, at the end of the loop sum the two results into one, then store that single 
sample? Because i tried and didn't seem faster.
If not that, what did you mean?

The one way i see this becoming faster is by storing two samples at a time, like 
flac_lpc_16_c does.
So reading two values, moving the second to the third dword, pmulq-ing them, sum, 
increasing the counter by one then at the end of the loop store those two samples.
So far i haven't succeeded in getting this to work, but hopefully it won't turn out 
to be slow like trying to do that in c as mentioned above.

In the meantime this version is a good speed up over the c implementation.

> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>