[FFmpeg-devel] [PATCH] flac/x86: add ff_flac_lpc_32_sse4()

James Almer jamrial at gmail.com
Sat Feb 1 05:45:38 CET 2014


On 01/02/14 1:38 AM, James Almer wrote:
> x64
> 1261661 decicycles in flac_lpc_32_c, 32768 runs
> 1045689 decicycles in ff_flac_lpc_32_sse4, 32768 runs
> 
> 1431506 decicycles in flac_lpc_32_c, 32768 runs
> 1209322 decicycles in ff_flac_lpc_32_sse4, 32768 runs
> 
> x86
> 1429597 decicycles in flac_lpc_32_c, 32768 runs
> 953667 decicycles in ff_flac_lpc_32_sse4, 32768 runs
> 
> 1610348 decicycles in flac_lpc_32_c, 32768 runs
> 1079424 decicycles in ff_flac_lpc_32_sse4, 32768 runs
> 
> About 100 to 500 ms faster decoding using -threads 1 depending on song and arch.
> Tested using a few 24 bits samples on an AMD FX 6300, Win7 x64 and x86.
> Biggest speedup appears to be on x86 builds.
> 
> Signed-off-by: James Almer <jamrial at gmail.com>
> ---
>  libavcodec/flacdsp.c          |  2 ++
>  libavcodec/flacdsp.h          |  1 +
>  libavcodec/x86/Makefile       |  2 ++
>  libavcodec/x86/flacdsp.asm    | 61 +++++++++++++++++++++++++++++++++++++++++++
>  libavcodec/x86/flacdsp_init.c | 39 +++++++++++++++++++++++++++
>  5 files changed, 105 insertions(+)
>  create mode 100644 libavcodec/x86/flacdsp.asm
>  create mode 100644 libavcodec/x86/flacdsp_init.c
> 

Couldn't test with Valgrind, or on a Linux box for that matter.
I have access to this FX 6300 for the time being so I used it to write this, but can't 
install a VM.

I originally wrote this doing two calculations per packed instruction (using all 128 
bits on the xmm registers instead of 64), but after punpckldq-ing and pshufd-ing values 
around and adding extra checks for odd pred_order values it somehow ended up slower 
than the pure c implementation.
This will do until i get that other version working faster. If i can, of course.

Regards.


More information about the ffmpeg-devel mailing list