[FFmpeg-devel] [PATCH 1/2] swresample: Refactor resample asm and port it to yasm

Thu Mar 20 02:16:17 CET 2014

On 19/03/14 9:08 PM, Michael Niedermayer wrote:
> On Wed, Mar 19, 2014 at 06:45:03PM -0300, James Almer wrote:
>> This reduces code duplication and makes it easier to implement new asm 
>> functions in the future
>>
>> Signed-off-by: James Almer <jamrial at gmail.com>
>> ---
>>  libswresample/resample.c            | 96 ++++++++++---------------------------
>>  libswresample/resample_template.c   | 49 +++++++------------
>>  libswresample/swresample_internal.h | 24 ++++++++++
>>  libswresample/x86/Makefile          |  1 +
>>  libswresample/x86/resample.asm      | 64 +++++++++++++++++++++++++
>>  libswresample/x86/resample_mmx.h    | 74 ----------------------------
>>  libswresample/x86/swresample_x86.c  | 16 +++++++
>>  7 files changed, 148 insertions(+), 176 deletions(-)
>>  create mode 100644 libswresample/x86/resample.asm
>>  delete mode 100644 libswresample/x86/resample_mmx.h
> 
> benchmark:
> 
> before: 253482 decicycles in resample, 1024 runs, 0 skips
> after   356545 decicycles in resample, 1024 runs, 0 skips
> 
> tested using ffplay HAYLEY\ WESTENRA-WHISPERS\ IN\ A\ DREAM.webm -af aformat=s32,aresample=48000,aformat=s32
> 
> 

Where did you put the timer.h macros? I put them at the beginning and end of 
the swri_resample_<sampleformat> function/macro in resample_template.c
And what about 16bits 44100khz to 16 bits 22050khz (using the sse2 code), which 
is the one i tried and where i noticed a boost?

Testing a 16bits 44100khz file and using the command you mention above (but with 
ffmpeg) i get

before: 2606446 decicycles in resample, 65522 runs, 14 skips
after:  2642538 decicycles in resample, 65497 runs, 39 skips

Which is indeed slower but not nearly as bad as in your test. Though without 
testing the same files doubt we could get a proper picture.

Nonetheless, we can drop this patch if it really affects performance that much 
in some scenarios. I mainly wrote it to reduce the considerable code duplication 
that exists and that will increase with each asm version added, and to remove 
arch-specific code that was outside the respective folders.

I can port the float sse version to inline in that case.

>> +%if mmsize == 8
>> +    emms
>> +%endif
> 
> this is not ok
> emms is slow and does not belong in the inner loop

This is a problem. Not sure how to make sure to run emms_c() from outside the 
loop only when an mmx version of scalarproduct is used.