[FFmpeg-devel] [PATCH] x86/me_cmp: port mmxext and sse2 sad functions to yasm
James Almer
jamrial at gmail.com
Mon Sep 15 00:35:26 CEST 2014
On 14/09/14 7:12 PM, Michael Niedermayer wrote:
> On Sat, Sep 13, 2014 at 10:12:12PM -0300, James Almer wrote:
>> Also add a missing c->pix_abs[0][0] initialization, and sse2 versions of
>> sad16_x2, sad16_y2 and sad16_xy2.
>> Since the _xy2 versions are not bitexact, they are accordingly marked as
>> approximate.
>>
>> Signed-off-by: James Almer <jamrial at gmail.com>
>> ---
>
>> Not benched.
>
> if the author of some code doesnt benchmark his code, how can he know
> which way it is faster ?
> what effect each difference has ? ...
I didn't bench because i didn't have the time and assumed it wasn't necessary
considering this is a port from inline to yasm with little to no changes to
the asm.
I'll try to do some quick benchmarks later.
>
>
>>
>> libavcodec/x86/me_cmp.asm | 229 +++++++++++++++++++++++++++++++++++++++++++
>> libavcodec/x86/me_cmp_init.c | 203 +++++++++-----------------------------
>> 2 files changed, 278 insertions(+), 154 deletions(-)
>>
>> diff --git a/libavcodec/x86/me_cmp.asm b/libavcodec/x86/me_cmp.asm
>> index b0741f3..68dc701 100644
>> --- a/libavcodec/x86/me_cmp.asm
>> +++ b/libavcodec/x86/me_cmp.asm
>> @@ -23,6 +23,10 @@
>>
>> %include "libavutil/x86/x86util.asm"
>>
>> +SECTION_RODATA
>> +
>> +cextern pb_1
>> +
>> SECTION .text
>>
>> %macro DIFF_PIXELS_1 4
>> @@ -465,3 +469,228 @@ cglobal hf_noise%1, 3,3,0, pix1, lsize, h
>> INIT_MMX mmx
>> HF_NOISE 8
>> HF_NOISE 16
>> +
>> +;---------------------------------------------------------------------------------------
>> +;int ff_sad_<opt>(MpegEncContext *v, uint8_t *pix1, uint8_t *pix2, int stride, int h);
>> +;---------------------------------------------------------------------------------------
>> +%macro SAD 1
>> +cglobal sad%1, 5, 5, 3, v, pix1, pix2, stride, h
>> +%if %1 == mmsize
>> + shr hd, 1
>> +%define STRIDE strideq
>> +%else
>> +%define STRIDE 8
>> +%endif
>> + pxor m2, m2
>> +
>> +align 16
>> +.loop
>> + movu m0, [pix2q]
>> + movu m1, [pix2q+STRIDE]
>> + psadbw m0, [pix1q]
>> + psadbw m1, [pix1q+STRIDE]
>> + paddw m2, m0
>> + paddw m2, m1
>> +%if %1 == mmsize
>> + lea pix1q, [pix1q+strideq*2]
>> + lea pix2q, [pix2q+strideq*2]
>> +%else
>> + add pix1q, strideq
>> + add pix2q, strideq
>> +%endif
>
>> + dec hd
>> + jg .loop
>
> the other loops use jnz, why the difference ?
>
Probably a copy-paste remnant. I'll make them consistent.
>
>
>> +%if mmsize == 16
>> + movhlps m0, m2
>> + paddw m2, m0
>> +%endif
>> + movd eax, m2
>> + RET
>> +%endmacro
>> +
>> +INIT_MMX mmxext
>> +SAD 8
>> +SAD 16
>> +INIT_XMM sse2
>> +SAD 16
>> +
>> +;------------------------------------------------------------------------------------------
>> +;int ff_sad_x2_<opt>(MpegEncContext *v, uint8_t *pix1, uint8_t *pix2, int stride, int h);
>> +;------------------------------------------------------------------------------------------
>> +%macro SAD_X2 1
>> +cglobal sad%1_x2, 5, 5, 5, v, pix1, pix2, stride, h
>> +%if %1 == mmsize
>> + shr hd, 1
>> +%define STRIDE strideq
>> +%else
>> +%define STRIDE 8
>> +%endif
>> + pxor m0, m0
>> +
>
>> +align 16
>
> do these improve or reduce the speed ?
No idea. I copied them from the inline version (where they were ".p2align 4")
to keep the resulting asm as similar as possible.
I'll check nonetheless.
>
>
>
>> +.loop:
>> + movu m1, [pix2q]
>> + movu m2, [pix2q+STRIDE]
>> +%if cpuflag(sse2)
>> + movu m3, [pix2q+1]
>> + movu m4, [pix2q+STRIDE+1]
>> + pavgb m1, m3
>> + pavgb m2, m4
>> +%else
>> + pavgb m1, [pix2q+1]
>> + pavgb m2, [pix2q+STRIDE+1]
>> +%endif
>> + psadbw m1, [pix1q]
>> + psadbw m2, [pix1q+STRIDE]
>> + paddw m0, m1
>> + paddw m0, m2
>> +%if %1 == mmsize
>> + lea pix1q, [pix1q+2*strideq]
>> + lea pix2q, [pix2q+2*strideq]
>> +%else
>> + add pix1q, strideq
>> + add pix2q, strideq
>> +%endif
>
>> + dec hd
>
> dec/inc has some speed penalties on some cpus
> see 16.2 in http://www.agner.org/optimize/optimizing_assembly.pdf
Ok, i'll use sub then.
More information about the ffmpeg-devel
mailing list