[FFmpeg-devel] [PATCH 4/4] avcodec/h264: sse2, avx h luma mbaff deblock/loop filter

James Darnley jdarnley at obe.tv
Wed Feb 15 20:05:58 EET 2017


On 2017-02-15 17:55, James Almer wrote:
> On 2/13/2017 9:44 AM, James Darnley wrote:
>> x86-64 only
>>
>> Yorkfield:
>> - sse2: 2.16x (434 vs. 201 cycles)
>>
>> Skylake:
>> - sse2: 3.04x (378 vs. 124 cycles)
>> - avx:  3.29x (378 vs. 115 cycles)
>> ---
>>  libavcodec/x86/h264_deblock.asm | 119 ++++++++++++++++++++++++++++++++++++++++
>>  libavcodec/x86/h264dsp_init.c   |  10 ++++
>>  2 files changed, 129 insertions(+)
>>
>> diff --git a/libavcodec/x86/h264_deblock.asm b/libavcodec/x86/h264_deblock.asm
>> index 509a0dbe0c..f47a199e8f 100644
>> --- a/libavcodec/x86/h264_deblock.asm
>> +++ b/libavcodec/x86/h264_deblock.asm
>> @@ -377,10 +377,129 @@ cglobal deblock_h_luma_8, 5,9,0,0x60+16*WIN64
>>      RET
>>  %endmacro
>>  
>> +; TODO: use macro arguments
>> +%macro TRANSPOSE_8X8B_XMM 8
> 
> Why not put this in x86util? And using arguments, of course.
> Also, just call it TRANSPOSE_8X8B.

I could call it that but people might be tempted to think it would work
on mmx registers.  Perhaps I can add a check for mmsize and error in
that case.

>> +    punpcklbw m0, m1
>> +    punpcklbw m2, m3
>> +    punpcklbw m4, m5
>> +    punpcklbw m6, m7
>> +
>> +    punpckhwd m1, m0, m2
>> +    punpcklwd m0, m2
> 
> Use SBUTTERFLY here and below.

Will do.

>> +
>> +    punpckhwd m5, m4, m6
>> +    punpcklwd m4, m6
>> +
>> +    punpckhdq m2, m0, m4
>> +    punpckldq m0, m4
>> +
>> +    punpckhdq m6, m1, m5
>> +    punpckldq m1, m5
>> +
>> +    MOVHL     m4, m0
>> +    MOVHL     m3, m2
>> +    MOVHL     m7, m6
>> +    MOVHL     m5, m1
>> +    SWAP 1, 4
>> +%endmacro
>> +
>> +%macro TRANSPOSE_8X8B_XMM 0
>> +    TRANSPOSE_8X8B_XMM 0, 1, 2, 3, 4, 5, 6, 7
> 
> This seems wrong, or at least superfluous.

I had the idea of letting it be a default so I (or other users) wouldn't
have to enter 0-7 every time.  Perhaps I could define the macro to take
zero to seven parameters but that is slightly suboptimal because it
would allow one to six parameters to be given

>> +%endmacro
>> +
>> +%macro DEBLOCK_H_LUMA_MBAFF 0
>> +
>> +cglobal deblock_h_luma_mbaff_8, 5, 9, 10, 8*16, pix_, stride_, alpha_, beta_, tc0_
> 
> Why the underscores?

Because I grew sick of seeing "named", "nameq", etc.  So I decided I
wanted a separator.  I could change x264asm but that seems rather involved.

(Ideally I would change nasm/yasm to support unicode so I could use
subscripted numbers but that is even harder.  Plus I would struggle to
enter the characters using my keyboard.  Heheheh)

>> +    movsxd stride_q,  stride_d
>> +    dec    alpha_d
>> +    dec    beta_d
>> +    mov    r5,        pix_q
>> +    lea    r6,       [3*stride_q]
> 
> Call r6 stride3.

I'll consider it.

Thanks for the comments.



More information about the ffmpeg-devel mailing list