[FFmpeg-devel] [PATCH 5/5] avcodec/h264: add avx 8-bit h264_idct_dc_add

James Darnley jdarnley at obe.tv
Thu Apr 6 19:01:54 EEST 2017


On 2017-04-05 06:05, James Almer wrote:
> On 4/4/2017 10:53 PM, James Darnley wrote:
>> Haswell:
>>  - 1.02x faster (405±0.7 vs. 397±0.8 decicycles) compared with mmxext
>>
>> Skylake-U:
>>  - 1.06x faster (498±1.8 vs. 470±1.3 decicycles) compared with mmxext
>> ---
>>  libavcodec/x86/h264_idct.asm  | 20 ++++++++++++++++++++
>>  libavcodec/x86/h264dsp_init.c |  2 ++
>>  2 files changed, 22 insertions(+)
>>
>> diff --git a/libavcodec/x86/h264_idct.asm b/libavcodec/x86/h264_idct.asm
>> index 24fb4d2..7fd57d3 100644
>> --- a/libavcodec/x86/h264_idct.asm
>> +++ b/libavcodec/x86/h264_idct.asm
>> @@ -1158,7 +1158,27 @@ INIT_XMM avx
>>      movd  [%7+%8], %4
>>  %endmacro
>>  
>> +%macro DC_ADD_INIT 1
>> +    add      %1d, 32
>> +    sar      %1d, 6
>> +    movd     m0, %1d
>> +    SPLATW   m0, m0, 0
> 
> Considering DC_ADD_MMXEXT_OP works with dwords, a single pshuflw should be
> enough. This macro calls two instructions to fill the entire XMM register,
> and there's no need for that.

Noted, I made that change butit doesn't seemto change much in terms of
performance.

> You could for that matter try to optimize DC_ADD_MMXEXT_OP a bit, combining
> said dwords with punpk* into fewer registers to reduce the amount of padd*
> and psub* needed afterwards. See ADD_RES_MMX_4_8 in hevc_add_res.asm

Noted.  Maybe in the future.

> And again, SSE2 first, AVX only if measurably faster. But since you're not
> making use of the wider XMM regs here at all, the only chips that will see
> any real speed up are those slow in mmx (like Skylake seems to be).

Yorkfield gets no benefit from sse2 (575±0.4 vs. 574±0.3 decicycles).
Haswell gets most of its benefit from sse2 (404±0.6 vs. 390±0.3 vs.
388±0.3).
Skylake-U gets all of its speedup from sse2 (533±3.0 vs 488±2.0 vs 497±1.4).

Nehalem and 64-bit also gets no benefit from sse2.

Again: SSE2 yay or nay?  Maybe I should just drop this; I'm not sure 5
cycles is worth it.

(I will now go and modify my script to divide the recorded decicycle
count by 10.)

>> +cglobal h264_idct_dc_add_8, 3, 4, 0, dst_, block_, stride_
                                      ^
Fixed this bug.



More information about the ffmpeg-devel mailing list