[FFmpeg-devel] [PATCH] lpc: rewrite lpc_compute_autocorr in external asm
James Almer
jamrial at gmail.com
Sun May 26 02:41:16 EEST 2024
On 5/25/2024 8:24 PM, Lynne via ffmpeg-devel wrote:
> On 26/05/2024 00:31, James Almer wrote:
>> On 5/25/2024 5:57 PM, Lynne via ffmpeg-devel wrote:
>>> The inline asm function had issues running under checkasm.
>>> So I came to finish what I started, and wrote the last part
>>> of LPC computation in assembly.
>>>
>>> autocorr_10_c: 135525.8
>>> autocorr_10_sse2: 50729.8
>>> autocorr_10_fma3: 19007.8
>>> autocorr_30_c: 390100.8
>>> autocorr_30_sse2: 142478.8
>>> autocorr_30_fma3: 50559.8
>>> autocorr_32_c: 407058.3
>>> autocorr_32_sse2: 151633.3
>>> autocorr_32_fma3: 50517.3
>>> ---
>>> libavcodec/x86/lpc.asm | 91 +++++++++++++++++++++++++++++++++++++++
>>> libavcodec/x86/lpc_init.c | 87 ++++---------------------------------
>>> 2 files changed, 100 insertions(+), 78 deletions(-)
>>>
>>> diff --git a/libavcodec/x86/lpc.asm b/libavcodec/x86/lpc.asm
>>> index a585c17ef5..790841b7f4 100644
>>> --- a/libavcodec/x86/lpc.asm
>>> +++ b/libavcodec/x86/lpc.asm
>>> @@ -32,6 +32,8 @@ dec_tab_sse2: times 2 dq -2.0
>>> dec_tab_scalar: times 2 dq -1.0
>>> seq_tab_sse2: dq 1.0, 0.0
>>> +autoc_init_tab: times 4 dq 1.0
>>> +
>>> SECTION .text
>>> %macro APPLY_WELCH_FN 0
>>> @@ -261,3 +263,92 @@ APPLY_WELCH_FN
>>> INIT_YMM avx2
>>> APPLY_WELCH_FN
>>> %endif
>>> +
>>> +%macro COMPUTE_AUTOCORR_FN 0
>>> +cglobal lpc_compute_autocorr, 4, 7, 8, data, len, lag, autoc, lag_p,
>>> data_l, len_p
>>
>> Already mentioned, but it should be 3 not 8.
>
> Already done, as said on IRC not 10 minutes after I submitted it.
>
>>
>>> +
>>> + shl lagd, 3
>>> + shl lenq, 3
>>> + xor lag_pq, lag_pq
>>> +
>>> +.lag_l:
>>> + movaps m8, [autoc_init_tab]
>>
>> m2
>>
>>> +
>>> + mov len_pq, lag_pq
>>> +
>>> + lea data_lq, [lag_pq + mmsize - 8]
>>> + neg data_lq ; -j - mmsize
>>> + add data_lq, dataq ; data[-j - mmsize]
>>> +.len_l:
>>> + ; We waste the upper value here on SSE2,
>>> + ; but we use it on AVX.
>>> + movupd xm0, [dataq + len_pq] ; data[i]
>>
>> movsd
>
> Fixed.
>
>>
>>> + movupd m1, [data_lq + len_pq] ; data[i - j]
>>> +
>>> +%if cpuflag(avx)
>>
>> %if mmsize == 32 here and everywhere else.
>
> Done.
>
>>
>>> + vbroadcastsd m0, xm0
>>
>> This is AVX2. AVX only has memory input argument. So use that and save
>> the movsd from above for the FMA3 version.
>>
>>> + vperm2f128 m1, m1, m1, 0x01
>>
>> Aren't you loading 16 extra bytes for no reason if you're just going
>> to use the upper 16 bytes from the load above?
>
> Lane swapped, like you mentioned.
>
>>> +%endif
>>> +
>>> + shufpd m0, m0, m0, 1100b
>>
>> The last argument has two bits, not four. What you're doing here is a
>> splat/broadcast, so you don't need it for FMA3.
>>
>>> + shufpd m1, m1, m1, 0101b
>>
>> The upper two bits of imm8 are ignored.
>
> Intentional. Not ignored on FMA3.
>
>>> +
>>> +%if cpuflag(fma3)
>>> + fmaddpd m8, m0, m1, m8 ; sum += data[i]*data[i-j]
>>> +%else
>>> + mulpd m0, m1
>>> + addpd m8, m0 ; sum += data[i]*data[i-j]
>>> +%endif
>>> +
>>> + add len_pq, 8
>>> + cmp len_pq, lenq
>>> + jl .len_l
>>> +
>>> + movups [autocq + lag_pq], m8 ; autoc[j] = sum
>>> + add lag_pq, mmsize
>>> + cmp lag_pq, lagq
>>> + jl .lag_l
>>> +
>>> + ; The tail computation is guaranteed never to happen
>>> + ; as long as we're doing multiples of 4, rather than 2.
>>> + ; It is trivial to convert this to avx if ever needed.
>>> +%if !cpuflag(avx)
>>
>> This doesn't seem to be tested as is. Maybe the checkasm should try
>> other lag values?
>
> That's for the checkasm patch. You can trigger this check with
> fate-alac-16-lpc-orders as-is.
Checkasm should test the entire function, so if an odd lag value will
trigger this chunk, it should be tested.
More information about the ffmpeg-devel
mailing list