[FFmpeg-devel] [PATCH v2 2/3] avfilter/x86/vf_exposure: add ff_exposure_avx2

Sun Nov 21 08:09:36 EET 2021

James Almer<mailto:jamrial at gmail.com>:
>On 11/20/2021 5:42 PM, Wu Jianhua wrote:
>> James Almer<mailto:jamrial at gmail.com>:
>> On 11/4/2021 1:18 AM, Wu Jianhua wrote:
>>>> Performance data(Less is better):
>>>>       exposure_sse:   500491
>>
>>> You reported a better result in the first patch.
>>
>> For they are tested on different baseline, I think it might be better to only compare these two values.
>>
>>>>       exposure_avx2:  449122
>>
>>> This looks like a really low speed up for a function that processes
>>>   twice the amount of floats per loop.
>>
>>>>
>>>> Signed-off-by: Wu Jianhua <jianhua.wu at intel.com>
>>>> ---
>>>>    libavfilter/x86/vf_exposure.asm    | 15 +++++++++++++++
>>>>    libavfilter/x86/vf_exposure_init.c |  6 ++++++
>>>>    2 files changed, 21 insertions(+)
>>>>
>>>> diff --git a/libavfilter/x86/vf_exposure.asm b/libavfilter/x86/vf_exposure.asm
>>>> index 3351c6fb3b..f271167805 100644
>>>> --- a/libavfilter/x86/vf_exposure.asm
>>>> +++ b/libavfilter/x86/vf_exposure.asm
>>>> @@ -36,11 +36,21 @@ cglobal exposure, 2, 2, 4, ptr, length, black, scale
>>>>        VBROADCASTSS m1, xmm1
>>>>    %endif
>>>>
>>>> +%if cpuflag(fma3) || cpuflag(fma4)
>>
>>> Remove the fma4 check if you're not using it.
>>
>> No problem. Avx2 flag is only initialized with fma3, so the fma4 is redundant indeed.
>>
>>>> +    mulps       m0, m0, m1 ; black * scale
>>>> +%endif
>>>> +
>>>>    .loop:
>>>> +%if cpuflag(fma3) || cpuflag(fma4)
>>>> +    mova        m2, m0
>>>> +    vfmsub231ps m2, m1, [ptrq]
>>>> +    movu    [ptrq], m2
> >
>>> Have you tried to not use FMA for this and just kept the sub + mul even
>>> for AVX2 and see how it performs?
>>
>> Yeah. Definitely. I have had sufficient tests before. The first version is kept sub + mul
>> for AVX2. After that, I keep trying to find a way out to speed up it further. Using FMA
>> here would be faster than sub + mul indeed, precisely, improving by 4%-10% approximately.
>> Not that much better, but still an optimal way I found at the present.

> I tried the checkasm test you wrote and when i made the AVX2 version use
> sub + mul instead of vfmsub231ps i noticed that i could change the
> epsilon value to FLT_EPSILON instead of 0.01f and the test would still
> succeed, meaning the output of the version using vfmsub231ps deviates a
> bit from the normal sub + mul one.

> The speed up is pretty small, so it may be worth just using the sub +
> mul version instead.

Yeah. Small, but it’s not called just one time. Many a little makes a mickle, isn’t it?
I might be more prefer to keep this.