[FFmpeg-devel] [PATCH v2 2/3] avfilter/x86/vf_exposure: add ff_exposure_avx2
Wu Jianhua
toqsxw at outlook.com
Sun Nov 21 08:09:36 EET 2021
James Almer<mailto:jamrial at gmail.com>:
>On 11/20/2021 5:42 PM, Wu Jianhua wrote:
>> James Almer<mailto:jamrial at gmail.com>:
>> On 11/4/2021 1:18 AM, Wu Jianhua wrote:
>>>> Performance data(Less is better):
>>>> exposure_sse: 500491
>>
>>> You reported a better result in the first patch.
>>
>> For they are tested on different baseline, I think it might be better to only compare these two values.
>>
>>>> exposure_avx2: 449122
>>
>>> This looks like a really low speed up for a function that processes
>>> twice the amount of floats per loop.
>>
>>>>
>>>> Signed-off-by: Wu Jianhua <jianhua.wu at intel.com>
>>>> ---
>>>> libavfilter/x86/vf_exposure.asm | 15 +++++++++++++++
>>>> libavfilter/x86/vf_exposure_init.c | 6 ++++++
>>>> 2 files changed, 21 insertions(+)
>>>>
>>>> diff --git a/libavfilter/x86/vf_exposure.asm b/libavfilter/x86/vf_exposure.asm
>>>> index 3351c6fb3b..f271167805 100644
>>>> --- a/libavfilter/x86/vf_exposure.asm
>>>> +++ b/libavfilter/x86/vf_exposure.asm
>>>> @@ -36,11 +36,21 @@ cglobal exposure, 2, 2, 4, ptr, length, black, scale
>>>> VBROADCASTSS m1, xmm1
>>>> %endif
>>>>
>>>> +%if cpuflag(fma3) || cpuflag(fma4)
>>
>>> Remove the fma4 check if you're not using it.
>>
>> No problem. Avx2 flag is only initialized with fma3, so the fma4 is redundant indeed.
>>
>>>> + mulps m0, m0, m1 ; black * scale
>>>> +%endif
>>>> +
>>>> .loop:
>>>> +%if cpuflag(fma3) || cpuflag(fma4)
>>>> + mova m2, m0
>>>> + vfmsub231ps m2, m1, [ptrq]
>>>> + movu [ptrq], m2
> >
>>> Have you tried to not use FMA for this and just kept the sub + mul even
>>> for AVX2 and see how it performs?
>>
>> Yeah. Definitely. I have had sufficient tests before. The first version is kept sub + mul
>> for AVX2. After that, I keep trying to find a way out to speed up it further. Using FMA
>> here would be faster than sub + mul indeed, precisely, improving by 4%-10% approximately.
>> Not that much better, but still an optimal way I found at the present.
> I tried the checkasm test you wrote and when i made the AVX2 version use
> sub + mul instead of vfmsub231ps i noticed that i could change the
> epsilon value to FLT_EPSILON instead of 0.01f and the test would still
> succeed, meaning the output of the version using vfmsub231ps deviates a
> bit from the normal sub + mul one.
> The speed up is pretty small, so it may be worth just using the sub +
> mul version instead.
Yeah. Small, but it’s not called just one time. Many a little makes a mickle, isn’t it?
I might be more prefer to keep this.
More information about the ffmpeg-devel
mailing list