[FFmpeg-devel] [PATCH 1/2] libavutil/cpu: Adds fast gather detection.

Mon Jul 12 13:46:08 EEST 2021

12 Jul 2021, 11:29 by alankelly-at-google.com at ffmpeg.org:

> On Fri, Jun 25, 2021 at 1:24 PM Alan Kelly <alankelly at google.com> wrote:
>
>> On Fri, Jun 25, 2021 at 10:40 AM Lynne <dev at lynne.ee> wrote:
>>
>>> Jun 25, 2021, 09:54 by alankelly-at-google.com at ffmpeg.org:
>>>
>>> > Broadwell and later and Zen3 and later have fast gather instructions.
>>> > ---
>>> >  Gather requires between 9 and 12 cycles on Haswell, 5 to 7 on
>>> Broadwell,
>>> >  and 2 to 5 on Skylake and newer. It is also slow on AMD before Zen 3.
>>> >  libavutil/cpu.h     |  2 ++
>>> >  libavutil/x86/cpu.c | 18 ++++++++++++++++--
>>> >  libavutil/x86/cpu.h |  1 +
>>> >  3 files changed, 19 insertions(+), 2 deletions(-)
>>> >
>>>
>>> No, we really don't need more FAST/SLOW flags, especially for
>>> something like this which is just fixable by _not_using_vgather_.
>>> Take a look at libavutil/x86/tx_float.asm, we only use vgather
>>> if it's guaranteed to either be faster for what we're gathering or
>>> is just as fast "slow". If neither is true, we use manual lookups,
>>> which is actually advantageous since for AVX2 we can interleave
>>> the lookups that happen in each lane.
>>>
>>> Even if we disregard this, I've extensively benchmarked vgather
>>> on Zen 3, Zen 2, Cascade Lake and Skylake, and there's hardly
>>> a great vgather improvement to be found in Zen 3 to justify
>>> using a new CPU flag for this.
>>> _______________________________________________
>>> ffmpeg-devel mailing list
>>> ffmpeg-devel at ffmpeg.org
>>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>>
>>> To unsubscribe, visit link above, or email
>>> ffmpeg-devel-request at ffmpeg.org with subject "unsubscribe".
>>>
>>
>> Thanks for your response. I'm not against finding a cleaner way of
>> enabling/disabling the code which will be protected by this flag. However,
>> the manual lookups solution proposed will not work in this case, the avx2
>> version of hscale will only be faster if fast gathers are available,
>> otherwise, the ssse3 version should be used.
>>
>> I haven't got access to a Zen3 so I can't comment on the performance. I
>> have tested on a Zen 2 and it is slow. On Broadwell hscale avx2 is about
>> 10% faster than the ssse3 version and on Skylake about 40% faster, Haswell
>> has similar performance to Zen2.
>>
>> Is there a proxy which could be used for detecting Broadwell or Skylake
>> and later? AVX512 seems too strict as there are Skylake chips without
>> AVX512. Thanks
>>
>
> Hi,
>
> I will paste the performance figures from the thread for the other part of
> this patch here so that the justification for this flag is clearer:
>
> Skylake Haswell
> hscale_8_to_15_width4_ssse3 761.2 760
> hscale_8_to_15_width4_avx2 468.7 957
> hscale_8_to_15_width8_ssse3 1170.7 1032
> hscale_8_to_15_width8_avx2 865.7 1979
> hscale_8_to_15_width12_ssse3 2172.2 2472
> hscale_8_to_15_width12_avx2 1245.7 2901
> hscale_8_to_15_width16_ssse3 2244.2 2400
> hscale_8_to_15_width16_avx2 1647.2 3681
>
> As you can see, it is catastrophic on Haswell and older chips but the gains
> on Skylake are impressive.
> As I don't have performance figures for Zen 3, I can disable this feature
> on all cpus apart from Broadwell and later as you say that there is no
> worthwhile improvement on Zen3. Is this OK with you?
>

It's not that catastrophic. Since Haswell CPUs generally don't have
large AVX2 gains, could you just exclude Haswell only from
EXTERNAL_AVX2_FAST, and require EXTERNAL_AVX2_FAST
to enable those functions?