[FFmpeg-devel] [PATCH 1/2] libavutil/cpu: Adds fast gather detection.

Mon Jul 12 12:29:41 EEST 2021

On Fri, Jun 25, 2021 at 1:24 PM Alan Kelly <alankelly at google.com> wrote:

> On Fri, Jun 25, 2021 at 10:40 AM Lynne <dev at lynne.ee> wrote:
>
>> Jun 25, 2021, 09:54 by alankelly-at-google.com at ffmpeg.org:
>>
>> > Broadwell and later and Zen3 and later have fast gather instructions.
>> > ---
>> >  Gather requires between 9 and 12 cycles on Haswell, 5 to 7 on
>> Broadwell,
>> >  and 2 to 5 on Skylake and newer. It is also slow on AMD before Zen 3.
>> >  libavutil/cpu.h     |  2 ++
>> >  libavutil/x86/cpu.c | 18 ++++++++++++++++--
>> >  libavutil/x86/cpu.h |  1 +
>> >  3 files changed, 19 insertions(+), 2 deletions(-)
>> >
>>
>> No, we really don't need more FAST/SLOW flags, especially for
>> something like this which is just fixable by _not_using_vgather_.
>> Take a look at libavutil/x86/tx_float.asm, we only use vgather
>> if it's guaranteed to either be faster for what we're gathering or
>> is just as fast "slow". If neither is true, we use manual lookups,
>> which is actually advantageous since for AVX2 we can interleave
>> the lookups that happen in each lane.
>>
>> Even if we disregard this, I've extensively benchmarked vgather
>> on Zen 3, Zen 2, Cascade Lake and Skylake, and there's hardly
>> a great vgather improvement to be found in Zen 3 to justify
>> using a new CPU flag for this.
>> _______________________________________________
>> ffmpeg-devel mailing list
>> ffmpeg-devel at ffmpeg.org
>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>
>> To unsubscribe, visit link above, or email
>> ffmpeg-devel-request at ffmpeg.org with subject "unsubscribe".
>>
>
> Thanks for your response. I'm not against finding a cleaner way of
> enabling/disabling the code which will be protected by this flag. However,
> the manual lookups solution proposed will not work in this case, the avx2
> version of hscale will only be faster if fast gathers are available,
> otherwise, the ssse3 version should be used.
>
> I haven't got access to a Zen3 so I can't comment on the performance. I
> have tested on a Zen 2 and it is slow. On Broadwell hscale avx2 is about
> 10% faster than the ssse3 version and on Skylake about 40% faster, Haswell
> has similar performance to Zen2.
>
> Is there a proxy which could be used for detecting Broadwell or Skylake
> and later? AVX512 seems too strict as there are Skylake chips without
> AVX512. Thanks
>

Hi,

I will paste the performance figures from the thread for the other part of
this patch here so that the justification for this flag is clearer:

Skylake Haswell
hscale_8_to_15_width4_ssse3 761.2 760
hscale_8_to_15_width4_avx2 468.7 957
hscale_8_to_15_width8_ssse3 1170.7 1032
hscale_8_to_15_width8_avx2 865.7 1979
hscale_8_to_15_width12_ssse3 2172.2 2472
hscale_8_to_15_width12_avx2 1245.7 2901
hscale_8_to_15_width16_ssse3 2244.2 2400
hscale_8_to_15_width16_avx2 1647.2 3681

As you can see, it is catastrophic on Haswell and older chips but the gains
on Skylake are impressive.
As I don't have performance figures for Zen 3, I can disable this feature
on all cpus apart from Broadwell and later as you say that there is no
worthwhile improvement on Zen3. Is this OK with you?

Thanks