[FFmpeg-devel] [PATCH 1/2] libavutil/cpu: Adds fast gather detection.

Fri Jun 25 14:24:55 EEST 2021

On Fri, Jun 25, 2021 at 10:40 AM Lynne <dev at lynne.ee> wrote:

> Jun 25, 2021, 09:54 by alankelly-at-google.com at ffmpeg.org:
>
> > Broadwell and later and Zen3 and later have fast gather instructions.
> > ---
> >  Gather requires between 9 and 12 cycles on Haswell, 5 to 7 on Broadwell,
> >  and 2 to 5 on Skylake and newer. It is also slow on AMD before Zen 3.
> >  libavutil/cpu.h     |  2 ++
> >  libavutil/x86/cpu.c | 18 ++++++++++++++++--
> >  libavutil/x86/cpu.h |  1 +
> >  3 files changed, 19 insertions(+), 2 deletions(-)
> >
>
> No, we really don't need more FAST/SLOW flags, especially for
> something like this which is just fixable by _not_using_vgather_.
> Take a look at libavutil/x86/tx_float.asm, we only use vgather
> if it's guaranteed to either be faster for what we're gathering or
> is just as fast "slow". If neither is true, we use manual lookups,
> which is actually advantageous since for AVX2 we can interleave
> the lookups that happen in each lane.
>
> Even if we disregard this, I've extensively benchmarked vgather
> on Zen 3, Zen 2, Cascade Lake and Skylake, and there's hardly
> a great vgather improvement to be found in Zen 3 to justify
> using a new CPU flag for this.
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request at ffmpeg.org with subject "unsubscribe".
>

Thanks for your response. I'm not against finding a cleaner way of
enabling/disabling the code which will be protected by this flag. However,
the manual lookups solution proposed will not work in this case, the avx2
version of hscale will only be faster if fast gathers are available,
otherwise, the ssse3 version should be used.

I haven't got access to a Zen3 so I can't comment on the performance. I
have tested on a Zen 2 and it is slow. On Broadwell hscale avx2 is about
10% faster than the ssse3 version and on Skylake about 40% faster, Haswell
has similar performance to Zen2.

Is there a proxy which could be used for detecting Broadwell or Skylake and
later? AVX512 seems too strict as there are Skylake chips without AVX512.
Thanks