[FFmpeg-devel] [PATCH] Add support for "omp simd" pragma.

Sun Jan 10 20:55:18 EET 2021

Jan 10, 2021, 17:43 by Reimar.Doeffinger at gmx.de:

> From: Reimar Döffinger <Reimar.Doeffinger at gmx.de>
>
> This requests loops to be vectorized using SIMD
> instructions.
> The performance increase is far from hand-optimized
> assembly but still significant over the plain C version.
> Typical values are a 2-4x speedup where a hand-written
> version would achieve 4x-10x.
> So it is far from a replacement, however some architures
> will get hand-written assembler quite late or not at all,
> and this is a good improvement for a trivial amount of work.
> The cause, besides the compiler being a compiler, is
> usually that it does not manage to use saturating instructions
> and thus has to use 32-bit operations where actually
> saturating 16-bit operations would be sufficient.
> Other causes are for example the av_clip functions that
> are not ideal for vectorization (and even as scalar code
> not optimal for any modern CPU that has either CSEL or
> MAX/MIN instructions).
> And of course this only works for relatively simple
> loops, the IDCT functions for example seemed not possible
> to optimize that way.
> Also note that while clang may accept the code and sometimes
> produces warnings, it does not seem to do anything actually
> useful at all.
> Here are example measurements using gcc 10 under Linux (in a VM unfortunately)
> on AArch64 on Apple M1:
> Commad:
> time ./ffplay_g LG\ 4K\ HDR\ Demo\ -\ New\ York.ts -t 10 -autoexit -threads 1 -noframedrop
>
> Original code:
> real    0m19.572s
> user    0m23.386s
> sys     0m0.213s
>
> Changing all put_hevc:
> real    0m15.648s
> user    0m19.503s (83.4% of original)
> sys     0m0.186s
>
> In addition changing add_residual:
> real    0m15.424s
> user    0m19.278s (82.4% of original)
> sys     0m0.133s
>
> In addition changing planar copy dither:
> real    0m15.040s
> user    0m18.874s (80.7% of original)
> sys     0m0.168s
>

I think I have to disagree.
The performance gains are marginal, its definitely something the compiler should
be able to decide on its own, and it makes performance highly compiler dependent.
And I'm not even resorting to the painfully obvious FUD arguments that could be made.

Most of the loops this is added to are trivially SIMDable. Just because no one has
had the motivation to do SIMD for a pretty unpopular codec doesn't mean we should
compromise.