[FFmpeg-devel] [PATCH] Add support for "omp simd" pragma.

Tue Jan 12 23:32:15 EET 2021

> On 12 Jan 2021, at 21:46, Lynne <dev at lynne.ee> wrote:
> 
> Jan 12, 2021, 19:28 by Reimar.Doeffinger at gmx.de:
> 
>> It’s almost 20%. At least for this combination of
>> codec and stream a large amount of time is spend in
>> non-DSP functions, so even hand-written assembler
>> won’t give you huge gains.
>> 
> It's non-guaranteed 20% on a single system. It could change, and it could very
> well mess up like gcc does with autovectorization, which we still explicitly disable
> because FATE fails (-fno-tree-vectorize, and I was the one who sent an RFC to
> try to undo it somewhat recently. Even though it was an RFC the reaction from devs
> was quite cold).

Oh, thanks for the reminder, I thought that was gone because it seems
it’s not used for clang, and MPlayer does not seem to set that.
I need to compare it, however the problem with the auto-vectorization
is exactly that the compiler will try to apply it to everything,
which has at least 2 issues:
1) it gigantically increases the risk for bugs when it's every
single loop instead of loops that we already wrote assembler for
somewhere.
2) it will quite often make things worse, by vectorizing loops
that are rarely iterated over more than a few times (and it
needs to handle a whole lot of code to handle loop counts not
a multiple of vector size) - because all too often the compiler
can only take a wild guess if “width” is usually 1 or 1920,
while we DO know.

>>> its definitely something the compiler should
>>> be able to decide on its own,
>>> 
>> 
>> So you object to unlikely() macros as well?
>> It’s really just giving the compiler a hint it should try, though I admit the configure part makes it
>> look otherwise.
>> 
> I'm more against the macro and changes to the code itself. If you can make it
> work without adding a macro to individual loops or the likes of av_cold/av_hot or
> any other changes to the code, I'll be more welcoming.

I expect that will just run into the same issue as the tree-vectorize...

> I really _hate_ compiler hints. Take a look at the upipe source code to see what
> a cthulian monstrosity made of hint flags looks like. Every single branch had
> a cold/hot macro and it was the project's coding style. It's completely irredeemable.

I guess my suggested solution would be to require proof of
clearly measurable performance benefit.
But I see the point that if it gets “randomly” added to loops
that might turn out quite a mess.

>>> Most of the loops this is added to are trivially SIMDable.
>>> 
>> 
>> How many hours of effort do you consider “trivial”?
>> Especially if it’s someone not experienced?
>> It might be fairly trivial with intrinsics, however
>> many of your counter-arguments also apply
>> to intrinsics (and to a degree inline assembly).
>> That’s btw not just a rhetorical question because
>> I’m pretty sure I am not going to all the trouble
>> to port more of the arm 32-bit assembler functions
>> since it’s a huge PITA, and I was wondering if there
>> was a point to even have a try with intrinsics...
>> 
> Intrinsics and inline assembly are a whole different thing than magic
> macros that tell and force the compiler what a well written compiler
> should already very well know about.

There are no well written compilers, in a way ;)
I would also argue that most of what intrinsics do,
such a compiler should figure out on its own, too.
And the first time I tried intrinsics they slowed the
loop down by a factor 2 because the compiler stored and
loaded the value to stack between every intrinsic,
so it’s not like they are not without problems.
But I was actually thinking that it might be somewhat
interesting to have a kind of “generic SIMD intrinsics”.
Though I think I read that such a thing has already be
tried, so it might just be wasted time.

> I already said all that can be said here: this will halt efforts on actually
> optimizing the code in exchange for naive trust in compilers.

I’m not sure it will discourage it more than having to write
the optimizations over and over, for Armv7 NEON, for Armv8 Linux,
for Armv8 Windows, then SVE/SVE2, who knows maybe Armv9
will also need a rewrite.
SSE2, AVX256, AVX512 for x86, so much stuff never gets ported
to the new versions.
I’d also claim anyone naively trusting in compilers is unlikely
to write SIMD optimizations either way :)

> New platforms will be stuck at scalar performance anyway until
> the compilers for the arch are smart enough to deal with vectorization.

That seems to happen a long time before someone gets around to
optimising FFmpeg though.
This is particularly true when it’s a new OS and not CPU architecture
platform.
For example macOS we are lucky enough that the assembler etc. are
largely compatible to Linux.
But for Windows-on-Arm there is no GNU assembler, and the Microsoft
assembler needs a completely different syntax, so even the assembly
we DO have just doesn’t work.

Anyway, thanks for the discussion.
I still think the situation with SIMD optimizations should be improved
SOMEHOW, but I nothing but wild ideas on the HOW.
If anyone feels the same, I’d welcome further discussion.

Thanks,
Reimar