[FFmpeg-devel] [PATCH] lavu/tx: WIP add x86 assembly

Fri Feb 26 06:59:20 EET 2021

This commit adds sse3 and avx assembly optimizations for 4-point and 
8-point transforms only.
The code to recombine them into higher-level transforms is non-functional
currently, so it's not included here. This is just to get some feedback
on possible optimizations.

The 4-point assembly is based on this structure:
https://gist.github.com/cyanreg/665b9c79cbe51df9296a969257f2a16c

The 8-point assembly is based on this structure:
https://gist.github.com/cyanreg/bbf25c8a8dfb910ed3b9ae7663983ca6

They're implemented as macros as they're pasted a few times in
the recombination code.

All code here is faster than both our own current assembly (by around 40%)
and FFTW3 (by around 10% to 40%).

The 8-point core assembly is barely 20 instructions! That's 1 less
than our current code, and saves on a lot of shuffles!
It's 40% faster than FFTW!

The 4-point core assembly is 10 instructions, which is 1 more than
our current code, however it doesn't require any external memory to
load from (a sign mask), which it trades for a shufps (faster),
and also it requires an additional temporary storage register
to reduce latency.

I'll collect the suggestions and implement them when I'm ready
to post the full power-of-two assembly.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-lavu-tx-WIP-add-x86-assembly.patch
Type: text/x-patch
Size: 13152 bytes
Desc: not available
URL: <https://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20210226/053f32e7/attachment.bin>