[FFmpeg-devel] [PATCH 0/6] avcodec/vc1: Arm optimisations

Mon Mar 21 19:37:18 EET 2022

Hi Martin,

Thanks very much for taking a look.

On 19/03/2022 23:06, Martin Storsjö wrote:
> As you are writing assembly for these functions, I would very much 
> appreciate if you could add checkasm tests for all the functions you're 
> implementing. I see that there exists a test for the blockdsp functions, 
> but all the other ones are missing a test.

I think I'd have a bit of a learning curve ahead of me there! I did 
write my own fuzz testers to check the validity of my assembly 
implementations, and I could share them (they'd probably need a bit of 
tidying up since I wasn't intending them for public consumption) but 
they were written in ignorance of the checkasm framework, so probably 
wouldn't slot in neatly.

Is there any writeup of checkasm anywhere, discussing how it's used, 
what sorts of things it tests, any speed/memory limits that tests should 
try to adhere to - that sort of thing?

> The other main issue I'd like to request is to indent the assembly 
> similarly to the rest of the existing assembly. For the 32 bit assembly, 
> your patches do match the surrounding code, but for the 64 bit assembly, 
> your patches align the operands column differently than the rest.

Since I was creating new source files for the 64-bit stuff, I assumed I 
had a bit of leeway in indentation style - but I can easily change it.

For what it's worth, the opcodes in AArch64 are significantly shorter 
than in AArch32, since the vector element size qualifiers go on the 
operands instead of the opcodes, so there's less need for extra indentation.

> Finally, the 32 bit assembly fails to build for me both with (recent) 
> clang and old binutils, with errors like these:
> 
> src/libavcodec/arm/vc1dsp_neon.S: Assembler messages:
> src/libavcodec/arm/vc1dsp_neon.S:1579: Error: bad type for scalar -- 
> `vmov r0,d4[1]'

Thanks - the Armv8-A ARM says (section F6.1.139) that the data type can 
be omitted here, and in that case it is equivalent to '32', so that's a 
bug in clang. But easy to work around.

> Oh, sidenote - I do see that the last patch in the set uses much more
> inconsistent indentation, with varying indentation between lines. Is
> this intentional to signify some structure in the code, or just
> accidental?

That was deliberate! The inner loop there is unrolled x2, and then 
adjacent iterations are overlapped 180 degrees out of phase. This is 
because each iteration starts off busy, with lots of instructions to 
execute, keeping pipelines full, and towards the end, it thins out, 
meaning we can benefit by using what would otherwise be stalls to 
speculatively start to process the next iteration before we've completed 
the current one.

Effectively, if you only read a series of instructions with matching 
indentation, you get one logical iteration of the loop - for example, in 
the AArch32 version, you can follow through the process from loading the 
source buffer into q10 (line 1849) until we store from it to the 
destination buffer, having determined that it doesn't contain the start 
of any escape sequences (line 1890).

It's a trick I've seen used a few times elsewhere, which is why I didn't 
bother explaining it in a comment. I could add one, or if you still 
don't like it once you've understood what it means, I'd be happy to take 
it out.

Ben