[FFmpeg-devel] [PATCH 0/6] avcodec/vc1: Arm optimisations
Martin Storsjö
martin at martin.st
Tue Mar 22 00:29:13 EET 2022
On Mon, 21 Mar 2022, Ben Avison wrote:
>
> On 19/03/2022 23:06, Martin Storsjö wrote:
>> As you are writing assembly for these functions, I would very much
>> appreciate if you could add checkasm tests for all the functions you're
>> implementing. I see that there exists a test for the blockdsp functions,
>> but all the other ones are missing a test.
>
> I think I'd have a bit of a learning curve ahead of me there! I did write my
> own fuzz testers to check the validity of my assembly implementations, and I
> could share them (they'd probably need a bit of tidying up since I wasn't
> intending them for public consumption) but they were written in ignorance of
> the checkasm framework, so probably wouldn't slot in neatly.
>
> Is there any writeup of checkasm anywhere, discussing how it's used, what
> sorts of things it tests, any speed/memory limits that tests should try to
> adhere to - that sort of thing?
I'm not aware of any guide in itself, but I can try to do a short writeup
here.
Checkasm is essentially a unit test framework for assembly functions. A
test runs two versions of the same function, one reference (e.g. C
implementation) and one test version (e.g. NEON implementation) against
randomish input data, and compares the output to make sure that it matches
(for the relevant part of the output buffer).
For each test, the main point lies in knowing what the expected/valid
ranges of inputs are, so that you pick random inputs that are valid - and
test specifically the e.g. buffer sizes that are used in practice by the
decoder.
If a function usually has e.g. different codepaths internally, depending
on some input values, you can make the test exercise all the different
codepaths. Or if the function has potential overflows in some case, you
can make part of the input buffer always contain the worst-case scenario,
so that each run always tests for the overflows, even if the rest of the
buffers are random data.
Another aspect of tests is what part of the output to check. E.g.
functions in video codecs often write a rectangular block in a buffer. At
the very least, a test can check that the contents within the expected
region of the output buffer matches. But tests can also (optionally) take
it one step further, and check that e.g. the function didn't accidentally
write outside of the edges of the indended rectangle in the output buffer.
Or in some cases it's expected that a function may overwrite e.g. up to 16
bytes past the end of the payload in each row - then you intentionally
wouldn't check that area.
Additionally, after comparing the tested version with the reference,
checkasm can also optionally benchmark your functions. This runs the
benchmark specifically of only the assembly function, nothing else, by
running the function with the same input parameters e.g. 1000 times. It
also measures the C version of the function, so that you can compare
against that and see the speedup of your assembly work, in isolation.
(On linux on ARM, it by default uses the perf timers. If you have enabled
user mode access to the cycle counter registers, which I highly recommend,
and configure with --disable-linux-perf, you get much more precise timing
- to the point that you can measure the impact of different instruction
scheduling setups on in-order cores, like the Cortex A53.)
Additionally, for the testing (but not for benchmarking), the functions
get wrapped with extra setup to try to find lingering nonfunctional issues
that don't show up when you just run the code.
E.g. one common isssue is with functions that take a 32 bit value as
argument. On a 64 bit architecture with arguments in registers, the upper
32 bits of such a register are undefined - while in practice they're often
zero. This can lead to issues later down the line, when e.g. a different
or updated compiler suddenly happens to pass nonzero bits in the undefined
part. The test wrapping tries to arrange so that these bits end up as
nonzero, to catch such hidden bugs.
Additionally, it checks to make sure you've restored all callee saved
registers. In most cases, if you happen to forget to restore e.g. a callee
saved SIMD register, the effects of it normally don't show up soon (or
at all), but may only show up much later depending on what the compiler
did in a calling function. But all functions covered in checkasm get this
checked for free.
After building checkasm, if you run e.g. ./tests/checkasm/checkasm, it
runs all the tests for all functions, for all SIMD instruction sets
available. (On ARM there's usually only NEON, but e.g. on X86 it first
enables only MMX, tests all functions available there, then increasing
levels with SSE2, SSSE3, etc, to test all potential implementations.) If
you run e.g. checkasm --test=blockdsp or --test=h264dsp, it will only run
the tests for that subsystem. If you further add --bench=h264_idct, it
will benchmark all functions with a name starting with h264_idct.
One of the simplest tests to have a look at, to understand the structure,
is tests/checkasm/blockdsp.c. First, the cpu feature mask is set so that
av_get_cpu_flags() returns 0. Then the test main function,
checkasm_check_blockdsp, is called, which initializes the DSP context
(which then only gets assigned the reference C implementation of the
function). In this case the test does nothing, as it compares the
C implementation with itself. The next time around, av_get_cpu_flags
returns NEON, and checkasm_check_blockdsp gets called again, where the DSP
context now gets the NEON function assigned.
In blockdsp.c, the first call to check_func(h.func,
"blockdsp.clear_block") stored a copy of the previous function pointer,
the C reference function, in a map. On the second call to it, it digs up
the previous version and keeps the new current version. These function
pointers are used via the macros call_ref() and call_new(), with the same
parameters as if you'd call the function directly. After running
both, you inspect the output of them to see if they match, and if not
you fail the test. Finally, the bench_new() macro checks if you've asked
to benchmark this particular function. If this function is one of the
functions to benchmark, it runs it N times with the provided parameters.
For your patches, the existing tests in h264dsp, vp8dsp, vp9dsp probably
are good examples of such tests.
A small gotcha when/if you're adding a new checkasm test in a new file,
under a new name. If you just run checkasm without parameters, it runs all
the tests by default (as long as the test is hooked up in the main tests[]
array in checkasm.c). But when running fate, it runs checkasm individually
with one module at a time. So if adding a new test module in checkasm, be
sure to add it to the test listing in tests/fate/checkasm.mak too.
The inverse transforms are tricky to test, because you probably can't feed
them any random input data. The existing h264dsp, vp8dsp and vp9dsp
inverse transform tests take random pixels and do a naive forward
transform of them, so that you only get transform coefficients within the
possible range.
For deblocking filters, those tests start with random-ish input data, but
try to arrange coefficients in a way so that each block contains all
possible combinations of data (above or below the threshold values). Or a
test can run the functions multiple times, with input data arranged to
trigger each special case.
For the unescape function, I'm not sure if we have any good examples of
existing testcases that work on a similar function though. Try to come up
with all cases of interesting input to the function (short buffers, long
buffers, mod-4/non-mod-4 length, nothing to unescape, lots of things to
unescape close together).
>> The other main issue I'd like to request is to indent the assembly
>> similarly to the rest of the existing assembly. For the 32 bit assembly,
>> your patches do match the surrounding code, but for the 64 bit assembly,
>> your patches align the operands column differently than the rest.
>
> Since I was creating new source files for the 64-bit stuff, I assumed I had a
> bit of leeway in indentation style - but I can easily change it.
Ok, thanks, that'd be appreciated. Yeah I try to maintain consistency
across files here.
> For what it's worth, the opcodes in AArch64 are significantly shorter than in
> AArch32, since the vector element size qualifiers go on the operands instead
> of the opcodes, so there's less need for extra indentation.
Yup, that's true. But for functions where instruction-like macros are
used, the macro names often are a bit longer than regular instructions, so
there the extra space is appreciated. And consistency is still nice when
both 32 and 64 bit arm use the same indentation style; in many cases, code
is ported between the two by just copying and slightly adjusting/rewriting
e.g. register names and tweaking instruction names.
>> Finally, the 32 bit assembly fails to build for me both with (recent) clang
>> and old binutils, with errors like these:
>>
>> src/libavcodec/arm/vc1dsp_neon.S: Assembler messages:
>> src/libavcodec/arm/vc1dsp_neon.S:1579: Error: bad type for scalar -- `vmov
>> r0,d4[1]'
>
> Thanks - the Armv8-A ARM says (section F6.1.139) that the data type can be
> omitted here, and in that case it is equivalent to '32', so that's a bug in
> clang. But easy to work around.
Ok, good. Yeah, bugs or not, we try to stick with the subset of assembly
that builds on all toolchains that regularly are used to building.
>> Oh, sidenote - I do see that the last patch in the set uses much more
>> inconsistent indentation, with varying indentation between lines. Is
>> this intentional to signify some structure in the code, or just
>> accidental?
>
> That was deliberate! The inner loop there is unrolled x2, and then adjacent
> iterations are overlapped 180 degrees out of phase. This is because each
> iteration starts off busy, with lots of instructions to execute, keeping
> pipelines full, and towards the end, it thins out, meaning we can benefit by
> using what would otherwise be stalls to speculatively start to process the
> next iteration before we've completed the current one.
>
> Effectively, if you only read a series of instructions with matching
> indentation, you get one logical iteration of the loop - for example, in the
> AArch32 version, you can follow through the process from loading the source
> buffer into q10 (line 1849) until we store from it to the destination buffer,
> having determined that it doesn't contain the start of any escape sequences
> (line 1890).
>
> It's a trick I've seen used a few times elsewhere, which is why I didn't
> bother explaining it in a comment. I could add one, or if you still don't
> like it once you've understood what it means, I'd be happy to take it out.
Right, I see. (I didn't try to read the code and follow it yet, I just
browsed your patches and testbuilt them.) I think it can be valuable to
keep this nonstandard indentation as a readability/maintainability aid
then. (But do shift the operand column 8 chars to the right for the 64 bit
version.)
// Martin
More information about the ffmpeg-devel
mailing list