[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Sat May 24 17:43:04 CEST 2008

On Friday 23 May 2008, Dmitry Antipov wrote:
> Siarhei Siamashka wrote:
> > Second and more important. Your test buffer is a byte array and it is not
> > guaranteed to be 8-byte aligned. WLDRD instruction requires strict 8-byte
> > alignment at least for WMMX1 cores (if documentation is not completely
> > wrong). You can check '/proc/cpu/aligment' to configure behaviour of your
> > CPU when it performs unaligned memory reads/writes. Theoretically, it
> > could be that CPU just ignores unaligned WLDRD reads and your benchmark
> > does not make much sense. But it also could be that PXA3xx cores can
> > support unaligned memory access, don't know.
>
> As I know, all XScale cores strictly follows the rule 'N-byte access should
> be aligned on N-bytes boundary' (where N=1,2,4,8). An attempt to run
> something like:
>
> unsigned char __attribute__((aligned (8))) data[64];
> asm volatile ("wldrh wr0, [%0]\n\t"
>                "wldrh wr1, [%0, #1]\n\t"
>
>                : : "r"(data));
>
> as well as:
>
> asm volatile ("wldrd wr0, [%0]\n\t"
>                "wldrd wr1, [%0, #4]\n\t"
>
>                : : "r"(data));
>
> causes SIGBUS and some noise from the kernel:
>
> Alignment trap: align (723) PC=0x000083e8 Instr=0xeddd1101
> Address=0xbefffd94 FSR 0x013

... unless you have run "echo 0 > /proc/cpu/alignment", which *might*
configure CPU to ignore alignment faults (with undefined behaviour).

Anyway, now you can benchmark the code and identify stalls. So you can do
optimizations for all these functions. If you encounter any pipeline 
stalls in your implementation, don't consider these stalls to be unavoidable
and better ask here, there might be some solution. As for the level of loops
unrolling, you have to unroll them as long as it helps to hide instruction
result latencies and avoid stalls. Anything more than that is not always a
clear win and you need to benchmark the code to see if loop unrolling
provides performance improvement for the whole program on real tasks.

It is very interesting to benchmark these functions as part of real video
encoding process to see how they are affected by cache misses on the typical
use cases.

How much memory do you have on your device? Is it possible to natively run
a standard FFmpeg regression test on it?

I have N810 with ARM11 core and 128MB of RAM and run regressions tests from
time to time in debian armel rootfs. It takes quite a lot of time though.

In any case, I can only advice you how to make code fast. But in order to
get your code accepted in FFmpeg, you need an approval from Michael
Niedermayer :)

-- 
Best regards,
Siarhei Siamashka