[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Sat May 17 20:36:31 CEST 2008

On Saturday 17 May 2008, Dmitry Antipov wrote:
[...]
> This version of pix_sum (1):
>
>      asm volatile("wzero wr0                 \n\t"
>                   "mov r1, #16               \n\t"
>                   "1: wldrd wr2, [%1, #8]    \n\t"
>                   "subs r1, r1, #1           \n\t" /* subs here */
>                   "wldrd wr1, [%1], %2       \n\t"
>                   "waccb wr2, wr2            \n\t"
>                   "waddw wr0, wr0, wr2       \n\t"
>                   "waccb wr1, wr1            \n\t"
>                   "waddw wr0, wr0, wr1       \n\t"
>                   "bne 1b                    \n\t"
>                   "textrmsw %0, wr0, #0      \n\t"
>
>                   : "=r"(s), "+r"(pix)
>                   : "r"(line_size)
>                   : "r1");
>
> is 6-7% faster than this (2):
>
>     asm volatile("wzero wr0                 \n\t"
>                   "mov r1, #16               \n\t"
>                   "1: wldrd wr2, [%1, #8]    \n\t"
>                   "wldrd wr1, [%1], %2       \n\t"
>                   "waccb wr2, wr2            \n\t"
>                   "waddw wr0, wr0, wr2       \n\t"
>                   "waccb wr1, wr1            \n\t"
>                   "waddw wr0, wr0, wr1       \n\t"
>                   "subs r1, r1, #1           \n\t" /* subs here */
>                   "bne 1b                    \n\t"
>                   "textrmsw %0, wr0, #0      \n\t"
>
>                   : "=r"(s), "+r"(pix)
>                   : "r"(line_size)
>                   : "r1");

Sure, it saves one cycle by executing "subs" instruction at the place where
CPU would stall anyway.

> An interesting subject is the completely unrolled version (3):
>
>      #define SUM \
>       "wldrd wr2, [%1, #8] \n\t" \
>       "wldrd wr1, [%1], %2 \n\t" \
>       "waccb wr2, wr2      \n\t" \
>       "waddw wr0, wr0, wr2 \n\t" \
>       "waccb wr1, wr1      \n\t" \
>       "waddw wr0, wr0, wr1 \n\t"
>
>      asm volatile("wzero wr0             \n\t"
>                   SUM
>                   SUM
>                   SUM
>                   SUM
>                   SUM
>                   SUM
>                   SUM
>                   SUM
>                   SUM
>                   SUM
>                   SUM
>                   SUM
>                   SUM
>                   SUM
>                   SUM
>                   SUM
>                   "textrmsw %0, wr0, #0  \n\t"
>
>                   : "=r"(s), "+r"(pix)
>                   : "r"(line_size));
>
> There is just no instruction to insert between loads here. But, anyway,
> this function is 8-9% faster than even (1).

That's how loop unrolling optimization works. You reduce loop overhead by
removing "subs" instruction and conditional jump (takes 1 cycle if correctly
predicted on XScale). Because (1) only "removed" "subs" instruction and
still has conditional jump, it was slower. An interesting (but unrelated) 
thing is that ARM11 supports "branch folding", reducing correctly predicted
conditional jump overhead to 0 cycles in some cases.

But your unrolled code still "sucks" :) It has a lot of pipeline stalls
that could be eliminated. Please read optimization manual and find
a definition of instruction latency. That will help a lot in optimizing
code and understanding how CPU works. ARM pipeline is quite simple to
comprehend and you will immediately spot potential stalls after you
get more practice with assembly code.

> But note the measurements was done within the loop like this:
>
>      for (i = 0, x = 0; i < 1000; i++)
>          x += pix_sum_iwmmxt(...);
>
> I.e. the function body definitely fills the instruction cache. In the real
> code, it's most probably to call the function from time to time, and
> unrolled version may loose due to it's huge size in comparison with looped
> version.

Yes, unrolling needs to be reasonable in order not to increase code size too
much. You need to evaluate how much performance gain you would get from loop
unrolling and decide what level of unrolling is optimal for your code.

Let's try the following. We can start with getting a perfect version of
'vsad_intra16_iwmmxt' function first. Once it is done, you can focus on
optimizing 'pix_sum' function yourself without getting any assistance or
further hints. Once you manage to get an implementation that does not have
any pipeline stalls, you should have enough experience and can move on to
optimizing the rest of functions. Is this plan acceptable for you?

-- 
Best regards,
Siarhei Siamashka