[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Sat May 17 16:53:08 CEST 2008

Siarhei Siamashka wrote:

> Back to back 'wldrd' instructions use question is still not clear. If you
> can make a benchmark that would detect single cycle difference, you can
> experiment with inserting some dummy instruction here. If the performance
> will not change, you have a stall here. If the performance would get worse,
> back to back 'wldrd' instructions are most likely ok for your CPU core.

It looks you're right here. This version of pix_sum (1):

     asm volatile("wzero wr0                 \n\t"
                  "mov r1, #16               \n\t"
                  "1: wldrd wr2, [%1, #8]    \n\t"
                  "subs r1, r1, #1           \n\t" /* subs here */
                  "wldrd wr1, [%1], %2       \n\t"
                  "waccb wr2, wr2            \n\t"
                  "waddw wr0, wr0, wr2       \n\t"
                  "waccb wr1, wr1            \n\t"
                  "waddw wr0, wr0, wr1       \n\t"
                  "bne 1b                    \n\t"
                  "textrmsw %0, wr0, #0      \n\t"
                  : "=r"(s), "+r"(pix)
                  : "r"(line_size)
                  : "r1");

is 6-7% faster than this (2):

    asm volatile("wzero wr0                 \n\t"
                  "mov r1, #16               \n\t"
                  "1: wldrd wr2, [%1, #8]    \n\t"
                  "wldrd wr1, [%1], %2       \n\t"
                  "waccb wr2, wr2            \n\t"
                  "waddw wr0, wr0, wr2       \n\t"
                  "waccb wr1, wr1            \n\t"
                  "waddw wr0, wr0, wr1       \n\t"
                  "subs r1, r1, #1           \n\t" /* subs here */
                  "bne 1b                    \n\t"
                  "textrmsw %0, wr0, #0      \n\t"
                  : "=r"(s), "+r"(pix)
                  : "r"(line_size)
                  : "r1");

An interesting subject is the completely unrolled version (3):

     #define SUM \
      "wldrd wr2, [%1, #8] \n\t" \
      "wldrd wr1, [%1], %2 \n\t" \
      "waccb wr2, wr2      \n\t" \
      "waddw wr0, wr0, wr2 \n\t" \
      "waccb wr1, wr1      \n\t" \
      "waddw wr0, wr0, wr1 \n\t"

     asm volatile("wzero wr0             \n\t"
                  SUM
                  SUM
                  SUM
                  SUM
                  SUM
                  SUM
                  SUM
                  SUM
                  SUM
                  SUM
                  SUM
                  SUM
                  SUM
                  SUM
                  SUM
                  SUM
                  "textrmsw %0, wr0, #0  \n\t"
                  : "=r"(s), "+r"(pix)
                  : "r"(line_size));

There is just no instruction to insert between loads here. But, anyway, this function
is 8-9% faster than even (1).

But note the measurements was done within the loop like this:

     for (i = 0, x = 0; i < 1000; i++)
         x += pix_sum_iwmmxt(...);

I.e. the function body definitely fills the instruction cache. In the real code,
it's most probably to call the function from time to time, and unrolled version may
loose due to it's huge size in comparison with looped version.

> Why are you introducing the extra "r1" register? You could just directly
> use "h" variable as a loop counter.

Agree, r1 is needed only for pix_sum and pix_norm, and only if we prefer
looped versions over unrolled ones.

Dmitry