[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Dmitry Antipov dmantipov
Fri May 16 18:19:44 CEST 2008


Siarhei Siamashka wrote:

> Half of the data loaded on the second iteration of your loop has been already
> loaded on the first iteration. It could be reused to improve performance.
> Reusing this data can be used by unrolling loop.

Argh, I see. Here two loads are avoided at the cost of having two moves:

asm volatile("mov r1, %3            \n\t"
              "wzero wr0             \n\t"
              "wldrd wr1, [%1]       \n\t"
              "wldrd wr2, [%1, #8]   \n\t"
              "1: add %1, %1, %2     \n\t"
              "wldrd wr3, [%1]       \n\t"
              "wldrd wr4, [%1, #8]   \n\t"
              "wsadbz wr1, wr1, wr3  \n\t"
              "wsadbz wr2, wr2, wr4  \n\t"
              "waddw wr0, wr0, wr1   \n\t"
              "waddw wr0, wr0, wr2   \n\t"
              "wmov wr1, wr3         \n\t"
              "wmov wr2, wr4         \n\t"
              "subs r1, r1, #1       \n\t"
              "bne 1b                \n\t"
              "textrmsw %0, wr0, #0  \n\t"
              : "=r"(s), "+r"(pix)
              : "r"(stride), "r"(h - 1)
              : "r1");

As for unrolling, I don't believe it's a good idea here since the number of
iterations of outer loop isn't known. Here is an unrolled version:

asm volatile("mov r1, %3                \n\t"
              "wzero wr0                 \n\t"
              "wldrd wr1, [%1]           \n\t"
              "wldrd wr2, [%1, #8]       \n\t"
              "1: add %1, %1, %2         \n\t"
              "wldrd wr3, [%1]           \n\t"
              "wldrd wr4, [%1, #8]       \n\t"
              "wsadbz wr1, wr1, wr3      \n\t"
              "wsadbz wr2, wr2, wr4      \n\t"
              "waddw wr0, wr0, wr1       \n\t"
              "waddw wr0, wr0, wr2       \n\t"
              "subs r1, r1, #1           \n\t"
              "beq 2f                    \n\t"
              "add %1, %1, %2            \n\t"
              "wldrd wr5, [%1]           \n\t"
              "wldrd wr6, [%1, #8]       \n\t"
              "wsadbz wr3, wr3, wr5      \n\t"
              "wsadbz wr4, wr4, wr6      \n\t"
              "waddw wr0, wr0, wr3       \n\t"
              "waddw wr0, wr0, wr4       \n\t"
              "wmov wr1, wr5             \n\t"
              "wmov wr2, wr6             \n\t"
              "subs r1, r1, #1           \n\t"
              "bne 1b                    \n\t"
              "2: textrmsw %0, wr0, #0   \n\t"
              : "=r"(s), "+r"(pix)
              : "r"(stride), "r"(h - 1)
              : "r1");

The granularity of performance monitoring unit's clock cycle counter
isn't enough to see performance differences between them :-).

Dmitry





More information about the ffmpeg-devel mailing list