[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Sat May 17 12:10:30 CEST 2008

On Saturday 17 May 2008, Dmitry Antipov wrote:
> Michael Niedermayer wrote:
> > the iterations are always an even number IIRC, but dont hesitate to add a
> > assert(!(h&1));
>
> But both vsad_intra16_c() and vsse_intra16_c() has an outer loop
> 'for(y=1; y<h; y++)', so, if H if always even, the number of iterations
> is always odd. For loop unrolled by 2, this means we either need an
> additional check within the loop body, or move the last iteration outside
> of the loop.
>
> For an always even H and number of iterations H - 1:
>
> #define BODY(p,q,r,s) \
>          "add %1, %1, %2                      \n\t" \
>          "wldrd wr" #p ", [%1]                \n\t" \

Back to back 'wldrd' instructions use question is still not clear. If you
can make a benchmark that would detect single cycle difference, you can
experiment with inserting some dummy instruction here. If the performance
will not change, you have a stall here. If the performance would get worse,
back to back 'wldrd' instructions are most likely ok for your CPU core.

>          "wldrd wr" #q ", [%1, #8]            \n\t" \
>          "wsadbz wr" #r ", wr" #r ", wr" #p " \n\t" \

Pipeline stall because you are using #q register too early (load latency 
is 3 cycles)

>          "wsadbz wr" #s ", wr" #s ", wr" #q " \n\t" \
>          "waddw wr0, wr0, wr" #r "            \n\t" \
>          "waddw wr0, wr0, wr" #s "            \n\t"
>
> int vsad_intra16_iwmmxt(void *c, uint8_t *pix, uint8_t *dummy, int stride,
> int h) {
>      int s;
>
>      assert(!(h&1));
>
>      asm volatile("mov r1, %3                \n\t"
>                   "wzero wr0                 \n\t"
>                   "wldrd wr1, [%1]           \n\t"
>                   "wldrd wr2, [%1, #8]       \n\t"
>                   /* main loop */
>                   "1:                        \n\t"
>                   BODY(3, 4, 1, 2)
>                   BODY(1, 2, 3, 4)
>                   "subs r1, r1, #2           \n\t"

As I said earlier, this "subs" instruction can be used at any place of the
loop and help with solving at least one cycle stall.

>                   "bne 1b                    \n\t"
>                   /* last step */
>                   BODY(3, 4, 1, 2)
>                   "textrmsw %0, wr0, #0      \n\t"
>
>                   : "=r"(s), "+r"(pix)
>                   : "r"(stride), "r"(h - 2)
>                   : "r1");

Why are you introducing the extra "r1" register? You could just directly
use "h" variable as a loop counter. By using alternative versions of
conditional jump ("bgt" instead of "bne"), you could try to eliminate the need
of doing arithmetics on "h" before the loop.

>      return s;
> }
>
> #undef BODY
>
> If number of iterations H - 1 (not H) is always even, the last step
> BODY(3, 4, 1, 2) is not needed.
>
> As for the latencies, I don't see (avoidable) ones here. 

All the latencies are generally avoidable on such kind of loops. You just may
need deeper unrolling sometimes (not needed in this case). 

> If someone do, please specify an exact instruction sequence which causes the
> latency - I'll check it against the manuals and try to redesign the code
> again. 

Instruction sequences are annotated above.

Thanks for keeping posting updates, you code is getting consistently better
and should become good enough soon. And XScale cores would definitely like it,
you don't have spare resources there to waste ;)

-- 
Best regards,
Siarhei Siamashka