[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Mon May 19 18:15:10 CEST 2008

Siarhei Siamashka wrote:

> Let's try the following. We can start with getting a perfect version of
> 'vsad_intra16_iwmmxt' function first. Once it is done, you can focus on
> optimizing 'pix_sum' function yourself without getting any assistance or
> further hints.

I've decided to start from pix_sum since it's the simplest function I've tried
to WMMXize :-).

Here you can find 15 WMMX versions of it: http://78.153.153.8/tmp/pix_sum.c
And here is an example output: http://78.153.153.8/tmp/pix_sum.txt (it's an
average results among a few tens runs).

Some conclusions on top on this work:

  - inserting some work (if exists) between loads is always good;

  - backward loading (from addr + 8 first, then from addr) is not slower than
    forward loading (from addr first, then from addr + 8);

  - explicit prefetching with 'pld [addr]' introduces slowdowns both for
    forward and backward loading (BTW, I want to spent some more time on
    learning prefetching and other cache tricks);

  - WMMX2 pre-increment (wldrd wr0, [%0, %1]!) has no advantage over post-increment
    (wldrd wr0, [%0], %1) and both of them has no advantage over convenient explicit add
    if the latter may be done within wldr stall slot (wldrd wr0, [%0], then add %0, %0, %1);

  - unrolling is good in general, but each level of unrolling needs to be precisely
    benchmarked. Note pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll8 is a bit
    slower than pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll4 (I suspect
    this is a nasty (mis?)feature of the core's instruction cache).

> Once you manage to get an implementation that does not have
> any pipeline stalls

Are you sure it's always possible? I'm not - sometimes we can minimize
the number of stalls by using another instruction ordering, but it's not
necessary means reaching the zero-stalls code.

For example, see pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll16 from
the above. This code:

#define SUM(x,y) \
     "wldrd wr" #x ", [%1, #8] \n\t" \   (0)
     "wldrd wr" #y ", [%1], %2 \n\t" \   (1)
     "wsadb wr0, wr" #x ", wr3 \n\t" \   (2)
     "wsadb wr0, wr" #y ", wr3 \n\t"     (3)

is 'bad' because 0) and 2) both requires one cycle for execution, but 0) introduces
result latency of #x (3 or 4 cycles, depends on the core) - #x is used 'too early'.
So the pair 0)-2) adds the latency of 2 or 3 cycles,  and the pair 1)-3) doubles it.
But, despite of this, pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll16
is one of the fastest versions of pix_sum.

Having lesser stalls is not always means having faster code, I believe. It would be very
interesting to see the version of pix_sum which is faster than
pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll16 and has no stalls at all.

Dmitry