[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2
Dmitry Antipov
dmantipov
Mon May 19 18:15:10 CEST 2008
Siarhei Siamashka wrote:
> Let's try the following. We can start with getting a perfect version of
> 'vsad_intra16_iwmmxt' function first. Once it is done, you can focus on
> optimizing 'pix_sum' function yourself without getting any assistance or
> further hints.
I've decided to start from pix_sum since it's the simplest function I've tried
to WMMXize :-).
Here you can find 15 WMMX versions of it: http://78.153.153.8/tmp/pix_sum.c
And here is an example output: http://78.153.153.8/tmp/pix_sum.txt (it's an
average results among a few tens runs).
Some conclusions on top on this work:
- inserting some work (if exists) between loads is always good;
- backward loading (from addr + 8 first, then from addr) is not slower than
forward loading (from addr first, then from addr + 8);
- explicit prefetching with 'pld [addr]' introduces slowdowns both for
forward and backward loading (BTW, I want to spent some more time on
learning prefetching and other cache tricks);
- WMMX2 pre-increment (wldrd wr0, [%0, %1]!) has no advantage over post-increment
(wldrd wr0, [%0], %1) and both of them has no advantage over convenient explicit add
if the latter may be done within wldr stall slot (wldrd wr0, [%0], then add %0, %0, %1);
- unrolling is good in general, but each level of unrolling needs to be precisely
benchmarked. Note pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll8 is a bit
slower than pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll4 (I suspect
this is a nasty (mis?)feature of the core's instruction cache).
> Once you manage to get an implementation that does not have
> any pipeline stalls
Are you sure it's always possible? I'm not - sometimes we can minimize
the number of stalls by using another instruction ordering, but it's not
necessary means reaching the zero-stalls code.
For example, see pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll16 from
the above. This code:
#define SUM(x,y) \
"wldrd wr" #x ", [%1, #8] \n\t" \ (0)
"wldrd wr" #y ", [%1], %2 \n\t" \ (1)
"wsadb wr0, wr" #x ", wr3 \n\t" \ (2)
"wsadb wr0, wr" #y ", wr3 \n\t" (3)
is 'bad' because 0) and 2) both requires one cycle for execution, but 0) introduces
result latency of #x (3 or 4 cycles, depends on the core) - #x is used 'too early'.
So the pair 0)-2) adds the latency of 2 or 3 cycles, and the pair 1)-3) doubles it.
But, despite of this, pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll16
is one of the fastest versions of pix_sum.
Having lesser stalls is not always means having faster code, I believe. It would be very
interesting to see the version of pix_sum which is faster than
pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll16 and has no stalls at all.
Dmitry
More information about the ffmpeg-devel
mailing list