[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2
Tue May 20 00:42:24 CEST 2008
On Monday 19 May 2008, Siarhei Siamashka wrote:
> Dmitry Antipov wrote:
> > For pix_sum, it's definitely possible to write zero-stall code. See
> > pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll16_by2 - it has
> > no stalls on the core when WLDR delay is 3 cycles.
> It has stalls because of back-to-back WLDRD instructions which are bad for
> your CPU. Interleaving WLDRD and WSADB will let you get maximal
> > If it's 4 cycles, we may try
> > pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll16_by4, and
> > involve more and more registers until we're going out of the free ones
> > :-).
> Yes, you can avoid stalls by unrolling loops and pipelining operations.
> Fortunately ARM has a lot of registers :)
Just in case if it is still not completely clear how to optimize this kind of
code. Let's suppose that we have the following sequence of operations (Bi
operations use result of Ai, and A operations have latency equal to 3).
This code will take 16 cycles because of stalls. Now we can shift all
the B operations relative to A down and get the following:
.. - empty slot
This code will take 9 cycles. Empty slot in the beginning of this sequence can
be filled with some unrelated operation. If latency is higher than 3, the
operations can be shifted relative to each other even more to get rid of
Now imagine that A operations are WLDRD and B operations are WSADB. Empty slot
in the beginning can be filled with some initialization code, etc.
I still recommend you to have a look at the code attached to:
It needs a performance fix (replace pairs of WSADBZ/WADDW with WSADB), but
tries to implements exactly this pipelining optimization for "vsad_intra16".
More information about the ffmpeg-devel