[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2
Mon May 19 18:43:48 CEST 2008
Dmitry Antipov wrote:
>> Once you manage to get an implementation that does not have
>> any pipeline stalls
> Are you sure it's always possible? I'm not - sometimes we can minimize
> the number of stalls by using another instruction ordering, but it's not
> necessary means reaching the zero-stalls code.
> For example, see pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll16 from
> the above. This code:
> #define SUM(x,y) \
> "wldrd wr" #x ", [%1, #8] \n\t" \ (0)
> "wldrd wr" #y ", [%1], %2 \n\t" \ (1)
> "wsadb wr0, wr" #x ", wr3 \n\t" \ (2)
> "wsadb wr0, wr" #y ", wr3 \n\t" (3)
> is 'bad' because 0) and 2) both requires one cycle for execution, but 0) introduces
> result latency of #x (3 or 4 cycles, depends on the core) - #x is used 'too early'.
> So the pair 0)-2) adds the latency of 2 or 3 cycles, and the pair 1)-3) doubles it.
> But, despite of this, pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll16
> is one of the fastest versions of pix_sum.
> Having lesser stalls is not always means having faster code, I believe. It would be very
> interesting to see the version of pix_sum which is faster than
> pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll16 and has no stalls at all.
As usual, an idea comes into my head _after_ pushing 'Send' button :-(((.
For pix_sum, it's definitely possible to write zero-stall code. See
pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll16_by2 - it has no
stalls on the core when WLDR delay is 3 cycles.
If it's 4 cycles, we may try pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll16_by4,
and involve more and more registers until we're going out of the free ones :-).
But I'm still sure that the possibility to write zero-stalls code in all cases is not
More information about the ffmpeg-devel