[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Mon May 19 19:35:51 CEST 2008

On Monday 19 May 2008, Dmitry Antipov wrote:
> Dmitry Antipov wrote:
> >> Once you manage to get an implementation that does not have
> >> any pipeline stalls
> >
> > Are you sure it's always possible? I'm not - sometimes we can minimize
> > the number of stalls by using another instruction ordering, but it's not
> > necessary means reaching the zero-stalls code.
> >
> > For example, see
> > pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll16 from the
> > above. This code:
> >
> > #define SUM(x,y) \
> >      "wldrd wr" #x ", [%1, #8] \n\t" \   (0)
> >      "wldrd wr" #y ", [%1], %2 \n\t" \   (1)
> >      "wsadb wr0, wr" #x ", wr3 \n\t" \   (2)
> >      "wsadb wr0, wr" #y ", wr3 \n\t"     (3)
> >
> > is 'bad' because 0) and 2) both requires one cycle for execution, but 0)
> > introduces result latency of #x (3 or 4 cycles, depends on the core) - #x
> > is used 'too early'. So the pair 0)-2) adds the latency of 2 or 3 cycles,
> >  and the pair 1)-3) doubles it. 

No, for WMMX2 (it is your core) you get one stall because of back-to-back
wldrd instructions, and one more stall because (3) tries to use result of
(1) too early.

If we did have back-to-back use of wldrd penalty, you would get only one
stall between (1) and (2) instructions. Instruction (3) would be fine because
after suffering this 1 cycle penalty between (1) and (2), instruction (3) is 
3 cycles away from (1).

> >  But, despite of this, 
> > pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll16 is one of the
> > fastest versions of pix_sum.
> >
> > Having lesser stalls is not always means having faster code, I believe.
> > It would be very interesting to see the version of pix_sum which is
> > faster than
> > pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll16 and has no
> > stalls at all.
>
> As usual, an idea comes into my head _after_ pushing 'Send' button :-(((.
>
> For pix_sum, it's definitely possible to write zero-stall code. See
> pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll16_by2 - it has no
> stalls on the core when WLDR delay is 3 cycles.

It has stalls because of back-to-back WLDRD instructions which are bad for
your CPU. Interleaving WLDRD and WSADB will let you get maximal performance.

> If it's 4 cycles, we may try
> pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll16_by4, and
> involve more and more registers until we're going out of the free ones :-).

Yes, you can avoid stalls by unrolling loops and pipelining operations.
Fortunately ARM has a lot of registers :)

> But I'm still sure that the possibility to write zero-stalls code in all
> cases is not proven.

It is not guaranteed for all cases, but for each particular case (at least 
for the cases as simple as we have here) you can try to strictly prove if
zero-stalls code is possible or not. That is of course for the inner loop, you
may have some unavoidable stalls on the first iteration.

WMMX instructions seem to have a maximum latency 4, so the code is not so hard
to analyze. This is nothing compared to ARM11 VFP where you have 8 cycles
latency and need to take parallel work of several pipelines into account :)

-- 
Best regards,
Siarhei Siamashka