[FFmpeg-devel] [PATCH] Fix mm_flags, mm_support for ARM

Tue Jul 1 10:00:43 CEST 2008

On Monday 30 June 2008, M?ns Rullg?rd wrote:
> Siarhei Siamashka wrote:
> > On Saturday 28 June 2008, M?ns Rullg?rd wrote:
[...]
> >> ((1<<(COL_SHIFT-1))/W4)*W4 doesn't fit in 16 bits, so that method
> >> can't easily be used when everything else is in 16-bit vectors.
> >
> > But (1 << (COL_SHIFT-1)) does not fit it either. I don't see any
> > significant difference between:
> >
> > a0 = W4 * col[0] + (1 << (COL_SHIFT-1));
> >
> > and
> >
> > a0 = W4 * col[0] + ((1<<(COL_SHIFT-1))/W4)*W4;
> >
> > Sure, the first one can be encoded as a constant immediate operand in ARM
> > instruction and the second one can't, but that's not a big deal (it can
> > be loaded from memory)
>
> Loading from memory is slower than using an immediate operand, and being
> wider than 16 bits it can't share a register with other values.

Loading from memory (actually from L1 cache) is faster than using an immediate
operand. Just because you can easily initialize 2 registers per cycle instead
of just one (on ARM11). The only drawback is the higher latency. With
immediate operand, you can use this constant right away, but when loading it
from memory, you have to wait a bit.

You have a number of constants to be loaded from memory anyway, putting one
more constant into the same cache line will have absolutely no impact on
performance.

[...]

> >> > In any case, ARMv6 idct still needs heavy optimizations, it is not
> >> > very fast (on its target devices with ARM11 CPUs of course).
> >>
> >> Well, it's considerably faster than the C IDCT, but I'm not denying it
> >> could be improved.  Are you talking about sparse data handling, or
> >> something else?
> >
> > It has quite a number of performance problems:
> > 1. it's not using 64-bit load instructions (and dual loads are almost as
> > useful as dual multiplies for IDCT)
>
> Odd, I can't remember why I didn't use those.
>
> > 2. heavy functions call/ret overhead (ordinary loops are a lot faster)
>
> Why are function calls so slow?

Well, let's check ARM11 TRM and compare the number of cycles.

The code like:

"bl some_function
bx lr"

will take 1 + 4 = 5 cycles in the best case (correct prediction).

while normal loop:

"subs <something>
b<cond> some_label"

will take 1 + 0 = 1 cycle in the best case (correct prediction and branch
folding)

So you lose 4 cycles per function call, this will negate effect of four
dual multiply instructions in the code (as each one saves 1 cycle when
compared to ARMv5TE multiplications).

Of course there are some issues related to the fact that you need to handle
loop counter in some way, but you also need to deal with return address when
calling functions.

In any case, function calls are much slower than normal loops on ARM11.

> > 3. sparse data handling
>
> Yes, that is a bit lacking.
>
> > It can be clearly seen if you try to benchmark it against optimized
> > ARMv5TE IDCT on ARM11, and ARMv6 IDCT fails to provide any significant
> > performance improvement (actually performance is roughly the same and
> > I can't say which one is faster). Also ARMv5TE optimized IDCT can be
> > easily modified to use ARMv6 saturation instructions instead of table
> > lookups. That should save at least 64 cycles per _add/_put function
> > and I suspect that this hacked ARMv5TE IDCT would easily leap ahead
> > and outperform your ARMv6 IDCT.
> >
> > We have the following hardware features list:
> > "dual loads", "dual multiplies", "saturation"
> >
> > Your IDCT uses only "dual multiplies" and "saturation"
> >
> > ARMv5TE IDCT uses only "dual loads", but can be trivially modified to use
> > "saturation" too
>
> How fast are the v5TE instructions on v6 compared to v6 SIMD instructions?
> On Cortex-A8 they are slower than on ARM9, presumably because NEON is
> the favoured way of doing things there.

v5TE instructions don't introduce any unexpected slowdowns like in Cortex-A8,
though v6 SIMD instructions are of course faster.

BTW, Cortex-A8 also has a downgrade for VFP (it is replaced with VFP Lite,
which is non-pipelined), so at least double precision floating point
calculations will be a lot slower on Cortex-A8 than on ARM11 per cycle. And
single precision floating point calculations have to be moved to NEON:
http://www.design-reuse.com/articles/11580/architecture-and-implementation-of-the-arm-cortex-a8-microprocessor.html

-- 
Best regards,
Siarhei Siamashka