[FFmpeg-devel] [PATCH] Fix mm_flags, mm_support for ARM

Tue Jul 1 09:30:16 CEST 2008

On Monday 30 June 2008, matthieu castet wrote:
> > Could you or anybody else having compatible ARM device just do some
> > benchmarking to confirm my results (I posted benchmarks here multiple
> > times already). It would be a really good help. Because I feel that
> > some people here still doubt that it provides a major performance
> > improvement.
>
> For dct-test (yes I know it is not a benchmark) on a arm926ejs svn
> implementation got 126.7 kdct/s, your 154.6 kdct/s.

In addition, current SVN implementation does not skip over empty coefficients
on columns processing, this gives it a slight advantage in dct-test. Special
handling of sparse data slows down the worst case a bit (such as dct-test) for
the new proposed ARMv5TE IDCT, but improves performance on typical data when
decoding video.

It is better to benchmark overall performance improvement on decoding some
video files.

Also it is not a problem with current SVN implementation, but it was optimized
for ARM9E only. And I made sure that updated IDCT is also fast on cores with
longer pipeline (and higher latencies) such as ARM11 and XScale. From the
practical point of view, there exist older XScale cores without iWMMXt support
(PXA255). And newer XScale cores (PXA27x or better) do not have iWMMXt
optimized IDCT in FFmpeg yet. It is possible to use IPP for getting iWMMXt
optimizations, but IPP has non-GPL compatible license and some distributions
prefer to avoid it.

> > Once/if the performance improvement is confirmed, a help with integration
> > would be really needed. That's not a joke, I really fail to see any
> > problems with the "balign/ASMALIGN/stack alignment" stuff, so I can't fix
> > them. A good example of a solution (a working patch) is very much
> > welcome.
>
> Could you list the integration problem that remains ?

AFAIK the known problems are only alignment related. But I may be wrong.

> For the alignement stack, may be for old eabi you could use ldm/stm
> instead of double load/store instruction but still use double load/store
> instruction on EABI.

It is not possible to freely use LDM/STM as a replacement, because they have
different addressing capabilities. LDM/STM can only use a register to address
memory. While LDR/STR/LDRD/STRD have support for flexible addressing. In our
case, pc-relative (relative to the instruction pointer) addressing is used,
which is just impossible with LDM/STM.

Actually, LDRD/STRD instructions take 2 cycles anyway for arm926ejs (ARM9E),
so just using a pair of normal LDR/STR instructions would have the same
performance. But for LDRD is faster for XScale/ARM11 and takes 1 cycle to load
2 registers.

Right now LDRD instructions have no harm for ARM9E except for alignment 
controversity. Minor side effects are the split memory pools (slightly worse
data cache use), but on the other hand slightly better code density (slightly
better instructions cache use).

Technically, one of the possibilities is to have separate variants of IDCT for
ARM9E, XScale (and possibly ARM11), most likely heavily using GNU assembler
macros to share common parts of code and avoid unnecessary duplication.
But I would prefer doing it a bit later, when developing "ultimate" ARMv5/v6
IDCT, with an optimal permutation, making the best possible use of zero
coefficients statistics distribution (by the way, it is different for MPEG4
ASP and MPEG2), etc. :)

> For memory pool, why don't you do only one memory pool ?
> With a good packing, this could avoid lot's of balign.

The problem is that normal LDR/STR instructions can have +-4096 as immediate
offset when addressing memory. But LDRD/STRD can only have +-256 as immediate
offset. When using pc-relative addressing, it means that memory pool needs to
be very close to the code using it. So having several pools is required when
using LDRD/STRD instructions here.

> Do you benchmark the improvement by using double load/store instruction.
> My manual (DDI0222B_9EJS_r1p2.pdf) say that for arm9js :
> - The LDRD instruction behaves in the same way as an LDM of two registers.
> - The STRD instruction behaves in the same way as an STM of two registers.

They behave in the same way from the performance point of view only, but other
issues apply, see above.

Also LDM/STM instructions are very slow on XScale ("2 + n" cycles, where n is
the number of registers to load/store).

-- 
Best regards,
Siarhei Siamashka