[FFmpeg-devel] [PATCH] Fix mm_flags, mm_support for ARM

Sat Jul 5 11:31:30 CEST 2008

On Wednesday 02 July 2008, matthieu castet wrote:
> Hi,
>
> Siarhei Siamashka wrote:
> > On Monday 30 June 2008, matthieu castet wrote:
> >>> Could you or anybody else having compatible ARM device just do some
> >>> benchmarking to confirm my results (I posted benchmarks here multiple
> >>> times already). It would be a really good help. Because I feel that
> >>> some people here still doubt that it provides a major performance
> >>> improvement.
> >>
> >> For dct-test (yes I know it is not a benchmark) on a arm926ejs svn
> >> implementation got 126.7 kdct/s, your 154.6 kdct/s.
> >>
> >>> Once/if the performance improvement is confirmed, a help with
> >>> integration would be really needed. That's not a joke, I really fail to
> >>> see any problems with the "balign/ASMALIGN/stack alignment" stuff, so I
> >>> can't fix them. A good example of a solution (a working patch) is very
> >>> much welcome.
> >>
> >> Could you list the integration problem that remains ?
> >>
>  > AFAIK the known problems are only alignment related. But I may be wrong.
>
> May be also a way to keep MAX_NEG_CROP synchronised.

Well, initially this source file had its own crop table. It was a duplication,
but that made it completely reliable and independent from the rest of code.
There is no completely nice solution, some kind of compromise is required.
Michael Niedermayer said that human dependent synchronization of MAX_NEG_CROP
(by adding comments in both 'dsputil.h' and 'simple_idct_armv5te.S') should be
enough.

> A easy way could be to include dsputils.h and use #ifndef __ASSEMBLY__
> to hide C declaration

I also earlier suggested to have a common header file which defines this
constant and gets included by both assembly code and C, which is pretty much
the same as what you propose here, but it was rejected.

> >> For the alignement stack, may be for old eabi you could use ldm/stm
> >> instead of double load/store instruction but still use double load/store
> >> instruction on EABI.
>
> Another solution could be to use only your code on EABI and using the
> old code for not aligned stack.

Sure. I also thought about it. But that will be an obvious duplication of
essentially the same code. Also it is not very much different from forcing
legacy ABI users to use C version of code or some other IDCT as they will
have worse performance.

> >> For memory pool, why don't you do only one memory pool ?
> >> With a good packing, this could avoid lot's of balign.
> >
> > The problem is that normal LDR/STR instructions can have +-4096 as
> > immediate offset when addressing memory. But LDRD/STRD can only have
> > +-256 as immediate offset. When using pc-relative addressing, it means
> > that memory pool needs to be very close to the code using it. So having
> > several pools is required when using LDRD/STRD instructions here.
>
> Well you never exploit the fact that the offset is signed.
> I manage to make only one pool by :
> - making idct_two_col_armv5te a function instead of a macro (I know I
> introduce some extra cycle for call/ret, but there less code and data
> (fit better in cache)).
> - put all constant between idct_two_col_armv5te and idct_rows_armv5te

Did you benchmark this change? Function calls are quite slow and I would
rather avoid them (getting rid of unnecessary function calls is one of the
reasons why this new code is faster than the one in SVN). You still can
replace LDRD with LDR+LDR instruction pairs on ARM9E and it will allow you 
to have only one data pool without sacrificing performance at all.

But anyway, thanks for focusing attention on this part, currently data
occupies 3 cache lines, it is possible to fit it into just 2 lines and
two pools. And also we have just 1 extra constant in memory which does 
not allow to fit data into a single cache line. It is possible to get
rid of it, at the cost of slowing down code a bit.

Anyway, additional pressure on the instructions cache from duplicating columns
processing code in idct_put and idct_add functions may be not so high. Only
idct_put functions are called when decoding I-frames. In all other cases
idct_add function seem to be dominating. So they are almost never heavily
used together at the same time.

But instructions cache misses are really a problem. My benchmark with 
oprofile showed that ~20% of cycles when decoding mpeg4 asp video is wasted 
on instruction cache misses (for 32K of instructions cache). But I did this
test quite a long time ago and it might be worth checking again.

In order to improve instructions cache utilization, you may also want to try
patches to 'dsputil_arm_s.S' from maemo mplayer repository. This change helps
to get rid of intermixing instructions/data, also improving performance of
each function by a few cycles.

Additional optimization for better use of data cache is to force 'readable'
constant to 0 in 'MPV_decode_mb_internal' for ARM (or any other CPUs with 
non-write-allocate cache). This will make the code use intermediate buffer in
cache to process data and reduce the number of data cache misses. Performance
improvement is quite measurable. More performance can be obtained from a
better use if write buffer, by not copying data to memory at the end
of 'MPV_decode_mb_internal', but interleaving copying with IDCT calculation
(write buffer has a limited size, to it is better to write data to memory in
relatively small chunks and do some IDCT number crunching between writes,
letting write buffer to flush to memory in the background). It is still
important to write data using blocks of 16 bytes aligned at 16 byte boundary
with STM instruction in order to use write burst.

Data cache use simulation can be also done in valgrind, I had a patch for it
to force simulation of read-allocate cache behaviour.

-- 
Best regards,
Siarhei Siamashka