[FFmpeg-devel] [PATCH] Some ARM VFP optimizations (vector_fmul, vector_fmul_reverse, float_to_int16)

Michael Niedermayer michaelni
Sun Apr 20 23:41:57 CEST 2008

On Sun, Apr 20, 2008 at 11:33:43PM +0300, Siarhei Siamashka wrote:
> On Sunday 20 April 2008, matthieu castet wrote:
> > Siarhei Siamashka wrote:
> > > Also my additional test program 'test-vfp.c' from
> > > https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libavcodec/test
> > >s/?root=mplayer runs successfully and verifies performance, correctness
> > > and absence of any incorrect memory accesses outside memory buffers.
> >
> > BTW do you plan to resubmit your armv5te optimisations [1] ?
> >
> > [1] http://article.gmane.org/gmane.comp.video.ffmpeg.devel/56672
> Yes, my intention is to sumbit all the ARM optimizations to FFmpeg
> and MPlayer eventually. It may take some time though.
> Regarding this ARMv5TE optimized IDCT. It is a refined version of ARMv5TE 
> IDCT from FFmpeg, initially implemented by M?ns Rullg?rd. Now almost all the
> code was modified (so making diff is useless, though the general structure is
> the same). The changes include:
> 1. Introduced the use of negative constants. ARMv5TE only supports fused
> multiply&accumulate instructions, but not multiply&subtract. Older 
> code used more instructions to do the same job (separate multiply and 
> subtract instructions), new code uses multiply&accumulate with negative
> constants.
> 2. Instructions scheduling optimized for longer pipeline (ARM11, XScale),
> better registers allocation. The same code is still optimal for ARM9E that 
> has shorter pipeline.
> 3. Dual loads using LDRD instructions (reading two 32-bit registers per cycle)
> for ARM11 and XScale. ARM9E also supports LDRD instruction, but it takes 2
> cycles, so this optimization does not provide any improvement for ARM9E cores.
> 4. Saturation using table lookups (it works faster)
> 5. Better shortcuts for zero coefficients and a tweak for nonoptimal code
> chunk identified by Michael Niedermayer when reviewing this code (the same
> problem is still unfixed in current FFMpeg ARMv5TE IDCT). 
> This all together provides a major speedup, especially when run on ARM11 and
> XScale cores. All the benchmarks have been posted long ago.

> Now about the problems why this code is still not in FFmpeg SVN. The most
> important problem that stalled everything is 64-bit alignment. In order to
> perform dual loads, data needs to be 64-bit (8 byte) aligned. This applies
> to constants loading and accessing temporary data on stack. Constants loading
> requires some kind of '.align' directive in the code. The problem is that
> '.align' is nonportable and '.balign' or '.p2align' are not supported by 
> broken Apple toolchain on x86 and apparently on ARM too (actually it has a 
> lot more issues in addition to this one). One more problem is related to
> stack, because legacy systems and toolchains (those that use OABI) support 
> only 4 byte stack alignment. There was a tweak in the code that performs 
> stack realignment when IDCT code is built for these broken legacy systems 
> with absolutely no overhead on modern systems (this code does not get compiled
> in for modern EABI systems that already provide 64-bit stack alignment), but
> it was rejected by Michael Niedermayer. I strongly disagree with this decision
> and breaking support for legacy systems for no reason is not a valid option
> for me.

I do not remember exactly but there was a serious problem with your suggested
"solution", IIRC it caused some serious speedloss on some systems.

1. Why do you need a aligned stack? That is why cant you use the heap?
2. Assuming you do need one, where was the problem with using a recent gcc
   which supports maintaining stack alignment?
3. What effect does your solution have on systems which do align the stack
   aka a recent gcc on pre EABI. Or even a non gcc compiler.

> The last (actually minor) problem is related to crop table that is defined in
> C code and used from assembly code. If anybody would ever change MAX_NEG_CROP
> constant in C code for whatever reason, assembly implementation of ARMv5TE
> IDCT will be silently broken (it will compile, but will break at runtime). I
> would prefer to have compilation failure here. A good example is lowres
> decoding support that is now broken for many less popular architectures
> (lowres decoding was added at some time in the past, but nobody tweaked sparc
> or alpha code to properly support it). Another option (which I initially used)
> is just to have a copy of crop table stored in assembly source file, it will
> take extra space, but will make code reliable and not dependent on any other
> stuff from outside.
> I really would like to hear some constructive feedback instead of "this is
> unacceptable" with no alternatives suggested :)

#if MAX_NEG_CROP != 12345
Please fix code in file foobar

> To sum it up, I would like to ask you to allow me to keep '.balign' directives
> (broken toolchains will just have compilation failure, that's not so bad, at
> least we will have a chance to see if it breaks in practice for anyone and 
> what toolchain they use).

we have a macro for alignment, use it, if it doesnt work, fix it
(see configure)


Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

I have often repented speaking, but never of holding my tongue.
-- Xenocrates
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080420/bc263f4c/attachment.pgp>

More information about the ffmpeg-devel mailing list