[FFmpeg-devel] [PATCH] Further optimization of base64 decode using AV_WB32.

Reimar Döffinger Reimar.Doeffinger at gmx.de
Sat Jan 21 21:29:02 CET 2012


On Sat, Jan 21, 2012 at 08:47:59PM +0100, Michael Niedermayer wrote:
> On Sat, Jan 21, 2012 at 06:53:53PM +0100, Reimar Döffinger wrote:
> > On Sat, Jan 21, 2012 at 06:39:29PM +0100, Reimar Döffinger wrote:
> > > On Sat, Jan 21, 2012 at 06:30:48PM +0100, Reimar Döffinger wrote:
> > > > On Sat, Jan 21, 2012 at 06:13:19PM +0100, Reimar Döffinger wrote:
> > > > > On Sat, Jan 21, 2012 at 05:56:32PM +0100, Michael Niedermayer wrote:
> > > > > > On Sat, Jan 21, 2012 at 05:52:27PM +0100, Reimar Döffinger wrote:
> > > > > > > This is somewhat questionable.
> > > > > > > The biggest issue is that av_bswap32 is not replaced
> > > > > > > with our asm version on gcc 4.5 or newer.
> > > > > > > This causes gcc to generate horrible code that is slower
> > > > > > > than the unoptimized variant.
> > > > > > > Old:                                  248852 decicycles
> > > > > > > New with gcc's attempt at av_bswap32: 256576 decicycles
> > > > > > > New with our bswap32:                 200260 decicycles
> > > > > > [...]
> > > > > > > diff --git a/libavutil/x86/bswap.h b/libavutil/x86/bswap.h
> > > > > > > index 52ffb4d..aa39d97 100644
> > > > > > > --- a/libavutil/x86/bswap.h
> > > > > > > +++ b/libavutil/x86/bswap.h
> > > > > > > @@ -37,7 +37,7 @@ static av_always_inline av_const unsigned av_bswap16(unsigned x)
> > > > > > >  }
> > > > > > >  #endif /* !AV_GCC_VERSION_AT_LEAST(4,1) */
> > > > > > >  
> > > > > > > -#if !AV_GCC_VERSION_AT_LEAST(4,5)
> > > > > > > +#if 1 || !AV_GCC_VERSION_AT_LEAST(4,5)
> > > > > > >  #define av_bswap32 av_bswap32
> > > > > > >  static av_always_inline av_const uint32_t av_bswap32(uint32_t x)
> > > > > > >  {
> > > > > > 
> > > > > > also make sure -cpu/arch/tune is set so gcc is allowed to use bswap
> > > > > > (its 486+) so not possible for gcc to use on strict x86
> > > > > 
> > > > > It is a x86_64 build, so I'd hope that gcc will not try to "optimize"
> > > > > of 486 on that...
> > > > 
> > > > gcc version is actually 4.6.2 and it fails to use the bswap instruction
> > > > regardless whether I use no extra options, -march=native, -m32, -m32
> > > > -march=native.
> > > > In all cases the code without our inline bswap is significantly slower
> > > > (ca. 20%).
> > > > I have no idea where the claim that gcc would recognize the bswap comes
> > > > from (hm, I haven't tested if the << 8 confuses it though, will now).
> > > 
> > > Yes, only completely removing the shift fixes it.
> > > One option would be to make the table 16 bit to avoid that shift.
> > > However my tests show that even though this saves the shift instruction
> > > the code does not become any faster in 64 bit mode and only maybe 2% in
> > > 32 bit mode (except of course for unbreaking the compiler), so it
> > > seems quite wasteful.
> > 
> > This works, too, it actually seems about 2% faster than with out bswap
> > asm (I assume better scheduling):
> 
> what effect does this have on ARM and or (lets randomly pick) mpeg2
> decoding speed ?

On my beagleboard it compiles to the rev instruction, just like our
inline asm for example.
For decoding speed, a quick test using "time" and the fate MPEG2 sample would
indicate no change, but the accuracy of that test is quite useless.
Why, what are the options to decide between?
However testing on ARM shows that we should not replace on the 32 bit
bswap with builtin, since that breaks bswap64.
Current bswap64 result (directly from C code, no inline asm etc.):
        rev     r3, r0
        rev     r0, r1
        mov     r1, r3
        bx      lr

Using __builtin_bswap64:
        rev     r3, r0
        rev     r0, r1
        mov     r1, r3
        bx      lr

Replacing only bswap32 with __builtin_bswap32:
        rev     r3, r0
        movs    r2, #0
        push    {r4}
        rev     r4, r1
        orr     r0, r2, r4
        mov     r1, r3
        pop     {r4}
        bx      lr

And yes, obviously gcc still excels at pushing things to stack for no good reason
at all, or'ing with a constant 0 (that of course must first be loaded) and moving
stuff to the right place afterwards instead of just using the right destination
register in the first place.


More information about the ffmpeg-devel mailing list