[FFmpeg-devel] Proposed patch

Sun Aug 5 10:18:00 CEST 2007

2007/8/4, Michael Niedermayer <michaelni at gmx.at>:
> Hi
>
> On Sat, Aug 04, 2007 at 09:31:35AM +0000, Arpi wrote:
> > Hi,
> >
> > > >   Anyway, if you consider ppc to be a critical one to benchmark on,
> > > > I would need somebody to do it for me.
> > >
> > > well, if noone steps forward to benchmark it on ppc then ill assume that
> > > there are no ppc users who care about a possible 50% speedloss
> > > (or speedgain...)
> >
> > Darwin g5.local 8.10.0 Darwin Kernel Version 8.10.0: Wed May 23 16:50:59 PDT
> > 2007; root:xnu-792.21.3~1/RELEASE_PPC Power Macintosh powerpc
> > 2x2.5ghz G5 cpu...
> >
> > gcc version 4.0.1 (Apple Computer, Inc. build 5341)
> [...]
> > MPlayer SVN-r24007 (C) 2000-2007 MPlayer Team
> > -vc ffmpeg12:
> > BENCHMARKs: VC:   3.138s VO:  13.480s A:   0.000s Sys:   0.487s =   17.104s
> > BENCHMARK%: VC: 18.3444% VO: 78.8109% A:  0.0000% Sys:  2.8447% = 100.0000%
> >
> > SVN-r24007 + the patch:
> > BENCHMARKs: VC:   3.189s VO:  13.302s A:   0.000s Sys:   0.499s =   16.990s
> > BENCHMARK%: VC: 18.7674% VO: 78.2959% A:  0.0000% Sys:  2.9367% = 100.0000%
> >
> > little slowdown...
>
> depends on which number you look at, VC yes, overall no, which is odd as the
> other parts shouldnt be affected by the patch ...
> maybe you could run these two benchmarks 3 times?
>
> also maybe you want to test the libmpeg2 bitstream reader, i think it hasnt
> been tested on ppc yet (or i dont remember ...)
> on x86 (duron 1x0.8 ghz, gcc 4.1.2)
> with the good old matrixbench mpeg2 its slower:
>
> (default)
> real    0m54.348s
> user    0m52.692s
> sys     0m1.349s
> real    0m54.193s
> user    0m52.755s
> sys     0m1.239s
> real    0m54.249s
> user    0m52.769s
> sys     0m1.209
>
> #define LIBMPEG2_BITSTREAM_READER in bitstream.h (!this needs svn head due to
> bugs in mpeg12.c)
> real    0m55.479s
> user    0m53.548s
> sys     0m1.332s
> real    0m55.097s
> user    0m53.601s
> sys     0m1.271s
> real    0m55.542s
> user    0m53.628s
> sys     0m1.232s
>
> #define A32_BITSTREAM_READER in bitstream.h
> real    0m57.933s
> user    0m56.142s
> sys     0m1.261s
> real    0m58.778s
> user    0m56.474s
> sys     0m1.455s
>

#def CONFIG_ALIGN
/* avcodec/bitstream.h */
static inline int unaligned32_be(const void *v){
        const uint8_t *p=v;
        return (((p[0]<<8) | p[1])<<16) | (p[2]<<8) | (p[3]);
}

/* avutil/intreadwrite.h */
#define AV_RL32(x) ((((uint8_t*)(x))[3] << 24) | \
                    (((uint8_t*)(x))[2] << 16) | \
                    (((uint8_t*)(x))[1] <<  8) | \
                     ((uint8_t*)(x))[0])

If the function in bitstream is faster, then I think it would worth to
replace the AV_RL32 with it.
I am not familiar with PPC assembler, but if I suppose that it must
load both shift parameters into registers then using 8 twice would
explain the speed up. Maybe it worths benchmarking shifting three
times by 8.

 (((((x3<<8)|x2)<<8)|x1)<<8)|x0

(If ppc executes 2+ instructions in parallel/out-of-order, bitstream.h
version may allow higher parallelization, so it could still be faster.
)

Whatever helps, I guess the other functions in avutil could be redone
in the same manner.