[FFmpeg-devel] Once again: Multithreaded H.264 decoding with ffmpeg?

Michael Niedermayer michaelni
Sat May 31 16:51:06 CEST 2008


On Sat, May 31, 2008 at 06:56:31AM -0600, Loren Merritt wrote:
> On Sat, 31 May 2008, Michael Niedermayer wrote:
> 
> >> QPEL_H264V(%%mm0, %%mm1, %%mm2, %%mm3, %%mm4, %%mm5, OP)\
> >> QPEL_H264V(%%mm1, %%mm2, %%mm3, %%mm4, %%mm5, %%mm0, OP)\
> >> QPEL_H264V(%%mm2, %%mm3, %%mm4, %%mm5, %%mm0, %%mm1, OP)\
> >> QPEL_H264V(%%mm3, %%mm4, %%mm5, %%mm0, %%mm1, %%mm2, OP)\
> >> QPEL_H264V(%%mm4, %%mm5, %%mm0, %%mm1, %%mm2, %%mm3, OP)\
> >> QPEL_H264V(%%mm5, %%mm0, %%mm1, %%mm2, %%mm3, %%mm4, OP)\
> >> QPEL_H264V(%%mm0, %%mm1, %%mm2, %%mm3, %%mm4, %%mm5, OP)\
> >> QPEL_H264V(%%mm1, %%mm2, %%mm3, %%mm4, %%mm5, %%mm0, OP)\
> >> QPEL_H264V(%%mm2, %%mm3, %%mm4, %%mm5, %%mm0, %%mm1, OP)\
> >> QPEL_H264V(%%mm3, %%mm4, %%mm5, %%mm0, %%mm1, %%mm2, OP)\
> >> QPEL_H264V(%%mm4, %%mm5, %%mm0, %%mm1, %%mm2, %%mm3, OP)\
> >> QPEL_H264V(%%mm5, %%mm0, %%mm1, %%mm2, %%mm3, %%mm4, OP)\
> >> QPEL_H264V(%%mm0, %%mm1, %%mm2, %%mm3, %%mm4, %%mm5, OP)\
> >> QPEL_H264V(%%mm1, %%mm2, %%mm3, %%mm4, %%mm5, %%mm0, OP)\
> >> QPEL_H264V(%%mm2, %%mm3, %%mm4, %%mm5, %%mm0, %%mm1, OP)\
> >> QPEL_H264V(%%mm3, %%mm4, %%mm5, %%mm0, %%mm1, %%mm2, OP)\
> >
> > #define F2(a,b,c,d,e,f) f (a,b,c,d,e,f) f (b,c,d,e,f,a)
> > #define F3(a,b,c,d,e,f) F2(a,b,c,d,e,f) F2(c,d,e,f,a,b) F2(e,f,a,b,c,d)
> > F3(a,b,c,d,e,f)
> > F3(a,b,c,d,e,f)
> > F2(a,b,c,d,e,f) F2(c,d,e,f,a,b)
> >
> > If thats more readable i dunno, if yasm is more readable i dunno either.
> > Especially for someone not familiar with yasm it likely could proof
> > confusing.
> 
> That requires a new define for every different permutation, and it often 
> isn't in a loop.
> 
> Try
>      LOAD          m0, m1, m2, m3, m4, m5, m6, m7
>      IDCT8_1D      m0, m1, m2, m3, m4, m5, m6, m7, m8, m9
>      TRANSPOSE8x8W m8, m1, m7, m3, m4, m0, m2, m6, m5
>      IDCT8_1D      m8, m0, m6, m3, m5, m4, m7, m1, m9, m2
>      STORE         m9, m0, m1, m3, m5, m8, m6, m7
> vs
>      LOAD          m0, m1, m2, m3, m4, m5, m6, m7
>      IDCT8_1D      m0, m1, m2, m3, m4, m5, m6, m7, m8, m9
>      TRANSPOSE8x8W m0, m1, m2, m3, m4, m5, m6, m7, m8
>      IDCT8_1D      m0, m1, m2, m3, m4, m5, m6, m7, m8, m9
>      STORE         m0, m1, m2, m3, m4, m5, m6, m7
> 
> or
>      SBUTTERFLY wd,  %1, %2, %9
>      SBUTTERFLY wd,  %3, %4, %2
>      SBUTTERFLY wd,  %5, %6, %4
>      SBUTTERFLY wd,  %7, %8, %6
>      SBUTTERFLY dq,  %1, %3, %8
>      SBUTTERFLY dq,  %9, %2, %3
>      SBUTTERFLY dq,  %5, %7, %2
>      SBUTTERFLY dq,  %4, %6, %7
>      SBUTTERFLY qdq, %1, %5, %6
>      SBUTTERFLY qdq, %9, %4, %5
>      SBUTTERFLY qdq, %8, %2, %4
>      SBUTTERFLY qdq, %3, %7, %2
> vs
>      SBUTTERFLY wd,  %1, %2, %9
>      SBUTTERFLY wd,  %3, %4, %9
>      SBUTTERFLY wd,  %5, %6, %9
>      SBUTTERFLY wd,  %7, %8, %9
>      SBUTTERFLY dq,  %1, %3, %9
>      SBUTTERFLY dq,  %2, %4, %9
>      SBUTTERFLY dq,  %5, %7, %9
>      SBUTTERFLY dq,  %6, %8, %9
>      SBUTTERFLY qdq, %1, %5, %9
>      SBUTTERFLY qdq, %2, %6, %9
>      SBUTTERFLY qdq, %3, %7, %9
>      SBUTTERFLY qdq, %4, %8, %9
> 
> Worse yet, what if 2 implementations of butterfly have different
> permutations, but must be plugged into the same higher level code?
> (mov,add,sub is faster on core2 whereas add,add,sub takes fewer regs and
> is thus useful on x86_32.)
> 
> lavc's TRANSPOSE8 has an extra movdqa because I couldn't find any
> other way to make the 32bit and 64bit versions have the same permuation.

The question is how many more things one could optimize by not forcing to
use the same source for 32 and 64bit.

Switching between register names is one thing but trying to use common code
where one case has 8 registers and one has 16 just doesnt look like such
a clear case. Sometimes its likely better to have common code, but always?
now if its better in yasm to have common code for a specific function why
does that also have to be the best way in gcc asm?

Iam really a little curious if cleanly written yasm code is so much supperior
over cleanly written gcc inline asm code. I certainly are no fan of gcc or
its asm, its mainly the extra dependancy and the loss of support for many
platforms that annoys me most on this ...

TRANSPOSE8 is used at 2 spots ...

        TRANSPOSE8(%%xmm4, %%xmm1, %%xmm7, %%xmm3, %%xmm5, %%xmm0, %%xmm2, %%xmm6, (%1))
        "paddw          %4, %%xmm4 \n"
        "movdqa     %%xmm4, 0x00(%1) \n"
        "movdqa     %%xmm2, 0x40(%1) \n"
        H264_IDCT8_1D_SSE2(%%xmm4, %%xmm0, %%xmm6, %%xmm3, %%xmm2, %%xmm5, %%xmm7, %%xmm1)
        "movdqa     %%xmm6, 0x60(%1) \n"
        "movdqa     %%xmm7, 0x70(%1) \n"

These movdqa are not needed on x86-64 and i suspect that by not using "common"
code their number can be reduced on x86-32, more precissely the second looks
like it could be merged with something from TRANSPOSE8.

Also the question of readability has been ignored entirely, is all the
preprocesor magic be it yasm or c really that good?
You use alot of preprocessor tricks in your gcc-asm, i just thought it
might be more flexibl and readable with a little less.
After all the code would be the same after the preprocessor anyway.

And last ultra finetuned common 64-32 code has another problem. That is
when one wants to change/optimize the code but she has not both a 32 and
64 bit cpu. It could easily lead to a speedloss or considerable more
work waiting for others to do the benchmarking.

So in the end IMHO maybe less preprocessor based asm code factorization
would be a better solution than yasm, just my 2cents, iam not opposing yasm
if people really want it ...

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Thouse who are best at talking, realize last or never when they are wrong.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080531/094ae26e/attachment.pgp>



More information about the ffmpeg-devel mailing list