[FFmpeg-devel] Once again: Multithreaded H.264 decoding with ffmpeg?

Loren Merritt lorenm
Sat May 31 14:56:31 CEST 2008


On Sat, 31 May 2008, Michael Niedermayer wrote:

>> QPEL_H264V(%%mm0, %%mm1, %%mm2, %%mm3, %%mm4, %%mm5, OP)\
>> QPEL_H264V(%%mm1, %%mm2, %%mm3, %%mm4, %%mm5, %%mm0, OP)\
>> QPEL_H264V(%%mm2, %%mm3, %%mm4, %%mm5, %%mm0, %%mm1, OP)\
>> QPEL_H264V(%%mm3, %%mm4, %%mm5, %%mm0, %%mm1, %%mm2, OP)\
>> QPEL_H264V(%%mm4, %%mm5, %%mm0, %%mm1, %%mm2, %%mm3, OP)\
>> QPEL_H264V(%%mm5, %%mm0, %%mm1, %%mm2, %%mm3, %%mm4, OP)\
>> QPEL_H264V(%%mm0, %%mm1, %%mm2, %%mm3, %%mm4, %%mm5, OP)\
>> QPEL_H264V(%%mm1, %%mm2, %%mm3, %%mm4, %%mm5, %%mm0, OP)\
>> QPEL_H264V(%%mm2, %%mm3, %%mm4, %%mm5, %%mm0, %%mm1, OP)\
>> QPEL_H264V(%%mm3, %%mm4, %%mm5, %%mm0, %%mm1, %%mm2, OP)\
>> QPEL_H264V(%%mm4, %%mm5, %%mm0, %%mm1, %%mm2, %%mm3, OP)\
>> QPEL_H264V(%%mm5, %%mm0, %%mm1, %%mm2, %%mm3, %%mm4, OP)\
>> QPEL_H264V(%%mm0, %%mm1, %%mm2, %%mm3, %%mm4, %%mm5, OP)\
>> QPEL_H264V(%%mm1, %%mm2, %%mm3, %%mm4, %%mm5, %%mm0, OP)\
>> QPEL_H264V(%%mm2, %%mm3, %%mm4, %%mm5, %%mm0, %%mm1, OP)\
>> QPEL_H264V(%%mm3, %%mm4, %%mm5, %%mm0, %%mm1, %%mm2, OP)\
>
> #define F2(a,b,c,d,e,f) f (a,b,c,d,e,f) f (b,c,d,e,f,a)
> #define F3(a,b,c,d,e,f) F2(a,b,c,d,e,f) F2(c,d,e,f,a,b) F2(e,f,a,b,c,d)
> F3(a,b,c,d,e,f)
> F3(a,b,c,d,e,f)
> F2(a,b,c,d,e,f) F2(c,d,e,f,a,b)
>
> If thats more readable i dunno, if yasm is more readable i dunno either.
> Especially for someone not familiar with yasm it likely could proof
> confusing.

That requires a new define for every different permutation, and it often 
isn't in a loop.

Try
     LOAD          m0, m1, m2, m3, m4, m5, m6, m7
     IDCT8_1D      m0, m1, m2, m3, m4, m5, m6, m7, m8, m9
     TRANSPOSE8x8W m8, m1, m7, m3, m4, m0, m2, m6, m5
     IDCT8_1D      m8, m0, m6, m3, m5, m4, m7, m1, m9, m2
     STORE         m9, m0, m1, m3, m5, m8, m6, m7
vs
     LOAD          m0, m1, m2, m3, m4, m5, m6, m7
     IDCT8_1D      m0, m1, m2, m3, m4, m5, m6, m7, m8, m9
     TRANSPOSE8x8W m0, m1, m2, m3, m4, m5, m6, m7, m8
     IDCT8_1D      m0, m1, m2, m3, m4, m5, m6, m7, m8, m9
     STORE         m0, m1, m2, m3, m4, m5, m6, m7

or
     SBUTTERFLY wd,  %1, %2, %9
     SBUTTERFLY wd,  %3, %4, %2
     SBUTTERFLY wd,  %5, %6, %4
     SBUTTERFLY wd,  %7, %8, %6
     SBUTTERFLY dq,  %1, %3, %8
     SBUTTERFLY dq,  %9, %2, %3
     SBUTTERFLY dq,  %5, %7, %2
     SBUTTERFLY dq,  %4, %6, %7
     SBUTTERFLY qdq, %1, %5, %6
     SBUTTERFLY qdq, %9, %4, %5
     SBUTTERFLY qdq, %8, %2, %4
     SBUTTERFLY qdq, %3, %7, %2
vs
     SBUTTERFLY wd,  %1, %2, %9
     SBUTTERFLY wd,  %3, %4, %9
     SBUTTERFLY wd,  %5, %6, %9
     SBUTTERFLY wd,  %7, %8, %9
     SBUTTERFLY dq,  %1, %3, %9
     SBUTTERFLY dq,  %2, %4, %9
     SBUTTERFLY dq,  %5, %7, %9
     SBUTTERFLY dq,  %6, %8, %9
     SBUTTERFLY qdq, %1, %5, %9
     SBUTTERFLY qdq, %2, %6, %9
     SBUTTERFLY qdq, %3, %7, %9
     SBUTTERFLY qdq, %4, %8, %9

Worse yet, what if 2 implementations of butterfly have different
permutations, but must be plugged into the same higher level code?
(mov,add,sub is faster on core2 whereas add,add,sub takes fewer regs and
is thus useful on x86_32.)

lavc's TRANSPOSE8 has an extra movdqa because I couldn't find any
other way to make the 32bit and 64bit versions have the same permuation.

--Loren Merritt




More information about the ffmpeg-devel mailing list