[FFmpeg-devel] Once again: Multithreaded H.264 decoding with ffmpeg?
Loren Merritt
lorenm
Sat May 31 14:56:31 CEST 2008
On Sat, 31 May 2008, Michael Niedermayer wrote:
>> QPEL_H264V(%%mm0, %%mm1, %%mm2, %%mm3, %%mm4, %%mm5, OP)\
>> QPEL_H264V(%%mm1, %%mm2, %%mm3, %%mm4, %%mm5, %%mm0, OP)\
>> QPEL_H264V(%%mm2, %%mm3, %%mm4, %%mm5, %%mm0, %%mm1, OP)\
>> QPEL_H264V(%%mm3, %%mm4, %%mm5, %%mm0, %%mm1, %%mm2, OP)\
>> QPEL_H264V(%%mm4, %%mm5, %%mm0, %%mm1, %%mm2, %%mm3, OP)\
>> QPEL_H264V(%%mm5, %%mm0, %%mm1, %%mm2, %%mm3, %%mm4, OP)\
>> QPEL_H264V(%%mm0, %%mm1, %%mm2, %%mm3, %%mm4, %%mm5, OP)\
>> QPEL_H264V(%%mm1, %%mm2, %%mm3, %%mm4, %%mm5, %%mm0, OP)\
>> QPEL_H264V(%%mm2, %%mm3, %%mm4, %%mm5, %%mm0, %%mm1, OP)\
>> QPEL_H264V(%%mm3, %%mm4, %%mm5, %%mm0, %%mm1, %%mm2, OP)\
>> QPEL_H264V(%%mm4, %%mm5, %%mm0, %%mm1, %%mm2, %%mm3, OP)\
>> QPEL_H264V(%%mm5, %%mm0, %%mm1, %%mm2, %%mm3, %%mm4, OP)\
>> QPEL_H264V(%%mm0, %%mm1, %%mm2, %%mm3, %%mm4, %%mm5, OP)\
>> QPEL_H264V(%%mm1, %%mm2, %%mm3, %%mm4, %%mm5, %%mm0, OP)\
>> QPEL_H264V(%%mm2, %%mm3, %%mm4, %%mm5, %%mm0, %%mm1, OP)\
>> QPEL_H264V(%%mm3, %%mm4, %%mm5, %%mm0, %%mm1, %%mm2, OP)\
>
> #define F2(a,b,c,d,e,f) f (a,b,c,d,e,f) f (b,c,d,e,f,a)
> #define F3(a,b,c,d,e,f) F2(a,b,c,d,e,f) F2(c,d,e,f,a,b) F2(e,f,a,b,c,d)
> F3(a,b,c,d,e,f)
> F3(a,b,c,d,e,f)
> F2(a,b,c,d,e,f) F2(c,d,e,f,a,b)
>
> If thats more readable i dunno, if yasm is more readable i dunno either.
> Especially for someone not familiar with yasm it likely could proof
> confusing.
That requires a new define for every different permutation, and it often
isn't in a loop.
Try
LOAD m0, m1, m2, m3, m4, m5, m6, m7
IDCT8_1D m0, m1, m2, m3, m4, m5, m6, m7, m8, m9
TRANSPOSE8x8W m8, m1, m7, m3, m4, m0, m2, m6, m5
IDCT8_1D m8, m0, m6, m3, m5, m4, m7, m1, m9, m2
STORE m9, m0, m1, m3, m5, m8, m6, m7
vs
LOAD m0, m1, m2, m3, m4, m5, m6, m7
IDCT8_1D m0, m1, m2, m3, m4, m5, m6, m7, m8, m9
TRANSPOSE8x8W m0, m1, m2, m3, m4, m5, m6, m7, m8
IDCT8_1D m0, m1, m2, m3, m4, m5, m6, m7, m8, m9
STORE m0, m1, m2, m3, m4, m5, m6, m7
or
SBUTTERFLY wd, %1, %2, %9
SBUTTERFLY wd, %3, %4, %2
SBUTTERFLY wd, %5, %6, %4
SBUTTERFLY wd, %7, %8, %6
SBUTTERFLY dq, %1, %3, %8
SBUTTERFLY dq, %9, %2, %3
SBUTTERFLY dq, %5, %7, %2
SBUTTERFLY dq, %4, %6, %7
SBUTTERFLY qdq, %1, %5, %6
SBUTTERFLY qdq, %9, %4, %5
SBUTTERFLY qdq, %8, %2, %4
SBUTTERFLY qdq, %3, %7, %2
vs
SBUTTERFLY wd, %1, %2, %9
SBUTTERFLY wd, %3, %4, %9
SBUTTERFLY wd, %5, %6, %9
SBUTTERFLY wd, %7, %8, %9
SBUTTERFLY dq, %1, %3, %9
SBUTTERFLY dq, %2, %4, %9
SBUTTERFLY dq, %5, %7, %9
SBUTTERFLY dq, %6, %8, %9
SBUTTERFLY qdq, %1, %5, %9
SBUTTERFLY qdq, %2, %6, %9
SBUTTERFLY qdq, %3, %7, %9
SBUTTERFLY qdq, %4, %8, %9
Worse yet, what if 2 implementations of butterfly have different
permutations, but must be plugged into the same higher level code?
(mov,add,sub is faster on core2 whereas add,add,sub takes fewer regs and
is thus useful on x86_32.)
lavc's TRANSPOSE8 has an extra movdqa because I couldn't find any
other way to make the 32bit and 64bit versions have the same permuation.
--Loren Merritt
More information about the ffmpeg-devel
mailing list