[FFmpeg-devel] Once again: Multithreaded H.264 decoding with ffmpeg?

Sat May 31 13:58:00 CEST 2008

On Fri, 30 May 2008, Michel Lespinasse wrote:
> On Fri, May 30, 2008 at 02:26:21PM -0600, Jason Garrett-Glaser wrote:
>> Main benefit of yasm:
>>
>> * vastly more powerful macro system that makes it far easier to
>> generalize a small function to dozens of specific cases
>>
>> This allows us to do the following:
>>
>> * abstraction between MMX and SSE code; write a single function that does both
>> * automatic handling of macros that permute their arguments; see
>> x264's DCT functions
>> * automatic handling of 32-bit vs 64-bit abstraction
>
> I think the above can also be achived using mmx.h and the C preprocessor.
> At least that's what I used in libmpeg2's IDCT code.

Which point are you responding to?

Abstraction between MMX and SSE can be done in gcc, but it's more complex. 
There are several things that need to be defined (reg prefix, reg size, 
movdqu, movdqa, movq) and gcc doesn't support defines inside macros and 
gcc warns about redefines, so that's a bunch of lines every time you 
switch. Plus extra ugly quotes all over since defines only apply in C 
context, not in asm strings. Or instead of global defines, you can add all 
of those parameters to every macro, which is fewer LOC but hardly cleaner.

Macros that permute their arguments are just impossible in gcc. It 
requires xdefine. So the only alternative is to not permute the arguments 
and keep track of all permutations by hand.
In case anyone is unclear about what I mean by permute, this is the 
difference between
     QPEL_H264V(%%mm0, %%mm1, %%mm2, %%mm3, %%mm4, %%mm5, OP)\
     QPEL_H264V(%%mm1, %%mm2, %%mm3, %%mm4, %%mm5, %%mm0, OP)\
     QPEL_H264V(%%mm2, %%mm3, %%mm4, %%mm5, %%mm0, %%mm1, OP)\
     QPEL_H264V(%%mm3, %%mm4, %%mm5, %%mm0, %%mm1, %%mm2, OP)\
     QPEL_H264V(%%mm4, %%mm5, %%mm0, %%mm1, %%mm2, %%mm3, OP)\
     QPEL_H264V(%%mm5, %%mm0, %%mm1, %%mm2, %%mm3, %%mm4, OP)\
     QPEL_H264V(%%mm0, %%mm1, %%mm2, %%mm3, %%mm4, %%mm5, OP)\
     QPEL_H264V(%%mm1, %%mm2, %%mm3, %%mm4, %%mm5, %%mm0, OP)\
     QPEL_H264V(%%mm2, %%mm3, %%mm4, %%mm5, %%mm0, %%mm1, OP)\
     QPEL_H264V(%%mm3, %%mm4, %%mm5, %%mm0, %%mm1, %%mm2, OP)\
     QPEL_H264V(%%mm4, %%mm5, %%mm0, %%mm1, %%mm2, %%mm3, OP)\
     QPEL_H264V(%%mm5, %%mm0, %%mm1, %%mm2, %%mm3, %%mm4, OP)\
     QPEL_H264V(%%mm0, %%mm1, %%mm2, %%mm3, %%mm4, %%mm5, OP)\
     QPEL_H264V(%%mm1, %%mm2, %%mm3, %%mm4, %%mm5, %%mm0, OP)\
     QPEL_H264V(%%mm2, %%mm3, %%mm4, %%mm5, %%mm0, %%mm1, OP)\
     QPEL_H264V(%%mm3, %%mm4, %%mm5, %%mm0, %%mm1, %%mm2, OP)\
and
     %rep 16
     QPEL_H264V m0, m1, m2, m3, m4, m5, OP
     SWAP 0, 1, 2, 3, 4, 5
     %endrep

32-bit vs 64-bit is of course done automatically by gcc. It was included 
just to show that yasm can implement the stuff gcc does automatically, 
while the reverse is not true.

--Loren Merritt