[FFmpeg-devel] Once again: Multithreaded H.264 decoding with ffmpeg?

Loren Merritt lorenm
Sun Jun 1 20:46:02 CEST 2008

On Sat, 31 May 2008, Michael Niedermayer wrote:

> The question is how many more things one could optimize by not forcing to
> use the same source for 32 and 64bit.
> Switching between register names is one thing but trying to use common code
> where one case has 8 registers and one has 16 just doesnt look like such
> a clear case. Sometimes its likely better to have common code, but always?
> now if its better in yasm to have common code for a specific function why
> does that also have to be the best way in gcc asm?

If it's better in yasm but not gcc to have common code for a specific 
function, then that means yasm successfully avoided code duplication while 
gcc's limitations made it impossible or unwieldy, which is a vote for 

> Iam really a little curious if cleanly written yasm code is so much supperior
> over cleanly written gcc inline asm code. I certainly are no fan of gcc or
> its asm, its mainly the extra dependancy and the loss of support for many
> platforms that annoys me most on this ...

Which platforms?

> TRANSPOSE8 is used at 2 spots ...
>        TRANSPOSE8(%%xmm4, %%xmm1, %%xmm7, %%xmm3, %%xmm5, %%xmm0, %%xmm2, %%xmm6, (%1))
>        "paddw          %4, %%xmm4 \n"
>        "movdqa     %%xmm4, 0x00(%1) \n"
>        "movdqa     %%xmm2, 0x40(%1) \n"
>        H264_IDCT8_1D_SSE2(%%xmm4, %%xmm0, %%xmm6, %%xmm3, %%xmm2, %%xmm5, %%xmm7, %%xmm1)
>        "movdqa     %%xmm6, 0x60(%1) \n"
>        "movdqa     %%xmm7, 0x70(%1) \n"
> These movdqa are not needed on x86-64 and i suspect that by not using "common"
> code their number can be reduced on x86-32, more precissely the second looks
> like it could be merged with something from TRANSPOSE8.

Agreed. In x264 I have separate x86_32 and x86_64 version of 8x8 dct. 
But in lavc I just wanted to do as little gcc-asm writing as possible, so 
I stopped after writing the minimal x86_32 version which can be 
compiled on x86_64 but doesn't make much use of the extra registers.

> Also the question of readability has been ignored entirely, is all the
> preprocesor magic be it yasm or c really that good?
> You use alot of preprocessor tricks in your gcc-asm, i just thought it
> might be more flexibl and readable with a little less.
> After all the code would be the same after the preprocessor anyway.

What is your alternative? Write code using preprocessor tricks but then
manually expand them before committing? Anything that reduces code
duplication is a win in terms of ease of writing (no matter how much
magic is involved), but I can understand optimizing for reading at the
expense of writing if you're reasonably sure that the function will
never change again.

> And last ultra finetuned common 64-32 code has another problem. That is
> when one wants to change/optimize the code but she has not both a 32 and
> 64 bit cpu. It could easily lead to a speedloss or considerable more
> work waiting for others to do the benchmarking.

Essentially all asm I've written in the past 3 years was optimized for 64 
and for 64-in-32bit-mode, not for any 32bit cpu, so I guess that doesn't 
count as ultra finetuned. If you optimize for a specific old cpu and have 
reason to believe your change hurts new cpus, then that's another split, 
not just 32-64. If you don't have specific reason but just don't have any 
64bit cpus to test on, then you not only have code duplication but 
non-identical duplication without even being sure that the differences are 
If every difference between two near-duplicate functions is documented as 
to which cpus it's been tested on and the results thereof (what's the 
chance of that?), then my argument on this point is reduced.

> So in the end IMHO maybe less preprocessor based asm code factorization
> would be a better solution than yasm, just my 2cents, iam not opposing yasm
> if people really want it ...

Better? It's a solution to a different problem. I'm asking for yasm so I 
can do more preprocessor stuff.
Well, syntax is another reason. I'd prefer
   pshufw mm0, [eax+ecx*4+16], 0
   "pshufw $0, 16(%%eax,%%ecx,4), %%mm0 \n\t"\
even if that were the only difference.

--Loren Merritt

More information about the ffmpeg-devel mailing list