[Ffmpeg-devel] [PATCH] (4) building with --disable-opts on i386

Sun Aug 13 17:25:41 CEST 2006

On Sunday 13 August 2006 00:50, Michael Niedermayer wrote:

> > the optimizer
> > should remove this.
>
> can you check that it really does? gcc -S should produce compiled but not
> assembled output ...

I've written a transpose8x8 routine using transpose4x4 to study the output. 
Assembly revealed that none of my compilers removed the temporaries. I found 
that strange and replaced the "m" constraint with the "X" constraint and 
found completely different code, that performed much better. Replacing the 
input constraints of the original routine with "X" turns out to produce 
*worser* code.

I compared the runtimes of 1 Billion transpose8x8 on in-cache data. Tests were 
done on two Suse 10.0 installations. "Modified Routine" means the patch + "X" 
constraints instead of "m" for in_*.

Athlon XP 2000+ 
Original Routine: 20.59 sec (gcc-3.4.6 -O3)
Modified Routine: 16.03 sec (gcc-3.4.6 -O3)
Original Routine: 17.00 sec (gcc-4.0.4 -O3)
Modified Routine: 19.75 sec (gcc-4.0.3 -O3)
Original Routine: 16.80 sec (gcc-4.1.1 -O3)
Modified Routine: 16.80 sec (gcc-4.1.1 -O3)

Pentium-4 (2800MHZ) with EM64T
Original Routine: 27.48 sec (gcc-4.0.3 -O3)
Original Routine: 20.51 sec (gcc-4.0.3 -O3 -m32)
Modified Routine: 22.36 sec (gcc-4.0.3 -O3) (!!)
Modified Routine: 20.80 sec (gcc-4.0.3 -m32 -O3)

Original Routine: 27.56 sec (gcc-4.1.1 -O3)
Original Routine: 20.59 sec (gcc-4.1.1 -O3 -m32)
Modified Routine: 22.42 sec (gcc-4.1.1 -O3) (!!)
Modified Routine: 20.84 sec (gcc-4.1.1 -m32 -O3)

- The sentence "The optimizer should remove this" is true for the gcc-4 
release series and even truer for the case of the P4
- I tried to get numbers for a dual-core MacTel and Apples gcc-4.0.1, but it 
turned out that in the context of my routine the original transpose4x4 
suffered register starvation even *with* -O3!
- I checked transcoding a 30 Meg 10 times with "ffmpeg -y -flags +bitexact 
-dct fastint -idct simple -y -qscale 10 -i monty.avi -vcodec rv20 -an 
monty.rm". This is 5% slower with the original patch and has the same 
runtimes with the modified patch - but only on a 4.* compiler. The 3.* and 
2.* series perform faster with the original routine.

This is a bit discouraging. I see no way to fix this, without either risk 
performance degradation or switching between optimized and unoptimized 
builds.

An observation is, that the register spill does not happen, if the passed 
values are return values from function call. This would make the following 
pattern possible: 

#ifdef DEBUG_BUILD
static inline uint32_t fix_uint32_t(uint32_t t) { return t; }
#else 
#define fix_uint32(X) X
#endif

int src_stride){
    asm volatile( //FIXME could save 1 instruction if done as 8x4 ...
        "movd  %4, %%mm0                \n\t"
        "movd  %5, %%mm1                \n\t"
        "movd  %6, %%mm2                \n\t"
        "movd  %7, %%mm3                \n\t"
        "punpcklbw %%mm1, %%mm0         \n\t"
        "punpcklbw %%mm3, %%mm2         \n\t"
        "movq %%mm0, %%mm1              \n\t"
        "punpcklwd %%mm2, %%mm0         \n\t"
        "punpckhwd %%mm2, %%mm1         \n\t"
        "movd  %%mm0, %0                \n\t"
        "punpckhdq %%mm0, %%mm0         \n\t"
        "movd  %%mm0, %1                \n\t"
        "movd  %%mm1, %2                \n\t"
        "punpckhdq %%mm1, %%mm1         \n\t"
        "movd  %%mm1, %3                \n\t"

        : "=m" (*(uint32_t*)(dst + 0*dst_stride)),
          "=m" (*(uint32_t*)(dst + 1*dst_stride)),
          "=m" (*(uint32_t*)(dst + 2*dst_stride)),
          "=m" (*(uint32_t*)(dst + 3*dst_stride))
        :  "m" (fix_uint32(*(uint32_t*)(src + 0*src_stride))), /* "fix" input 
if it doesn't compile in -O0 */
           "m" (fix_uint32(*(uint32_t*)(src + 1*src_stride))),
           "m" (fix_uint32(*(uint32_t*)(src + 2*src_stride))),
           "m" (fix_uint32(*(uint32_t*)(src + 3*src_stride)))
    );
}

What do you think? 
Marco