[Ffmpeg-devel] [PATCH] (4) building with --disable-opts on i386
Marco Manfredini
mldb
Sun Aug 13 17:25:41 CEST 2006
On Sunday 13 August 2006 00:50, Michael Niedermayer wrote:
> > the optimizer
> > should remove this.
>
> can you check that it really does? gcc -S should produce compiled but not
> assembled output ...
I've written a transpose8x8 routine using transpose4x4 to study the output.
Assembly revealed that none of my compilers removed the temporaries. I found
that strange and replaced the "m" constraint with the "X" constraint and
found completely different code, that performed much better. Replacing the
input constraints of the original routine with "X" turns out to produce
*worser* code.
I compared the runtimes of 1 Billion transpose8x8 on in-cache data. Tests were
done on two Suse 10.0 installations. "Modified Routine" means the patch + "X"
constraints instead of "m" for in_*.
Athlon XP 2000+
Original Routine: 20.59 sec (gcc-3.4.6 -O3)
Modified Routine: 16.03 sec (gcc-3.4.6 -O3)
Original Routine: 17.00 sec (gcc-4.0.4 -O3)
Modified Routine: 19.75 sec (gcc-4.0.3 -O3)
Original Routine: 16.80 sec (gcc-4.1.1 -O3)
Modified Routine: 16.80 sec (gcc-4.1.1 -O3)
Pentium-4 (2800MHZ) with EM64T
Original Routine: 27.48 sec (gcc-4.0.3 -O3)
Original Routine: 20.51 sec (gcc-4.0.3 -O3 -m32)
Modified Routine: 22.36 sec (gcc-4.0.3 -O3) (!!)
Modified Routine: 20.80 sec (gcc-4.0.3 -m32 -O3)
Original Routine: 27.56 sec (gcc-4.1.1 -O3)
Original Routine: 20.59 sec (gcc-4.1.1 -O3 -m32)
Modified Routine: 22.42 sec (gcc-4.1.1 -O3) (!!)
Modified Routine: 20.84 sec (gcc-4.1.1 -m32 -O3)
- The sentence "The optimizer should remove this" is true for the gcc-4
release series and even truer for the case of the P4
- I tried to get numbers for a dual-core MacTel and Apples gcc-4.0.1, but it
turned out that in the context of my routine the original transpose4x4
suffered register starvation even *with* -O3!
- I checked transcoding a 30 Meg 10 times with "ffmpeg -y -flags +bitexact
-dct fastint -idct simple -y -qscale 10 -i monty.avi -vcodec rv20 -an
monty.rm". This is 5% slower with the original patch and has the same
runtimes with the modified patch - but only on a 4.* compiler. The 3.* and
2.* series perform faster with the original routine.
This is a bit discouraging. I see no way to fix this, without either risk
performance degradation or switching between optimized and unoptimized
builds.
An observation is, that the register spill does not happen, if the passed
values are return values from function call. This would make the following
pattern possible:
#ifdef DEBUG_BUILD
static inline uint32_t fix_uint32_t(uint32_t t) { return t; }
#else
#define fix_uint32(X) X
#endif
int src_stride){
asm volatile( //FIXME could save 1 instruction if done as 8x4 ...
"movd %4, %%mm0 \n\t"
"movd %5, %%mm1 \n\t"
"movd %6, %%mm2 \n\t"
"movd %7, %%mm3 \n\t"
"punpcklbw %%mm1, %%mm0 \n\t"
"punpcklbw %%mm3, %%mm2 \n\t"
"movq %%mm0, %%mm1 \n\t"
"punpcklwd %%mm2, %%mm0 \n\t"
"punpckhwd %%mm2, %%mm1 \n\t"
"movd %%mm0, %0 \n\t"
"punpckhdq %%mm0, %%mm0 \n\t"
"movd %%mm0, %1 \n\t"
"movd %%mm1, %2 \n\t"
"punpckhdq %%mm1, %%mm1 \n\t"
"movd %%mm1, %3 \n\t"
: "=m" (*(uint32_t*)(dst + 0*dst_stride)),
"=m" (*(uint32_t*)(dst + 1*dst_stride)),
"=m" (*(uint32_t*)(dst + 2*dst_stride)),
"=m" (*(uint32_t*)(dst + 3*dst_stride))
: "m" (fix_uint32(*(uint32_t*)(src + 0*src_stride))), /* "fix" input
if it doesn't compile in -O0 */
"m" (fix_uint32(*(uint32_t*)(src + 1*src_stride))),
"m" (fix_uint32(*(uint32_t*)(src + 2*src_stride))),
"m" (fix_uint32(*(uint32_t*)(src + 3*src_stride)))
);
}
What do you think?
Marco
More information about the ffmpeg-devel
mailing list