[Ffmpeg-devel] gcc4 support & MMX fixups (from Debian)
Aurelien Jacobs
aurel
Wed Feb 1 00:01:14 CET 2006
On Tue, 31 Jan 2006 23:37:04 +0100
Pawe? Sikora <pluto at pld-linux.org> wrote:
> Dnia Tuesday, 31 of January 2006 21:25, matthieu castet napisa?:
> > Hi Pawe?,
> >
> > Pawe? Sikora wrote:
> > > Hi all,
> > >
> > > I have an implementation of transpose4x4 in C which uses gcc's vector
> > > extensions. It doesn't press register allocator so much and allows
> > > optimal code scheduling.
> > >
> > > Instantiation of attached patch e.g. in foo(dst, src, 4, 4)
> > > gives a nice piece of code:
> > >
> > > [ x86-64 example ]
> > >
> > > foo: movd 4(%rsi), %mm0
> > > movd (%rsi), %mm1
> > > movd 8(%rsi), %mm2
> > > movd 12(%rsi), %mm3
> > > punpcklbw %mm0, %mm1
> > > punpcklbw %mm3, %mm2
> > > movq %mm1, %mm0
> > > punpckhwd %mm2, %mm1
> > > punpcklwd %mm2, %mm0
> > > movd %mm1, 8(%rdi)
> > > punpckhdq %mm1, %mm1
> > > movd %mm0, (%rdi)
> > > punpckhdq %mm0, %mm0
> > > movd %mm1, 12(%rdi)
> > > movd %mm0, 4(%rdi)
> > > ret
> > >
> > > actually gcc-4.1 has a good optimizer and happy asm. hardcoding
> > > doesn't introduce incredible performance boost but only degradation
> > > of code scheduling.
> >
> > Could you post a benchmarck between the 2 versions ?
>
> I did a simple benchmark with transpose4x4 marked with attribute noinline.
>
> results:
>
> orig: iters = 1000000000, dt = 7.92 [avg]
> fixed: iters = 1000000000, dt = 7.35 [avg]
>
> we gain: ~7.2%
That sounds interesting, but here, with gcc-4.0.2 on amd64, I have some
rather different results :
orig: iters = 1000000000, dt = 12.16
fixed: iters = 1000000000, dt = 173.86
So it seems that gcc-4.1 gives some spectacular improvements in this area,
but this code really shouldn't be enabled with gcc-4.0.
Aurel
More information about the ffmpeg-devel
mailing list