[MPlayer-dev-eng] [RFC] disable fastmemcpy on x86-64 by default

Michael Niedermayer michaelni at gmx.at
Sun May 27 23:34:10 CEST 2007


On Sun, May 27, 2007 at 10:47:55PM +0200, Attila Kinali wrote:
> On Sun, 27 May 2007 18:19:48 +0200
> Reimar D?ffinger <Reimar.Doeffinger at stud.uni-karlsruhe.de> wrote:
> 
> > Hello,
> > since SSE is part of the x86-64 architecture, at least glibc makes use
> > of it for its memcpy and some quick (and imprecise) tests indicate that
> > it's at least not slower.
> > So what do you think about attached patch? Can someone do more concise
> > benchmarks?
> 
> Here some benchmarks:
> 
> System:
> attila at jashugan:~ # uname -a
> Linux jashugan 2.6.18 #1 Wed Sep 27 17:50:21 CEST 2006 x86_64 GNU/Linux
> attila at jashugan:~ # cat /proc/cpuinfo 
> processor       : 0
> vendor_id       : AuthenticAMD
> cpu family      : 15
> model           : 55
> model name      : AMD Athlon(tm) 64 Processor 3700+
> stepping        : 2
> cpu MHz         : 2202.856
> cache size      : 1024 KB
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 1
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm
> bogomips        : 4409.53
> TLB size        : 1024 4K pages
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 40 bits physical, 48 bits virtual
> power management: ts fid vid ttp
> 
> attila at jashugan:~ # dpkg -s libc6|grep Version
> Version: 2.3.6.ds1-13
> attila at jashugan:~ # dpkg -s gcc|grep Version
> Version: 4:4.1.1-15
> attila at jashugan:~ # free -m
>              total       used       free     shared    buffers     cached
> Mem:          2012       1952         59          0          4       1125
> -/+ buffers/cache:        822       1190
> Swap:         7812          0       7812
> 
> 
> Graphics card is a Matrox G550, used vo: xmga
> 
> All benchmarks are best of 3, with one burn in, run from a local
> sata disk (resp after burn in from RAM)
> 
> standard parameters: -quiet -nosound -benchmark
[...]
> 
> 
> I also sinlge-run tested a few other samples similar to benchmark 1 and 3
> (ie animes with divx3, divx4, xvid, h.264) codecs that didn't show any
> siginificant speed difference (<1%)
> 
> Interesting are benchmark 2 and 5, which both are faster with
> the patch. They are also the only ones i came across that 
> were decoded using the low_delay flag.

hypothesis 2

the width of the with patch faster cases is not a multiple of 64
which forces a few bytes of the mem2agp copy to be copied by
small_memcpy() which does NOT use non temporal stores which
defeats the whole purpose of mem2agp copy as the dst is read into
the cache and reads from agp are shitty slow

experiment to test, outcomment the small_memcpy() call

fix if it is the problem, put a non temporal memcpy with size
<64 there ...

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Opposition brings concord. Out of discord comes the fairest harmony.
-- Heraclitus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/mplayer-dev-eng/attachments/20070527/af110c5a/attachment.pgp>


More information about the MPlayer-dev-eng mailing list