[Mplayer-cvslog] CVS: main/DOCS/tech TODO,1.9,1.10

Wed Dec 12 21:41:49 CET 2001

Hi

On Wednesday 12 December 2001 17:02, Nick Kurshev wrote:
> Hello, Michael!
[...]
> > > I know only that manuals always suggested to replace conditional jumps
> > > with direct code ;)
> >
> > yes but that isnt possible here
> > they allso suggest to avoid function pointers
>
> Did you read K7 manual?
> What about:
> JMP near mreg16/32 (indirect)    DirectPath
> JMP near mem16/32 (indirect)     DirectPath
Direct path means that its decoded quickly, it says nothing about how fast it 
is executed or about branchprediction afaik
btw for function pointers u need call / ret and they are vectorPath

[...]
> My tests shows me that on Duron:
> direct call takes 4 clocks
> indirect call takes 5 clocks
> (these clocks include measuring of loop)
> So there is only 20% of difference that is too few against memcpy process.
hmm, i looked into TFM and noticed that i wasnt completly correct about my 
assumptation that indirect calls are that slow (they should both excute in 
about 2 cycles on the ppro,p3,... cpus) ... well but its slower in the 
benchmark ... looking at asm output ...

with if / else gcc generates code like
test ...
 jz L1
(function1)
 jmp L2
L1:
(function2)
L2:

with function pointers
movl ..., %eax
call *%eax

...

L1:
pushl %ebp
movl %esp,%ebp
(function1)
leave
ret

L2:
pushl %ebp
movl %esp,%ebp
(function2)
leave
ret

quiet a bit longer and slower indeed

[...]

so if we would try to code it manually it would look like:
movl flags, %eax
testl MMX2|3DNOW, %eax
 jz MMXorC
testl MMX2, %eax
 jz 3DNOW
(MMX2-memcpy)
 jmp end
3DNOW:
(3DNOW-memcpy)
 jmp end
testl MMX, %eax
 jz C
(MMX-memcpy)
 jmp end
C:
(C-memcpy)
end:

that would execute 1 mov, 2 tests and 3 jmps at max
these would be decoded to 5 micro Ops on intel chips
they are all directpath on amd chips
the latency is 1 on k7 for all except the mov, and the mov is 3-cycle latency

function pointers:
movl ..., %eax
call *%eax

...
MMX:
blah blah
ret

that would execute 1 mov, 1 indirect call, 1 return
these would be decoded to 10 micro Ops on intel chip 
decoding itself is very likely slower too here
call / ret is vectorpath on k7
call / ret have 4-5 cycles latency on k7 and the mov has 3

numbers are simple from TFM (amd and intels), so i might have missed some 
important exceptions

the only possible way to really figure out which is faster is to code both 
and benchmark them on different cpus, but i doubt that it is worth it because
1. it only affects runtime cpu detection
2. a single memory access (misses L1&L2 cache) need 50 cpu cycles or so so 
even if your variant turns out to be faster on some cpu the difference would 
be tiny

Michael