[FFmpeg-devel] [flamefest-start] A little something on MMX/SSE intrinsics

Sat Mar 1 13:38:22 CET 2008

Ivan Kalvachev wrote:
> 
> How are PPCs so scheduling-sensitive?

In the specific case of G3 vs G4 vs G5 vs CELL the main issue is that 
the people thinking about how to use silicon surface and power had quite 
radical different ideas on what to do. So the G3 alu has certain 
features and optional stuff that in G4 aren't as fast, similar 
differences between G4 and G5 Altivec implementation, CELL is a world 
apart since you have to consider branch hints since they preferred have 
another hardware thread instead putting a complex branch predictor... 
(they cut other corners as well, the idea isn't bad given you and your 
compiler are aware of them)

> Usually you write instructions with as much parallelism as possible
> and the CPU is expected to execute as much instructions as it can.

That is fine, the problem is which instructions. Right now the best way 
to have sane code overall is writing branchless simd, use the cache 
hinter but forget the stream cache hinter (works just on G4) and try to 
keep in mind how deep the pipeline is and the load/store delay and other 
interesting details that gcc should have already and should use(e.g. to 
reorder/change appropriately instructions, generate constants out of 
immediate instructions instead of loads the values (can be faster) and 
keep in mind how altivec interacts with the scalar alu.)

G4 has a quite reduced bandwidth with the memory but has ways to make 
the dma engine behave (stream hints), G5 has better access to memory but 
   you lose the stream hints, CELL has an even better memory management 
BUT you have higher penalties for missing branch and other peculiarities.

> I just want the summary, not reading 5-6 optimization manuals.

I hope I given you an idea.

-- 

Luca Barbato
Gentoo Council Member
Gentoo/linux Gentoo/PPC
http://dev.gentoo.org/~lu_zero