[FFmpeg-devel] Inline ASM vs. Intrinsics
Fri May 11 23:10:56 CEST 2007
On Fri, May 11, 2007 at 09:20:04PM +0200, Michael Niedermayer wrote:
> i thought it was significantly slower than compareable CPUs (same time period
> and same price range) even when both run natively compiled code
> (with similarly good compilers of course)
"similarly good compilers" being the tricky bit. I think IA64
compiler design remains an ongoing research problem. Last I checked I
was getting essentially identical performance from gcc and icc on the
same IA64 machine -- however I use almost no floating point in my
workload and I think that's where icc is supposed to shine.
> and from what i remember its a nightmare for a compiler to generate
> good code for it ...
Instructions are 41 bits and are bundled three at a time into 128-bit
blocks; the remaining 5 bits have to do with specifying which
execution units to use. I don't think the chip does any internal
reordering, instead relying on the compiler to figure it all out
explicitly. If you look at IA64 assembly you often see a lot of NOPs
in the code, presumably because the compiler couldn't find something
to schedule in that spot. Instructions can also be hinted, for
example you can explicitly request memory ordering and cache behavior
on individual loads and stores (and yes, getting these hints right in
hot code paths can have a measurable impact on performance).
If an open source project requires a lot of assembly language to do
its job (such as JIT compiler library), you'll typically find that
it's been ported to everything _but_ IA64, because nobody is
masochistic enough to try it.
The chips are large and expensive, but you can get them with 9MB of
on-chip cache if you've got the money. There's also some subtle
limitations on which particular CPU models you can combine in the same
machine, which can cause trouble if you build a system and then try to
add more CPUs later but can't get compatible chips (been there, done
What IA64 _can_ do is scale way, way, WAY up. If you want a single
machine with hundreds of CPUs, several terabytes of shared,
cache-coherent RAM, 90 PCI-X slots, and 16 GPUs, you can have it right
now. That's the only reason I ended up dealing with IA64 for my
project: when we were working out the RAM and CPU requirements IA64
was simply the only practical option. Today there are several other
possibilities and I've kept my code mostly ready to drop onto x86_64
when the time comes.
More information about the ffmpeg-devel