[Libav-user] a little performance/optimisation headbreaker :)

Fri Feb 15 16:33:48 CET 2013

On Fri, Feb 15, 2013 at 6:37 AM, "René J.V. Bertin" <rjvbertin at gmail.com> wrote:
> On my 2.7Ghz dual-core i7 MBP, I get about 10000Hz for the SSE version, and roughly half that for the generic, scalar function, using gcc-4.2 as well as using MSVC 2010 Express running under WinXP in VirtualBox. The factor 2 speed gain for SSE code also applies on 2 AMD machines (mid-end laptop and C62 netbook).
>
> Then I installed a new mingw32 cross-compiler based on gcc 4.7 and for the heck of it compiled my benchmark with it ... and found same factor 2 ... but in favour of the scalar code, on my i7 . It's more like a factor 2.5, actually. Same thing after installing the native OS X gcc 4.7 version.
>
> The question: is gcc-4.7 clever enough to do a better optimisation of the 2nd benchmark loop than the 1st loop, or does it really generate so much better assembly from the scalar function? NB, -fno-inline-functions has no effect here.

gcc 4.7 is clever enough to generate SSE code by itself. Maybe that's
what you're experiencing. I guess compiler flags do matter too.

Have you inspected the generated assembly code? gcc -S should tell you
exactly the difference between the two loops, and I found it a very
informative exercise to inspect it when something goes hinky
performance-wise. Especially since you've used inline assembler for
gcc, which tends to inhibit many of its other optimizations. Why don't
you try gcc's vector primitives instead?