[FFmpeg-devel] [PATCH] Move MLP's dot product to DSPContext

Tue Apr 21 11:19:06 CEST 2009

On Tue, Apr 21, 2009 at 05:31:14AM +0200, Michael Niedermayer wrote:
> ahh and note, gcc and 64 operations -> very poor code, naive asm
> will be much faster at least it was that way in the past ...

Well, even on 64 bit it is bad enough, the current code results in:
     478:       89 c8                   mov    %ecx,%eax
     47a:       49 63 10                movslq (%r8),%rdx
     47d:       83 c1 01                add    $0x1,%ecx
     480:       48 63 44 84 d8          movslq -0x28(%rsp,%rax,4),%rax
     485:       49 83 c0 04             add    $0x4,%r8
     489:       48 0f af c2             imul   %rdx,%rax
     48d:       49 01 c1                add    %rax,%r9
     490:       44 39 d1                cmp    %r10d,%ecx
     493:       75 e3                   jne    478

Optimizing the C code a bit:
        int32_t *b = firbuf + fir->order;
        int32_t *c = fir->coeff + fir->order;
        int64_t o = -fir->order;
do {
accum += (int64_t)b[o] * c[o];
} while (++o);

Then results in this:
     4c0:       49 63 00                movslq (%r8),%rax
     4c3:       48 63 11                movslq (%rcx),%rdx
     4c6:       49 83 c0 04             add    $0x4,%r8
     4ca:       48 83 c1 04             add    $0x4,%rcx
     4ce:       48 0f af c2             imul   %rdx,%rax
     4d2:       49 01 c1                add    %rax,%r9
     4d5:       49 83 c2 01             add    $0x1,%r10
     4d9:       75 e5                   jne    4c0

Which includes two completely useless adds.
gcc is just incredibly useless for inner loops, loops that could be done
with three instructions end up using ten or more and such crap (this
is actually one of the better cases).
Admittedly the x86_64 imul is braindead, too, for not providing a
32x32->64 multiply (except for the edx:eax crap).