[FFmpeg-devel] [PATCH] Move MLP's dot product to DSPContext
Reimar Döffinger
Reimar.Doeffinger
Tue Apr 21 11:19:06 CEST 2009
On Tue, Apr 21, 2009 at 05:31:14AM +0200, Michael Niedermayer wrote:
> ahh and note, gcc and 64 operations -> very poor code, naive asm
> will be much faster at least it was that way in the past ...
Well, even on 64 bit it is bad enough, the current code results in:
478: 89 c8 mov %ecx,%eax
47a: 49 63 10 movslq (%r8),%rdx
47d: 83 c1 01 add $0x1,%ecx
480: 48 63 44 84 d8 movslq -0x28(%rsp,%rax,4),%rax
485: 49 83 c0 04 add $0x4,%r8
489: 48 0f af c2 imul %rdx,%rax
48d: 49 01 c1 add %rax,%r9
490: 44 39 d1 cmp %r10d,%ecx
493: 75 e3 jne 478
Optimizing the C code a bit:
int32_t *b = firbuf + fir->order;
int32_t *c = fir->coeff + fir->order;
int64_t o = -fir->order;
do {
accum += (int64_t)b[o] * c[o];
} while (++o);
Then results in this:
4c0: 49 63 00 movslq (%r8),%rax
4c3: 48 63 11 movslq (%rcx),%rdx
4c6: 49 83 c0 04 add $0x4,%r8
4ca: 48 83 c1 04 add $0x4,%rcx
4ce: 48 0f af c2 imul %rdx,%rax
4d2: 49 01 c1 add %rax,%r9
4d5: 49 83 c2 01 add $0x1,%r10
4d9: 75 e5 jne 4c0
Which includes two completely useless adds.
gcc is just incredibly useless for inner loops, loops that could be done
with three instructions end up using ten or more and such crap (this
is actually one of the better cases).
Admittedly the x86_64 imul is braindead, too, for not providing a
32x32->64 multiply (except for the edx:eax crap).
More information about the ffmpeg-devel
mailing list