[FFmpeg-devel] [PATCH 2/3] x86/float_dsp: unroll loop in vector_fmac_scalar

Wed Apr 16 18:12:23 CEST 2014

On 16/04/14 7:06 AM, Christophe Gisquet wrote:
> Hi,
> 
>> ~6% faster SSE2 performance. AVX/FMA3 are unaffected.
> 
> What CPU, environment and test case have you used?
> 
> For SSE2, if I'm not mistaken, the difference in the code is having
> different regs used in the unrolled part. When I tested that with AAC,
> which often performs calls for 64 elements, this was not a win for mingw64.
> 
> But a 6% win for most typical systems is certainly better than a <1% loss
> for a few. I'm OK with the change otherwise.
> 
> Best regards,
> Christophe

Athlon 64 7750+ mingw-w64. Went from 274 cycles to 257 when i benched with 
the dts-es sample i uploaded for the fate test.
Also, does aac even use vector_fmac_scalar? A grep on libavcodec shows 
results only in dcadec.c.

The objdump disassemble fow win64 looks like this for pre-patch

 movaps (%rdx,%r9,1),%xmm1
 mulps  %xmm0,%xmm1
 movaps 0x10(%rdx,%r9,1),%xmm2
 mulps  %xmm0,%xmm2
 addps  (%rcx,%r9,1),%xmm1
 addps  0x10(%rcx,%r9,1),%xmm2
 movaps %xmm1,(%rcx,%r9,1)
 movaps %xmm2,0x10(%rcx,%r9,1)
 movaps 0x20(%rdx,%r9,1),%xmm1
 mulps  %xmm0,%xmm1
 movaps 0x30(%rdx,%r9,1),%xmm2
 mulps  %xmm0,%xmm2
 addps  0x20(%rcx,%r9,1),%xmm1
 addps  0x30(%rcx,%r9,1),%xmm2
 movaps %xmm1,0x20(%rcx,%r9,1)
 movaps %xmm2,0x30(%rcx,%r9,1)

And post-patch:

 movaps (%rdx,%r9,1),%xmm1
 mulps  %xmm2,%xmm1
 movaps 0x10(%rdx,%r9,1),%xmm0
 mulps  %xmm2,%xmm0
 movaps 0x20(%rdx,%r9,1),%xmm3
 mulps  %xmm2,%xmm3
 movaps 0x30(%rdx,%r9,1),%xmm4
 mulps  %xmm2,%xmm4
 addps  (%rcx,%r9,1),%xmm1
 addps  0x10(%rcx,%r9,1),%xmm0
 addps  0x20(%rcx,%r9,1),%xmm3
 addps  0x30(%rcx,%r9,1),%xmm4
 movaps %xmm1,(%rcx,%r9,1)
 movaps %xmm0,0x10(%rcx,%r9,1)
 movaps %xmm3,0x20(%rcx,%r9,1)
 movaps %xmm4,0x30(%rcx,%r9,1)

The difference in the resulting code is in the order of instructions thanks 
to the unrolling of the loop. The mulps now have enough room to finish before 
the addps are executed, and so do the addps before the mova to memory.
Currently the addps are basically right after the mulps, which is afaik not 
optimal.