[FFmpeg-devel] [PATCH 7/9] sbcenc: add MMX optimizations

Sat Dec 23 22:52:11 EET 2017

On Sat, Dec 23, 2017 at 05:47:04PM -0300, James Almer wrote:
> On 12/23/2017 5:44 PM, Aurelien Jacobs wrote:
> > On Sat, Dec 23, 2017 at 03:35:28PM -0300, James Almer wrote:
> >> On 12/23/2017 3:01 PM, Aurelien Jacobs wrote:
> >>> This was originally based on libsbc, and was fully integrated into ffmpeg.
> >>>
> >>> Rough speed test:
> >>> C version:    speed= 592x
> >>> MMX version:  speed= 785x
> >>> ---
> >>>  libavcodec/sbcdsp.c          |   3 +
> >>>  libavcodec/sbcdsp.h          |   2 +
> >>>  libavcodec/x86/Makefile      |   2 +
> >>>  libavcodec/x86/sbcdsp.asm    | 284 +++++++++++++++++++++++++++++++++++++++++++
> >>>  libavcodec/x86/sbcdsp_init.c |  51 ++++++++
> >>>  5 files changed, 342 insertions(+)
> >>>  create mode 100644 libavcodec/x86/sbcdsp.asm
> >>>  create mode 100644 libavcodec/x86/sbcdsp_init.c
> >>
> >> [...]
> >>
> >>> +;*******************************************************************
> >>> +;void ff_sbc_calc_scalefactors(int32_t sb_sample_f[16][2][8],
> >>> +;                              uint32_t scale_factor[2][8],
> >>> +;                              int blocks, int channels, int subbands)
> >>> +;*******************************************************************
> >>> +INIT_MMX mmx
> >>> +cglobal sbc_calc_scalefactors, 5, 7, 3, sb_sample_f, scale_factor, blocks, channels, subbands, ptr, blk
> >>> +    ; subbands = 4 * subbands * channels
> >>> +    shl  subbandsd, 2
> >>> +    cmp  channelsd, 2
> >>> +    jl   .loop_1
> >>> +    shl  subbandsd, 1
> >>> +
> >>> +.loop_1:
> >>> +    sub           subbandsq, 8
> >>> +    lea           ptrq, [sb_sample_fq + subbandsq]
> >>> +
> >>> +    ; blk = (blocks - 1) * 64;
> >>> +    lea           blkq, [blocksq - 1]
> >>> +    shl           blkd, 6
> >>> +
> >>> +    movq          m0, [scale_mask]
> >>
> >> I insist, this can be easily loaded outside the loop. You have enough
> >> spare regs to store a copy.
> > 
> > Oh, I forgot to reply to this. There isn't any register left available
> > on x86_32, hence why I kept those load inside the loop.
> 
> You're not using a gprs to store the mask nor need to. You're using mmx
> regs and have 5 left.

Oh, indeed ! Not sure why it didn't even cross my mind...
I will have a look at this.