[FFmpeg-devel] [WIP] add sse4 flac lpc encoder

James Almer jamrial at gmail.com
Tue Feb 4 06:48:45 CET 2014


On 02/02/14 10:18 PM, James Darnley wrote:
> A rather hacked together patch adding an sse4 version of the flac lpc
> encoder for 16-bit samples, flac_lpc_encode_c_16().  But it works correctly.
> 
> I have been using gprof to measure the time taken in functions.
> 
>> Each sample counts as 0.01 seconds.
>>   %   cumulative   self              self     total           
>>  time   seconds   seconds    calls  ms/call  ms/call  name    
> Original code:
>>  43.94     19.45    19.45                             flac_lpc_encode_c_16
> This patch:
>>  25.74     17.10     8.54                             ff_flac_enc_lpc_16_sse4
> 
> The fraction of total time is down from nearly half to just over a
> quarter.  The time reported by `time` is also less these ~12 seconds.
> 
> Original: 0m52.318s
> Patch:    0m40.198s
> 
> These tests were done with compression level 8 which does skew the time
> spent in these functions to be in my favour.
> 
> I already see that I can use 4 more xmm regs to unroll the loop more.

I tested just now, and the code is crashing for me.

> +INIT_XMM sse4
> +cglobal flac_enc_lpc_16, 3, 5, 4, 0, res, smp, coefs ; len, order, shift

You're calling the function with six arguments but this is only expecting 
three. You're also reserving five general purpose registers instead of six.

> +                                   ; r0   r1   r2      r3   r4     r5
> +
> +%define posj r3
> +%define negj r4
> +
> +movd m3, r5m ; shift
> +loop_len:
> +    pxor m0,  m0
> +    xor posj, posj
> +    xor negj, negj

You're losing the len and order values before using their registers as 
counters.

You could do
cglobal flac_enc_lpc_16, 6, 8, 4, res, smp, coefs, len, order, shift, pos, neg

Above, and rename things accordingly. Though you'd be using eight registers, 
making the code unsuitable for x86.

> +    loop_order:
> +        movd   m2, [coefsq+posj*4] ; c = coefs[j]
> +        SPLATD m2
> +        movu   m1, [smpq+negj*4-4] ; s = smp[i-j-1]
> +        pmulld m1,  m2
> +        paddd  m0,  m1             ; p += c * s
> +
> +        add posj, 1
> +        sub negj, 1
> +        cmp posj, r4m
> +    jne loop_order
> +
> +    psrad m0, m3                   ; p >>= shift
> +    movu  m1, [smpq]
> +    psubd m1, m0                   ; smp[i] - p
> +    movu  [resq], m1               ; res[i] = smp[i] - (p >> shift)
> +
> +    add resq, mmsize
> +    add smpq, mmsize
> +    sub DWORD r3m, mmsize/4
> +jg loop_len
> +RET

After changing what i mentioned above the code worked for me, though the speed 
gains weren't as good in my tests compared to what you reported. (I however 
used the default compression level).

Regards


More information about the ffmpeg-devel mailing list