[FFmpeg-devel] [PATCH] SSE3/4 implementation of flac_encode_residual_lpc

Fri May 29 19:00:12 CEST 2009

On Thu, 28 May 2009, Bobby Bingham wrote:

> Attached is a version I hope is about ready for inclusion.  Provides an
> overall encoding speedup of ~30% at compression_level=12.

> "movdqa     %%xmm3,  %%xmm6 \n\t" // verify that 16 bits is enough
> "movdqa     %%xmm5,  %%xmm7 \n\t"
> "pslld      $16,     %%xmm6 \n\t"
> "pslld      $16,     %%xmm7 \n\t"
> "psrad      $16,     %%xmm6 \n\t"
> "psrad      $16,     %%xmm7 \n\t"
> "pcmpeqd    %%xmm3,  %%xmm6 \n\t"
> "pcmpeqd    %%xmm5,  %%xmm7 \n\t"
> "pand       %%xmm6,  %%xmm7 \n\t"
> "pmovmskb   %%xmm7,  %2     \n\t"
> "cmp        $0xffff, %2     \n\t"
> "jne        2f              \n\t"

About half of the invocations to flac_encode_residual_lpc will know in 
advance that all of the samples fit in 16bit, so those shouldn't check 
this at all. For the remainder, this logic should be doable with just 
1 paddd and 1 por per vector. Merge several vectors before branching.

The double branch is inelegant. It could be removed if you either wrote 
the whole loop in asm, or split the asm block and branched in C. 
Especially if the 16bit checking is moved to a separate loop as 
appropriate for not always needing to run it.

With 6 "r" constraints, you need #if HAVE_6REGS.

--Loren Merritt