[FFmpeg-devel] [PATCH] SSE3/4 implementation of flac_encode_residual_lpc
Fri May 29 19:00:12 CEST 2009
On Thu, 28 May 2009, Bobby Bingham wrote:
> Attached is a version I hope is about ready for inclusion. Provides an
> overall encoding speedup of ~30% at compression_level=12.
> "movdqa %%xmm3, %%xmm6 \n\t" // verify that 16 bits is enough
> "movdqa %%xmm5, %%xmm7 \n\t"
> "pslld $16, %%xmm6 \n\t"
> "pslld $16, %%xmm7 \n\t"
> "psrad $16, %%xmm6 \n\t"
> "psrad $16, %%xmm7 \n\t"
> "pcmpeqd %%xmm3, %%xmm6 \n\t"
> "pcmpeqd %%xmm5, %%xmm7 \n\t"
> "pand %%xmm6, %%xmm7 \n\t"
> "pmovmskb %%xmm7, %2 \n\t"
> "cmp $0xffff, %2 \n\t"
> "jne 2f \n\t"
About half of the invocations to flac_encode_residual_lpc will know in
advance that all of the samples fit in 16bit, so those shouldn't check
this at all. For the remainder, this logic should be doable with just
1 paddd and 1 por per vector. Merge several vectors before branching.
The double branch is inelegant. It could be removed if you either wrote
the whole loop in asm, or split the asm block and branched in C.
Especially if the 16bit checking is moved to a separate loop as
appropriate for not always needing to run it.
With 6 "r" constraints, you need #if HAVE_6REGS.
More information about the ffmpeg-devel