[FFmpeg-devel] [PATCH] VP3 DC-only IDCT

Michael Niedermayer michaelni
Sat Mar 13 20:18:11 CET 2010


On Sat, Mar 13, 2010 at 01:36:20AM -0500, David Conrad wrote:
> Hi,
> 
> This gives 2-4% faster overall decode for normal files.
> 
> Some thoughts:
> I can't think of any shortcuts that could make the IDCT faster with 128-byte simd that don't rely on knowing the last non-zero coefficient.
> 
> Knowing that before calling the idct, you could do a slightly faster IDCT that assumes the right and bottom of the block are all 0. This seems to be significantly faster only for mmx; for sse2 it's nearly a wash between the added check vs. the time saved.
> 
> For an average video, around a third of all idcts are DC-only, a third more could be done with that shortcut (i.e. last_nnz is under 10), and the rest require a full IDCT.
> 
> libtheora only does the 10 element shortcut, not DC-only. It also only has a mmx IDCT.
> 
> I also haven't really looked at whether a DC-only IDCT is beneficial for mpeg codecs, thus the vp3-specific dsputil function.
> 

[...]
> diff --git a/libavcodec/vp3dsp.c b/libavcodec/vp3dsp.c
> index 87b64de..606e361 100644
> --- a/libavcodec/vp3dsp.c
> +++ b/libavcodec/vp3dsp.c
> @@ -223,6 +223,25 @@ void ff_vp3_idct_add_c(uint8_t *dest/*align 8*/, int line_size, DCTELEM *block/*
>      idct(dest, line_size, block, 2);
>  }
>  
> +void ff_vp3_idct_dc_add_c(uint8_t *dest/*align 8*/, int line_size, DCTELEM *block/*align 16*/){
> +    const uint8_t *cm = ff_cropTbl + MAX_NEG_CROP;
> +    int i, dc = block[0];

> +    dc = (46341*dc)>>16;
> +    dc = (46341*dc)>>16;

me searches for a bag to vomit into ...
do they do all x>>1 in theora that way or just selected ones?


[...]
> diff --git a/libavcodec/x86/vp3dsp_mmx.c b/libavcodec/x86/vp3dsp_mmx.c
> index fead8e8..e39d0a1 100644
> --- a/libavcodec/x86/vp3dsp_mmx.c
> +++ b/libavcodec/x86/vp3dsp_mmx.c
> @@ -395,3 +395,65 @@ void ff_vp3_idct_add_mmx(uint8_t *dest, int line_size, DCTELEM *block)
>      ff_vp3_idct_mmx(block);
>      add_pixels_clamped_mmx(block, dest, line_size);
>  }
> +

> +void ff_vp3_idct_dc_add_mmx2(uint8_t *dest, int linesize, DCTELEM *block)
> +{
> +    int dc = block[0];
> +    dc = (46341*dc)>>16;

> +    dc = (46341*dc)>>16;
> +    dc = (dc + 8) >> 4;

you can merge these 2


> +    __asm__ volatile(
> +        "movd          %0, %%mm0 \n\t"
> +        "pshufw $0, %%mm0, %%mm0 \n\t"
> +        "pxor       %%mm1, %%mm1 \n\t"
> +        "psubw      %%mm0, %%mm1 \n\t"
> +        "packuswb   %%mm0, %%mm0 \n\t"
> +        "packuswb   %%mm1, %%mm1 \n\t"
> +        ::"r"(dc)
> +    );
> +    __asm__ volatile(
> +        "movq          %0, %%mm2 \n\t"
> +        "movq          %1, %%mm3 \n\t"
> +        "movq          %2, %%mm4 \n\t"
> +        "movq          %3, %%mm5 \n\t"
> +        "paddusb    %%mm0, %%mm2 \n\t"
> +        "paddusb    %%mm0, %%mm3 \n\t"
> +        "paddusb    %%mm0, %%mm4 \n\t"
> +        "paddusb    %%mm0, %%mm5 \n\t"
> +        "psubusb    %%mm1, %%mm2 \n\t"
> +        "psubusb    %%mm1, %%mm3 \n\t"
> +        "psubusb    %%mm1, %%mm4 \n\t"
> +        "psubusb    %%mm1, %%mm5 \n\t"
> +        "movq       %%mm2, %0    \n\t"
> +        "movq       %%mm3, %1    \n\t"
> +        "movq       %%mm4, %2    \n\t"
> +        "movq       %%mm5, %3    \n\t"
> +        :"+m"(*(uint32_t*)(dest+0*linesize)),
> +         "+m"(*(uint32_t*)(dest+1*linesize)),
> +         "+m"(*(uint32_t*)(dest+2*linesize)),
> +         "+m"(*(uint32_t*)(dest+3*linesize))
> +    );
> +    dest += 4*linesize;
> +    __asm__ volatile(
> +        "movq          %0, %%mm2 \n\t"
> +        "movq          %1, %%mm3 \n\t"
> +        "movq          %2, %%mm4 \n\t"
> +        "movq          %3, %%mm5 \n\t"
> +        "paddusb    %%mm0, %%mm2 \n\t"
> +        "paddusb    %%mm0, %%mm3 \n\t"
> +        "paddusb    %%mm0, %%mm4 \n\t"
> +        "paddusb    %%mm0, %%mm5 \n\t"
> +        "psubusb    %%mm1, %%mm2 \n\t"
> +        "psubusb    %%mm1, %%mm3 \n\t"
> +        "psubusb    %%mm1, %%mm4 \n\t"
> +        "psubusb    %%mm1, %%mm5 \n\t"
> +        "movq       %%mm2, %0    \n\t"
> +        "movq       %%mm3, %1    \n\t"
> +        "movq       %%mm4, %2    \n\t"
> +        "movq       %%mm5, %3    \n\t"
> +        :"+m"(*(uint32_t*)(dest+0*linesize)),
> +         "+m"(*(uint32_t*)(dest+1*linesize)),
> +         "+m"(*(uint32_t*)(dest+2*linesize)),
> +         "+m"(*(uint32_t*)(dest+3*linesize))
> +    );

please write it as a single asm block, gcc had the habit of putting unneeded
instructions between asm blocks


[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

The worst form of inequality is to try to make unequal things equal.
-- Aristotle
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100313/f743e63e/attachment.pgp>



More information about the ffmpeg-devel mailing list