[FFmpeg-devel] [PATCH] VC-1 MMX DSP functions

Michael Niedermayer michaelni
Tue Oct 2 23:19:42 CEST 2007


Hi

On Sat, Sep 15, 2007 at 03:03:13PM +0200, Christophe GISQUET wrote:
> Hello,
> 
> Christophe GISQUET a ?crit :
> > Hello,
> > 
> > here are the MMX functions now licensed under the MIT license.
> 
> An updated version matching the new C implementation.
> 
> IMHO, I/O of the functions being different (8/16 bits), it's difficult
> to have only 3 functions doing the filtering, but at least I tried to
> factorize code.
[...]
> +/** By sacrifying mm5 and mm6, all 11 (12) pixels can be done in one pass */
> +static void vc1_put_ver_16b_shift2_mmx(int16_t *dst,
> +                                       const uint8_t *src, long int stride,
> +                                       int rnd, int64_t shift)
> +{
> +    int h = 8;
> +    src -= stride;
> +    asm volatile(
> +        LOAD_ROUNDER_MMX("%5")

> +        ASMALIGN(3)
> +        "1:                                \n\t"

how much speed is gained by the align?


> +        "movd      0(%1,%3  ), %%mm1       \n\t"
> +        "movd      4(%1,%3  ), %%mm2       \n\t"
> +        "movd      8(%1,%3  ), %%mm5       \n\t"
> +        "movd      0(%1,%3,2), %%mm3       \n\t"
> +        "movd      4(%1,%3,2), %%mm4       \n\t"
> +        "movd      8(%1,%3,2), %%mm6       \n\t"

some cpus (P4) dont like shifts, not even in building addresses that is they
are SLOW IIRC

if %1 would point to line 1
these could be read with (%1) and (%1, %3)

and after 
add       %3, %1

you could read the other 2 by (%1,%4) and (%1, %3) (%4 = -2*stride)

there are of course other variants

also you read the data and unpack it 4 times, this is not good
half of that could be avoided by code like that:
(and maybe there are more efficient variants ...)

b= read_and_unpack(i+1);
c= read_and_unpack(i+2);
for(){
    b+=c;
    b*=9;
    a= read_and_unpack(i+0);
    d= read_and_unpack(i+3);
    b-=a;
    b-=d;
    c+=d;
    c*=9;
    b= read_and_unpack(i+1);
    a= read_and_unpack(i+4);
    c-=b;
    c-=a;
    d+=a;
    d*=9;
    c= read_and_unpack(i+2);
    b= read_and_unpack(i+5);
    d-=c;
    d-=b;
    a+=b;
    a*=9;
    d= read_and_unpack(i+3);
    c= read_and_unpack(i+6);
    a-=d;
    a-=c;
}
and my suggestion above can use a macro to avoid the 4x code duplication


[...]
> +        "movq      %%mm1, %%mm3            \n\t"
> +        "movq      %%mm2, %%mm4            \n\t"
> +        "movq      %%mm5, %%mm6            \n\t"
> +        "psllw     $3, %%mm1               \n\t"
> +        "psllw     $3, %%mm2               \n\t"
> +        "psllw     $3, %%mm5               \n\t"
> +        "paddsw    %%mm3, %%mm1            \n\t"
> +        "paddsw    %%mm4, %%mm2            \n\t"
> +        "paddsw    %%mm6, %%mm5            \n\t"

have you tried 3 pmullw instead of this?

also these comments apply to more then just this function

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Those who are too smart to engage in politics are punished by being
governed by those who are dumber. -- Plato 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20071002/5bb9b794/attachment.pgp>



More information about the ffmpeg-devel mailing list