[FFmpeg-devel] [PATCH] VC-1 MMX DSP functions
Michael Niedermayer
michaelni
Tue Oct 2 23:19:42 CEST 2007
Hi
On Sat, Sep 15, 2007 at 03:03:13PM +0200, Christophe GISQUET wrote:
> Hello,
>
> Christophe GISQUET a ?crit :
> > Hello,
> >
> > here are the MMX functions now licensed under the MIT license.
>
> An updated version matching the new C implementation.
>
> IMHO, I/O of the functions being different (8/16 bits), it's difficult
> to have only 3 functions doing the filtering, but at least I tried to
> factorize code.
[...]
> +/** By sacrifying mm5 and mm6, all 11 (12) pixels can be done in one pass */
> +static void vc1_put_ver_16b_shift2_mmx(int16_t *dst,
> + const uint8_t *src, long int stride,
> + int rnd, int64_t shift)
> +{
> + int h = 8;
> + src -= stride;
> + asm volatile(
> + LOAD_ROUNDER_MMX("%5")
> + ASMALIGN(3)
> + "1: \n\t"
how much speed is gained by the align?
> + "movd 0(%1,%3 ), %%mm1 \n\t"
> + "movd 4(%1,%3 ), %%mm2 \n\t"
> + "movd 8(%1,%3 ), %%mm5 \n\t"
> + "movd 0(%1,%3,2), %%mm3 \n\t"
> + "movd 4(%1,%3,2), %%mm4 \n\t"
> + "movd 8(%1,%3,2), %%mm6 \n\t"
some cpus (P4) dont like shifts, not even in building addresses that is they
are SLOW IIRC
if %1 would point to line 1
these could be read with (%1) and (%1, %3)
and after
add %3, %1
you could read the other 2 by (%1,%4) and (%1, %3) (%4 = -2*stride)
there are of course other variants
also you read the data and unpack it 4 times, this is not good
half of that could be avoided by code like that:
(and maybe there are more efficient variants ...)
b= read_and_unpack(i+1);
c= read_and_unpack(i+2);
for(){
b+=c;
b*=9;
a= read_and_unpack(i+0);
d= read_and_unpack(i+3);
b-=a;
b-=d;
c+=d;
c*=9;
b= read_and_unpack(i+1);
a= read_and_unpack(i+4);
c-=b;
c-=a;
d+=a;
d*=9;
c= read_and_unpack(i+2);
b= read_and_unpack(i+5);
d-=c;
d-=b;
a+=b;
a*=9;
d= read_and_unpack(i+3);
c= read_and_unpack(i+6);
a-=d;
a-=c;
}
and my suggestion above can use a macro to avoid the 4x code duplication
[...]
> + "movq %%mm1, %%mm3 \n\t"
> + "movq %%mm2, %%mm4 \n\t"
> + "movq %%mm5, %%mm6 \n\t"
> + "psllw $3, %%mm1 \n\t"
> + "psllw $3, %%mm2 \n\t"
> + "psllw $3, %%mm5 \n\t"
> + "paddsw %%mm3, %%mm1 \n\t"
> + "paddsw %%mm4, %%mm2 \n\t"
> + "paddsw %%mm6, %%mm5 \n\t"
have you tried 3 pmullw instead of this?
also these comments apply to more then just this function
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Those who are too smart to engage in politics are punished by being
governed by those who are dumber. -- Plato
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20071002/5bb9b794/attachment.pgp>
More information about the ffmpeg-devel
mailing list