[FFmpeg-devel] [PATCH] VC-1 MMX DSP functions

Michael Niedermayer michaelni
Thu Oct 11 23:06:06 CEST 2007


Hi

On Thu, Oct 11, 2007 at 09:02:15PM +0200, Christophe GISQUET wrote:
[...]
> > also you read the data and unpack it 4 times, this is not good
> > half of that could be avoided by code like that:
> > (and maybe there are more efficient variants ...)
> > 
> > b= read_and_unpack(i+1);
> > c= read_and_unpack(i+2);
> > for(){
> >     b+=c;
> >     b*=9;
> >     a= read_and_unpack(i+0);
> >     d= read_and_unpack(i+3);
> >     b-=a;
> >     b-=d;
> >     c+=d;
> >     c*=9;
> >     b= read_and_unpack(i+1);
> >     a= read_and_unpack(i+4);
> >     c-=b;
> >     c-=a;
> >     d+=a;
> >     d*=9;
> >     c= read_and_unpack(i+2);
> >     b= read_and_unpack(i+5);
> >     d-=c;
> >     d-=b;
> >     a+=b;
> >     a*=9;
> >     d= read_and_unpack(i+3);
> >     c= read_and_unpack(i+6);
> >     a-=d;
> >     a-=c;
> > }
> > and my suggestion above can use a macro to avoid the 4x code duplication
> 
> Agreed. However, you trade memory loads/unpacks for potentially worse
> code parallelism/pairing and size (there are 4 loops unrolled here). I
> wonder if that'll be a win. I leave that to a later patch.

you have unrolled the loops in the horizontal direction that also increased
the code size and instruction pairing is specific to the good old pentium
it has no relevance today


> 
> >> +        "movq      %%mm1, %%mm3            \n\t"
> >> +        "movq      %%mm2, %%mm4            \n\t"
> >> +        "movq      %%mm5, %%mm6            \n\t"
> >> +        "psllw     $3, %%mm1               \n\t"
> >> +        "psllw     $3, %%mm2               \n\t"
> >> +        "psllw     $3, %%mm5               \n\t"
> >> +        "paddsw    %%mm3, %%mm1            \n\t"
> >> +        "paddsw    %%mm4, %%mm2            \n\t"
> >> +        "paddsw    %%mm6, %%mm5            \n\t"
> > 
> > have you tried 3 pmullw instead of this?
> 
> Well, I'd loose one register that is used in the current code, unless I
> leave the *9 factor in memory. However, with your idea of unrolling
> loops to factor out memory loads, I won't have enough free registers to
> continue doing this.

no, you do have enough registers, just order the instructions more
efficiently

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

I have never wished to cater to the crowd; for what I know they do not
approve, and what they approve I do not know. -- Epicurus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20071011/bdc8d7b2/attachment.pgp>



More information about the ffmpeg-devel mailing list