[FFmpeg-devel] [PATCH] VC-1 MMX DSP functions

Thu Oct 11 21:02:15 CEST 2007

Note: attached patch is only to verify what I'm discussing in this mail.

Michael Niedermayer a ?crit :
>> +        ASMALIGN(3)
>> +        "1:                                \n\t"
> 
> how much speed is gained by the align?

All shift1 and shift3 functions benefit from it, but not the shift2
versions. I'll therefore remove the unneeded ones.

> some cpus (P4) dont like shifts, not even in building addresses that is they
> are SLOW IIRC
> 
> if %1 would point to line 1
> these could be read with (%1) and (%1, %3)
> 
> and after 
> add       %3, %1
> 
> you could read the other 2 by (%1,%4) and (%1, %3) (%4 = -2*stride)
> 
> there are of course other variants

For the record, with my core2 playing Robotica_720
with shifts:
3310 dezicycles in ver, 524218 runs, 70 skips
2574 dezicycles in hor, 524155 runs, 133 skips

without:
3340 dezicycles in ver, 524230 runs, 58 skips
2643 dezicycles in hor, 524160 runs, 128 skips

So it's around 2.5% faster. Code attached for anyone to confirm this.
Also values for reference in any later mail.

Do you also think it's worth modifying the shift1/3 versions?

> also you read the data and unpack it 4 times, this is not good
> half of that could be avoided by code like that:
> (and maybe there are more efficient variants ...)
> 
> b= read_and_unpack(i+1);
> c= read_and_unpack(i+2);
> for(){
>     b+=c;
>     b*=9;
>     a= read_and_unpack(i+0);
>     d= read_and_unpack(i+3);
>     b-=a;
>     b-=d;
>     c+=d;
>     c*=9;
>     b= read_and_unpack(i+1);
>     a= read_and_unpack(i+4);
>     c-=b;
>     c-=a;
>     d+=a;
>     d*=9;
>     c= read_and_unpack(i+2);
>     b= read_and_unpack(i+5);
>     d-=c;
>     d-=b;
>     a+=b;
>     a*=9;
>     d= read_and_unpack(i+3);
>     c= read_and_unpack(i+6);
>     a-=d;
>     a-=c;
> }
> and my suggestion above can use a macro to avoid the 4x code duplication

Agreed. However, you trade memory loads/unpacks for potentially worse
code parallelism/pairing and size (there are 4 loops unrolled here). I
wonder if that'll be a win. I leave that to a later patch.

>> +        "movq      %%mm1, %%mm3            \n\t"
>> +        "movq      %%mm2, %%mm4            \n\t"
>> +        "movq      %%mm5, %%mm6            \n\t"
>> +        "psllw     $3, %%mm1               \n\t"
>> +        "psllw     $3, %%mm2               \n\t"
>> +        "psllw     $3, %%mm5               \n\t"
>> +        "paddsw    %%mm3, %%mm1            \n\t"
>> +        "paddsw    %%mm4, %%mm2            \n\t"
>> +        "paddsw    %%mm6, %%mm5            \n\t"
> 
> have you tried 3 pmullw instead of this?

Well, I'd loose one register that is used in the current code, unless I
leave the *9 factor in memory. However, with your idea of unrolling
loops to factor out memory loads, I won't have enough free registers to
continue doing this.

Best regards,
-- 
Christophe GISQUET
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vc1dsp.diff
Type: text/x-patch
Size: 27528 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20071011/e400d486/attachment.bin>