[FFmpeg-devel] [PATCH] VC-1 MMX DSP functions
Christophe GISQUET
christophe.gisquet
Thu Oct 11 21:02:15 CEST 2007
Note: attached patch is only to verify what I'm discussing in this mail.
Michael Niedermayer a ?crit :
>> + ASMALIGN(3)
>> + "1: \n\t"
>
> how much speed is gained by the align?
All shift1 and shift3 functions benefit from it, but not the shift2
versions. I'll therefore remove the unneeded ones.
> some cpus (P4) dont like shifts, not even in building addresses that is they
> are SLOW IIRC
>
> if %1 would point to line 1
> these could be read with (%1) and (%1, %3)
>
> and after
> add %3, %1
>
> you could read the other 2 by (%1,%4) and (%1, %3) (%4 = -2*stride)
>
> there are of course other variants
For the record, with my core2 playing Robotica_720
with shifts:
3310 dezicycles in ver, 524218 runs, 70 skips
2574 dezicycles in hor, 524155 runs, 133 skips
without:
3340 dezicycles in ver, 524230 runs, 58 skips
2643 dezicycles in hor, 524160 runs, 128 skips
So it's around 2.5% faster. Code attached for anyone to confirm this.
Also values for reference in any later mail.
Do you also think it's worth modifying the shift1/3 versions?
> also you read the data and unpack it 4 times, this is not good
> half of that could be avoided by code like that:
> (and maybe there are more efficient variants ...)
>
> b= read_and_unpack(i+1);
> c= read_and_unpack(i+2);
> for(){
> b+=c;
> b*=9;
> a= read_and_unpack(i+0);
> d= read_and_unpack(i+3);
> b-=a;
> b-=d;
> c+=d;
> c*=9;
> b= read_and_unpack(i+1);
> a= read_and_unpack(i+4);
> c-=b;
> c-=a;
> d+=a;
> d*=9;
> c= read_and_unpack(i+2);
> b= read_and_unpack(i+5);
> d-=c;
> d-=b;
> a+=b;
> a*=9;
> d= read_and_unpack(i+3);
> c= read_and_unpack(i+6);
> a-=d;
> a-=c;
> }
> and my suggestion above can use a macro to avoid the 4x code duplication
Agreed. However, you trade memory loads/unpacks for potentially worse
code parallelism/pairing and size (there are 4 loops unrolled here). I
wonder if that'll be a win. I leave that to a later patch.
>> + "movq %%mm1, %%mm3 \n\t"
>> + "movq %%mm2, %%mm4 \n\t"
>> + "movq %%mm5, %%mm6 \n\t"
>> + "psllw $3, %%mm1 \n\t"
>> + "psllw $3, %%mm2 \n\t"
>> + "psllw $3, %%mm5 \n\t"
>> + "paddsw %%mm3, %%mm1 \n\t"
>> + "paddsw %%mm4, %%mm2 \n\t"
>> + "paddsw %%mm6, %%mm5 \n\t"
>
> have you tried 3 pmullw instead of this?
Well, I'd loose one register that is used in the current code, unless I
leave the *9 factor in memory. However, with your idea of unrolling
loops to factor out memory loads, I won't have enough free registers to
continue doing this.
Best regards,
--
Christophe GISQUET
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vc1dsp.diff
Type: text/x-patch
Size: 27528 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20071011/e400d486/attachment.bin>
More information about the ffmpeg-devel
mailing list