[FFmpeg-soc] [PATCH] 4/4 Split sws_getContext_altivec_alloc_context from sws_getContext

Sun Jun 15 14:26:23 CEST 2008

On Sun, Jun 15, 2008 at 12:26:32PM +0200, Luca Barbato wrote:
> Michael Niedermayer wrote:
> > On Sun, Jun 15, 2008 at 01:55:57AM +0200, Luca Barbato wrote:
> >> Keiji Costantini wrote:
> >>> Luca Barbato ha scritto:
> >>>> Michael Niedermayer wrote:
> >>>>> On Wed, Jun 11, 2008 at 02:36:08AM +0200, Keiji Costantini wrote:
> >>>>>> -                p[j] = c->vLumFilter[i];
> >>>>>> -                p[j] = c->vChrFilter[i];
> >>>>> Whichever way this is done and whereever, it should be done at the
> >>>>> same place where lum/chrMmxFilter is initialized.
> >>>>> And of course both altivec & mmx should use the same array for the same data.
> >>>>>
> >>>>> But looking again it seems these arrays are practically unused and the
> >>>>> code using it looks like it shouldnt use them in the first place.
> >>>>>
> >>>>> So, correct cleanup seems to be to remove vCCoeffsBank and vYCoeffsBank.
> >>>> The *Banks are just a copy from aligned memory to another, so just using 
> >>>> the vLumFilter and vChrFilter directly won't cause problems.
> >>>>
> >>>> lu
> >>>>
> >>> extract from code:
> >>>
> >>>      for (i=0;i<c->vLumFilterSize*c->dstH;i++) {
> >>>          int j;
> >>>          short *p = (short *)&c->vYCoeffsBank[i];
> >>>          for (j=0;j<8;j++)
> >>>              p[j] = c->vLumFilter[i];
> >>>      }
> >>>
> >>> I see *Banks are *filters copied 8 times each...
> >> I'm an idiot =P
> > 
> > At least i now know why i didnt understand your earlier reply :) 
> 
> Happens when I try to read code and I'm just awake or about to sleep ^^;
> 
> >> Well they could go away adding 2 vec_splats, but I'm pretty sure it 
> >> would slow things down. I'd consider this later -_-
> > 
> > I wouldnt be so sure that the splats are slower than the cache trashing the
> > array causes.
> > Also if done properly (like in the mmx code) then there are rather few splats.
> 
> Now I'm just awake so I'll write something stupid again but:
> 
> if I just use the original vector I'd have:
> 
> (dumb way)
> - one full unaligned load (2 loads, 1 table lookup, 1 permute)
> - a splat
> 
> or
> (smarter way)
> - one simple load
> - address mask to get the which is the element I care about
> - a splat
> 
> right now I have a simple load and what's equivalent to the address mask 
> more or less (one &15 more), so you are right I should be able to kill 
> those vector and don't lose much.
> 
> lu - am I insane?

yes

I said it should be done like the MMX code does. What you propose to
change doesnt need to be changed at all.

Currently yuv2rgb_altivec.c builds a filter_size * height table during
init. It should build a filter_size table for each line.
What you do is redundantly perform the splat for each 
filter_size * width * height the speed difference being proportional to
the width.

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

No snowflake in an avalanche ever feels responsible. -- Voltaire
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-soc/attachments/20080615/9b9a371e/attachment.pgp>