[FFmpeg-devel] [PATCH] swscale: round on planar2x C code

Sun Sep 12 14:02:24 CEST 2010

On Sun, Sep 12, 2010 at 12:11:15PM +0100, M?ns Rullg?rd wrote:
> Michael Niedermayer <michaelni at gmx.at> writes:
> 
> > On Sun, Sep 12, 2010 at 10:30:18AM +0100, M?ns Rullg?rd wrote:
> >> Ramiro Polla <ramiro.polla at gmail.com> writes:
> >> 
> >> > Hi,
> >> >
> >> > The C and MMX2/3dnow code differ in planar2x due to pavgb's rounding.
> >> > Attached patch makes the output similar.
> >> >
> >> > I couldn't measure any speed difference. gcc ends up using "leal
> >> > 3(xxx)" instead of "leal (xxx)" which doesn't seem to have a speed
> >> > penalty.
> >> 
> >> What it does on x86 is irrelevant since the mmx code will always be
> >> used in practice.
> >
> > no, the C code is used for the right border on mmx chips when the
> > width is not a multiple of 8 and the C code
> > used for that must round the same way as mmx.
> 
> That C code doesn't have to be the same as the reference.

no of course not but its simpler if it is


> 
> >> > Otherwise we could put the MMX2/3dnow code under some if(flags &bitexact).
> >> >
> >> > Ramiro Polla
> >> >
> >> > Index: rgb2rgb_template.c
> >> > ===================================================================
> >> > --- rgb2rgb_template.c	(revision 32166)
> >> > +++ rgb2rgb_template.c	(working copy)
> >> > @@ -1820,10 +1820,10 @@ static inline void RENAME(planar2x)(const uint8_t
> >> >          dst[dstStride]= (  src[0] + 3*src[srcStride])>>2;
> >> >  
> >> >          for (x=mmxSize-1; x<srcWidth-1; x++) {
> >> > -            dst[2*x          +1]= (3*src[x+0] +   src[x+srcStride+1])>>2;
> >> > -            dst[2*x+dstStride+2]= (  src[x+0] + 3*src[x+srcStride+1])>>2;
> >> > -            dst[2*x+dstStride+1]= (  src[x+1] + 3*src[x+srcStride  ])>>2;
> >> > -            dst[2*x          +2]= (3*src[x+1] +   src[x+srcStride  ])>>2;
> >> > +            dst[2*x          +1]= ((3*src[x+0] +   src[x+srcStride+1])+3)>>2;
> >> > +            dst[2*x+dstStride+2]= ((  src[x+0] + 3*src[x+srcStride+1])+3)>>2;
> >> > +            dst[2*x+dstStride+1]= ((  src[x+1] + 3*src[x+srcStride  ])+3)>>2;
> >> > +            dst[2*x          +2]= ((3*src[x+1] +   src[x+srcStride  ])+3)>>2;
> >> 
> >> WTF +3?  Does mmx round like that?  Most other CPUs with a rounding
> >> average instruction do +4.
> 
> I obviously meant +2.
> 
> More generally, they do a rounding shift as (x + (1 << s-1)) >> s.

that wont help us to implement this
we need 3*a + b + whatever >>2 and we need it for 8bit unsigned values
once we have to convert to 16bit for rounding shift instructions we already
have lost 50% speed

with mmx we use 2 a+b+1>>1 instructions


> 
> >> IMO the C version should be easily implemented exactly on the
> >> majority of systems, not bow to the quirks of intel.
> >
> > The most widespread cpu architecture our users use cannot be ignored even if
> > there was a problem but iam not seeing one atm, other SIMD systems are quite
> > likely to round like mmx
> 
> I'm telling you they do not.

then they dont support even mpeg MC which needs a+b+1>>1
and i dont belive that


[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Complexity theory is the science of finding the exact solution to an
approximation. Benchmarking OTOH is finding an approximation of the exact
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100912/5ec4a420/attachment.pgp>