[FFmpeg-devel] [RFC] snow SSE2 optimizations (was: Re: [FFmpeg-cvslog] r10223 - in trunk/libavcodec/i386: dsputil_mmx.c snowdsp_mmx.c)

Thu Aug 30 13:10:59 CEST 2007

Hello,
On Tue, Aug 28, 2007 at 05:32:04AM +0200, Michael Niedermayer wrote:
> On Tue, Aug 28, 2007 at 12:07:02AM +0200, Reimar D?ffinger wrote:
> > On Mon, Aug 27, 2007 at 11:34:44PM +0200, Michael Niedermayer wrote:
> > > > > also theres some shift by 4 missing here
> > > > 
> > > > I don't think so, there is a "psraw $4, %%xmm0               \n\t"
> > > > further down. And I know the code is an unreadable mess. I'll try to
> > > > reimplement it somewhen if noone else will do it...
> > > 
> > > the daa after obmc is 16bit unsigned, the data after the IDWT is 13bit
> > > signed the white point differs by a factor of 16 a shift by 4 is needed to get
> > > them on the same level before adding ...
> > 
> > Right, right, I just missed a few lines of code while reading the C
> > version, thus the confusion.
> > Since the diff is unreadable, do you think the following is better than
> > the current code (I mean visually, it does decode correctly after all ;-),
> > though it is not measurably faster than the mmx code on my PC):
> 
> SSE2 is rarely faster than MMX its because most cpus need 2x as long to
> execute SSE2 instructions than MMX ...
> 
> and yes the code is MUCH more readable than before

Can you tell which option to set (preferably for mencoder) to get a
block width of 16?
Currently the inner_add_yblock_bw_16_obmc_32_sse2 never gets used for me
so I can't test it...
And I see why SSE2 makes hardly a difference anyway, it is only used for
block width of 8 and 16, but in my sample almost all are 4...

Greetings,
Reimar D?ffinger