[Ffmpeg-devel] [PATCH] Snow mmx+sse2 asm optimizations

Robert Edele yartrebo
Sat Feb 25 14:22:07 CET 2006


On Mon, 2006-02-20 at 12:49 -0500, Robert Edele wrote:
> On Mon, 2006-02-06 at 14:12 +0100, Michael Niedermayer wrote:
> > Hi
> > 
> > On Sun, Feb 05, 2006 at 12:47:14PM -0500, Robert Edele wrote:
> > > I've written assembly SIMD optimizations (MMX, SSE2) for three parts of
> > > snow. These changes include:
> > > 
> > >  - MMX and SSE2 code for the bottom part of add_yblock_buffered.
> > >    - Left shifting the OBMC tables by 2, and updating parts of the code
> > > to work with the change. This makes for somewhat faster code by
> > > eliminating some shift operations in the innermost loop of
> > > add_yblock_buffered.
> > > 
> > >  - vertical compose has a straightforward SIMD implementation.
> > > 
> > >  - horizontal compose has substantially modified internally to allow for
> > > an efficient SIMD implementation and improving cache performance. For
> > > plain C code, it may be faster or slower on your system (faster on
> > > mine). The largest change is that it is almost entirely in-place and the
> > > temp buffer is only half used now, allowing for SIMD optimization and
> > > improving cache performance. An added step, interleave_line(), has been
> > > added because the in-place lifts do not leave the coefficients in the
> > > proper places. This code is extremely fast in SIMD.
> > > 
> > > I am aware that conditional compilation for SIMD code is frowned upon,
> > > so could someone give me some feedback on how my code could be
> > > efficiently done using function pointers like the other SIMD
> > > optimizations in ffmpeg? Some functions (interleave_line, 8x8 obmc) take
> > > nary 500 clocks to finish.
> > 
> > 1. how much speed do we loose if you convert them to naive function pointers
> > also keep in mind that gcc has serious difficulty optimizion large functions
> 
> We lose a few hundred dezicycles (a few dozen clock cycles) per function
> pointer. This hits add_yblock the hardest (about 5%). Losses for
> horizontal and vertical compose are only about 1%. This is probably
> largest on my chipset (Pentium IV) and should be smaller on better
> processors, such as earlier Pentiums, and AMDs.
> 
> > 2. if you want to decrease the overhead:
> > then change:
> > for(){
> >  func_ptr()
> > }
> > to
> > func_mmx(){
> >  for(){
> >   mmx()
> >  }
> > }
> > func_c(){
> >  for(){
> >   c()
> >  }
> > }
> 
> done
> 
> > 
> > yeah you duplicate a few lines of code, but its MUCH cleaner
> > and if there is lots of other stuff in the loop which needs to be duplicated
> > then that should be split into its own inline function ...
> 
> As far as duplicated code goes:
> 
> There might be room for such improvement in add_yblock, but it will
> require undoing some optimizations in the asm, costing about 5% extra
> clock cycles in inner_add_yblock (about 1-2% of overall snow speed).
> 
> The other two function have no room for such code condensation, or such
> condensation would cost far too much (>25% extra clocks).
> 
> > 
> > 
> > [...]
> > > @@ -1409,6 +1484,121 @@
> > >          spatial_compose53i_dy(&cs, buffer, width, height, stride);
> > >  }
> > >  
> > > +static void interleave_line(DWTELEM * low, DWTELEM * high, DWTELEM *b, int width){
> > > +    int i = width - 2;
> > > +    
> > > +    if (width & 1)
> > > +    {
> > > +        b[i+1] = low[(i+1)/2];
> > 
> > dividing signed integers by 2^x is slow due to the special case of negative
> > numbers, so use a >>1 here ur use unsigned
> 
> fixed
> 
> > 
> > 
> > [...]
> > > +static void horizontal_compose97i_unified(DWTELEM *b, int width){
> > > +    const int w2= (width+1)>>1;
> > > +#ifdef HAVE_SSE2
> > > +// SSE2 code runs faster with pointers aligned on a 32-byte boundary.
> > > +    DWTELEM temp_buf[width + 4];
> > > +    DWTELEM * const temp = temp_buf + 4 - (((int)temp_buf & 0xF) / 4);
> > > +#else
> > > +    DWTELEM temp[width];
> > > +#endif
> > > +    const int w_l= (width>>1);
> > > +    const int w_r= w2 - 1;
> > > +    int i;
> > > +        
> > > +    {
> > > +    DWTELEM * const ref = b + w2 - 1;
> > > +    DWTELEM b_0 = b[0]; //By allowing the first entry in b[0] to be calculated twice
> > > +    // (the first time erroneously), we allow the SSE2 code to run an extra pass.
> > > +    // The savings in code and time are well worth having to store this value and
> > > +    // calculate b[0] correctly afterwards.
> > > +
> > > +    i = 0;
> > > +#ifdef HAVE_MMX
> > > +horizontal_compose97i_lift0_asm
> > > +#endif
> > > +    for(; i<w_l; i++){
> > > +      b[i] = b[i] - ((W_DM * (ref[i] + ref[i + 1]) + W_DO) >> W_DS);
> > > +    }
> > > +
> > > +    if(width&1){
> > > +        b[w_l] = b[w_l] - ((W_DM * 2 * ref[w_l] + W_DO) >> W_DS);
> > > +    }
> > > +    b[0] = b_0 - ((W_DM * 2 * ref[1]+W_DO)>>W_DS);
> > > +    }
> > > +    
> > > +    {
> > > +    DWTELEM * const dst = b+w2;
> > > +    DWTELEM * const src = dst;
> > > +    
> > > +    i = 0;
> > > +    for(; (((long)&dst[i]) & 0xF) && i<w_r; i++){
> > > +        dst[i] = src[i] - (b[i] + b[i + 1]);
> > > +    }
> > > +#ifdef HAVE_SSE2    
> > > +    horizontal_compose97i_lift1_asm
> > > +#endif
> > > +    for(; i<w_r; i++){
> > > +        dst[i] = src[i] - (b[i] + b[i + 1]);
> > > +    }
> > > +
> > > +    if(!(width&1)){
> > > +        dst[w_r] = src[w_r] - (2 * b[w_r]);
> > > +    }
> > > +    }
> > 
> > umm, indention ...
> 
> Cleaned up, but not sure if it's perfect yet.
> 
> > and this is quite repeative code, so should be in its own function/macro
> > similar to lift()
> 
> I've thought about this, and each piece is different enough that little
> to no space is saved by doing as you say. The original code you wrote
> (using more versatile macros) is substantialy slower (about 10%)
> compared to the same code, but with the macros expanded and obvious
> simplifications done. The asm takes even more advantage of the value of
> the constants, allowing for more optimization.
> 
> On an aside, if the lift constants change, the asm will break, as some
> value-specific optimizations are done (and lacking a 32-bit SIMD integer
> multiplier, I have little choice).
> 
> > [...]
> > 
> > -- 
> > Michael
> > 
Could someone please have a look at my edited patch? It's not ready to
commit, but it shouldn't be too hard to transfer the asm code to dsputil
now.

Sincerely,
Robert Edele
-------------- next part --------------
A non-text attachment was scrubbed...
Name: snow_asm.patch
Type: text/x-patch
Size: 89481 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20060225/c72674f3/attachment.bin>



More information about the ffmpeg-devel mailing list