[Ffmpeg-devel] [PATCH] Snow mmx+sse2 asm optimizations
Robert Edele
yartrebo
Sat Feb 25 14:22:07 CET 2006
On Mon, 2006-02-20 at 12:49 -0500, Robert Edele wrote:
> On Mon, 2006-02-06 at 14:12 +0100, Michael Niedermayer wrote:
> > Hi
> >
> > On Sun, Feb 05, 2006 at 12:47:14PM -0500, Robert Edele wrote:
> > > I've written assembly SIMD optimizations (MMX, SSE2) for three parts of
> > > snow. These changes include:
> > >
> > > - MMX and SSE2 code for the bottom part of add_yblock_buffered.
> > > - Left shifting the OBMC tables by 2, and updating parts of the code
> > > to work with the change. This makes for somewhat faster code by
> > > eliminating some shift operations in the innermost loop of
> > > add_yblock_buffered.
> > >
> > > - vertical compose has a straightforward SIMD implementation.
> > >
> > > - horizontal compose has substantially modified internally to allow for
> > > an efficient SIMD implementation and improving cache performance. For
> > > plain C code, it may be faster or slower on your system (faster on
> > > mine). The largest change is that it is almost entirely in-place and the
> > > temp buffer is only half used now, allowing for SIMD optimization and
> > > improving cache performance. An added step, interleave_line(), has been
> > > added because the in-place lifts do not leave the coefficients in the
> > > proper places. This code is extremely fast in SIMD.
> > >
> > > I am aware that conditional compilation for SIMD code is frowned upon,
> > > so could someone give me some feedback on how my code could be
> > > efficiently done using function pointers like the other SIMD
> > > optimizations in ffmpeg? Some functions (interleave_line, 8x8 obmc) take
> > > nary 500 clocks to finish.
> >
> > 1. how much speed do we loose if you convert them to naive function pointers
> > also keep in mind that gcc has serious difficulty optimizion large functions
>
> We lose a few hundred dezicycles (a few dozen clock cycles) per function
> pointer. This hits add_yblock the hardest (about 5%). Losses for
> horizontal and vertical compose are only about 1%. This is probably
> largest on my chipset (Pentium IV) and should be smaller on better
> processors, such as earlier Pentiums, and AMDs.
>
> > 2. if you want to decrease the overhead:
> > then change:
> > for(){
> > func_ptr()
> > }
> > to
> > func_mmx(){
> > for(){
> > mmx()
> > }
> > }
> > func_c(){
> > for(){
> > c()
> > }
> > }
>
> done
>
> >
> > yeah you duplicate a few lines of code, but its MUCH cleaner
> > and if there is lots of other stuff in the loop which needs to be duplicated
> > then that should be split into its own inline function ...
>
> As far as duplicated code goes:
>
> There might be room for such improvement in add_yblock, but it will
> require undoing some optimizations in the asm, costing about 5% extra
> clock cycles in inner_add_yblock (about 1-2% of overall snow speed).
>
> The other two function have no room for such code condensation, or such
> condensation would cost far too much (>25% extra clocks).
>
> >
> >
> > [...]
> > > @@ -1409,6 +1484,121 @@
> > > spatial_compose53i_dy(&cs, buffer, width, height, stride);
> > > }
> > >
> > > +static void interleave_line(DWTELEM * low, DWTELEM * high, DWTELEM *b, int width){
> > > + int i = width - 2;
> > > +
> > > + if (width & 1)
> > > + {
> > > + b[i+1] = low[(i+1)/2];
> >
> > dividing signed integers by 2^x is slow due to the special case of negative
> > numbers, so use a >>1 here ur use unsigned
>
> fixed
>
> >
> >
> > [...]
> > > +static void horizontal_compose97i_unified(DWTELEM *b, int width){
> > > + const int w2= (width+1)>>1;
> > > +#ifdef HAVE_SSE2
> > > +// SSE2 code runs faster with pointers aligned on a 32-byte boundary.
> > > + DWTELEM temp_buf[width + 4];
> > > + DWTELEM * const temp = temp_buf + 4 - (((int)temp_buf & 0xF) / 4);
> > > +#else
> > > + DWTELEM temp[width];
> > > +#endif
> > > + const int w_l= (width>>1);
> > > + const int w_r= w2 - 1;
> > > + int i;
> > > +
> > > + {
> > > + DWTELEM * const ref = b + w2 - 1;
> > > + DWTELEM b_0 = b[0]; //By allowing the first entry in b[0] to be calculated twice
> > > + // (the first time erroneously), we allow the SSE2 code to run an extra pass.
> > > + // The savings in code and time are well worth having to store this value and
> > > + // calculate b[0] correctly afterwards.
> > > +
> > > + i = 0;
> > > +#ifdef HAVE_MMX
> > > +horizontal_compose97i_lift0_asm
> > > +#endif
> > > + for(; i<w_l; i++){
> > > + b[i] = b[i] - ((W_DM * (ref[i] + ref[i + 1]) + W_DO) >> W_DS);
> > > + }
> > > +
> > > + if(width&1){
> > > + b[w_l] = b[w_l] - ((W_DM * 2 * ref[w_l] + W_DO) >> W_DS);
> > > + }
> > > + b[0] = b_0 - ((W_DM * 2 * ref[1]+W_DO)>>W_DS);
> > > + }
> > > +
> > > + {
> > > + DWTELEM * const dst = b+w2;
> > > + DWTELEM * const src = dst;
> > > +
> > > + i = 0;
> > > + for(; (((long)&dst[i]) & 0xF) && i<w_r; i++){
> > > + dst[i] = src[i] - (b[i] + b[i + 1]);
> > > + }
> > > +#ifdef HAVE_SSE2
> > > + horizontal_compose97i_lift1_asm
> > > +#endif
> > > + for(; i<w_r; i++){
> > > + dst[i] = src[i] - (b[i] + b[i + 1]);
> > > + }
> > > +
> > > + if(!(width&1)){
> > > + dst[w_r] = src[w_r] - (2 * b[w_r]);
> > > + }
> > > + }
> >
> > umm, indention ...
>
> Cleaned up, but not sure if it's perfect yet.
>
> > and this is quite repeative code, so should be in its own function/macro
> > similar to lift()
>
> I've thought about this, and each piece is different enough that little
> to no space is saved by doing as you say. The original code you wrote
> (using more versatile macros) is substantialy slower (about 10%)
> compared to the same code, but with the macros expanded and obvious
> simplifications done. The asm takes even more advantage of the value of
> the constants, allowing for more optimization.
>
> On an aside, if the lift constants change, the asm will break, as some
> value-specific optimizations are done (and lacking a 32-bit SIMD integer
> multiplier, I have little choice).
>
> > [...]
> >
> > --
> > Michael
> >
Could someone please have a look at my edited patch? It's not ready to
commit, but it shouldn't be too hard to transfer the asm code to dsputil
now.
Sincerely,
Robert Edele
-------------- next part --------------
A non-text attachment was scrubbed...
Name: snow_asm.patch
Type: text/x-patch
Size: 89481 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20060225/c72674f3/attachment.bin>
More information about the ffmpeg-devel
mailing list