[Ffmpeg-devel] [PATCH] Snow mmx+sse2 asm optimizations
Robert Edele
yartrebo
Mon Feb 20 18:49:28 CET 2006
On Mon, 2006-02-06 at 14:12 +0100, Michael Niedermayer wrote:
> Hi
>
> On Sun, Feb 05, 2006 at 12:47:14PM -0500, Robert Edele wrote:
> > I've written assembly SIMD optimizations (MMX, SSE2) for three parts of
> > snow. These changes include:
> >
> > - MMX and SSE2 code for the bottom part of add_yblock_buffered.
> > - Left shifting the OBMC tables by 2, and updating parts of the code
> > to work with the change. This makes for somewhat faster code by
> > eliminating some shift operations in the innermost loop of
> > add_yblock_buffered.
> >
> > - vertical compose has a straightforward SIMD implementation.
> >
> > - horizontal compose has substantially modified internally to allow for
> > an efficient SIMD implementation and improving cache performance. For
> > plain C code, it may be faster or slower on your system (faster on
> > mine). The largest change is that it is almost entirely in-place and the
> > temp buffer is only half used now, allowing for SIMD optimization and
> > improving cache performance. An added step, interleave_line(), has been
> > added because the in-place lifts do not leave the coefficients in the
> > proper places. This code is extremely fast in SIMD.
> >
> > I am aware that conditional compilation for SIMD code is frowned upon,
> > so could someone give me some feedback on how my code could be
> > efficiently done using function pointers like the other SIMD
> > optimizations in ffmpeg? Some functions (interleave_line, 8x8 obmc) take
> > nary 500 clocks to finish.
>
> 1. how much speed do we loose if you convert them to naive function pointers
> also keep in mind that gcc has serious difficulty optimizion large functions
We lose a few hundred dezicycles (a few dozen clock cycles) per function
pointer. This hits add_yblock the hardest (about 5%). Losses for
horizontal and vertical compose are only about 1%. This is probably
largest on my chipset (Pentium IV) and should be smaller on better
processors, such as earlier Pentiums, and AMDs.
> 2. if you want to decrease the overhead:
> then change:
> for(){
> func_ptr()
> }
> to
> func_mmx(){
> for(){
> mmx()
> }
> }
> func_c(){
> for(){
> c()
> }
> }
done
>
> yeah you duplicate a few lines of code, but its MUCH cleaner
> and if there is lots of other stuff in the loop which needs to be duplicated
> then that should be split into its own inline function ...
As far as duplicated code goes:
There might be room for such improvement in add_yblock, but it will
require undoing some optimizations in the asm, costing about 5% extra
clock cycles in inner_add_yblock (about 1-2% of overall snow speed).
The other two function have no room for such code condensation, or such
condensation would cost far too much (>25% extra clocks).
>
>
> [...]
> > @@ -1409,6 +1484,121 @@
> > spatial_compose53i_dy(&cs, buffer, width, height, stride);
> > }
> >
> > +static void interleave_line(DWTELEM * low, DWTELEM * high, DWTELEM *b, int width){
> > + int i = width - 2;
> > +
> > + if (width & 1)
> > + {
> > + b[i+1] = low[(i+1)/2];
>
> dividing signed integers by 2^x is slow due to the special case of negative
> numbers, so use a >>1 here ur use unsigned
fixed
>
>
> [...]
> > +static void horizontal_compose97i_unified(DWTELEM *b, int width){
> > + const int w2= (width+1)>>1;
> > +#ifdef HAVE_SSE2
> > +// SSE2 code runs faster with pointers aligned on a 32-byte boundary.
> > + DWTELEM temp_buf[width + 4];
> > + DWTELEM * const temp = temp_buf + 4 - (((int)temp_buf & 0xF) / 4);
> > +#else
> > + DWTELEM temp[width];
> > +#endif
> > + const int w_l= (width>>1);
> > + const int w_r= w2 - 1;
> > + int i;
> > +
> > + {
> > + DWTELEM * const ref = b + w2 - 1;
> > + DWTELEM b_0 = b[0]; //By allowing the first entry in b[0] to be calculated twice
> > + // (the first time erroneously), we allow the SSE2 code to run an extra pass.
> > + // The savings in code and time are well worth having to store this value and
> > + // calculate b[0] correctly afterwards.
> > +
> > + i = 0;
> > +#ifdef HAVE_MMX
> > +horizontal_compose97i_lift0_asm
> > +#endif
> > + for(; i<w_l; i++){
> > + b[i] = b[i] - ((W_DM * (ref[i] + ref[i + 1]) + W_DO) >> W_DS);
> > + }
> > +
> > + if(width&1){
> > + b[w_l] = b[w_l] - ((W_DM * 2 * ref[w_l] + W_DO) >> W_DS);
> > + }
> > + b[0] = b_0 - ((W_DM * 2 * ref[1]+W_DO)>>W_DS);
> > + }
> > +
> > + {
> > + DWTELEM * const dst = b+w2;
> > + DWTELEM * const src = dst;
> > +
> > + i = 0;
> > + for(; (((long)&dst[i]) & 0xF) && i<w_r; i++){
> > + dst[i] = src[i] - (b[i] + b[i + 1]);
> > + }
> > +#ifdef HAVE_SSE2
> > + horizontal_compose97i_lift1_asm
> > +#endif
> > + for(; i<w_r; i++){
> > + dst[i] = src[i] - (b[i] + b[i + 1]);
> > + }
> > +
> > + if(!(width&1)){
> > + dst[w_r] = src[w_r] - (2 * b[w_r]);
> > + }
> > + }
>
> umm, indention ...
Cleaned up, but not sure if it's perfect yet.
> and this is quite repeative code, so should be in its own function/macro
> similar to lift()
I've thought about this, and each piece is different enough that little
to no space is saved by doing as you say. The original code you wrote
(using more versatile macros) is substantialy slower (about 10%)
compared to the same code, but with the macros expanded and obvious
simplifications done. The asm takes even more advantage of the value of
the constants, allowing for more optimization.
On an aside, if the lift constants change, the asm will break, as some
value-specific optimizations are done (and lacking a 32-bit SIMD integer
multiplier, I have little choice).
> [...]
>
> --
> Michael
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at mplayerhq.hu
> http://mplayerhq.hu/mailman/listinfo/ffmpeg-devel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: snow_asm.patch
Type: text/x-patch
Size: 89481 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20060220/203904ff/attachment.bin>
More information about the ffmpeg-devel
mailing list