[MPlayer-dev-eng] Re: [PATCH] Re: spp deblocking GREAT optimization !!!

Nikolaj nialpof at pisem.net
Thu Sep 23 22:02:41 CEST 2004


On Thu, 23 Sep 2004 02:23:06 +0200, Diego Biurrun <diego at biurrun.de> wrote:

> Nikolaj writes:
>> On Sun, 12 Sep 2004 10:03:36 +0900, Attila Kinali <attila at kinali.ch> 
>> wrote:
>>
>> >> Can somebody compare PSNR of this code and original flat threshold ?
>> >> Or even determine optimum matrix (which will be multiplied by
>> >> quantizer) PSNR-wise? This will be very interesting. Probably genetic
>> >> algorithms can help :). I practically can't do this myself.
>> >
>> > Can you make a patch out of it ?
>> > c&p of code at the right position isnt something i like to do.
>> Yes. At first, please review included patch. I'm not really sure its OK,
>> but without it, low part of the %%mm2 register is never readed at the
>> corresponding places (treshold func-s).
>
> It's already in CVS, Michael fixed this a few days ago.

:) OK. It seems Gmane don't have CVS-log list, and ordinal subscription 
somehow tends
to switch into inactive status. (Probably bad mailbox :( )
(I'd sent this in "verbal" form to Michael long before).

Currently, I'm thinking - is it possible to be even faster?:
(Not a much, but even 1.2x is very good)
Speaking of details, I'm trying to find similar fings in the 
DCT[x[0]..x[7]] and
DCT[x[2]..x[9]]. It seems that probably something similar can be found. 
For example,
one can zero x[0],x[1],x[8],x[9], then find some sort of common part and 
then find
differences from nonzero x[0],x[1],x[8],x[9]. Also, considering the fact 
DCT is an
AAN DCT, it is really an Re((i)FFT), for which (FFT) there are quite 
simple formulas
for shift in time domain.
Then, IDCT step. It is has a following graph form:

   - [      ] -
   - [ IDCT ] - + - [      ] -
   - [      ] - + - [ IDCT ] -
   - [      ] - + - [      ] -
                  - [      ] -   (here, "-" are just dataflows (in fact, 
one "-" = 2 dataflows))
And since IDCT is a linear transform, maybe it is possible to construct a 
transform,
which accepts 16 (2*8) values and gives 10 result values, and is simpler, 
than 2 idct's.

Below is a code to optimize, if possible. Someone please take a look. :)
Or maybe it will be a too big nightmare to convert its optimized version 
to the
ASM, so just don't bother about all above?...

(Below, T() is a threshold, _non-linear_ function.
Also,
0.70710678118654752438 = cos(pi*4/16)
0.54119610014619698435 = cos(pi*6/16)sqrt(2)
0.38268343236508977170 = cos(pi*6/16)
1.30656296487637652774 = cos(pi*2/16)sqrt(2)
FIX_1_414213562 = sqrt(2)
FIX_1_847759065 = 2*cos(pi/8)
FIX_1_082392200 = 2*sqrt(2)*sin(pi/8)
FIX_2_613125930 = 2*(cos(pi/8)+sin(pi/8))
)


//----------------
       // FDCT

       tmp0 = dataptr[0] + dataptr[7];
       tmp7 = dataptr[0] - dataptr[7];

       tmp1 = dataptr[1] + dataptr[6];
       tmp6 = dataptr[1] - dataptr[6];

       tmp2 = dataptr[2] + dataptr[5];
       tmp5 = dataptr[2] - dataptr[5];

       tmp3 = dataptr[3] + dataptr[4];
       tmp4 = dataptr[3] - dataptr[4];

       // Even part

       tmp10 = tmp0 + tmp3;
       tmp13 = tmp0 - tmp3;
       tmp11 = tmp1 + tmp2;
       tmp12 = tmp1 - tmp2;

       output1[0] = tmp10 + tmp11;
       output1[4] = tmp10 - tmp11;

       z1 = (tmp12 + tmp13) * FIX_0_707106781;
       output1[2] = tmp13 + z1;
       output1[6] = tmp13 - z1;

       // Odd part

       tmp10 = (tmp4 + tmp5);
       tmp11 = (tmp5 + tmp6);
       tmp12 = (tmp6 + tmp7);


       z5 = (tmp10 - tmp12) * FIX_0_382683433;
       z2 = tmp10 * FIX_0_541196100 + z5;
       z4 = tmp12 * FIX_1_306562965 + z5;
       z3 = tmp11 * FIX_0_707106781;

       z11 = tmp7 + z3;
       z13 = tmp7 - z3;

       output1[5] = z13 + z2;
       output1[3] = z13 - z2;
       output1[1] = z11 + z4;
       output1[7] = z11 - z4;
//--------
       tmp0 = dataptr[0+2] + dataptr[7+2];
       tmp7 = dataptr[0+2] - dataptr[7+2];

       tmp1 = dataptr[1+2] + dataptr[6+2];
       tmp6 = dataptr[1+2] - dataptr[6+2];

       tmp2 = dataptr[2+2] + dataptr[5+2];
       tmp5 = dataptr[2+2] - dataptr[5+2];

       tmp3 = dataptr[3+2] + dataptr[4+2];
       tmp4 = dataptr[3+2] - dataptr[4+2];

       // Even part

       tmp10 = tmp0 + tmp3;
       tmp13 = tmp0 - tmp3;
       tmp11 = tmp1 + tmp2;
       tmp12 = tmp1 - tmp2;

       output2[0] = tmp10 + tmp11;
       output2[4] = tmp10 - tmp11;

       z1 = (tmp12 + tmp13) * FIX_0_707106781;
       output2[2] = tmp13 + z1;
       output2[6] = tmp13 - z1;

       // Odd part

       tmp10 = (tmp4 + tmp5);
       tmp11 = (tmp5 + tmp6);
       tmp12 = (tmp6 + tmp7);


       z5 = (tmp10 - tmp12) * FIX_0_382683433;
       z2 = tmp10 * FIX_0_541196100 + z5;
       z4 = tmp12 * FIX_1_306562965 + z5;
       z3 = tmp11 * FIX_0_707106781;

       z11 = tmp7 + z3;
       z13 = tmp7 - z3;

       output2[5] = z13 + z2;
       output2[3] = z13 - z2;
       output2[1] = z11 + z4;
       output2[7] = z11 - z4;

//-------
       output1[0] = T(output1[0])
       output1[1] = T(output1[1])
       output1[2] = T(output1[2])
       output1[3] = T(output1[3])
       output1[4] = T(output1[4])
       output1[5] = T(output1[5])
       output1[6] = T(output1[6])
       output1[7] = T(output1[7])

       output2[0] = T(output2[0])
       output2[1] = T(output2[1])
       output2[2] = T(output2[2])
       output2[3] = T(output2[3])
       output2[4] = T(output2[4])
       output2[5] = T(output2[5])
       output2[6] = T(output2[6])
       output2[7] = T(output2[7])
//-------

       // IDCT
       // Even part

       tmp0 = output1[0];
       tmp1 = output1[2];
       tmp2 = output1[4];
       tmp3 = output1[6];

       tmp10 = tmp0 + tmp2;
       tmp11 = tmp0 - tmp2;

       tmp13 = tmp1 + tmp3;
       tmp12 = ((tmp1 - tmp3) * FIX_1_414213562) - tmp13;

       tmp0 = tmp10 + tmp13;
       tmp3 = tmp10 - tmp13;
       tmp1 = tmp11 + tmp12;
       tmp2 = tmp11 - tmp12;

       // Odd part

       tmp4 = output1[1];
       tmp5 = output1[3];
       tmp6 = output1[5];
       tmp7 = output1[7];

       z13 = tmp6 + tmp5;
       z10 = tmp6 - tmp5;
       z11 = tmp4 + tmp7;
       z12 = tmp4 - tmp7;

       tmp7 = z11 + z13;
       tmp11 = (z11 - z13) * FIX_1_414213562;
       z5 =    (z10 + z12) * FIX_1_847759065;
       tmp10 = (z12 * FIX_1_082392200) - z5;
       tmp12 = (z10 * -FIX_2_613125930) + z5;

       tmp6 = tmp12 - tmp7;
       tmp5 = tmp11 - tmp6;
       tmp4 = tmp10 + tmp5;

       wsptr1[0]=  (tmp0 + tmp7);
       wsptr1[1]=  (tmp1 + tmp6);
       wsptr1[2]=  (tmp2 + tmp5);
       wsptr1[3]=  (tmp3 - tmp4);
       wsptr1[4]=  (tmp3 + tmp4);
       wsptr1[5]=  (tmp2 - tmp5);
       wsptr1[6]=  (tmp1 - tmp6);
       wsptr1[7]=  (tmp0 - tmp7);
//-------
       // Even part

       tmp0 = output2[0];
       tmp1 = output2[2];
       tmp2 = output2[4];
       tmp3 = output2[6];

       tmp10 = tmp0 + tmp2;
       tmp11 = tmp0 - tmp2;

       tmp13 = tmp1 + tmp3;
       tmp12 = ((tmp1 - tmp3) * FIX_1_414213562) - tmp13;

       tmp0 = tmp10 + tmp13;
       tmp3 = tmp10 - tmp13;
       tmp1 = tmp11 + tmp12;
       tmp2 = tmp11 - tmp12;

       // Odd part

       tmp4 = output2[1];
       tmp5 = output2[3];
       tmp6 = output2[5];
       tmp7 = output2[7];

       z13 = tmp6 + tmp5;
       z10 = tmp6 - tmp5;
       z11 = tmp4 + tmp7;
       z12 = tmp4 - tmp7;

       tmp7 = z11 + z13;
       tmp11 = (z11 - z13) * FIX_1_414213562;
       z5 =    (z10 + z12) * FIX_1_847759065;
       tmp10 = (z12 * FIX_1_082392200) - z5;
       tmp12 = (z10 * -FIX_2_613125930) + z5;

       tmp6 = tmp12 - tmp7;
       tmp5 = tmp11 - tmp6;
       tmp4 = tmp10 + tmp5;

       wsptr2[0]=  (tmp0 + tmp7);
       wsptr2[1]=  (tmp1 + tmp6);
       wsptr2[2]=  (tmp2 + tmp5);
       wsptr2[3]=  (tmp3 - tmp4);
       wsptr2[4]=  (tmp3 + tmp4);
       wsptr2[5]=  (tmp2 - tmp5);
       wsptr2[6]=  (tmp1 - tmp6);
       wsptr2[7]=  (tmp0 - tmp7);
//---
       result[0]= wsptr1[0]            ;
       result[1]= wsptr1[1]            ;
       result[2]= wsptr1[2] + wsptr2[0];
       result[3]= wsptr1[3] + wsptr2[1];
       result[4]= wsptr1[4] + wsptr2[2];
       result[5]= wsptr1[5] + wsptr2[3];
       result[6]= wsptr1[6] + wsptr2[4];
       result[7]= wsptr1[7] + wsptr2[5];
       result[8]=             wsptr2[6];
       result[9]=             wsptr2[7];

//-----------------

-- 
Best regards,
      Nikolaj                          mailto:nialpof at pisem.net




More information about the MPlayer-dev-eng mailing list