[FFmpeg-devel] [PATCH] SPARC VIS simple_idct try #2

Wed Aug 22 20:53:21 CEST 2007

Hi!

New patch attached, which hopefully addresses your concerns. Replies to the 
issues below.

Wednesday 22 August 2007 03:47-kor Michael Niedermayer ezt ?rta:
> On Tue, Aug 21, 2007 at 11:37:38PM +0200, Balatoni Denes wrote:
> > side. Although the mlib version is faster, this is more accurate - and I
> > honestly don't know why it is not faster on the sparc than it is, it 
> > should 
> > be according to my estimates, but it is not.
>
> it would be very interresting to find out why its not faster ...

I benchmarked it now with gethrtime(), and apparently it is as fast as it 
could be (cca. 1 clock/instruction). So probably I underestimated the speed 
of the C version.

> > Also, although it's the same algorithm as simple_idct, it might be less
> > accurate on rare occasions, because of the primitive slow split 8+8 bit
> > multply operation in VIS. I am hoping the idct won't overflow, too.

> what happens with the regression tests if its forced to be used where the
> normal C simple idct normally is?
>
> and what does dct-test.c output for the idct?

I won't try the regression test, as it would be very complicated - ffmpeg 
doesn't support solaris 8 (I don't have acces to anything else), so I am 
using an older hacked version of mplayer.

But I did try dct-test, here is the output:

------------
ednebal mwux119> ./dct-test -i 0
ffmpeg DCT/IDCT test

    0     0     0     0     0     0     0     0
    0     0     0     0     0     0     0     0
    0     0     0     0     0     0     0     0
    0     0     0     0     0     0     0     0
    0     0     0     0     0     0     0     0
    0     0     0     0     0     0     0     0
    0     0     0     0     0     0     0     0
    0     0     0     0     0     0     0     0
IDCT REF-DBL: err_inf=0 err2=0.00000000 syserr=0.00000000 maxout=260 
blockSumErr=0
IDCT REF-DBL: 50.2 kdct/s

   -4   -24    -1   -15     4    24   -29   -47
  -35   -13    21    -2    24     8    -9    19
   30   -17   -38     7   -18    43   -28    -5
   31   -31   -13    -9    23     5    16    11
  -16    20   -43    18     4   -11    -9     0
   -3    -6    -9    -3    -5     4   -25    11
  -43    10    -6    -7    24    22   -12    23
    4   -23   -11     0    10    11    13    24
IDCT INT: err_inf=1 err2=0.01606250 syserr=0.00235000 maxout=260 blockSumErr=6
IDCT INT: 629.0 kdct/s

   -1    13    15    -2    30     9    10   -11
  -22   -27     9    11    15    -1    31     8
    2    -2     4   -21     0     9   -21   -13
   21   -17   -10    12     4     1     9    -1
   -8     4    -5    -6    20    -3    12     0
   21   -15     9    18    -9    19   -14     3
    1    22   -10    -6    11    -6     6     1
   21    18    -7   -11     2    17    16    -8
IDCT SIMPLE-C: err_inf=1 err2=0.00840078 syserr=0.00155000 maxout=260 
blockSumErr=5
IDCT SIMPLE-C: 463.2 kdct/s

  651   363   821  1416  1116   686    64   492
  247   380   736   237   441   323   534   420
 1094   716   359   676   442   484   371   426
  726   296   437   302   399   380   413   411
  762   622   357   548   476   429   350   463
  451   386   407   382   437   386   415   391
  384   579   421   523   482   492   403   454
  304   455   360   399   344   407   396   375
IDCT SIMPLE-VIS: err_inf=1 err2=0.06097266 syserr=0.07080000 maxout=259 
blockSumErr=12
IDCT SIMPLE-VIS: 1024.6 kdct/s

 2810  1725  3883  3476  1445  2977  1362  1943
 1263  1244  3402  2061  3266  2497  2936  2741
 3763  3555  1640  2765  1864  2615  2039  2391
 3648  1897  2893  2304  2447  2391  2482  2613
 1796  3132  2055  2615  2115  2426  2246  2373
 2690  2483  2701  2527  2544  2556  2554  2520
 1361  2738  2237  2556  2252  2520  2360  2349
 1882  2594  2395  2605  2279  2759  2274  2578
IDCT MLIBidct: err_inf=3 err2=0.38151094 syserr=0.19415000 maxout=260 
blockSumErr=40
IDCT MLIBidct: 2380.8 kdct/s
------------------

So it is less accurate than simple_idct, but definietly more accurate than the 
mlib version. I think I also can confirm that the inaccuracy is from the 
split two instruction multiply+rounding - because I have an earlier version 
of simple_idct_vis that uses 32 bit for the calculations (and hence does only 
two rows parallel, so it's half as fast, only a bit faster than the C 
version),  which has the same good accuracy as simple_idct. As the main 
algorithm is the same, I think the conclusion, that the two instruction 
multiply+rounding is guilty, is correct.

> > +     6393,  6393,  6393,  6393
> > +};
>
> const static

Fixed.

> also you permute the input explicitly instead of setting 
> idct_permutation_type 
> properly
>
> please dont sumbit trash

I don't think it is trash. Again, there is no left shift. It could be 
substituted with 4 additions and no fpackfix mess, but than it would only be 
marginally faster, and there would be more code in this file (as this 
permutating is still needed for the other half of the idct), in other words it 
would be more complex for little gain. So I prefer this version.

> > +void ff_simple_idct_add_vis(uint8_t *dest, int line_size, DCTELEM *data)
> > { 
> > +    ff_simple_idct_vis(data);
> > +    ff_add_pixels_clamped_vis(data, dest, line_size);
> > +}
>
> check that gcc inlines these 4 calls, if not do something so it does, they
> should be inlined

I checked it, it didn't, now - thanks to the "inline" keyword - it does, I 
checked it too.

Thanks for reviewing
bye
Denes
-------------- next part --------------
A non-text attachment was scrubbed...
Name: simple_idct_vis_try3.diff
Type: text/x-diff
Size: 21821 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070822/1724ee15/attachment.diff>