[FFmpeg-devel] [PATCH] SPARC VIS simple_idct
Balatoni Denes
dbalatoni
Fri Aug 24 20:55:55 CEST 2007
Hi Michael!
So yes, I decided that I don't want all my work so far to be abandoned.
The chance of somebody picking up the patch and resubmitting it is
zero, there are very few sparc desktop users, and even less developers
obiously - maybe it's because the desktop SPARC CPUs are crap and expensive
too.
I gave a deep thought to each of your sugestions:
Friday 24 August 2007 01:07-kor Michael Niedermayer ezt ?rta:
> > > ok ill accept the fpadd16 for now if you move them after the for /fcmpd
> > > as
> > > they arent needed before and after that they would often not be
> > > executed :)
> > >
> >
> > That would be right in the middle of the idct4row macro. So there would be
> > insane amounts of code duplication - well unless, I cut up idct4row by
> > columns. If this was the last show stopper, I would do it, though ;)
>
> all you have to do is to put a ADDBLAH macro after the branches in the
> idct4row
> then define it appropriately before each idct4row
I might have misunderstood you, but are you saying to shift only the two
registers that are used in each column at the start of every column
calculation? I think it would make the code slower, as these VIS instructions
have a 4 clock latency (at least), so the cpu would be idling for two clocks
after each fpadd16. Also I really don't want to butcher the code, intruducing
more and more nested macros, until everything is a mess. So for these reasons
I didn't implement this part.
> > > anyway, please try to negate the high 8 bit or the coeffs and
> > > try
> > > fmul8sucks16
> > > fmul8ulcks16
> > > fpsub16
> > >
> > > maybe this reduces the rounding errors and its the same amount
> > > of code ...
> >
> > I see what you mean - and a good idea, which didn't occur to me - but I am
> > skeptical. The problem is, as I see it, is because of the two
> > instructions,
> > the error can be 1 (vs. 0.5 if it was one instruction with rounding). Now
> > playing with the signs doesn't change this unfortunatelly.
>
> its an idea, testing takes 5minutes for someone with a sparc, its just a
> matter of *-1 the 8bits and running dct-test
> a theoretical analysis of the rounding behavior would take more time
here it is. It is not really better, so we should stick to the original
algorithm IMHO.
-1056 1035 -606 -324 1266 1041 -228 700
730 94 1094 597 393 279 640 368
441 924 90 347 462 557 272 437
1061 311 627 496 449 385 441 447
369 726 324 476 519 490 385 449
555 404 471 460 522 493 514 477
192 551 254 395 418 482 314 467
287 498 338 378 379 468 373 391
IDCT SIMPLE-VIS: err_inf=1 err2=0.06218359 syserr=0.06330000 maxout=260
blockSumErr=13
IDCT SIMPLE-VIS: 1431.0 kdct/s
> > > hmm what happens if you just round the coeffs down to 8bit? / throw the
> > > half the multiplies out, the code as it is is inaccurate anyway due to
> > > sparc misdesign maybe the loss from 8bit coeffs isnt that big ...
> >
> > The current situtation is not that bad (of course the idct is switched of
> > by
> > default, but for just viewing a movie it should be mostly adequate). 8bit
> > coeffs imho would make the situation very bad though.
>
> again its a matter of <5min to test this (hey just add +128>>8 to the
> coeffs)
> you needed more time to write this mail
results don't look encouraging - it's worse in some respect than mlib, and
still slower
944 -817 1855 1574 -711 732 -433 -343
-1018 -514 1656 10 1197 561 978 553
1747 1428 -299 690 -102 617 341 284
1518 1 814 256 523 392 639 585
-492 1336 -3 687 173 566 476 487
577 544 646 494 463 770 459 587
-584 1083 273 746 204 486 449 499
-84 813 301 518 429 666 399 538
IDCT SIMPLE-VIS: err_inf=4 err2=1.00296406 syserr=0.09275000 maxout=262
blockSumErr=80
IDCT SIMPLE-VIS: 1248.8 kdct/s
Btw checking these two things was not 5 minutes, but two hours - the first
problems appeared, when I saw that dct-test didn't recognize TRANSPOSE_PERM.
> > > it seems that the sparc has twice as many registers as what would be
> > > needed
> > > to keep the coefficients in registers between teh rows and columns
> > > transforms that would avoid 32 instructions ...
> >
> > They are kept in registers, nothing ever touches the coefficients.
>
> STOREROWS writes them, LOADTRANS reads them, maybe i should have called
> them something else than coefficients but thats not really changing anything
Almost all registers are used, check the code. Either I don't understand what
you are saying, or you are suggesting that I rewrite the whole stuff.
> > > the idct should not store the output in memory but leave it in registers
> > > the ff_simple_idct_put/add then should call the idct (or have it
> > > inlined)
> > > and the clamping code should just work with the registers
> > > this avoids another 32 instructions
> >
> > This could be done too. Although there would be some great code
> > duplication
> > (unless one uses macros),
Because only half of the values are in registers, the other half has to be
loaded. The net effect is that the XXX_pixels_clamped have to be mostly
rewritten - that is renaming all the registers in some convoluted ways, or
doing the same to IDCT4ROW and the other macros - all for about 1% speed up
(cca. 0.3% overall best). I am sorry, but I am not going to do that, because
I think it's just stupid - loads of work for no real benefit (on the
contrary, the result is some quite convoluted messy code, with some weird
register order).
bye
Denes
More information about the ffmpeg-devel
mailing list