[FFmpeg-devel] [PATCH] SPARC VIS simple_idct try#6

Michael Niedermayer michaelni
Tue Aug 28 21:33:46 CEST 2007


Hi

On Tue, Aug 28, 2007 at 07:44:56PM +0200, Balatoni Denes wrote:
> Hi Michael!
> 
> Tuesday 28 August 2007 06:04-kor Michael Niedermayer ezt ?rta:
> > > You forgot to give a good reason, because your argument seems flawed.
> > 
> > the code is suboptimal speedwise and you try to convice me that it cant
> > be improved instead of trying to improve the code
> > your code does alot of stores which are followed by loads many of them
> > can be avoided with no changes to the available registers yet you dont
> > you rather concentrate on arguing what in your oppinion cant be done
> 
> Are you saying, that if I add/put_clamped the result in the second half of the 
> transform directly, instead of just storing it for a later add/put_clamped 
> operation, than you would accept my patch? If your answer is yes, than it is 
> feasible, and I think I will do that.

you are forgetting that theres also 25% between the horizontal and vertical
idcts which can be reused with no store/load and no changes to the registers


> 
> > > Ok, I understand what you mean. I did some calculations. On the ultrasparc
> > > III  
> > > (4 clock latency) about 14 clocks would be spent waiting - that's not too 
> > > bad, that's still an 18 clock speed improvement. However on the ultrasparc
> > > T2  
> > > (Niagara 2, 6 clock latency) about 36 clocks would be spent waiting - that 
> > > would be slower than before the rewrite. So it's a bad idea.
> > 
> > well and what if you combine the code for 2 columns? that is 2 even ones
> > or 2 odd ones not even odd mix ...
> 
> Ok, I underestimated the speed loss on UltraSPARC T2 (US III was fine IMO), 
> because I forgot to count the odd coulmns. Second, I think there are not 
> enough registers to properly calculate two odd columns at once (8 registers 
> would be needed for that). 

how common is the US T2 ? if its a rare and old CPU i dont see a reason to
care about it ...
also you seem to ignore that most of the 8x8 blocks will have nearly all
of their elements 0 so a slowdown in code which is only executed for non
zero parts does not weight the same as a slowdown in code which is
always executed


> And I also realized, that the transpose operation 
> needs 32 32bit registers (that is, all 32 bit registers are needed), which 
> means that sometime half the data has to be stored in 64 bit registers, and 
> then moved to 32 bit registers before transpose, which is an additional 8 
> instructions. So with these taken into account, I believe (and I tried to 
> make an accurate estimate) the rewrite would still be a few clocks slower on 
> the US T2. 

1. load left half of the 8x8 block in 8 64 bit registers
2. do idct of that into 8 2x32bit registers
3. transpose these
 a. 8 2x32 -> 8 2x32
 b. 8 2x32 -> 8 2x32
 c. 8 2x32 -> 8 64bit
4. load right half of the 8x8 block in 8 64 bit registers
5. do idct of that into 8 2x32bit registers
(here all 32bit registers are available for the transpose)
6. transpose these
 a. 8 2x32 -> 8 2x32
 b. 8 2x32 -> 8 2x32
 c. 8 2x32 -> 8 64bit
...

so there are no additional 8 instructions at least i cant see where ...


> 
> Also, in the rewritten code, source and destination registers would always be 
> changing between the 1/4 transformations, so it would be a convoluted mess. 
> Also writing the code would not be very easy, (because each register have to 
> be handpicked from what is available at any given time, I think there 
> wouldn't really be a simple pattern - like it is now - to what register is 
> used when).
> 
> > It is dangerous to be right in matters on which the established
> > authorities are wrong. -- Voltaire
> 
> So I should start to be afraid now ? :)

no, you arent right :)

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Breaking DRM is a little like attempting to break through a door even
though the window is wide open and the only thing in the house is a bunch
of things you dont want and which you would get tomorrow for free anyway
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070828/7d34d1b1/attachment.pgp>



More information about the ffmpeg-devel mailing list