[FFmpeg-devel] [PATCH] SPARC VIS simple_idct try#6
Tue Aug 28 19:44:56 CEST 2007
Tuesday 28 August 2007 06:04-kor Michael Niedermayer ezt ?rta:
> > You forgot to give a good reason, because your argument seems flawed.
> the code is suboptimal speedwise and you try to convice me that it cant
> be improved instead of trying to improve the code
> your code does alot of stores which are followed by loads many of them
> can be avoided with no changes to the available registers yet you dont
> you rather concentrate on arguing what in your oppinion cant be done
Are you saying, that if I add/put_clamped the result in the second half of the
transform directly, instead of just storing it for a later add/put_clamped
operation, than you would accept my patch? If your answer is yes, than it is
feasible, and I think I will do that.
> > Ok, I understand what you mean. I did some calculations. On the ultrasparc
> > III
> > (4 clock latency) about 14 clocks would be spent waiting - that's not too
> > bad, that's still an 18 clock speed improvement. However on the ultrasparc
> > T2
> > (Niagara 2, 6 clock latency) about 36 clocks would be spent waiting - that
> > would be slower than before the rewrite. So it's a bad idea.
> well and what if you combine the code for 2 columns? that is 2 even ones
> or 2 odd ones not even odd mix ...
Ok, I underestimated the speed loss on UltraSPARC T2 (US III was fine IMO),
because I forgot to count the odd coulmns. Second, I think there are not
enough registers to properly calculate two odd columns at once (8 registers
would be needed for that). And I also realized, that the transpose operation
needs 32 32bit registers (that is, all 32 bit registers are needed), which
means that sometime half the data has to be stored in 64 bit registers, and
then moved to 32 bit registers before transpose, which is an additional 8
instructions. So with these taken into account, I believe (and I tried to
make an accurate estimate) the rewrite would still be a few clocks slower on
the US T2.
Also, in the rewritten code, source and destination registers would always be
changing between the 1/4 transformations, so it would be a convoluted mess.
Also writing the code would not be very easy, (because each register have to
be handpicked from what is available at any given time, I think there
wouldn't really be a simple pattern - like it is now - to what register is
> It is dangerous to be right in matters on which the established
> authorities are wrong. -- Voltaire
So I should start to be afraid now ? :)
More information about the ffmpeg-devel