[FFmpeg-devel] [PATCH] SPARC VIS simple_idct try #2

Thu Aug 23 03:31:42 CEST 2007

Hi

On Thu, Aug 23, 2007 at 03:01:34AM +0200, Michael Niedermayer wrote:
[...]
> > > also iam realizing now that you read and work just with 32bits at a time
> > > while the registers really are 64bit
> > > so unles sparc need 2x as much time for 64bit instructions this is very
> > > inefficient
> > 
> > Now I am kind of puzzled. I am using 64 bit registers. Like f0+f1 is one 64bit 
> > register. f32, f34, ...f62 are 64 bit registers (these can't even be accessed 
> > in 32 bit parts). So I really don't understand what you are saying. The big 
> > macro computes 4 rows in parallel, how could it do that, without using 64 bit 
> > registers?
> 
> hmm ok the registers are split in 2, i didnt know that (this design is
> extreemly bad for a RISC cpu as it makes out of order execution very hard)
> 
> still the loads are 32bit
> so lets check if i finally figured out how sparc asm works
> 1. you load everything by using 32bit loads into the low and high
>    halfs of 64bit registers
> 2. you duplicate the input and mask on each side half the 16bit values
>    away
> 3. you use fpackfix to shift half the input left by 4bit and pack the 2 16bit
>    values which are seperated by 16 zero bits into a 32bit register
> 4. you subtract the other half from 2048
> 5. you do the same fpackfix on the second half
> ...
> 
> ok lets see
> 1. 8 instrucions are useless you can use 64bit loads
> 2. all 16 instructions ure unneeded
> 3+5 (16 instructions) are unneeded you can quickly shift the coeffs up to
>     block_last_index by using C code
> 4. the subtract is done using 32bit effectively (half of the registers
>    are 0 aka unused
> 
> so again
> fix the permutation
> shift left by 4 bits using C code or asm both stoping after block_last_index
> do the 2048*(1<<4) subtraction if needed per 64bit
> 
> at this point you should have pretty much the same data as in your case
> but very significantly faster

also if you need to transpose the stuff between row and column transform the
following can be used (minus typos high/low half errors and such)

f0  A0A1 A2A3 A4A5 A6A7
f2  B0B1 B2B3 B4B5 B6B7
f4  C0C1 C2C3 C4C5 C6C7
f6  D0D1 D2D3 D4D5 D6D7
f8  E0E1 E2E3 E4E5 E6E7
f10 F0F1 F2F3 F4F5 F6F7
f12 G0G1 G2G3 G4G5 G6G7
f14 H0H1 H2H3 H4H5 H6H7

fpmerge f0,f4 ,f16   A0C0 A1C1 A2C2 A3C3
fpmerge f2,f6 ,f18   B0D0 B1D1 B2D2 B3D3

fpmerge f16,f18 ,f0  A0B0 C0D0 A1B1 C1D1
fpmerge f17,f19 ,f2  A2B2 C2D2 A3B3 C3D3

fpmerge f0,f1,f16    A0A1 B0B1 C0C1 D0D1
fpmerge f2,f3,f18    A2A3 B2B3 C2C3 D2D3

...


[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

If you really think that XML is the answer, then you definitly missunderstood
the question -- Attila Kinali
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070823/bb9532b7/attachment.pgp>