[FFmpeg-devel] [PATCH] SPARC VIS simple_idct try #4

Fri Aug 24 00:32:42 CEST 2007

Hi Michael!

Well, honestly, I don't want to spend more time on this - at least for now. I 
have already spent like one and a half week with this, when I should have 
been doing my job. Today all I did at work was fixing it according to your 
suggestions, and I want to do something else now (if all else fails, my 
job;) ). To clarify, this is not my dayjob, hacking ffmpeg, and I am not a 
university student anymore either (unfortunatelly).
I would be happy if the idct finally went to svn, and I would be sad if it 
didn't (after spending so much time with it), but this was what I could do 
for now. 

Anyway, I answer to some of your new suggestions, where appropriate.

Thursday 23 August 2007 23:14-kor Michael Niedermayer ezt ?rta:
> ok ill accept the fpadd16 for now if you move them after the for /fcmpd as
> they arent needed before and after that they would often not be executed :)
>

That would be right in the middle of the idct4row macro. So there would be 
insane amounts of code duplication - well unless, I cut up idct4row by 
columns. If this was the last show stopper, I would do it, though ;)

> > I am quite positive, that the 2 instruction fmul is the problem. Both
> > halves of the multiply do rounding, so this explains everything. And as I
> > mentioned, the version that used 16x16->32 bit muls had the same good
> > accuracy as simple_idct.
>
> hmm, are the people who designed the vis instruction set still alive or did
> they die on alzheimer already?

Yes, VIS is very basic, primitive and slow compared to mmx/sse2 :(

> anyway, please try to negate the high 8 bit or the coeffs and
> try
> fmul8sucks16
> fmul8ulcks16
> fpsub16
>
> maybe this reduces the rounding errors and its the same amount
> of code ...

I see what you mean - and a good idea, which didn't occur to me - but I am 
skeptical. The problem is, as I see it, is because of the two instructions, 
the error can be 1 (vs. 0.5 if it was one instruction with rounding). Now 
playing with the signs doesn't change this unfortunatelly.

> i also reject code which can be changed to achive a 0.1% overall speedup
> (not rejecting it is like accepting a 0.1% slowdown ...)

No because then it is 0% speedup vs. X-0.1% speedup, where X > 0.1.

> > multiplies. So with the simple_idct algorithm, I don't expect major
> > speedups.
>
> hmm what happens if you just round the coeffs down to 8bit? / throw the
> half the multiplies out, the code as it is is inaccurate anyway due to
> sparc misdesign maybe the loss from 8bit coeffs isnt that big ...

The current situtation is not that bad (of course the idct is switched of by 
default, but for just viewing a movie it should be mostly adequate). 8bit 
coeffs imho would make the situation very bad though.

> > +    asm volatile(
> > +        "ldd [%2], %%f32         \n\t"
> > +        "ldd [%2+8], %%f34       \n\t"
> > +        "ldd [%2+16], %%f36      \n\t"
> > +        "ldd [%2+24], %%f38      \n\t"
> > +        "ldd [%2+32], %%f40      \n\t"
> > +        "ldd [%2+40], %%f42      \n\t"
> > +        "ldd [%2+48], %%f44      \n\t"
> > +
> > +        // shift right 16-4=12
> > +        LOADSCALE("%3+0")
> > +        IDCT4ROWS
> > +        STOREROWS("%4+0")
> > +        LOADSCALE("%3+8")
> > +        IDCT4ROWS
> > +        STOREROWS("%4+8")
> > +
> > +        // shift right 16+4
> > +        LOADTRANS("%4+0")
> > +        IDCT4ROWS
> > +        SCALEROWS
> > +        STOREROWS("%3+0")
> > +        LOADTRANS("%4+64")
> > +        IDCT4ROWS
> > +        SCALEROWS
> > +        STOREROWS("%3+8")
>
> it seems that the sparc has twice as many registers as what would be needed
> to keep the coefficients in registers between teh rows and columns
> transforms that would avoid 32 instructions ...

They are kept in registers, nothing ever touches the coefficients.

> is fpmerge (with 0) faster, slower or the same speed as fmul8x16? what
> about fpexpand, both could be used here alternatively
> btw do you have some document which lists the speed if all these
> instructions?

Same speed, troughput 1/clock, latency 4 (on ultrasparc3, 6 on ultrasparc T2).
These numbers can be found in the CPU user manuals.
http://www.sun.com/processors/documentation.html

> > +        "ldd [%0], %%f32      \n\t"
> > +        "ldd [%0+8], %%f34    \n\t"
> > +        "ldd [%0+16], %%f36   \n\t"
> > +        "ldd [%0+24], %%f38   \n\t"
> > +        "ldd [%0+32], %%f40   \n\t"
> > +        "ldd [%0+40], %%f42   \n\t"
> > +        "ldd [%0+48], %%f44   \n\t"
> > +        "ldd [%0+56], %%f46   \n\t"
> > +        "ldd [%0+64], %%f48   \n\t"
> > +        "ldd [%0+72], %%f50   \n\t"
> > +        "ldd [%0+80], %%f52   \n\t"
> > +        "ldd [%0+88], %%f54   \n\t"
> > +        "ldd [%0+96], %%f56   \n\t"
> > +        "ldd [%0+104], %%f58  \n\t"
> > +        "ldd [%0+112], %%f60  \n\t"
> > +        "ldd [%0+120], %%f62  \n\t"
> > +        "fpadd16 %%f0, %%f32, %%f0   \n\t"
> > +        "fpadd16 %%f2, %%f34, %%f2   \n\t"
> > +        "fpadd16 %%f4, %%f36, %%f4   \n\t"
> > +        "fpadd16 %%f6, %%f38, %%f6   \n\t"
> > +        "fpadd16 %%f8, %%f40, %%f8   \n\t"
> > +        "fpadd16 %%f10, %%f42, %%f10 \n\t"
> > +        "fpadd16 %%f12, %%f44, %%f12 \n\t"
> > +        "fpadd16 %%f14, %%f46, %%f14 \n\t"
> > +        "fpadd16 %%f16, %%f48, %%f16 \n\t"
> > +        "fpadd16 %%f18, %%f50, %%f18 \n\t"
> > +        "fpadd16 %%f20, %%f52, %%f20 \n\t"
> > +        "fpadd16 %%f22, %%f54, %%f22 \n\t"
> > +        "fpadd16 %%f24, %%f56, %%f24 \n\t"
> > +        "fpadd16 %%f26, %%f58, %%f26 \n\t"
> > +        "fpadd16 %%f28, %%f60, %%f28 \n\t"
> > +        "fpadd16 %%f30, %%f62, %%f30 \n\t"
> > +        "fpack16 %%f0, %%f0   \n\t"
> > +        "fpack16 %%f4, %%f4   \n\t"
> > +        "fpack16 %%f8, %%f8   \n\t"
> > +        "fpack16 %%f12, %%f12 \n\t"
> > +        "fpack16 %%f16, %%f16 \n\t"
> > +        "fpack16 %%f20, %%f20 \n\t"
> > +        "fpack16 %%f24, %%f24 \n\t"
> > +        "fpack16 %%f28, %%f28 \n\t"
> > +        "fpack16 %%f2, %%f1   \n\t"
> > +        "fpack16 %%f6, %%f5   \n\t"
> > +        "fpack16 %%f10, %%f9  \n\t"
> > +        "fpack16 %%f14, %%f13 \n\t"
> > +        "fpack16 %%f18, %%f17 \n\t"
> > +        "fpack16 %%f22, %%f21 \n\t"
> > +        "fpack16 %%f26, %%f25 \n\t"
> > +        "fpack16 %%f30, %%f29 \n\t"
> > +        "sub %1, %4, %1       \n\t"
> > +        "std %%f0, [%1]       \n\t"
> > +        "add %1, %2, %1       \n\t"
> > +        "std %%f4, [%1]       \n\t"
> > +        "add %1, %2, %1       \n\t"
> > +        "std %%f8, [%1]       \n\t"
> > +        "add %1, %2, %1       \n\t"
> > +        "std %%f12, [%1]      \n\t"
> > +        "add %1, %2, %1       \n\t"
> > +        "std %%f16, [%1]      \n\t"
> > +        "add %1, %2, %1       \n\t"
> > +        "std %%f20, [%1]      \n\t"
> > +        "add %1, %2, %1       \n\t"
> > +        "std %%f24, [%1]      \n\t"
> > +        "add %1, %2, %1       \n\t"
> > +        "std %%f28, [%1]      \n\t"
>
> as the sparc has a shitload of registers you could use a new register for
> each result of the adds used during reading and then reuse these during
> writing
> avoids 8 instructions

This could be done. But I am afraid - didn't check - gcc is already sticking 
some crazy prologues and epiloges to the asm section because of the lot of 
input and output constrains(and btw, I give output registers too, because I 
am afraid gcc miscompiles the input, it the output is not given).

> > +        : "=r" (out1), "=r" (out2), "=r" (out3), "=r" (out4), "=r"
> > (out5) +        : "0" (block), "1" (pixels), "2" (line_size), "3"
> > (expand), "4" (line_size*7) +    );
> > +}
> > +
> > +void ff_simple_idct_put_vis(uint8_t *dest, int line_size, DCTELEM *data)
> > { +    ff_simple_idct_vis(data);
> > +    ff_put_pixels_clamped_vis(data, dest, line_size);
> > +}
> > +
> > +void ff_simple_idct_add_vis(uint8_t *dest, int line_size, DCTELEM *data)
> > { +    ff_simple_idct_vis(data);
> > +    ff_add_pixels_clamped_vis(data, dest, line_size);
> > +}
>
> the idct should not store the output in memory but leave it in registers
> the ff_simple_idct_put/add then should call the idct (or have it inlined)
> and the clamping code should just work with the registers
> this avoids another 32 instructions

This could be done too. Although there would be some great code duplication 
(unless one uses macros), because the xxx_pixels_clamped instructions are put 
into the context in dsputil, so there should be a version which reads from 
memory.

bye
Denes