[FFmpeg-devel] [PATCH] SPARC VIS simple_idct try #4

Thu Aug 23 23:14:48 CEST 2007

Hi

On Thu, Aug 23, 2007 at 09:01:01PM +0200, Balatoni Denes wrote:
> Hi!
> 
> So here is a new patch, I implemented your suggestions. One change is, that I 
> used 4 fpadd16 to do the shifting left by four, because what gcc made of the 
> C code didn't look all that good - or maybe I misunerstood it, I don't know. 
> Anyhow it shouldn't be much slower, as the c version also needed 1 load, 1 
> store, 1 shift and 1 logical and, and also increasing the loop variable, and 
> checking it (although block_last_index could have made it slightly faster). I 
> hope it's ok.

ok ill accept the fpadd16 for now if you move them after the for /fcmpd as
they arent needed before and after that they would often not be executed :)

> 
> Thursday 23 August 2007 14:00-kor Michael Niedermayer ezt ?rta:
> > > HDTV). Also as the idct is rather inaccurate, 
> >
> > ive not yet looked at how to make it more accurate :)
> 
> I am quite positive, that the 2 instruction fmul is the problem. Both halves 
> of the multiply do rounding, so this explains everything. And as I mentioned, 
> the version that used 16x16->32 bit muls had the same good accuracy as 
> simple_idct.

hmm, are the people who designed the vis instruction set still alive or did
they die on alzheimer already?

anyway, please try to negate the high 8 bit or the coeffs and
try
fmul8sucks16
fmul8ulcks16
fpsub16

maybe this reduces the rounding errors and its the same amount
of code ...

> 
> > its like leaving 100euro laying at the street saying its not enough to buy
> > a car ...
> > [...]
> > 2% overall speedup is huge ive rejected patches which would have introduced
> > new features because they slowed the code down by 0.1%
> 
> Yes, ok, I did it after all (and it didn't hurt :) ). Unfortunatelly I can't 
> benchmark properly because of many background processes, but dct-test says - 
> though it seems a bit too optimistic - there is a 20% speed-up of the idct. I 
> think 5-10% is more realistic and probable, but anyway there should be 
> measurable improvment. BTW I do think your rejecting features because they 
> slowed the code down by 0.1% is a bit harsh, but that's none of my 
> business :)

well it is because this works both ways ;)
i also reject code which can be changed to achive a 0.1% overall speedup
(not rejecting it is like accepting a 0.1% slowdown ...)
and yes 0.1% on its own is irrelevant but this accumulates and after
100 such patches you lost 10% speed ...

> 
> > also mlib does the idct at half the speed, so i think theres more than 5% of
> > gain possible
> 
> IMO the idct is not too slow right now. But also imho mlib's speed is because 
> of a faster, mpeg (derived) algorithm, which uses half as many multiplies. So 
> with the simple_idct algorithm, I don't expect major speedups.

hmm what happens if you just round the coeffs down to 8bit? / throw the
half the multiplies out, the code as it is is inaccurate anyway due to sparc
misdesign maybe the loss from 8bit coeffs isnt that big ...

also maybe try a mpegidct.c (see ffmpeg svn r9601) based solution

> 
> bye
> Denes
> 
> ps: it would be great if this could be committed as is, because I already 

no but my comments in this mail should be the last, after dealing with them
the idct should be ok for svn

[...]
> +    asm volatile(
> +        "ldd [%2], %%f32         \n\t"
> +        "ldd [%2+8], %%f34       \n\t"
> +        "ldd [%2+16], %%f36      \n\t"
> +        "ldd [%2+24], %%f38      \n\t"
> +        "ldd [%2+32], %%f40      \n\t"
> +        "ldd [%2+40], %%f42      \n\t"
> +        "ldd [%2+48], %%f44      \n\t"
> +
> +        // shift right 16-4=12
> +        LOADSCALE("%3+0")
> +        IDCT4ROWS
> +        STOREROWS("%4+0")
> +        LOADSCALE("%3+8")
> +        IDCT4ROWS
> +        STOREROWS("%4+8")
> +
> +        // shift right 16+4
> +        LOADTRANS("%4+0")
> +        IDCT4ROWS
> +        SCALEROWS
> +        STOREROWS("%3+0")
> +        LOADTRANS("%4+64")
> +        IDCT4ROWS
> +        SCALEROWS
> +        STOREROWS("%3+8")

it seems that the sparc has twice as many registers as what would be needed to
keep the coefficients in registers between teh rows and columns transforms
that would avoid 32 instructions ...

[...]
> +inline void ff_add_pixels_clamped_vis(const DCTELEM *block, uint8_t *pixels, int line_size) {
> +    int out1, out2, out3, out4, out5;
> +
> +    asm volatile(
> +        "ldd [%3], %%f32      \n\t"
> +        "wr %%g0, 0x38, %%gsr \n\t"
> +        "ldd [%1], %%f0       \n\t"
> +        "add %1, %2, %1       \n\t"
> +        "ldd [%1], %%f4       \n\t"
> +        "add %1, %2, %1       \n\t"
> +        "ldd [%1], %%f8       \n\t"
> +        "add %1, %2, %1       \n\t"
> +        "ldd [%1], %%f12      \n\t"
> +        "add %1, %2, %1       \n\t"
> +        "ldd [%1], %%f16      \n\t"
> +        "add %1, %2, %1       \n\t"
> +        "ldd [%1], %%f20      \n\t"
> +        "add %1, %2, %1       \n\t"
> +        "ldd [%1], %%f24      \n\t"
> +        "add %1, %2, %1       \n\t"
> +        "ldd [%1], %%f28      \n\t"
> +        "fmul8x16 %%f1, %%f32, %%f2   \n\t"
> +        "fmul8x16 %%f5, %%f32, %%f6   \n\t"
> +        "fmul8x16 %%f9, %%f32, %%f10  \n\t"
> +        "fmul8x16 %%f13, %%f32, %%f14 \n\t"
> +        "fmul8x16 %%f17, %%f32, %%f18 \n\t"
> +        "fmul8x16 %%f21, %%f32, %%f22 \n\t"
> +        "fmul8x16 %%f25, %%f32, %%f26 \n\t"
> +        "fmul8x16 %%f29, %%f32, %%f30 \n\t"
> +        "fmul8x16 %%f0, %%f32, %%f0   \n\t"
> +        "fmul8x16 %%f4, %%f32, %%f4   \n\t"
> +        "fmul8x16 %%f8, %%f32, %%f8   \n\t"
> +        "fmul8x16 %%f12, %%f32, %%f12 \n\t"
> +        "fmul8x16 %%f16, %%f32, %%f16 \n\t"
> +        "fmul8x16 %%f20, %%f32, %%f20 \n\t"
> +        "fmul8x16 %%f24, %%f32, %%f24 \n\t"
> +        "fmul8x16 %%f28, %%f32, %%f28 \n\t"

is fpmerge (with 0) faster, slower or the same speed as fmul8x16? what about
fpexpand, both could be used here alternatively
btw do you have some document which lists the speed if all these instructions?

> +        "ldd [%0], %%f32      \n\t"
> +        "ldd [%0+8], %%f34    \n\t"
> +        "ldd [%0+16], %%f36   \n\t"
> +        "ldd [%0+24], %%f38   \n\t"
> +        "ldd [%0+32], %%f40   \n\t"
> +        "ldd [%0+40], %%f42   \n\t"
> +        "ldd [%0+48], %%f44   \n\t"
> +        "ldd [%0+56], %%f46   \n\t"
> +        "ldd [%0+64], %%f48   \n\t"
> +        "ldd [%0+72], %%f50   \n\t"
> +        "ldd [%0+80], %%f52   \n\t"
> +        "ldd [%0+88], %%f54   \n\t"
> +        "ldd [%0+96], %%f56   \n\t"
> +        "ldd [%0+104], %%f58  \n\t"
> +        "ldd [%0+112], %%f60  \n\t"
> +        "ldd [%0+120], %%f62  \n\t"
> +        "fpadd16 %%f0, %%f32, %%f0   \n\t"
> +        "fpadd16 %%f2, %%f34, %%f2   \n\t"
> +        "fpadd16 %%f4, %%f36, %%f4   \n\t"
> +        "fpadd16 %%f6, %%f38, %%f6   \n\t"
> +        "fpadd16 %%f8, %%f40, %%f8   \n\t"
> +        "fpadd16 %%f10, %%f42, %%f10 \n\t"
> +        "fpadd16 %%f12, %%f44, %%f12 \n\t"
> +        "fpadd16 %%f14, %%f46, %%f14 \n\t"
> +        "fpadd16 %%f16, %%f48, %%f16 \n\t"
> +        "fpadd16 %%f18, %%f50, %%f18 \n\t"
> +        "fpadd16 %%f20, %%f52, %%f20 \n\t"
> +        "fpadd16 %%f22, %%f54, %%f22 \n\t"
> +        "fpadd16 %%f24, %%f56, %%f24 \n\t"
> +        "fpadd16 %%f26, %%f58, %%f26 \n\t"
> +        "fpadd16 %%f28, %%f60, %%f28 \n\t"
> +        "fpadd16 %%f30, %%f62, %%f30 \n\t"
> +        "fpack16 %%f0, %%f0   \n\t"
> +        "fpack16 %%f4, %%f4   \n\t"
> +        "fpack16 %%f8, %%f8   \n\t"
> +        "fpack16 %%f12, %%f12 \n\t"
> +        "fpack16 %%f16, %%f16 \n\t"
> +        "fpack16 %%f20, %%f20 \n\t"
> +        "fpack16 %%f24, %%f24 \n\t"
> +        "fpack16 %%f28, %%f28 \n\t"
> +        "fpack16 %%f2, %%f1   \n\t"
> +        "fpack16 %%f6, %%f5   \n\t"
> +        "fpack16 %%f10, %%f9  \n\t"
> +        "fpack16 %%f14, %%f13 \n\t"
> +        "fpack16 %%f18, %%f17 \n\t"
> +        "fpack16 %%f22, %%f21 \n\t"
> +        "fpack16 %%f26, %%f25 \n\t"
> +        "fpack16 %%f30, %%f29 \n\t"
> +        "sub %1, %4, %1       \n\t"
> +        "std %%f0, [%1]       \n\t"
> +        "add %1, %2, %1       \n\t"
> +        "std %%f4, [%1]       \n\t"
> +        "add %1, %2, %1       \n\t"
> +        "std %%f8, [%1]       \n\t"
> +        "add %1, %2, %1       \n\t"
> +        "std %%f12, [%1]      \n\t"
> +        "add %1, %2, %1       \n\t"
> +        "std %%f16, [%1]      \n\t"
> +        "add %1, %2, %1       \n\t"
> +        "std %%f20, [%1]      \n\t"
> +        "add %1, %2, %1       \n\t"
> +        "std %%f24, [%1]      \n\t"
> +        "add %1, %2, %1       \n\t"
> +        "std %%f28, [%1]      \n\t"

as the sparc has a shitload of registers you could use a new register for
each result of the adds used during reading and then reuse these during
writing
avoids 8 instructions

> +        : "=r" (out1), "=r" (out2), "=r" (out3), "=r" (out4), "=r" (out5)
> +        : "0" (block), "1" (pixels), "2" (line_size), "3" (expand), "4" (line_size*7)
> +    );
> +}
> +
> +void ff_simple_idct_put_vis(uint8_t *dest, int line_size, DCTELEM *data) {
> +    ff_simple_idct_vis(data);
> +    ff_put_pixels_clamped_vis(data, dest, line_size);
> +}
> +
> +void ff_simple_idct_add_vis(uint8_t *dest, int line_size, DCTELEM *data) {
> +    ff_simple_idct_vis(data);
> +    ff_add_pixels_clamped_vis(data, dest, line_size);
> +}

the idct should not store the output in memory but leave it in registers
the ff_simple_idct_put/add then should call the idct (or have it inlined)
and the clamping code should just work with the registers
this avoids another 32 instructions

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Let us carefully observe those good qualities wherein our enemies excel us
and endeavor to excel them, by avoiding what is faulty, and imitating what
is excellent in them. -- Plutarch
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070823/ddbad929/attachment.pgp>