[FFmpeg-devel] [PATCH] SPARC VIS simple_idct try#7

Thu Aug 30 01:25:32 CEST 2007

Hi

On Wed, Aug 29, 2007 at 11:37:19PM +0200, Balatoni Denes wrote:
> Hi!
> 
> Wednesday 29 August 2007 00:13-kor Michael Niedermayer ezt ?rta:
> > On Tue, Aug 28, 2007 at 10:38:23PM +0200, Balatoni Denes wrote:
> > > > > you are forgetting that theres also 25% between the horizontal and
> > > > > vertical idcts which can be reused with no store/load and no changes
> > > > > to the registers
> > > >
> > > > Indeed, I didn't take that into account. So if I fix that 25% and the
> > > > clamping part, will you accept the patch?
> > >
> > > Better yet: that would be 4 instructions. How about I gain 4 clocks in
> > > some other way instead - how, let it be my secret. Okay?
> >
> > hmm no but you have to do that secret optimization too now at minimum for
> > it to be considered for svn
> >
> > let me remind you, code has to be optimal to be accepted
> >
> > ill investigate the register shortage vs. avoidable load/stores vs. latency
> > after (the unlikely) case that you do correct the undisputed
> > suboptimalities
> 
> Here is a new patch. I fixed all "undisputed suboptimalities". I also 
> elminitad many adds, as you suggested before, because I found that gcc 
> optimized away all unneeded prologue and epilogue code around the asm block. 
> I also eliminated the temporary 128 byte storage where it is not needed.

:)

some more hard to dispute ideas below ...

[...]
> @@ -4045,6 +4049,13 @@
>    int accel = vis_level ();
>  
>    if (accel & ACCEL_SPARC_VIS) {
> +      if(avctx->idct_algo==FF_IDCT_SIMPLEVIS){
> +                c->idct_put = ff_simple_idct_put_vis;
> +                c->idct_add = ff_simple_idct_add_vis;
> +                c->idct     = ff_simple_idct_vis;
> +                c->idct_permutation_type = FF_TRANSPOSE_IDCT_PERM;
> +      }
> +

this should be 4 spaces indented

[...]
> +#define IDCT4ROWS \
> +    /* 1. column */\
> +        "fmul8ulx16 %%f0, %%f38, %%f58 \n\t"\
> +        "fmul8ulx16 %%f2, %%f32, %%f18 \n\t"\
> +        "fmul8ulx16 %%f2, %%f36, %%f22 \n\t"\
> +        "fmul8ulx16 %%f2, %%f40, %%f26 \n\t"\
> +        "fmul8ulx16 %%f2, %%f44, %%f30 \n\t"\
> +\
> +        "fmul8sux16 %%f0, %%f38, %%f48 \n\t"\
> +        "fmul8sux16 %%f2, %%f32, %%f50 \n\t"\
> +        "fmul8sux16 %%f2, %%f36, %%f52 \n\t"\
> +        "fmul8sux16 %%f2, %%f40, %%f54 \n\t"\
> +        "fmul8sux16 %%f2, %%f44, %%f56 \n\t"\
> +\
> +        "fpadd16 %%f48, %%f58, %%f58 \n\t"\
> +        "fpadd16 %%f50, %%f18, %%f18 \n\t"\
> +        "fpadd16 %%f52, %%f22, %%f22 \n\t"\
> +        "fpadd16 %%f54, %%f26, %%f26 \n\t"\
> +        "fpadd16 %%f56, %%f30, %%f30 \n\t"\
> +\
> +        "fpadd16 %%f58, %%f0, %%f16  \n\t"\
> +        "fpadd16 %%f58, %%f0, %%f20  \n\t"\
> +        "fpadd16 %%f58, %%f0, %%f24  \n\t"\
> +        "fpadd16 %%f58, %%f0, %%f28  \n\t"\
> +        "fpadd16 %%f18, %%f2, %%f18  \n\t"\
> +        "fpadd16 %%f22, %%f2, %%f22  \n\t"\
> +        "fpadd16 %%f26, %%f2, %%f26  \n\t"\
> +    /* 2. column */\
> +        "for %%f4, %%f6, %%f60         \n\t"\
> +        "fcmpd %%fcc0, %%f62, %%f60    \n\t"\

the for and fcmpd can be moved up (with some distance from each other
so to avoid the 10 cycle stall (you said all instructions have a latency
of 6 on the US T2) this should cause theres nothing touching any of
f4,f6,f60,f62,fcc above so this should work

> +        "fbe 3f                        \n\t"\
> +        "nop                           \n\t"\

you can move a instruction into the nop slot, its always executed if the annul
bit is not set according to docs so the fpadd16 %%f26, %%f2, %%f26 from
above would be a choice
this applies to all the other nop as well

> +        "fmul8ulx16 %%f4, %%f34, %%f48 \n\t"\
> +        "fmul8ulx16 %%f4, %%f42, %%f50 \n\t"\
> +        "fmul8ulx16 %%f6, %%f36, %%f52 \n\t"\
> +        "fmul8ulx16 %%f6, %%f44, %%f54 \n\t"\
> +        "fmul8ulx16 %%f6, %%f32, %%f56 \n\t"\
> +        "fmul8ulx16 %%f6, %%f40, %%f58 \n\t"\
> +\
> +        "fpadd16 %%f16, %%f48, %%f16 \n\t"\
> +        "fpadd16 %%f20, %%f50, %%f20 \n\t"\
> +        "fpsub16 %%f24, %%f50, %%f24 \n\t"\
> +        "fpsub16 %%f28, %%f48, %%f28 \n\t"\
> +        "fpadd16 %%f18, %%f52, %%f18 \n\t"\
> +        "fpsub16 %%f22, %%f54, %%f22 \n\t"\
> +        "fpsub16 %%f26, %%f56, %%f26 \n\t"\
> +        "fpsub16 %%f30, %%f58, %%f30 \n\t"\
> +\
> +        "fmul8sux16 %%f4, %%f34, %%f48 \n\t"\
> +        "fmul8sux16 %%f4, %%f42, %%f50 \n\t"\
> +        "fmul8sux16 %%f6, %%f36, %%f52 \n\t"\
> +        "fmul8sux16 %%f6, %%f44, %%f54 \n\t"\
> +        "fmul8sux16 %%f6, %%f32, %%f56 \n\t"\
> +        "fmul8sux16 %%f6, %%f40, %%f58 \n\t"\
> +\
> +        "fpadd16 %%f16, %%f48, %%f16 \n\t"\
> +        "fpadd16 %%f20, %%f50, %%f20 \n\t"\
> +        "fpsub16 %%f24, %%f50, %%f24 \n\t"\
> +        "fpsub16 %%f28, %%f48, %%f28 \n\t"\
> +        "fpadd16 %%f18, %%f52, %%f18 \n\t"\
> +        "fpsub16 %%f22, %%f54, %%f22 \n\t"\
> +        "fpsub16 %%f26, %%f56, %%f26 \n\t"\
> +        "fpsub16 %%f30, %%f58, %%f30 \n\t"\
> +\
> +        "fpadd16 %%f16, %%f4, %%f16  \n\t"\
> +        "fpsub16 %%f28, %%f4, %%f28  \n\t"\
> +        "fpadd16 %%f18, %%f6, %%f18  \n\t"\
> +        "fpsub16 %%f26, %%f6, %%f26  \n\t"\
> +        "fpsub16 %%f30, %%f6, %%f30  \n\t"\
> +    /* 3. column */\
> +        "3:                             \n\t"\
> +        "for %%f8, %%f10, %%f60         \n\t"\
> +        "fcmpd %%fcc0, %%f62, %%f60     \n\t"\

the for and fcmp can similarely be moved up, you have to switch to fcc1 though
to avoid a conflict with the above ones
this applies to the other for/fcmpd as well

[...]
> +        TRANSPOSE
> +        IDCT4ROWS
> +        SCALEROWS
> +        PUTPIXELSCLAMPED("0")
> +        LOAD("%2+64")
> +        TRANSPOSE
> +        IDCT4ROWS
> +        SCALEROWS
> +        PUTPIXELSCLAMPED("4")

the SCALEROWS is unneeded, the fpack16 can do the downshift and a single
addition to the 0,0 coefficient before the idct or first column after the
transpose can compensate for the rounding difference

[...]
> +        TRANSPOSE
> +        IDCT4ROWS
> +        SCALEROWS
> +        ADDPIXELSCLAMPED("0")
> +        LOAD("%2+64")
> +        TRANSPOSE
> +        IDCT4ROWS
> +        SCALEROWS
> +        ADDPIXELSCLAMPED("4")

same here, the SCALEROWS can be avoided by changing the shift used in fpack16
and the expansion value for the added pixels as well as adding a bias with a
single instruction further above

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Asymptotically faster algorithms should always be preferred if you have
asymptotical amounts of data
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070830/a5524abc/attachment.pgp>