[FFmpeg-devel] [ping] [PATCH] mmx implementation of vc-1 inverse transformations

Ronald S. Bultje rsbultje
Wed Oct 6 16:08:26 CEST 2010


Hi,

as a start: I agree with Jason, I'd prefer new code like this to be
written in yasm, but I won't object since I didn't write it.

2010/9/30 Yuriy Kaminskiy <yumkam at mail.ru>:
[..]
> Index: libavcodec/vc1dec.c
> @@ -2086,7 +2086,7 @@ static int vc1_decode_p_block(VC1Context
>          for(j = 0; j < 2; j++) {
>              last = subblkpat & (1 << (1 - j));
>              i = 0;
> -            off = j * 32;
> +            off = s->dsp.vc1_inv_trans_8x4_transposed ? j * 4 : j * 32;
>              while (!last) {
>                  vc1_decode_ac_coeff(v, &last, &skip, &value, v->codingset2);
>                  i += skip;

I would prefer if we wouldn't add random fields in DSPContext, this
will quickly go crazy. Better check (like h264 does) whether the
function is the C function (which then needs to be exported in a
header) and do this behaviour based on that.

> -#define LOAD4(stride,in,a,b,c,d)\
> -    "movq 0*"#stride"+"#in", "#a"\n\t"\
> -    "movq 1*"#stride"+"#in", "#b"\n\t"\
> -    "movq 2*"#stride"+"#in", "#c"\n\t"\
> -    "movq 3*"#stride"+"#in", "#d"\n\t"
> +#define LOAD4(m,stride,in,a,b,c,d)\
> +    "mov"#m" 0*"#stride"+"#in", "#a"\n\t"\
> +    "mov"#m" 1*"#stride"+"#in", "#b"\n\t"\
> +    "mov"#m" 2*"#stride"+"#in", "#c"\n\t"\
> +    "mov"#m" 3*"#stride"+"#in", "#d"\n\t"

This is one of those cases where, yes, it saves space but really,
you'll have to change so much code and it doesn't quite become more
readable, so I wouldn't mind if you'd just write out the SSE macros as
separate macros. That is, if you want this kept in dsputil_mmx.h.

Should VC1-specific macros go in dsputil_mmx.h or in a vc1-specific
file (e.g. in the file where it's used)? LOAD4/STORE4 are unused. In
that case, I don't mind the above macro since it's not in a
ubiquitously-included header.

> +/*
> +    precodition:

Precondition. Please run a spell-checker, I don't intend to nit too
much but this sort of stuff you can easily pick up yourself.

> +#define OPC_SS_AB(opc, src, dst0, dst1)\
> +    #opc" "#src","#dst0"\n\t"\
> +    #opc" "#src","#dst1"\n\t"
> +
> +#define OPC_SSSS_ABCD(opc, src, dst0, dst1, dst2, dst3)\
> +    OPC_SS_AB(opc, src, dst0, dst1)\
> +    OPC_SS_AB(opc, src, dst2, dst3)

I think this borders obfuscation. It doesn't really hurt in any way
but it does hide the actual code being done up to an abstraction
level, and really at no gain. There is very little chance that without
this macro, we write slower code. I'd prefer this removed and the
resulting code written out fully, it's not that much longer.

> +#define TRANSFORM_4X4_COMMON(m,r0,r1,r2,r3,r4,r5,r6,r7,c)\

> +    "mov"#m"  "#c", "#r5"\n\t" /* c */\
> +    "mov"#m" "#r5", "#r7"\n\t" /* c */\
> +    SUMSUB_BA(r2,r0)\
> +    "paddw   "#r0", "#r5"\n\t" /* r0 - r2 + c */\

Note the dependency chain for r5; put the second mov#m below the
sumsub_ba, should be faster.

> +    "paddw   "#r2", "#r7"\n\t" /* r0 + r2 + c */\
> +    OPC_SS_AB(psraw,$1,r5,r7)\
> +    "mov"#m" "#r1", "#r4"\n\t" /* r1 */\
> +    "mov"#m" "#r3", "#r6"\n\t" /* r3 */\

> +    "paddw   "#r1", "#r1"\n\t" /* 2 * r1 */\
> +    "paddw   "#r6", "#r1"\n\t" /* 2 * r1 + r3 */\
> +    "paddw   "#r3", "#r3"\n\t" /* 2 * r3 */\
> +    "psubw   "#r4", "#r3"\n\t" /* 2 * r3 - r1 */\

These two can be interleaved.

> +    "paddw   "#r1", "#r4"\n\t" /* 3 * r1 + r3 */\
> +    "paddw   "#r3", "#r6"\n\t" /* 3 * r3 - r1 */\

> +    SUMSUB_BA(r4,r7)\
> +    SUMSUB_BA(r6,r5)\

These also. Feel free to add a SUMSUB_BAx2 macro. (This is
particularly relevant for OOE CPUs, so you may not see any performance
chance for this yourself depending on your CPU).

> +    OPC_SSSS_ABCD(psraw,$2,r4,r7,r5,r6)\
> +    "paddw   "#r1", "#r4"\n\t" /* ((((r0 + r2 + c) >> 1) + (3 * r1 + r3)) >> 2) + (2 * r1 + r3) */\
> +    "psubw   "#r3", "#r5"\n\t" /* ((((r0 - r2 + c) >> 1) - (3 * r3 - r1)) >> 2) - (2 * r3 - r1) */\
> +    "paddw   "#r3", "#r6"\n\t" /* ((((r0 - r2 + c) >> 1) + (3 * r3 - r1)) >> 2) + (2 * r3 - r1) */\
> +    "psubw   "#r1", "#r7"\n\t" /* ((((r0 + r2 + c) >> 1) - (3 * r1 + r3)) >> 2) - (2 * r1 + r3) */
[..]
> +#define TRANSFORM_8X4_ROW_COMMON(m,r0,r1,r2,r4,r5,r6,r7)\
> +    "mov"#m" "#r2", "#r4"\n\t" /* r6 */\
> +    "paddw   "#r4", "#r4"\n\t" /* 2 * r6 */\
> +    "paddw   "#r4", "#r2"\n\t" /* 3 * r6 */\
> +    "mov"#m" "#r0", "#r6"\n\t" /* r0 */\
> +    "paddw   "#r1", "#r6"\n\t" /* r0 + r1 */\

Again, the bottom two are independent from the other three: interleave.

> +    "mov"#m" "#r1", "#r4"\n\t" /* r1 */\
> +    "psubw   "#r0", "#r4"\n\t" /* r1 - r0 */\
> +    "paddw   "#r6", "#r0"\n\t" /* 2 * r0 + r1 */\
> +    "paddw   "#r4", "#r1"\n\t" /* 2 * r1 - r0 */\

> +    "pxor    "#r5", "#r5"\n\t" /* 0 */\
> +    "mov"#m" "#r5", "#r7"\n\t" /* 0 */\

I'm quite sure pxor m7, m7 would be faster since then there's no
dependency chain.

> +    "psubw   "#r6", "#r5"\n\t" /* -(r1 + r0) */\
> +    "psubw   "#r4", "#r7"\n\t" /* -(r1 - r0) */\
> +    OPC_SSSS_ABCD(psraw,$1,r4,r5,r6,r7)
[..]

> +#define TRANSFORM_4X8_COL_COMMON(m,r0,r1,r2,r4,r5,r6,r7,t,c)\
> +    "pcmpeqw  "#t",  "#t"\n\t" /* -1 */\
> +    "mov"#m" "#r2", "#r4"\n\t" /* r2 */\
> +    "paddw   "#r4", "#r4"\n\t" /* 2 * r2 */\
> +    "paddw   "#r4", "#r2"\n\t" /* 3 * r2 */\
> +    "mov"#m" "#r0", "#r6"\n\t" /* r0 */\
> +    "paddw   "#r1", "#r6"\n\t" /* r0 + r1 */\

Interleave (as per above).

> +    "mov"#m" "#r1", "#r4"\n\t" /* r1 */\
> +    "psubw   "#r0", "#r4"\n\t" /* r1 - r0 */\
> +    "paddw   "#r6", "#r0"\n\t" /* 2 * r0 + r1 */\
> +    "paddw   "#r4", "#r1"\n\t" /* 2 * r1 - r0 */\
> +    "mov"#m"  "#c", "#r5"\n\t" /* c */\
> +    "mov"#m" "#r5", "#r7"\n\t" /* c */\
> +    SUMSUB_BA(r6,r5)\
> +    SUMSUB_BA(r4,r7)\

SUMSUB_BAx2

> +    OPC_SS_AB(psubw,t,r6,r7)\
> +    OPC_SSSS_ABCD(psraw,$1,r4,r5,r6,r7)
[..]

> +#define LOAD_ADD_CLAMP_STORE1(m,io,t,z,r)\
> +    "mov"#m"   "#io",  "#t"\n\t"\
> +    "punpcklbw  "#z",  "#t"\n\t"\
> +    "paddw      "#t",  "#r"\n\t"\
> +    "packuswb   "#z",  "#r"\n\t"\
> +    "mov"#m"    "#r", "#io"\n\t"

I would swear I thought we had an inline STORE_DIFF macro somewhere.
Please at least rename this to STORE_DIFF. Maybe this a duplicate of
ff_put_pixels_clamped_mmx() (if you simply "call" it, then gcc will
inline it, i.e. no performance penalty but smaller source).

> +#define LOAD_ADD_CLAMP_STORE4(m,stride,io,t,z,r0,r1,r2,r3)\
> +    LOAD_ADD_CLAMP_STORE1(m,(io),t,z,r0)\
> +    "add "#stride", "#io"\n\t"\
> +    LOAD_ADD_CLAMP_STORE1(m,(io),t,z,r1)\
> +    "add "#stride", "#io"\n\t"\
> +    LOAD_ADD_CLAMP_STORE1(m,(io),t,z,r2)\
> +    "add "#stride", "#io"\n\t"\
> +    LOAD_ADD_CLAMP_STORE1(m,(io),t,z,r3)

If somehow the above doesn't work, then at least please use a
STORE_DIFFx2, as per x86util.asm (yes, that's yasm, but the logic is
the same). Again that's interleaving, and also it saves you 2 of 3 add
stride instructions.

Please rename the file to vc1dsp_mmx.h, there's nothing i386-generic
about these functions and it's in line with the rest (e.g.
dsputil_mmx.h).

> +static void vc1_inv_trans_8x8_mmx(DCTELEM block[64])
> +{
> +    DECLARE_ALIGNED(16, int16_t, temp)[64];
> +    __asm__ volatile(
> +    LOAD4(q,0x10,0x00(%0),%%mm5,%%mm1,%%mm0,%%mm3)

Indenting, this becomes 2xsemi-unreadable.

> +    TRANSPOSE4(%%mm5,%%mm1,%%mm0,%%mm3,%%mm4)
> +    STORE4(q,0x10,0x00(%0),%%mm5,%%mm3,%%mm4,%%mm0)
> +
> +    LOAD4(q,0x10,0x08(%0),%%mm6,%%mm5,%%mm7,%%mm1)
> +    TRANSPOSE4(%%mm6,%%mm5,%%mm7,%%mm1,%%mm2)
> +    STORE4(q,0x10,0x08(%0),%%mm6,%%mm1,%%mm2,%%mm7)

> +    TRANSFORM_8X4_ROW_H1
> +    (
> +        q,q,
> +        0x00(%0),0x20(%0),0x08(%0),0x38(%0),
> +        %%mm0,%%mm1,%%mm2,%%mm3,%%mm4,%%mm5,%%mm6,%%mm7
> +    )

This is even worse, this is absolutely unreadable. :-). Same for all
other functions in this file.

This function does the same operation 4x, can you create a loop
without slowing it down significantly? (gcc should unroll by itself,
but then source size is 4x smaller).

Same comments for the SSE2 one. Please fix indenting, do the proper
thing for SSE2 vs. SSE2SLOW if you have a Core1 or similar so we can
test if SSE2_SLOW is actually faster than MMX2.

For SSE2, please mark xmm6 and xmm7 as clobbered (see
doc/optimization.txt), otherwise it breaks on Win64.

The getenv hack shouldn't be needed.

Once you fix all the above, let's do a second round of review.

Ronald



More information about the ffmpeg-devel mailing list