[Ffmpeg-devel] [RFC] VC1 Transform in AltiVec

Michael Niedermayer michaelni
Tue Jul 18 12:05:58 CEST 2006


Hi

On Tue, Jul 18, 2006 at 06:46:23AM +0300, Kostya wrote:
> Here is my first attept to optimize something with processor-specific instructions.
> A patch to vc1.c provided.
> 
> Please note that:
> a) It is AltiVec-only, so don't try to compile on x86 or machine without AltiVec support
> b) It's just a hack to demonstrate it works, in future this will go to ppc/vc1_altivec.c
> 
> TRANSPOSE8() macro was taken from ppc/mpegvideo_altivec.c
> 
> I'd like to hear from people who know this stuff if I took the right approach (and further
> suggestions of optimization).
> 
> MMX version will follow.

> --- vc1_svn.c	2006-07-16 07:47:53.000000000 +0300
> +++ vc1.c	2006-07-17 19:09:12.000000000 +0300
> @@ -716,6 +716,192 @@
>      return 0;
>  }
>  
> +#define TRANSPOSE8(a,b,c,d,e,f,g,h) \
> +do { \
> +    __typeof__(a)  _A1, _B1, _C1, _D1, _E1, _F1, _G1, _H1; \
> +    __typeof__(a)  _A2, _B2, _C2, _D2, _E2, _F2, _G2, _H2; \

stuff beginning with _ is reserved in C ...


[...]

> +static void vc1_8x8_altivec(DCTELEM block[64])
> +{
> +    vector signed short ssrc0, ssrc1, ssrc2, ssrc3, ssrc4, ssrc5, ssrc6, ssrc7;
> +    vector signed int s0, s1, s2, s3, s4, s5, s6, s7;
> +    vector signed int s8, s9, sA, sB, sC, sD, sE, sF;
> +    vector signed int t0, t1, t2, t3, t4, t5, t6, t7;
> +    const vector signed int vec_64 = {64, 64, 64, 64};
> +    const vector signed int vec_7 = {7, 7, 7, 7};
> +    const vector signed int vec_4 = {4, 4, 4, 4};
> +    const vector signed int vec_3 = {3, 3, 3, 3};
> +    const vector signed int vec_2 = {2, 2, 2, 2};
> +    const vector signed int vec_1 = {1, 1, 1, 1};
> +
> +    ssrc0 = vec_ld(  0, block);
> +    ssrc1 = vec_ld( 16, block);
> +    ssrc2 = vec_ld( 32, block);
> +    ssrc3 = vec_ld( 48, block);
> +    ssrc4 = vec_ld( 64, block);
> +    ssrc5 = vec_ld( 80, block);
> +    ssrc6 = vec_ld( 96, block);
> +    ssrc7 = vec_ld(112, block);
> +
> +    TRANSPOSE8(ssrc0, ssrc1, ssrc2, ssrc3, ssrc4, ssrc5, ssrc6, ssrc7);

the TRANSPOSE is unneeded, the scantables can be transposed to get the same
effect


> +    s0 = vec_unpackl(ssrc0);
> +    s1 = vec_unpackl(ssrc1);
> +    s2 = vec_unpackl(ssrc2);
> +    s3 = vec_unpackl(ssrc3);
> +    s4 = vec_unpackl(ssrc4);
> +    s5 = vec_unpackl(ssrc5);
> +    s6 = vec_unpackl(ssrc6);
> +    s7 = vec_unpackl(ssrc7);
> +    s8 = vec_unpackh(ssrc0);
> +    s9 = vec_unpackh(ssrc1);
> +    sA = vec_unpackh(ssrc2);
> +    sB = vec_unpackh(ssrc3);
> +    sC = vec_unpackh(ssrc4);
> +    sD = vec_unpackh(ssrc5);
> +    sE = vec_unpackh(ssrc6);
> +    sF = vec_unpackh(ssrc7);
> +
> +    STEP8(s0, s1, s2, s3, s4, s5, s6, s7, vec_4);
> +    SHIFT_HOR(s0, s1, s2, s3, s4, s5, s6, s7);
> +    STEP8(s8, s9, sA, sB, sC, sD, sE, sF, vec_4);
> +    SHIFT_HOR(s8, s9, sA, sB, sC, sD, sE, sF);

the horizontal transform fits in 16bit as is so no unpack/pack is needed


> +    ssrc0 = vec_pack(s8, s0);
> +    ssrc1 = vec_pack(s9, s1);
> +    ssrc2 = vec_pack(sA, s2);
> +    ssrc3 = vec_pack(sB, s3);
> +    ssrc4 = vec_pack(sC, s4);
> +    ssrc5 = vec_pack(sD, s5);
> +    ssrc6 = vec_pack(sE, s6);
> +    ssrc7 = vec_pack(sF, s7);
> +
> +    TRANSPOSE8(ssrc0, ssrc1, ssrc2, ssrc3, ssrc4, ssrc5, ssrc6, ssrc7);
> +    s0 = vec_unpackl(ssrc0);
> +    s1 = vec_unpackl(ssrc1);
> +    s2 = vec_unpackl(ssrc2);
> +    s3 = vec_unpackl(ssrc3);
> +    s4 = vec_unpackl(ssrc4);
> +    s5 = vec_unpackl(ssrc5);
> +    s6 = vec_unpackl(ssrc6);
> +    s7 = vec_unpackl(ssrc7);
> +    s8 = vec_unpackh(ssrc0);
> +    s9 = vec_unpackh(ssrc1);
> +    sA = vec_unpackh(ssrc2);
> +    sB = vec_unpackh(ssrc3);
> +    sC = vec_unpackh(ssrc4);
> +    sD = vec_unpackh(ssrc5);
> +    sE = vec_unpackh(ssrc6);
> +    sF = vec_unpackh(ssrc7);
> +    STEP8(s0, s1, s2, s3, s4, s5, s6, s7, vec_4);
> +    SHIFT_VERT(s0, s1, s2, s3, s4, s5, s6, s7);
> +    STEP8(s8, s9, sA, sB, sC, sD, sE, sF, vec_4);
> +    SHIFT_VERT(s8, s9, sA, sB, sC, sD, sE, sF);

the vertical transform can also be done in 16bit though its a little trickier

            t1 = 6 * (src[ 0] + src[32]);
            t2 = 6 * (src[ 0] - src[32]);
            t3 = 8 * src[16] +  3 * src[48];
            t4 = 3 * src[16] -  8 * src[48];

            t5 = t1 + t3;
            t6 = t2 + t4;
            t7 = t2 - t4;
            t8 = t1 - t3;

            t1 = (8 * src[ 8] + 8 * src[24] + 4 * src[40] + 2 * src[56]) + ((- src[24] + src[40])>>1);
            t2 = (8 * src[ 8] - 2 * src[24] - 8 * src[40] - 4 * src[56]) + ((- src[ 8] - src[56])>>1);
            t3 = (4 * src[ 8] - 8 * src[24] + 2 * src[40] + 8 * src[56]) + ((  src[ 8] - src[56])>>1);
            t4 = (2 * src[ 8] - 4 * src[24] + 8 * src[40] - 8 * src[56]) + ((- src[24] - src[40])>>1);

            dst[ 0] = (t5 + t1 + 32) >> 6;
            dst[ 8] = (t6 + t2 + 32) >> 6;
            dst[16] = (t7 + t3 + 32) >> 6;
            dst[24] = (t8 + t4 + 32) >> 6;
            dst[32] = (t8 - t4 + 32) >> 6;
            dst[40] = (t7 - t3 + 32) >> 6;
            dst[48] = (t6 - t2 + 32) >> 6;
            dst[56] = (t5 - t1 + 32) >> 6;

its also interresting to note that microsoft must be aware of this due to the
way rounding is done on the second half of coeffs but they apparently 
dont mention it in the spec ... i am wondering what other stuff they have
hidden ...

and the + 32 can be added to t1/t2 instead of the end

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

In the past you could go to a library and read, borrow or copy any book
Today you'd get arrested for mere telling someone where the library is




More information about the ffmpeg-devel mailing list