[FFmpeg-devel] [PATCH] mmx implementation of vc-1 inverse transformations
Victor Pollex
victor.pollex
Thu Jul 31 14:50:44 CEST 2008
Michael Niedermayer schrieb:
[...]
>
>> @@ -467,7 +469,256 @@
>> DECLARE_FUNCTION(3, 2)
>> DECLARE_FUNCTION(3, 3)
>>
>> +static void vc1_inv_trans_8x8_mmx(DCTELEM block[64])
>> +{
>> + DECLARE_ALIGNED_16(int16_t, temp[64]);
>> + asm volatile(
>> + LOAD4(q,0x10,0x00(%0),%%mm5,%%mm1,%%mm0,%%mm3)
>> + TRANSPOSE4(%%mm5,%%mm1,%%mm0,%%mm3,%%mm4)
>> + STORE4(q,0x10,0x00(%0),%%mm5,%%mm3,%%mm4,%%mm0)
>> +
>> + LOAD4(q,0x10,0x08(%0),%%mm6,%%mm5,%%mm7,%%mm1)
>> + TRANSPOSE4(%%mm6,%%mm5,%%mm7,%%mm1,%%mm2)
>> + STORE4(q,0x10,0x08(%0),%%mm6,%%mm1,%%mm2,%%mm7)
>>
>
> it is still transposing the data at the begin of functions.
> I thought you transposed the scantables ...
>
I did transpose the scantables except the one for the 8x8
transformation, as it is used in several places and a lot more code has
to be changed to accomodate the scantable change.
>
> [...]
>
>> + :
>> + : "r"(block), "m"(temp[0])
>> + : "memory"
>> + );
>> +
>> + asm volatile(
>>
>
> why is this asm () block splited?
>
for some stupid reason my gcc adds a "push ebx" and "pop ebx" to the
start and the end of the function if I use more than 3 general purpose
register in an asm block. I'm using gcc 4.3.1, is this some sort of bug,
perhaps a known bug?
>
> [...]
>
>> + STORE4(q,0x10,0x40%1,%%mm4,%%mm7,%%mm0,%%mm6)
>> + :
>> + : "r"(block), "m"(temp[0]), "m"(ff_pw_4)
>> + : "memory"
>> + );
>> +
>> + asm volatile(
>> + "movq 0x30%3, %%mm1\n\t" /* b[3] */
>> + TRANSFORM_4X8_COL_H1
>> + (
>> + q,q,
>> + 0x00%3,0x10%3,0x20%3,0x40%3,0x70%3,
>>
>
> this store and later load seems redundant
>
I need them later in the second half of the 4x8 column transformation
and for the first half I need b[3], b[5] and b[6] of which only b[5] and
b[6] are already in the registers so I need to load b[3] and before I
use any further data I use all of the remaining registers so I have to
load them.
> and the asm should not be split
>
see above.
>
> [...]
>
>> + STORE4(dqa,0x10,0x00(%0),%%xmm0,%%xmm5,%%xmm7,%%xmm3)
>> + STORE4(dqa,0x10,0x40(%0),%%xmm6,%%xmm4,%%xmm2,%%xmm1)
>> + TRANSFORM_8X4_ROW_H1
>> + (
>> + dqa,dqa,
>> + 0x00(%0),0x20(%0),0x40(%0),0x70(%0),
>>
>
> some of these stores and loads seem redundant
>
I need them for the second half of the 8x4 row transformation and again
before I use any further data I used all the remaining register.
>
> [...]
>
>> +void ff_vc1dsp_init_sse2(DSPContext* dsp, AVCodecContext *avctx) {
>> + if(!(mm_flags & MM_SSE2))
>> + return;
>> +
>> + dsp->vc1_inv_trans_8x8 = vc1_inv_trans_8x8_sse2;
>> + dsp->vc1_inv_trans_4x8 = vc1_inv_trans_4x8_sse2;
>> + dsp->vc1_inv_trans_8x4 = vc1_inv_trans_8x4_sse2;
>> +}
>>
>
> are all of the SSE2 variants faste than mmx?
>
For me the 8x8 sse2 variant is faster than the mmx one, but as I
metioned in an earlier post, the 4x8 isn't and the 8x4 is only a bit
faster, that is why I asked if someone else could benchmark them, to see
if they behave like that just for me.
More information about the ffmpeg-devel
mailing list