[FFmpeg-devel] [PATCH] SSE2 Xvid idct

Alexander Strange astrange
Fri Apr 11 00:42:40 CEST 2008


On Apr 6, 2008, at 12:14 PM, Michael Niedermayer wrote:
> On Sun, Apr 06, 2008 at 12:19:58AM -0400, Alexander Strange wrote:
>> This adds skal's sse2 idct and uses it as the xvid idct when  
>> available.
>>
>> I merged two shuffles into the permutation and changed the zero- 
>> skipping
>> some - it's fastest in MMX and not really worth doing for the first  
>> three
>> rows. Their right halfs are still usually all zero, but adding the  
>> branch
>> to check for it is a net loss. The best thing for speed would be  
>> switching
>> IDCTs by counting the last nonzero coefficient position, but that's
>> something for later.
>>
>> xvididctheader - makes a new header so I don't add any more extern
>> declarations in .c files.
>> sse2-permute - the new permutation; it might not have a specific  
>> enough
>> name, but it should work as well for simpleidct as this if I can  
>> get back
>> to that.
>> sse2-xvid-idct.diff + idct_sse2_xvid.c - the IDCT
>>
>> The URLs in the header (copied from idct_mmx_xvid and the original  
>> nasm
>> source) are broken at the moment, but archive.org URLs are longer  
>> than 80
>> characters, so I left them like they are.
>>
>> skal agreed it could be under LGPL in the last thread.
> [...]
>> #define SKIP_ROW_CHECK(src)                 \
>>    "movq     "src", %%mm0            \n\t" \
>>    "por    8+"src", %%mm0            \n\t" \
>>    "packssdw %%mm0, %%mm0            \n\t" \
>>    "movd     %%mm0, %%eax            \n\t" \
>>    "testl    %%eax, %%eax            \n\t" \
>>    "jz 1f                            \n\t"
>
> You could try to check pairs of rows, this might be faster for some  
> rows.
> Also the code should be interleaved not form such nasty dependancy  
> chains
> you do have enogh mmx registers.

I think the movq breaks the dependence chain, at least on my CPU. But  
moving stuff above the branch is good - changed to check two rows at  
once for 3-6 and use MMX pmovmskb.

>> #define iMTX_MULT(src, table, rounder)      \
>>    "movdqa   "src", %%xmm0         \n\t"   \
>
>>    "pshufd      $0, %%xmm0, %%xmm4 \n\t"   \
>>    "pshufd   $0x55, %%xmm0, %%xmm6 \n\t"   \
>>    "pshufd   $0xAA, %%xmm0, %%xmm5 \n\t"   \
>>    "pshufd   $0xFF, %%xmm0, %%xmm7 \n\t"   \
>
> you can replace 2 of the pshufd by 1 movdqa, 1unpckldqd and 1unpckhdqd
> considering that pshufd seems to be slower this _could_ be faster.
> here my notes about it
> 02461357
> 02461357 mov
> 02460246 unpck
> 13571357 unpck
> 46024602 shufld
> 57135713 shufld

Done, it seems a bit faster.

> [...]
>> #define iLLM_PASS(dct)                      \
>>    "movdqa   "MANGLE(tan3)", %%xmm0  \n\t" \
>>    "movdqa      3*16("dct"), %%xmm3  \n\t" \
>>    "movdqa           %%xmm0, %%xmm1  \n\t" \
>>    "movdqa      5*16("dct"), %%xmm5  \n\t" \
>>    "movdqa   "MANGLE(tan1)", %%xmm4  \n\t" \
>>    "movdqa        16("dct"), %%xmm6  \n\t" \
>>    "movdqa      7*16("dct"), %%xmm7  \n\t" \
>
> if i didnt miscalculate it then you can keep 4 of the above in  
> registers
> from the row transform (and all 8 dct values for x86_64)

Done (with some macroing). This leaves a few (2-4, maybe) unnecessary  
movdqa for x86_64, but I don't want it to be unreadably magical.
I changed iMTX_MULT to use xmm0-3 instead of 4-7 since I thought it  
looked better.

>
> [...]
>>    "movdqa   %%xmm2, ("dct")         \n\t" \
>>    "movdqa   %%xmm3, %%xmm2          \n\t" \
>>    "psubsw   %%xmm6, %%xmm3          \n\t" \
>>    "paddsw   %%xmm2, %%xmm6          \n\t" \
>>    "movdqa   %%xmm6, %%xmm2          \n\t" \
>>    "psubsw   %%xmm7, %%xmm6          \n\t" \
>>    "paddsw   %%xmm2, %%xmm7          \n\t" \
>>    "movdqa   %%xmm3, %%xmm2          \n\t" \
>>    "psubsw   %%xmm5, %%xmm3          \n\t" \
>>    "paddsw   %%xmm2, %%xmm5          \n\t" \
>>    "movdqa   %%xmm5, %%xmm2          \n\t" \
>>    "psubsw   %%xmm0, %%xmm5          \n\t" \
>>    "paddsw   %%xmm2, %%xmm0          \n\t" \
>>    "movdqa   %%xmm3, %%xmm2          \n\t" \
>>    "psubsw   %%xmm4, %%xmm3          \n\t" \
>>    "paddsw   %%xmm2, %%xmm4          \n\t" \
>>    "movdqa  ("dct"), %%xmm2          \n\t" \
>
> i suspect this can be written without the load/store by using
> add,add,sub buterflies (of course only if it is faster)

Avoided on x86-64.

These changed:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sse2-permute2.diff
Type: application/octet-stream
Size: 1335 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080410/98ddc0d2/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dcttest-xvidsse2-2.diff
Type: application/octet-stream
Size: 1834 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080410/98ddc0d2/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: idct_sse2_xvid.c
Type: application/octet-stream
Size: 11495 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080410/98ddc0d2/attachment-0002.obj>



More information about the ffmpeg-devel mailing list