[FFmpeg-devel] [PATCH] SSE2 Xvid idct
Alexander Strange
astrange
Fri Apr 11 00:42:40 CEST 2008
On Apr 6, 2008, at 12:14 PM, Michael Niedermayer wrote:
> On Sun, Apr 06, 2008 at 12:19:58AM -0400, Alexander Strange wrote:
>> This adds skal's sse2 idct and uses it as the xvid idct when
>> available.
>>
>> I merged two shuffles into the permutation and changed the zero-
>> skipping
>> some - it's fastest in MMX and not really worth doing for the first
>> three
>> rows. Their right halfs are still usually all zero, but adding the
>> branch
>> to check for it is a net loss. The best thing for speed would be
>> switching
>> IDCTs by counting the last nonzero coefficient position, but that's
>> something for later.
>>
>> xvididctheader - makes a new header so I don't add any more extern
>> declarations in .c files.
>> sse2-permute - the new permutation; it might not have a specific
>> enough
>> name, but it should work as well for simpleidct as this if I can
>> get back
>> to that.
>> sse2-xvid-idct.diff + idct_sse2_xvid.c - the IDCT
>>
>> The URLs in the header (copied from idct_mmx_xvid and the original
>> nasm
>> source) are broken at the moment, but archive.org URLs are longer
>> than 80
>> characters, so I left them like they are.
>>
>> skal agreed it could be under LGPL in the last thread.
> [...]
>> #define SKIP_ROW_CHECK(src) \
>> "movq "src", %%mm0 \n\t" \
>> "por 8+"src", %%mm0 \n\t" \
>> "packssdw %%mm0, %%mm0 \n\t" \
>> "movd %%mm0, %%eax \n\t" \
>> "testl %%eax, %%eax \n\t" \
>> "jz 1f \n\t"
>
> You could try to check pairs of rows, this might be faster for some
> rows.
> Also the code should be interleaved not form such nasty dependancy
> chains
> you do have enogh mmx registers.
I think the movq breaks the dependence chain, at least on my CPU. But
moving stuff above the branch is good - changed to check two rows at
once for 3-6 and use MMX pmovmskb.
>> #define iMTX_MULT(src, table, rounder) \
>> "movdqa "src", %%xmm0 \n\t" \
>
>> "pshufd $0, %%xmm0, %%xmm4 \n\t" \
>> "pshufd $0x55, %%xmm0, %%xmm6 \n\t" \
>> "pshufd $0xAA, %%xmm0, %%xmm5 \n\t" \
>> "pshufd $0xFF, %%xmm0, %%xmm7 \n\t" \
>
> you can replace 2 of the pshufd by 1 movdqa, 1unpckldqd and 1unpckhdqd
> considering that pshufd seems to be slower this _could_ be faster.
> here my notes about it
> 02461357
> 02461357 mov
> 02460246 unpck
> 13571357 unpck
> 46024602 shufld
> 57135713 shufld
Done, it seems a bit faster.
> [...]
>> #define iLLM_PASS(dct) \
>> "movdqa "MANGLE(tan3)", %%xmm0 \n\t" \
>> "movdqa 3*16("dct"), %%xmm3 \n\t" \
>> "movdqa %%xmm0, %%xmm1 \n\t" \
>> "movdqa 5*16("dct"), %%xmm5 \n\t" \
>> "movdqa "MANGLE(tan1)", %%xmm4 \n\t" \
>> "movdqa 16("dct"), %%xmm6 \n\t" \
>> "movdqa 7*16("dct"), %%xmm7 \n\t" \
>
> if i didnt miscalculate it then you can keep 4 of the above in
> registers
> from the row transform (and all 8 dct values for x86_64)
Done (with some macroing). This leaves a few (2-4, maybe) unnecessary
movdqa for x86_64, but I don't want it to be unreadably magical.
I changed iMTX_MULT to use xmm0-3 instead of 4-7 since I thought it
looked better.
>
> [...]
>> "movdqa %%xmm2, ("dct") \n\t" \
>> "movdqa %%xmm3, %%xmm2 \n\t" \
>> "psubsw %%xmm6, %%xmm3 \n\t" \
>> "paddsw %%xmm2, %%xmm6 \n\t" \
>> "movdqa %%xmm6, %%xmm2 \n\t" \
>> "psubsw %%xmm7, %%xmm6 \n\t" \
>> "paddsw %%xmm2, %%xmm7 \n\t" \
>> "movdqa %%xmm3, %%xmm2 \n\t" \
>> "psubsw %%xmm5, %%xmm3 \n\t" \
>> "paddsw %%xmm2, %%xmm5 \n\t" \
>> "movdqa %%xmm5, %%xmm2 \n\t" \
>> "psubsw %%xmm0, %%xmm5 \n\t" \
>> "paddsw %%xmm2, %%xmm0 \n\t" \
>> "movdqa %%xmm3, %%xmm2 \n\t" \
>> "psubsw %%xmm4, %%xmm3 \n\t" \
>> "paddsw %%xmm2, %%xmm4 \n\t" \
>> "movdqa ("dct"), %%xmm2 \n\t" \
>
> i suspect this can be written without the load/store by using
> add,add,sub buterflies (of course only if it is faster)
Avoided on x86-64.
These changed:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sse2-permute2.diff
Type: application/octet-stream
Size: 1335 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080410/98ddc0d2/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dcttest-xvidsse2-2.diff
Type: application/octet-stream
Size: 1834 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080410/98ddc0d2/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: idct_sse2_xvid.c
Type: application/octet-stream
Size: 11495 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080410/98ddc0d2/attachment-0002.obj>
More information about the ffmpeg-devel
mailing list