[MPlayer-dev-eng] [PATCH] SSE2 optimizations for libmpeg2
Loren Merritt
lorenm at u.washington.edu
Sun Feb 17 18:26:07 CET 2008
On Sat, 16 Feb 2008, Diego Biurrun wrote:
> I found this patch on the libmpeg2 mailing list, here it is, slightly
> adapted and cleaned up. I could only test compilation without SSE2 as I
> don't have a SSE2 processor. I'd be happy to hear about test results
> and benchmarks.
Cpu detection is broken, attached patch fixes it (to be applied on top of
the previous patch). Furthermore, cpu detection is doubly redundant: Not
only is libmpeg2's own detection not used, but libmpeg2's wrapper for
mplayer's detection is also not used. vd_libmpeg2 overrides them.
What is DO_NOT_MIX_MMX_AND_SSE2 for? There's nothing wrong with mixing mmx
and sse2. Pick whichever is fastest for any given function, or even mix
them in the same function.
sse2 output is not identical to mmx output (65 dB). Is the idct supposed
to differ?
quick test on a single 414 second dvd source, core2 e6600:
ffmpeg2: 18.33 +/- .11 sec (678 fps)
libmpeg2 mmx: 16.91 +/- .09 sec (734 fps)
libmpeg2 sse2: 14.75 +/- .05 sec (842 fps)
oprofile (filtered to merge the many mc functions into a single line)
ffmpeg2:
samples % symbol name
128869 38.4785 mpeg_decode_mb
58947 17.6007 ff_simple_idct_add_mmx
32611 9.7372 ff_simple_idct_put_mmx
30083 8.9823 put_pixels_mmx2
12113 3.6168 MPV_decode_mb
11071 3.3056 MPV_motion
10564 3.1543 clear_blocks_mmx
9792 2.9238 add_pixels_clamped_mmx
8167 2.4386 demux_pattern_3
8034 2.3988 mpeg_decode_motion
7522 2.2460 decode_dc
6130 1.8303 mpeg_decode_slice
3191 0.9528 prefetch_mmx2
3157 0.9426 fast_memcpy
1136 0.3393 avg_pixels_mmx2
libmpeg2 mmx:
75716 25.2977 mmxext_idct
75500 25.2255 get_non_intra_block
33795 11.2913 MC_put_mmxext
30040 10.0368 get_intra_block_B15
20569 6.8724 mpeg2_slice
15603 5.2132 mpeg2_idct_add_mmxext
8583 2.8677 mpeg2_parse
8566 2.8620 slice_intra_DCT
8067 2.6953 demux_pattern_3
6589 2.2015 motion_fr_frame_420
5965 1.9930 mpeg2_idct_copy_mmxext
3080 1.0291 motion_fr_field_420
2955 0.9873 fast_memcpy
1489 0.4975 MC_avg_mmxext
649 0.2168 motion_reuse_420
libmpeg2 sse2:
73447 28.2600 get_non_intra_block
41666 16.0317 mpeg2_idct_add_sse2
33042 12.7135 MC_put_sse2
29754 11.4484 get_intra_block_B15
22896 8.8096 mpeg2_idct_copy_sse2
19205 7.3895 mpeg2_slice
8828 3.3967 mpeg2_parse
8248 3.1736 demux_pattern_3
6420 2.4702 motion_fr_frame_420
6355 2.4452 slice_intra_DCT
2952 1.1358 motion_fr_field_420
2918 1.1228 fast_memcpy
1422 0.5473 MC_avg_sse2
613 0.2359 motion_reuse_420
... seems to show that the mc part is useless, and speedup is due entirely
to idct.
--Loren Merritt
-------------- next part --------------
--- libmpeg2/cpu_accel.c~ 2008-02-17 08:54:13.000000000 -0700
+++ libmpeg2/cpu_accel.c 2008-02-17 09:06:06.000000000 -0700
@@ -97,8 +97,6 @@
if (!eax) /* vendor string only */
return 0;
- if (edx & 0x04000000) /* SSE2 */
- accel |= MPEG2_ACCEL_X86_SSE2;
AMD = (ebx == 0x68747541) && (ecx == 0x444d4163) && (edx == 0x69746e65);
cpuid (0x00000001, eax, ebx, ecx, edx);
@@ -108,6 +106,8 @@
caps = MPEG2_ACCEL_X86_MMX;
if (edx & 0x02000000) /* SSE - identical to AMD MMX extensions */
caps = MPEG2_ACCEL_X86_MMX | MPEG2_ACCEL_X86_MMXEXT;
+ if (edx & 0x04000000) /* SSE2 */
+ caps |= MPEG2_ACCEL_X86_SSE2;
cpuid (0x80000000, eax, ebx, ecx, edx);
if (eax < 0x80000001) /* no extended capabilities */
Index: libmpcodecs/vd_libmpeg2.c
===================================================================
--- libmpcodecs/vd_libmpeg2.c (revision 26016)
+++ libmpcodecs/vd_libmpeg2.c (working copy)
@@ -72,6 +72,8 @@
accel |= MPEG2_ACCEL_X86_MMX;
if(gCpuCaps.hasMMX2)
accel |= MPEG2_ACCEL_X86_MMXEXT;
+ if(gCpuCaps.hasSSE2)
+ accel |= MPEG2_ACCEL_X86_SSE2;
if(gCpuCaps.has3DNow)
accel |= MPEG2_ACCEL_X86_3DNOW;
if(gCpuCaps.hasAltiVec)
More information about the MPlayer-dev-eng
mailing list