[FFmpeg-devel] MMX accelerated DSP functions for VC1/WMV3 decoders
Christophe GISQUET
christophe.gisquet
Sat Jun 30 14:37:53 CEST 2007
Hello,
the attached patch provides some mmx functions (pshuw from mmx2 would
only marginally be faster) for those decoders. They could also be used
in the encoder, but I didn't bother with this, as there are probably
people more fit than me to accommodate this with the build system.
Tests and benchmarks were performed on
http://samples.mplayerhq.hu/V-codecs/WMV9/highdef/Robotica_720.wmv
I have tested decoding accuracy with a cmp (as I don't know nor plan to
introduce this in the regression tests), and used the following command
to measure speed/profile:
./ffmpeg -benchmark -i Robotica_720.wmv -an -f rawvideo -y /dev/null
And now for the row figures...
without patch, utime: 7.44 7.35 7.16 7.37 7.27
with: 5.32 5.37 5.33 5.31 5.41
And the profiling (oprofile results)...
without patch:
samples % symbol name
129666 40.5939 vc1_mspel_mc
45812 14.3422 vc1_inv_trans_8x8_c
26404 8.2662 vc1_decode_p_blocks
21967 6.8771 put_no_rnd_h264_chroma_mc8_c
21336 6.6796 vc1_decode_ac_coeff
8582 2.6867 vc1_decode_intra_block
8273 2.5900 vc1_decode_p_block
8157 2.5537 clear_blocks_mmx
6896 2.1589 put_h264_chroma_mc8_mmx
6748 2.1126 vc1_inv_trans_8x4_c
6254 1.9579 vc1_inv_trans_4x8_c
with:
samples % symbol name
6095 17.8169 vc1_inv_trans_8x8_c
3769 11.0176 vc1_decode_p_blocks
3565 10.4212 put_no_rnd_h264_chroma_mc8_c
3380 9.8804 vc1_decode_ac_coeff
1365 3.9902 vc1_inv_trans_8x4_c
1348 3.9405 vc1_decode_p_block
1260 3.6832 clear_blocks_mmx
1146 3.3500 put_h264_chroma_mc8_mmx
1046 3.0577 vc1_inv_trans_4x8_c
938 2.7420 ff_emulated_edge_mc
849 2.4818 ff_put_vc1_mspel_mc22_mmx
791 2.3123 vc1_mc_1mv
774 2.2626 vc1_decode_intra_block
746 2.1807 ff_put_vc1_mspel_mc00_mmx
698 2.0404 ff_put_vc1_mspel_mc20_mmx
576 1.6838 ff_put_vc1_mspel_mc21_mmx
500 1.4616 ff_put_vc1_mspel_mc23_mmx
481 1.4061 ff_put_vc1_mspel_mc12_mmx
476 1.3914 ff_put_vc1_mspel_mc32_mmx
339 0.9910 ff_put_vc1_mspel_mc31_mmx
334 0.9764 ff_put_vc1_mspel_mc11_mmx
334 0.9764 ff_put_vc1_mspel_mc13_mmx
333 0.9734 ff_put_vc1_mspel_mc02_mmx
318 0.9296 ff_init_block_index
305 0.8916 ff_put_vc1_mspel_mc10_mmx
304 0.8887 add_pixels_clamped_mmx
274 0.8010 ff_put_vc1_mspel_mc33_mmx
267 0.7805 ff_put_vc1_mspel_mc30_mmx
233 0.6811 vc1_decode_i_blocks
180 0.5262 ff_put_vc1_mspel_mc01_mmx
165 0.4823 ff_put_vc1_mspel_mc03_mmx
154 0.4502 vc1_inv_trans_4x4_c
The new total for the ff_put_vc1_mspel_mc* functions is now just above
20%. There is some unoptimal stuff left of course, like filter 0 being
just a source/destination modification, put_pixels8_mmx being
duplicated, or some useless register loads, but code complexity would
increase beyond what I'm willing to put in.
vc1_inv_trans_8x8_c would be a next follow-up candidate but the code
looks bothersome. On the other hand, put_no_rnd_h264_chroma_mc8_c would
benefit other codecs. I do have an mmx1/2 implementation for it, but I'm
holding it until this patch gets in svn, if it ever does.
Best regards,
Christophe GISQUET
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vc1dsp_mmx.diff
Type: text/x-patch
Size: 14848 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070630/40e0816f/attachment.bin>
More information about the ffmpeg-devel
mailing list