[FFmpeg-devel] [PATCH] Higher bit-depth x86 SIMD assembly for yadif
Michael Niedermayer
michaelni at gmx.at
Thu Jan 19 22:44:51 CET 2012
Hi
CC-ing to dark shikari & loren as they might want to review too?
On Thu, Jan 19, 2012 at 08:55:58PM +0100, James Darnley wrote:
> Attached are five patches which add code for:
> mmx to sse4 instruction sets for 15 and 16 bits per sample
> mmx to ssse3 instruction sets for 9 to 14 bits per sample
> actual support of 9 bits per sample
>
> I know that 11 to 15 bits per sample don't exist at present but
> support might be added since h264 allows up to 14 bits per sample.
> Anyway, all the code added here is used for existing features.
>
> Below, I have copied the commit messages for convenience.
>
> Something else to think about. The source code clarity could be
> greatly improved by using yasm and its preprocessor. I wonder how
> much abstraction it would need to roll the source to all three
> functions together and whether it would save source code size.
if you want to convert it to yasm, thats fine, if not its fine too.
whichever way you prefer
>
> Subject: [PATCH 1/5] x86 SIMD for 16 bits per sample in yadif
>
> It might be a rather dumb copy of the 8-bit SIMD but it works and
> produces identical output to the C. The MMX and SSE2 has been tested on
> my Athlon64. The SSSE3 and SSE4.1 needs testing and benching elsewhere.
>
> Benchmarks on the Athlon64 using a 704px wide video, per line:
> 1693075 decicycles in C, 521977 runs, 2311 skips
> 1029468 decicycles in mmx, 523347 runs, 941 skips
> 730504 decicycles in sse2, 523474 runs, 814 skips
>
> Subject: [PATCH 2/5] x86 SIMD for 9 to 14 bits per sample in yadif
>
> These lower bit depths do not need unpacking to double words letting the
> code process more pixels per iteration (still 2 in mmx but 6 in sse2)
> and avoiding emulating the missing double word instructions on older
> instruction sets.
>
> Benchmarks on my Athlon64 using a 704 pixel wide video, per line:
> 1695927 decicycles in C, 260986 runs, 1158 skips
> 854770 decicycles in mmx, 261717 runs, 427 skips
> 440202 decicycles in sse2, 261829 runs, 315 skips
>
> Works out at:
> mmx - 1.20 times faster than the 16 bit
> sse2 - 1.66 times faster than the 16 bit
[...]
> + "paddd "MM"6, "MM"3 \n\t" /* d+diff */\
> + PMAXSD(MM"2",MM"1",MM"7")\
> + PMINSD(MM"3",MM"1",MM"7")\
> + PACK(MM"1")\
> +\
> + :\
> + :[tmpA] "r"(tmpA),\
> + [prev] "r"(prev),\
> + [cur] "r"(cur),\
> + [next] "r"(next),\
> + [prefs]"r"(prefs),\
> + [mrefs]"r"(mrefs),\
> + [mode] "g"(mode)\
this should list the SIMD registers written to on the clobber list
otherwise with SSE* there may be issues on win64 and in theory also
elsewhere
> + );\
> + __asm__ volatile(MOVH" "MM"1, %0" :"=m"(*dst));\
I guess it should be ok in reality but its not guranteed that
SIMD registers dont change between blocks
[...]
also feel free to add youself as yadif SIMD maintainer to the
MAINTAINERS file if you like
and very nice work and speed up
Thanks!
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Asymptotically faster algorithms should always be preferred if you have
asymptotical amounts of data
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20120119/beb6dcb4/attachment.asc>
More information about the ffmpeg-devel
mailing list