[FFmpeg-devel] [PATCH] VP8 luma(16) inner-MB H/V loopfilter MMX/SSE2

Ronald S. Bultje rsbultje
Tue Jul 20 00:40:14 CEST 2010


Hi,

On Mon, Jul 19, 2010 at 2:19 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Sun, Jul 18, 2010 at 02:21:12PM -0400, Ronald S. Bultje wrote:
>> On Sun, Jul 11, 2010 at 2:47 PM, Loren Merritt <lorenm at u.washington.edu> wrote:
>> > On Sun, 11 Jul 2010, Michael Niedermayer wrote:
>> >> On Sun, Jul 11, 2010 at 04:52:04PM +0000, Loren Merritt wrote:
>> >>> On Sun, 11 Jul 2010, Ronald S. Bultje wrote:
>> >>>> You'll notice that the sse2 is significantly slower here, my rough
>> >>>> guess is that this is because of my shitty CPU which pretty much
>> >>>> emulates xmm-ops through mmx-ops, so it doesn't add a lot of benefit
>> >>>> other than not having to setup the loop for doing the second 8 pixels,
>> >>>> combined with the added complexity of a 8x16 transpose before the
>> >>>> actual filter. I'm betting that on an actual sse2-supporting CPU
>> >>>> (Jason?), this would still be faster, but we might want to put this
>> >>>> under a FF_MM_SSE2_NOT_SHITTY flag or something along those lines. If
>> >>>> you think my code is shitty, comments are welcome also. ;-).
>> >>>
>> >>> Rather than special-casing most of the functions, we at x264 declared
>> >>> that
>> >>> Core1 doesn't have sse2, and changed the cpuid parser accordingly.
>> >>> If you want to support the few cases where sse2 is slightly faster than
>> >>> mmx, I recommend picking a different flag for that and applying it only
>> >>> when you've tested on Core1, so that FF_MM_SSE2 can be trusted to dwim in
>> >>> the usual case.
>> >>>
>> >>> --Loren Merritt
>> >>
>> >>> ?cpuid.c | ? 14 +++++++++++++-
>> >>> ?1 file changed, 13 insertions(+), 1 deletion(-)
>> >>> 7ba0916766645e2de9330e9ba8f30d815da14c91 ?cpuid.diff
>> >>
>> >> do we have any float SSE2 code that this could affect negatively?
>> >> if not iam ok with this patch
>> >
>> > ff_lpc_compute_autocorr_sse2
>>
>> Attached patch implements FF_MM_SSE2/3SLOW for this purpose.
> [...]
>> @@ -108,13 +112,25 @@
>> ? ? ? ? ? ? ?rval |= FF_MM_MMX2;
>> ? ? ?}
>>
>> + ? ?if (!strncmp(vendor.c, "GenuineIntel", 12) &&
>> + ? ? ? ?family == 6 && (model == 9 || model == 13 || model == 14)) {
>> + ? ? ? ?/* 6/9 (pentium-m "banias"), 6/13 (pentium-m "dothan"), and 6/14 (core1 "yonah")
>> + ? ? ? ? * theoretically support sse2, but it's usually slower than mmx,
>> + ? ? ? ? * so let's just pretend they don't. */
>
>> + ? ? ? ?if (rval & FF_MM_SSE2) rval |= FF_MM_SSE2SLOW;
>> + ? ? ? ?if (rval & FF_MM_SSE3) rval |= FF_MM_SSE3SLOW;
>> + ? ? ? ?rval &= ~(FF_MM_SSE2|FF_MM_SSE3);
>
> if (rval & FF_MM_SSE2) rval ^= FF_MM_SSE2SLOW | FF_MM_SSE2;
> ...
>
> ok otherwise

Applied with the suggested modification.

Ronald



More information about the ffmpeg-devel mailing list