[FFmpeg-devel] [PATCH] VP8 luma(16) inner-MB H/V loopfilter MMX/SSE2

Ronald S. Bultje rsbultje
Sun Jul 11 19:53:54 CEST 2010


hi,

On Jul 11, 2010, at 12:59 PM, Michael Niedermayer <michaelni at gmx.at>  
wrote:

> On Sun, Jul 11, 2010 at 04:52:04PM +0000, Loren Merritt wrote:
>> On Sun, 11 Jul 2010, Ronald S. Bultje wrote:
>>
>>> You'll notice that the sse2 is significantly slower here, my rough
>>> guess is that this is because of my shitty CPU which pretty much
>>> emulates xmm-ops through mmx-ops, so it doesn't add a lot of benefit
>>> other than not having to setup the loop for doing the second 8  
>>> pixels,
>>> combined with the added complexity of a 8x16 transpose before the
>>> actual filter. I'm betting that on an actual sse2-supporting CPU
>>> (Jason?), this would still be faster, but we might want to put this
>>> under a FF_MM_SSE2_NOT_SHITTY flag or something along those lines.  
>>> If
>>> you think my code is shitty, comments are welcome also. ;-).
>>
>> Rather than special-casing most of the functions, we at x264  
>> declared that
>> Core1 doesn't have sse2, and changed the cpuid parser accordingly.
>> If you want to support the few cases where sse2 is slightly faster  
>> than
>> mmx, I recommend picking a different flag for that and applying it  
>> only
>> when you've tested on Core1, so that FF_MM_SSE2 can be trusted to  
>> dwim in
>> the usual case.
>>
>> --Loren Merritt
>
>> cpuid.c |   14 +++++++++++++-
>> 1 file changed, 13 insertions(+), 1 deletion(-)
>> 7ba0916766645e2de9330e9ba8f30d815da14c91  cpuid.diff
>
> do we have any float SSE2 code that this could affect negatively?
> if not iam ok with this patch

All other vp8 sse2 funcs are faster than mmx on my core1, so we might  
want to actually test a little before applying this?

Otherwise ok for me.

Ronald



More information about the ffmpeg-devel mailing list