[FFmpeg-devel] [PATCH] VP8 luma(16) inner-MB H/V loopfilter MMX/SSE2
Ronald S. Bultje
rsbultje
Sun Jul 11 19:53:54 CEST 2010
hi,
On Jul 11, 2010, at 12:59 PM, Michael Niedermayer <michaelni at gmx.at>
wrote:
> On Sun, Jul 11, 2010 at 04:52:04PM +0000, Loren Merritt wrote:
>> On Sun, 11 Jul 2010, Ronald S. Bultje wrote:
>>
>>> You'll notice that the sse2 is significantly slower here, my rough
>>> guess is that this is because of my shitty CPU which pretty much
>>> emulates xmm-ops through mmx-ops, so it doesn't add a lot of benefit
>>> other than not having to setup the loop for doing the second 8
>>> pixels,
>>> combined with the added complexity of a 8x16 transpose before the
>>> actual filter. I'm betting that on an actual sse2-supporting CPU
>>> (Jason?), this would still be faster, but we might want to put this
>>> under a FF_MM_SSE2_NOT_SHITTY flag or something along those lines.
>>> If
>>> you think my code is shitty, comments are welcome also. ;-).
>>
>> Rather than special-casing most of the functions, we at x264
>> declared that
>> Core1 doesn't have sse2, and changed the cpuid parser accordingly.
>> If you want to support the few cases where sse2 is slightly faster
>> than
>> mmx, I recommend picking a different flag for that and applying it
>> only
>> when you've tested on Core1, so that FF_MM_SSE2 can be trusted to
>> dwim in
>> the usual case.
>>
>> --Loren Merritt
>
>> cpuid.c | 14 +++++++++++++-
>> 1 file changed, 13 insertions(+), 1 deletion(-)
>> 7ba0916766645e2de9330e9ba8f30d815da14c91 cpuid.diff
>
> do we have any float SSE2 code that this could affect negatively?
> if not iam ok with this patch
All other vp8 sse2 funcs are faster than mmx on my core1, so we might
want to actually test a little before applying this?
Otherwise ok for me.
Ronald
More information about the ffmpeg-devel
mailing list