[FFmpeg-devel] [PATCH] Use av_clip_uint8 in swscale.
Måns Rullgård
mans
Tue Aug 18 01:51:33 CEST 2009
Frank Barchard <fbarchard at google.com> writes:
> 2009/8/17 M?ns Rullg?rd <mans at mansr.com>
>
>> And you might know it's unbounded.
>>
> yes. So a general purpose function has to do it the slow way. But this
> function is applied within swscaler on YUV data with a know range.
This function is meant to be generic, not only for use in libswscale.
>> > If you combined 3 bytes, its 768 values.
>>
>> "combined"?
>
> Image operations are usually a function of 2 images, or 4 channels, which
> puts a limit on how far out of range they can be.
Yes, *USUALLY*. Do you know the difference between something that
*usually* works and something that *always* works?
>> > I know if statements are increasingly efficient, and memory less
>> > efficient, but the original code had 4 to 6 instructions and
>> > potentially 2 branches taken per clipped value. av_clip_uint8()
>> > can be optimized to a single instruction on most CPU's
>>
>> Yes, on those with dedicated clip instructions. Others will need
>> several instructions to support the full 32-bit range. Even if the
>> range is known to be smaller, a table lookup can be slower than a few
>> compares and conditional instructions, and it poisons the cache
>> needlessly.
>
> Here's a benchmark on my code that is very similar. This version, including
> YUV conversion, runs in 2.97ms
>
> static inline uint32 clip(int32 value) {
>
> if (value < 0) return 0u; if (value > 65535) return 255u;
> return static_cast<uint32>(value >> 8);}
>
> *This code runs in 2.11ms*
>
> static inline uint32 clip(int32 value) { return
> static_cast<uint32>(g_rgb_clip_table[((value) >> 8) +
> kClipOverflow]);}
That means only that your compiler sucks.
> The table is read only, so the cache lines are not dirty, and image
> data tends to be coherent and only use a portion of table. The tables
> for simple YUV clipping are 832 bytes.
That's a significant portion of a 16k cache.
>> >> > On x86, there is cmov, but in the above code it would take cmp,
>> >> > cmov, cmp, cmov to do each value, whereas the table method takes
>> >> > one mov instruction.
>> >>
>> >> You're forgetting the address calculation.
>> >
>> > movzx eax,cliptbl[eax*4]
>>
>> Now you're back at the 4GB table. And where did the value of
>> "cliptbl" come from? It would have to be loaded from somewhere.
>
> cliptbl is an array. You can index directly off arrays on x86.
But not on x86-64 or most other modern CPUs.
--
M?ns Rullg?rd
mans at mansr.com
More information about the ffmpeg-devel
mailing list