[FFmpeg-devel] [PATCH] libavutil: add x86 optimized av_popcount

James Almer jamrial at gmail.com
Wed Feb 25 19:21:57 CET 2015


On 25/02/15 2:58 PM, Ronald S. Bultje wrote:
> Hi,
> 
> On Wed, Feb 25, 2015 at 12:52 PM, James Almer <jamrial at gmail.com> wrote:
> 
>> On 25/02/15 2:36 PM, Ronald S. Bultje wrote:
>>> Hi James,
>>>
>>> On Wed, Feb 25, 2015 at 12:25 PM, James Almer <jamrial at gmail.com> wrote:
>>>
>>>> On 25/02/15 9:41 AM, Ronald S. Bultje wrote:
>>>>> Hi,
>>>>>
>>>>> On Tue, Feb 24, 2015 at 8:05 PM, James Almer <jamrial at gmail.com>
>> wrote:
>>>>>>
>>>>>> +#if HAVE_FAST_POPCNT
>>>>>> +#if AV_GCC_VERSION_AT_LEAST(4,5)
>>>>>> +#ifndef av_popcount
>>>>>> +    #define av_popcount   __builtin_popcount
>>>>>> +#endif /* av_popcount */
>>>>>> +#if HAVE_FAST_64BIT
>>>>>> +#ifndef av_popcount64
>>>>>> +    #define av_popcount64 __builtin_popcountll
>>>>>> +#endif /* av_popcount64 */
>>>>>> +#endif /* HAVE_FAST_64BIT */
>>>>>> +#endif /* AV_GCC_VERSION_AT_LEAST(4,5) */
>>>>>> +#endif /* HAVE_FAST_POPCNT */
>>>>>>
>>>>>
>>>>> Is this just to get the sse4 popcnt instruction if we compile with
>>>>> -mcpu=sse4? The slightly odd thing is that we're using a built-in, yet
>>>>> configure still does an arch/cpu check. I'd expect the
>> built-in/compiler
>>>> to
>>>>> do that for us based on -mcpu, and we could always unconditionally use
>>>> this
>>>>> (as long as gcc >= 4.5); alternatively, you could use inline asm and
>> then
>>>>> have the configure check (HAVE_FAST_POPCNT). But doing both seems a
>>>> little
>>>>> odd. I have no objection to it, patch is still fine, just odd.
>>>>>
>>>>> Ronald
>>>>
>>>> I purposely made the checks for gcc 4.5 and in configure for cpus that
>>>> support popcnt
>>>> because otherwise __builtin_popcount (at least gcc's) is slower than our
>>>> generic
>>>> av_popcount_c function from lavu/common.h.
>>>> When the CPU supports popcnt the builtin becomes a single inlined
>>>> instruction.
>>>>
>>>> I tried the __asm__ approach, but the code generated by the builtin
>> seemed
>>>> better.
>>>
>>>
>>> That's interesting, can you show the code you tried?
>>>
>>> Ronald
>>
>> If instead of builtins i use
>>
>> +#if HAVE_INLINE_ASM
>> +
>> +#ifdef __POPCNT__
>> +
>> +#define av_popcount av_popcount_abm
>> +static av_always_inline av_const int av_popcount_abm(uint32_t a)
>> +{
>> +    int x;
>> +
>> +    __asm__ ("popcnt %1, %0" : "=r" (x) : "rm" (a));
>> +    return x;
>> +}
>> +
>> +#if ARCH_X86_64
>> +#define av_popcount64 av_popcount64_abm
>> +static av_always_inline av_const int av_popcount64_abm(uint64_t a)
>> +{
>> +    uint64_t x;
>> +
>> +    __asm__ ("popcnt %1, %0" : "=r" (x) : "rm" (a));
>> +    return x;
>> +}
>> +#endif
>> +
>> +#endif /* __POPCNT__ */
>> +
>> +#endif /* HAVE_INLINE_ASM */
>>
>> in the x86/ header from the second version i sent, on x86_32 av_cpu_count
>> from
>> libavutil/cpu.o on Windows (Will not work with other OSes because of
>> HAVE_GETPROCESSAFFINITYMASK) generates
>>
>>  1af:   f3 0f b8 5c 24 18       popcnt ebx,DWORD PTR [esp+0x18]
>>  1b5:   31 c0                   xor    eax,eax
>>  1b7:   f3 0f b8 c0             popcnt eax,eax
>>  1bb:   01 c3                   add    ebx,eax
>>
>> Whereas with the builtins i instead get
>>
>>  1af:   f3 0f b8 5c 24 18       popcnt ebx,DWORD PTR [esp+0x18]
>>
>> This is probably because of av_popcount64_c (Used unconditionally on
>> x86_32) calling
>> av_popcount twice, and proc_aff being DWORD_PTR type means it's 4 bytes in
>> x86_32.
>> The builtin seems to know proc_aff can't be right shifted 32 bits so it
>> generates
>> a single popcnt.
> 
> 
> It seems to me rather than the second version is defined as "pure" (or
> whatever the keyword was), i.e. "we don't depend on external state" -> "if
> you call this with a constant, we can calculate this at compile-time".
> 
> Ronald

Do you know how to get the asm version working right for this one case? 
Otherwise i think the second version with builtins i sent is good to apply. Much
cleaner than the configure approach to be honest.
Someone else can change it to add the relevant clang check/s because i can't test 
that.


More information about the ffmpeg-devel mailing list