[FFmpeg-devel] [PATCH] libavutil: add x86 optimized av_popcount
Ronald S. Bultje
rsbultje at gmail.com
Wed Feb 25 19:29:18 CET 2015
Hi,
On Wed, Feb 25, 2015 at 1:21 PM, James Almer <jamrial at gmail.com> wrote:
> On 25/02/15 2:58 PM, Ronald S. Bultje wrote:
> > Hi,
> >
> > On Wed, Feb 25, 2015 at 12:52 PM, James Almer <jamrial at gmail.com> wrote:
> >
> >> On 25/02/15 2:36 PM, Ronald S. Bultje wrote:
> >>> Hi James,
> >>>
> >>> On Wed, Feb 25, 2015 at 12:25 PM, James Almer <jamrial at gmail.com>
> wrote:
> >>>
> >>>> On 25/02/15 9:41 AM, Ronald S. Bultje wrote:
> >>>>> Hi,
> >>>>>
> >>>>> On Tue, Feb 24, 2015 at 8:05 PM, James Almer <jamrial at gmail.com>
> >> wrote:
> >>>>>>
> >>>>>> +#if HAVE_FAST_POPCNT
> >>>>>> +#if AV_GCC_VERSION_AT_LEAST(4,5)
> >>>>>> +#ifndef av_popcount
> >>>>>> + #define av_popcount __builtin_popcount
> >>>>>> +#endif /* av_popcount */
> >>>>>> +#if HAVE_FAST_64BIT
> >>>>>> +#ifndef av_popcount64
> >>>>>> + #define av_popcount64 __builtin_popcountll
> >>>>>> +#endif /* av_popcount64 */
> >>>>>> +#endif /* HAVE_FAST_64BIT */
> >>>>>> +#endif /* AV_GCC_VERSION_AT_LEAST(4,5) */
> >>>>>> +#endif /* HAVE_FAST_POPCNT */
> >>>>>>
> >>>>>
> >>>>> Is this just to get the sse4 popcnt instruction if we compile with
> >>>>> -mcpu=sse4? The slightly odd thing is that we're using a built-in,
> yet
> >>>>> configure still does an arch/cpu check. I'd expect the
> >> built-in/compiler
> >>>> to
> >>>>> do that for us based on -mcpu, and we could always unconditionally
> use
> >>>> this
> >>>>> (as long as gcc >= 4.5); alternatively, you could use inline asm and
> >> then
> >>>>> have the configure check (HAVE_FAST_POPCNT). But doing both seems a
> >>>> little
> >>>>> odd. I have no objection to it, patch is still fine, just odd.
> >>>>>
> >>>>> Ronald
> >>>>
> >>>> I purposely made the checks for gcc 4.5 and in configure for cpus that
> >>>> support popcnt
> >>>> because otherwise __builtin_popcount (at least gcc's) is slower than
> our
> >>>> generic
> >>>> av_popcount_c function from lavu/common.h.
> >>>> When the CPU supports popcnt the builtin becomes a single inlined
> >>>> instruction.
> >>>>
> >>>> I tried the __asm__ approach, but the code generated by the builtin
> >> seemed
> >>>> better.
> >>>
> >>>
> >>> That's interesting, can you show the code you tried?
> >>>
> >>> Ronald
> >>
> >> If instead of builtins i use
> >>
> >> +#if HAVE_INLINE_ASM
> >> +
> >> +#ifdef __POPCNT__
> >> +
> >> +#define av_popcount av_popcount_abm
> >> +static av_always_inline av_const int av_popcount_abm(uint32_t a)
> >> +{
> >> + int x;
> >> +
> >> + __asm__ ("popcnt %1, %0" : "=r" (x) : "rm" (a));
> >> + return x;
> >> +}
> >> +
> >> +#if ARCH_X86_64
> >> +#define av_popcount64 av_popcount64_abm
> >> +static av_always_inline av_const int av_popcount64_abm(uint64_t a)
> >> +{
> >> + uint64_t x;
> >> +
> >> + __asm__ ("popcnt %1, %0" : "=r" (x) : "rm" (a));
> >> + return x;
> >> +}
> >> +#endif
> >> +
> >> +#endif /* __POPCNT__ */
> >> +
> >> +#endif /* HAVE_INLINE_ASM */
> >>
> >> in the x86/ header from the second version i sent, on x86_32
> av_cpu_count
> >> from
> >> libavutil/cpu.o on Windows (Will not work with other OSes because of
> >> HAVE_GETPROCESSAFFINITYMASK) generates
> >>
> >> 1af: f3 0f b8 5c 24 18 popcnt ebx,DWORD PTR [esp+0x18]
> >> 1b5: 31 c0 xor eax,eax
> >> 1b7: f3 0f b8 c0 popcnt eax,eax
> >> 1bb: 01 c3 add ebx,eax
> >>
> >> Whereas with the builtins i instead get
> >>
> >> 1af: f3 0f b8 5c 24 18 popcnt ebx,DWORD PTR [esp+0x18]
> >>
> >> This is probably because of av_popcount64_c (Used unconditionally on
> >> x86_32) calling
> >> av_popcount twice, and proc_aff being DWORD_PTR type means it's 4 bytes
> in
> >> x86_32.
> >> The builtin seems to know proc_aff can't be right shifted 32 bits so it
> >> generates
> >> a single popcnt.
> >
> >
> > It seems to me rather than the second version is defined as "pure" (or
> > whatever the keyword was), i.e. "we don't depend on external state" ->
> "if
> > you call this with a constant, we can calculate this at compile-time".
> >
> > Ronald
>
> Do you know how to get the asm version working right for this one case?
>
No, const should do it but looks like inline asm breaks that assumption.
> Otherwise i think the second version with builtins i sent is good to
> apply. Much
> cleaner than the configure approach to be honest.
OK.
Ronald
More information about the ffmpeg-devel
mailing list