[FFmpeg-devel] [PATCH] avutil/mathematics: speed up av_gcd by using Stein's binary GCD algorithm

Michael Niedermayer michael at niedermayer.cc
Sun Oct 11 02:13:59 CEST 2015


On Sat, Oct 10, 2015 at 07:45:17PM -0400, Ganesh Ajjanagadde wrote:
> On Sat, Oct 10, 2015 at 7:14 PM, Michael Niedermayer
> <michael at niedermayer.cc> wrote:
> > On Sat, Oct 10, 2015 at 11:32:06PM +0200, Henrik Gramner wrote:
> >> On Sat, Oct 10, 2015 at 11:06 PM, Ganesh Ajjanagadde
> >> <gajjanagadde at gmail.com> wrote:
> >> > This uses Stein's binary GCD algorithm:
> >> > https://en.wikipedia.org/wiki/Binary_GCD_algorithm
> >> > to get a reported 1.7-2.1x speedup over Euclidean GCD on standard architectures.
> >> > Have not benchmarked, so I can't comment
> >>
> >> Before you submit a patch that's supposed to make something faster,
> >> you should benchmark it to verify that it is in fact faster. Do this
> >> with inputs of various sizes on both 32- and 64-bit architectures and
> >> both with and without compilers that support __builtin_ctzll(v).
> >
> > without __builtin_ctzll() the old code seems faster in a simple test
> > like:
> > make -j12 && ./ffmpeg -i matrixbench_mpeg2.mpg test.mov
> 
> >
> > also it seems a simple
> >     while (u != v) {
> >         if (u > v)
> >             FFSWAP(int64_t, v, u);
> >         v -= u;
> >         do {
> >             v >>= 1;
> >         } while(!(v & 1));
> >     }
> > is faster than the non builtin ctzll in its place but still not as
> > fast as the current code
> 
> So I checked out the various implementation of (non built in) ctzll at
> https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightBinSearch.
> The funny thing is, among the looping ones, the fastest is the simplest :):
> static inline int ff_ctzll_c(long long v)
> {
>     int c = 0;
>     while (!(v & 1)) {
>         c++;
>         v >>= 1;
>     }
>     return c;
> }

its possible gcc/clang recognizes this and replaces it by the
instruction for the builtin
this is perfectly fine when such an instruction is available
but when not this might be slower than what we would expect from
just testing on architectures with such instructions


> 
> The De Bruijn based approach is intriguing:
> https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightMultLookup,
> since there is no loop there, and assuming multiplies are fast, it
> should be good.


> 
> I guess one could in principle simply write the asm instruction, but I
> can't do that with my current knowledge.

this would only work on architectures that do have such
instruction, i dont know to which non x86 architectures this applies

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

If a bugfix only changes things apparently unrelated to the bug with no
further explanation, that is a good sign that the bugfix is wrong.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20151011/906ff256/attachment.sig>


More information about the ffmpeg-devel mailing list