[FFmpeg-devel] [PATCHv2] lavc/cbrt_tablegen: speed up tablegen
Ganesh Ajjanagadde
gajjanag at mit.edu
Wed Jan 6 18:08:17 CET 2016
On Wed, Jan 6, 2016 at 7:38 AM, Daniel Serpell <dserpell at gmail.com> wrote:
> Hi!,
>
> El Tue, Jan 05, 2016 at 09:01:40PM -0800, Ganesh Ajjanagadde escribio:
>> On Tue, Jan 5, 2016 at 10:46 AM, Ganesh Ajjanagadde <gajjanag at mit.edu> wrote:
>> > On Tue, Jan 5, 2016 at 10:10 AM, Daniel Serpell <dserpell at gmail.com> wrote:
>> >> Hi!,
>> >>
>> >> El Tue, Jan 05, 2016 at 08:08:35AM -0800, Ganesh Ajjanagadde escribio:
>> >>> On Tue, Jan 5, 2016 at 7:44 AM, Daniel Serpell <dserpell at gmail.com> wrote:
>> >>> > Hi!,
>> >>> >
>> >>> > El Mon, Jan 04, 2016 at 06:33:59PM -0800, Ganesh Ajjanagadde escribio:
>> >>> >> This exploits an approach based on the sieve of Eratosthenes, a popular
>> >>> >> method for generating prime numbers.
>> >>> >>
>> >>> >> Tables are identical to previous ones.
>> >>> >>
>> >>> >> Tested with FATE with/without --enable-hardcoded-tables.
>> >>> >>
>> >>> >> Sample benchmark (Haswell, GNU/Linux+gcc):
>> >>> >> prev:
>> >>> >> 7860100 decicycles in cbrt_tableinit, 1 runs, 0 skips
>> >>> >> 7777490 decicycles in cbrt_tableinit, 2 runs, 0 skips
>> >>> >> [...]
>> >>> >> 7582339 decicycles in cbrt_tableinit, 256 runs, 0 skips
>> >>> >> 7563556 decicycles in cbrt_tableinit, 512 runs, 0 skips
>> >>> >>
>> >>> >> new:
>> >>> >> 2099480 decicycles in cbrt_tableinit, 1 runs, 0 skips
>> >>> >> 2044470 decicycles in cbrt_tableinit, 2 runs, 0 skips
>> >>> >> [...]
>> >>> >> 1796544 decicycles in cbrt_tableinit, 256 runs, 0 skips
>> >>> >> 1791631 decicycles in cbrt_tableinit, 512 runs, 0 skips
>> >>> >>
>> >>> >
>> >>> > See attached code, function "test1", based on an approximation of:
>> >>> >
>> >>> > (i+1)^(1/3) ~= i^(1/3) * ( 1 + 1/(3i) - 1/(9i) + 5/(81i) - .... )
>> [...]
>>
>> So I looked up Mr. Fog's manuals, and it seems like divides are
>> considerably slower than multiplies and adds. This made me somewhat
>> skeptical of your idea, and so I benched it.
>> Seems like your patch is actually significantly slower on my setup
>> (gcc 5.3.0, GNU libm):
>> 3349080 decicycles in cbrt_tableinit, 1 runs, 0 skips
>> 3466605 decicycles in cbrt_tableinit, 2 runs, 0 skips
>> [...]
>> 3425939 decicycles in cbrt_tableinit, 256 runs, 0 skips
>> 3414528 decicycles in cbrt_tableinit, 512 runs, 0 skips
>>
>> (clang 3.7.0):
>> 3337590 decicycles in cbrt_tableinit, 1 runs, 0 skips
>> 3276225 decicycles in cbrt_tableinit, 2 runs, 0 skips
>> [...]
>> 2871065 decicycles in cbrt_tableinit, 256 runs, 0 skips
>> 2832137 decicycles in cbrt_tableinit, 512 runs, 0 skips
>>
>> The divides seem to be what is killing your approach.
>> Times (just for the divisions, clang):
>> 1085430 decicycles in cbrt_tableinit, 1 runs, 0 skips
>> 1005165 decicycles in cbrt_tableinit, 2 runs, 0 skips
>> [...]
>> 863732 decicycles in cbrt_tableinit, 256 runs, 0 skips
>> 864907 decicycles in cbrt_tableinit, 512 runs, 0 skips
>>
>> Another illustration, even if I change the 1.0/(3*i) to simply i to
>> get rid of the multiplication as well, obviously not possible but
>> really wishful thinking, you still only get:
>> (clang)
>> 2013390 decicycles in cbrt_tableinit, 1 runs, 0 skips
>> 1950135 decicycles in cbrt_tableinit, 2 runs, 0 skips
>> [...]
>> 1668963 decicycles in cbrt_tableinit, 256 runs, 0 skips
>> 1653179 decicycles in cbrt_tableinit, 512 runs, 0 skips
>>
>> To complete my curiousity,
>> (gcc)
>> 1485240 decicycles in cbrt_tableinit, 1 runs, 0 skips
>> 1522115 decicycles in cbrt_tableinit, 2 runs, 0 skips
>> [...]
>> 1324325 decicycles in cbrt_tableinit, 256 runs, 0 skips
>> 1322110 decicycles in cbrt_tableinit, 512 runs, 0 skips
>>
>> i.e not a whole lot faster than the benchmark I posted.
>>
>> If you feel this can't be right, I can do give a higher level
>> justification, which shows that for a reasonable libm cbrt, and
>> standard architectural assumptions, the approach I have is faster.
>>
>
> Thanks for your tests.
>
> In my PC I measured a higher speedup, perhaps you can replace the main
> loop with:
>
> cb = 7;
> for(i=343; i<8192; i++)
> {
> double s, r, t;
> out[i] = cb * i;
> // For big values, we use 4 terms:
> r = (1.0/3.0)/i; // 0 A , 1 M , 1 D
> t = r; //
> s = 1.0 + r; // 1 A , 1 M , 1 D
> r = r * r; // 1 A , 2 M , 1 D
> s = s - r; // 2 A , 2 M , 1 D
> r = r * t; // 2 A , 3 M , 1 D
> s = s + 1.6644 *r; // 3 A , 4 M , 1 D
> cb = cb * s; // 3 A , 5 M , 1 D
> }
All of the below done with gcc.
Better, but still slower than proposed:
2620600 decicycles in cbrt_tableinit, 1 runs, 0 skips
2666850 decicycles in cbrt_tableinit, 2 runs, 0 skips
[...]
2476271 decicycles in cbrt_tableinit, 256 runs, 0 skips
2469658 decicycles in cbrt_tableinit, 512 runs, 0 skips
>
>
> I know that divisions are considerably slower than multiplications, but
> at least in my PC, the following is slower still (based on newton's
> method):
>
> // "cb" is the cube-root approximation to "1/i²"
> cb = 1.0/49.0;
> for(i=343; i<8192; i++)
> {
> double s = cb*cb*i;
> cb = (1.0/3.0) * (4*cb - s*s);
> s = cb*cb*i;
> cb = (1.0/3.0) * (4*cb - s*s);
> out[i] = cb * i * i;
> }
As you suspected, this is slower still:
2949680 decicycles in cbrt_tableinit, 1 runs, 0 skips
2990155 decicycles in cbrt_tableinit, 2 runs, 0 skips
[...]
2768156 decicycles in cbrt_tableinit, 256 runs, 0 skips
2747967 decicycles in cbrt_tableinit, 512 runs, 0 skips
>
> Perhaps in your PC is faster?
I am running a standard Haswell laptop setup, i7-4700MQ.
Thank you for trying these ideas.
BTW, if it helps you in identifying where (if at all) you can improve
your approach, I will present a somewhat simplified analysis of what I
think are the strengths of each, purely from a speed perspective,
since here at least I find readability subjective. For instance,
admittedly the algorithm I use is a more complex one, but you also
have the issue of hard to read inner loop (many variables), and
carefully handpicked perturbations of the mathematical 5/3, 10/3
constants. I tried writing an eval_poly for simplifying your logic
(based off Horner's); it results in approximately 10% slowdown of your
approach, likely due to some unnecessary multiplies for the unity
coefficients. I made it inline and the poly a static const, so in
principle multiplications by 1 can be avoided, but whatever.
Strengths of yours:
1. Minimal calls to cbrt. For an expensive cbrt, your approach may be
better. Number of calls for you is 64, may be tightened with some
better polynomials possibly down to 40 or so. For me, it is ~ 900
calls, essentially number of primes < 8192 ~ 8192/ln(8192).
2. Less branching.
Strengths of mine:
1. No divides. Your approach has nearly 8000 of them.
2. No additions. You have roughly 20,000 to 24,000 of them.
3. Fewer multiplications. This requires some analysis (or brute force
checking). Analysis given here. The number of multiplies is
essentially sum(exponents of primes) in 8192! (factorial). This is
sum(floor(n/p^i)) over p, may be upper bounded by n*sum(1/(p-1)) over
p, which is approximately n*(sum(1/p)+small constant). Now, sum(1/p) ~
ln(ln(n)), so there are roughly 2.2+c, definitely less than 4, likely
around 3 multiplications per element. You have approximately 5
multiplications per element.
4. Possible vectorization of these multiplies. Loops may be unrolled,
and multiplications done simultaneously. Your approach is quite
sequential, each cb value depending on previous one. In this approach,
once the prime is identified, the iteration is very straightforward.
>
> Thanks,
>
> Daniel.
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
More information about the ffmpeg-devel
mailing list