[FFmpeg-devel] [PATCHv2] lavc/cbrt_tablegen: speed up tablegen

Daniel Serpell dserpell at gmail.com
Wed Jan 6 16:38:27 CET 2016


Hi!,

El Tue, Jan 05, 2016 at 09:01:40PM -0800, Ganesh Ajjanagadde escribio:
> On Tue, Jan 5, 2016 at 10:46 AM, Ganesh Ajjanagadde <gajjanag at mit.edu> wrote:
> > On Tue, Jan 5, 2016 at 10:10 AM, Daniel Serpell <dserpell at gmail.com> wrote:
> >> Hi!,
> >>
> >> El Tue, Jan 05, 2016 at 08:08:35AM -0800, Ganesh Ajjanagadde escribio:
> >>> On Tue, Jan 5, 2016 at 7:44 AM, Daniel Serpell <dserpell at gmail.com> wrote:
> >>> > Hi!,
> >>> >
> >>> > El Mon, Jan 04, 2016 at 06:33:59PM -0800, Ganesh Ajjanagadde escribio:
> >>> >> This exploits an approach based on the sieve of Eratosthenes, a popular
> >>> >> method for generating prime numbers.
> >>> >>
> >>> >> Tables are identical to previous ones.
> >>> >>
> >>> >> Tested with FATE with/without --enable-hardcoded-tables.
> >>> >>
> >>> >> Sample benchmark (Haswell, GNU/Linux+gcc):
> >>> >> prev:
> >>> >> 7860100 decicycles in cbrt_tableinit,       1 runs,      0 skips
> >>> >> 7777490 decicycles in cbrt_tableinit,       2 runs,      0 skips
> >>> >> [...]
> >>> >> 7582339 decicycles in cbrt_tableinit,     256 runs,      0 skips
> >>> >> 7563556 decicycles in cbrt_tableinit,     512 runs,      0 skips
> >>> >>
> >>> >> new:
> >>> >> 2099480 decicycles in cbrt_tableinit,       1 runs,      0 skips
> >>> >> 2044470 decicycles in cbrt_tableinit,       2 runs,      0 skips
> >>> >> [...]
> >>> >> 1796544 decicycles in cbrt_tableinit,     256 runs,      0 skips
> >>> >> 1791631 decicycles in cbrt_tableinit,     512 runs,      0 skips
> >>> >>
> >>> >
> >>> > See attached code, function "test1", based on an approximation of:
> >>> >
> >>> >   (i+1)^(1/3) ~= i^(1/3) * ( 1 + 1/(3i) - 1/(9i) + 5/(81i) - .... )
> [...]
> 
> So I looked up Mr. Fog's manuals, and it seems like divides are
> considerably slower than multiplies and adds. This made me somewhat
> skeptical of your idea, and so I benched it.
> Seems like your patch is actually significantly slower on my setup
> (gcc 5.3.0, GNU libm):
> 3349080 decicycles in cbrt_tableinit,       1 runs,      0 skips
> 3466605 decicycles in cbrt_tableinit,       2 runs,      0 skips
> [...]
> 3425939 decicycles in cbrt_tableinit,     256 runs,      0 skips
> 3414528 decicycles in cbrt_tableinit,     512 runs,      0 skips
> 
> (clang 3.7.0):
> 3337590 decicycles in cbrt_tableinit,       1 runs,      0 skips
> 3276225 decicycles in cbrt_tableinit,       2 runs,      0 skips
> [...]
> 2871065 decicycles in cbrt_tableinit,     256 runs,      0 skips
> 2832137 decicycles in cbrt_tableinit,     512 runs,      0 skips
> 
> The divides seem to be what is killing your approach.
> Times (just for the divisions, clang):
> 1085430 decicycles in cbrt_tableinit,       1 runs,      0 skips
> 1005165 decicycles in cbrt_tableinit,       2 runs,      0 skips
> [...]
>  863732 decicycles in cbrt_tableinit,     256 runs,      0 skips
>  864907 decicycles in cbrt_tableinit,     512 runs,      0 skips
> 
> Another illustration, even if I change the 1.0/(3*i) to simply i to
> get rid of the multiplication as well, obviously not possible but
> really wishful thinking, you still only get:
> (clang)
> 2013390 decicycles in cbrt_tableinit,       1 runs,      0 skips
> 1950135 decicycles in cbrt_tableinit,       2 runs,      0 skips
> [...]
> 1668963 decicycles in cbrt_tableinit,     256 runs,      0 skips
> 1653179 decicycles in cbrt_tableinit,     512 runs,      0 skips
> 
> To complete my curiousity,
> (gcc)
> 1485240 decicycles in cbrt_tableinit,       1 runs,      0 skips
> 1522115 decicycles in cbrt_tableinit,       2 runs,      0 skips
> [...]
> 1324325 decicycles in cbrt_tableinit,     256 runs,      0 skips
> 1322110 decicycles in cbrt_tableinit,     512 runs,      0 skips
> 
> i.e not a whole lot faster than the benchmark I posted.
> 
> If you feel this can't be right, I can do give a higher level
> justification, which shows that for a reasonable libm cbrt, and
> standard architectural assumptions, the approach I have is faster.
> 

Thanks for your tests.

In my PC I measured a higher speedup, perhaps you can replace the main
loop with:

    cb = 7;
    for(i=343; i<8192; i++)
    {
        double s, r, t;
        out[i] = cb * i;
        // For big values, we use 4 terms:
        r = (1.0/3.0)/i;      // 0 A , 1 M , 1 D
        t = r;                //
        s = 1.0 + r;          // 1 A , 1 M , 1 D
        r = r * r;            // 1 A , 2 M , 1 D
        s = s - r;            // 2 A , 2 M , 1 D
        r = r * t;            // 2 A , 3 M , 1 D
        s = s + 1.6644 *r;    // 3 A , 4 M , 1 D
        cb = cb * s;          // 3 A , 5 M , 1 D
    }


I know that divisions are considerably slower than multiplications, but
at least in my PC, the following is slower still (based on newton's
method):

    // "cb" is the cube-root approximation to "1/i²"
    cb = 1.0/49.0;
    for(i=343; i<8192; i++)
    {
        double s = cb*cb*i;
        cb = (1.0/3.0) * (4*cb - s*s);
        s = cb*cb*i;
        cb = (1.0/3.0) * (4*cb - s*s);
        out[i] = cb * i * i;
    }

Perhaps in your PC is faster?

Thanks,

    Daniel.


More information about the ffmpeg-devel mailing list