[FFmpeg-devel] [PATCH] PPC64: Add IBM POWER8 SIMD Implementation

Dan Parrot dan.parrot at mail.com
Tue Jun 28 22:40:13 CEST 2016


On Wed, 2016-06-22 at 20:33 -0300, James Almer wrote:
> On 6/22/2016 8:15 PM, Dan Parrot wrote:
> > On Thu, 2016-06-23 at 01:03 +0200, Michael Niedermayer wrote:
> >> On Tue, Jun 21, 2016 at 12:04:42AM -0500, Dan Parrot wrote:
> >>> On Tue, 2016-06-21 at 02:22 +0200, Michael Niedermayer wrote:
> >>>> On Mon, Jun 20, 2016 at 06:38:18PM -0500, Dan Parrot wrote:
> >>>>> On Tue, 2016-06-21 at 01:06 +0200, Michael Niedermayer wrote:
> >>>>>> On Mon, Jun 20, 2016 at 05:55:47PM -0500, Dan Parrot wrote:
> >>>>>>> On Tue, 2016-06-21 at 00:45 +0200, Michael Niedermayer wrote:
> >>>>>>>> On Sun, Jun 19, 2016 at 09:57:42PM +0000, Dan Parrot wrote:
> >>>>>>>>> First commit addressing Trac ticket #5570. Functions defined in libswscale/input.c
> >>>>>>>>> have corresponding SIMD definitions in libswscale/ppc/input_vsx.c
> >>>>>>>>> ---
> >>>>>>>>>  libswscale/ppc/Makefile       |    1 +
> >>>>>>>>>  libswscale/ppc/input_vsx.c    | 1070 +++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>>  libswscale/swscale.c          |    3 +
> >>>>>>>>>  libswscale/swscale_internal.h |    1 +
> >>>>>>>>>  4 files changed, 1075 insertions(+)
> >>>>>>>>>  create mode 100644 libswscale/ppc/input_vsx.c
> >>>>>>>>>
> >>>>>>>>> diff --git a/libswscale/ppc/Makefile b/libswscale/ppc/Makefile
> >>>>>>>>> index d1b596e..2482893 100644
> >>>>>>>>> --- a/libswscale/ppc/Makefile
> >>>>>>>>> +++ b/libswscale/ppc/Makefile
> >>>>>>>>> @@ -1,3 +1,4 @@
> >>>>>>>>>  OBJS += ppc/swscale_altivec.o                                           \
> >>>>>>>>> +        ppc/input_vsx.o                                                 \
> >>>>>>>>>          ppc/yuv2rgb_altivec.o                                           \
> >>>>>>>>>          ppc/yuv2yuv_altivec.o                                           \
> >>>>>>>>> diff --git a/libswscale/ppc/input_vsx.c b/libswscale/ppc/input_vsx.c
> >>>>>>>>> new file mode 100644
> >>>>>>>>> index 0000000..adb0e38
> >>>>>>>>> --- /dev/null
> >>>>>>>>> +++ b/libswscale/ppc/input_vsx.c
> >>>>>>>>> @@ -0,0 +1,1070 @@
> >>>>>>>>> +/*
> >>>>>>>>> + * Copyright (C) 2016 Dan Parrot <dan.parrot at mail.com>
> >>>>>>>>> + *
> >>>>>>>>> + * This file is part of FFmpeg.
> >>>>>>>>> + *
> >>>>>>>>> + * FFmpeg is free software; you can redistribute it and/or
> >>>>>>>>> + * modify it under the terms of the GNU Lesser General Public
> >>>>>>>>> + * License as published by the Free Software Foundation; either
> >>>>>>>>> + * version 2.1 of the License, or (at your option) any later version.
> >>>>>>>>> + *
> >>>>>>>>> + * FFmpeg is distributed in the hope that it will be useful,
> >>>>>>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>>>>>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >>>>>>>>> + * Lesser General Public License for more details.
> >>>>>>>>> + *
> >>>>>>>>> + * You should have received a copy of the GNU Lesser General Public
> >>>>>>>>> + * License along with FFmpeg; if not, write to the Free Software
> >>>>>>>>> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> >>>>>>>>> + */
> >>>>>>>>> +
> >>>>>>>>> +#include <math.h>
> >>>>>>>>> +#include <stdint.h>
> >>>>>>>>> +#include <stdio.h>
> >>>>>>>>> +#include <string.h>
> >>>>>>>>> +
> >>>>>>>>> +#include "libavutil/avutil.h"
> >>>>>>>>> +#include "libavutil/bswap.h"
> >>>>>>>>> +#include "libavutil/cpu.h"
> >>>>>>>>> +#include "libavutil/intreadwrite.h"
> >>>>>>>>> +#include "libavutil/mathematics.h"
> >>>>>>>>> +#include "libavutil/pixdesc.h"
> >>>>>>>>> +#include "libavutil/avassert.h"
> >>>>>>>>> +#include "config.h"
> >>>>>>>>> +#include "libswscale/rgb2rgb.h"
> >>>>>>>>> +#include "libswscale/swscale.h"
> >>>>>>>>> +#include "libswscale/swscale_internal.h"
> >>>>>>>>> +
> >>>>>>>>> +#define input_pixel(pos) (isBE(origin) ? AV_RB16(pos) : AV_RL16(pos))
> >>>>>>>>> +
> >>>>>>>>> +#define r ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE || origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? b_r : r_b)
> >>>>>>>>> +#define b ((origin == AV_PIX_FMT_BGR48BE || origin == AV_PIX_FMT_BGR48LE || origin == AV_PIX_FMT_BGRA64BE || origin == AV_PIX_FMT_BGRA64LE) ? r_b : b_r)
> >>>>>>>>> +
> >>>>>>>>> +#if HAVE_VSX
> >>>>>>>>> +
> >>>>>>>>> +// This is a SIMD version for IBM POWER8 of function rgb64ToY_c_template
> >>>>>>>>> +// in file libswscale/input.c
> >>>>>>>>> +static av_always_inline void
> >>>>>>>>> +rgb64ToY_c_template_vsx(uint16_t *dst, const uint16_t *src, int width,
> >>>>>>>>> +                        enum AVPixelFormat origin, int32_t *rgb2yuv)
> >>>>>>>>> +{
> >>>>>>>>> +    int32_t ry = rgb2yuv[RY_IDX], gy = rgb2yuv[GY_IDX], by = rgb2yuv[BY_IDX];
> >>>>>>>>> +    int i, j;
> >>>>>>>>> +    int num_vec, frag;
> >>>>>>>>> +
> >>>>>>>>> +    num_vec = width / 8;
> >>>>>>>>> +    frag    = width % 8;
> >>>>>>>>> +
> >>>>>>>>> +    vector int v_ry = vec_splats((int)ry);
> >>>>>>>>> +    vector int v_gy = vec_splats((int)gy);
> >>>>>>>>> +    vector int v_by = vec_splats((int)by);
> >>>>>>>>> +
> >>>>>>>>> +    int s_opr2;
> >>>>>>>>> +    s_opr2 = (int)(0x2001 << (RGB2YUV_SHIFT-1));
> >>>>>>>>> +
> >>>>>>>>> +    vector int v_opr1 = vec_splats((int)RGB2YUV_SHIFT);
> >>>>>>>>> +    vector int v_opr2 = vec_splats((int)s_opr2);
> >>>>>>>>> +
> >>>>>>>>> +    vector int v_r, v_g, v_b, v_tmp;
> >>>>>>>>> +    vector short v_tmpi, v_dst;
> >>>>>>>>> +
> >>>>>>>>> +    for (i = 0; i < num_vec; i++) {
> >>>>>>>>> +        for (j = 7; j >= 0  ; j--) {
> >>>>>>>>> +            int r_b = input_pixel(&src[(i*8+j)*4+0]);
> >>>>>>>>> +            int g   = input_pixel(&src[(i*8+j)*4+1]);
> >>>>>>>>> +            int b_r = input_pixel(&src[(i*8+j)*4+2]);
> >>>>>>>>> +
> >>>>>>>>> +            v_r[j % 4] = r;
> >>>>>>>>> +            v_g[j % 4] = g;
> >>>>>>>>> +            v_b[j % 4] = b;
> >>>>>>>>> +
> >>>>>>>>> +            if (!(j % 4)) {
> >>>>>>>>                        ^
> >>>>>>>>
> >>>>>>>>> +                v_tmp = v_ry * v_r;
> >>>>>>>>> +                v_tmp = v_tmp + v_gy * v_g;
> >>>>>>>>> +                v_tmp = v_tmp + v_by * v_b;
> >>>>>>>>> +                v_tmp = v_tmp + v_opr2;
> >>>>>>>>> +                v_tmp = vec_sr(v_tmp, (vector unsigned int)v_opr1);
> >>>>>>>>> +
> >>>>>>>>> +                v_tmpi  = (vector short)v_tmp;
> >>>>>>>>> +                v_dst[(j / 4) * 4 + 3]  = v_tmpi[6];
> >>>>>>>>                             ^
> >>>>>>>> what is the speed of a division and modulo on PPC compared to a
> >>>>>>>> bitwise and ?
> >>>>>>>>
> >>>>>>>> its also not trivial for the compiler to optimize then out as it
> >>>>>>>> has to proof the varables are never negative
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> [...]
> >>>>>>>> _______________________________________________
> >>>>>>>> ffmpeg-devel mailing list
> >>>>>>>> ffmpeg-devel at ffmpeg.org
> >>>>>>>> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> >>>>>>>
> >>>>>>> I don't know the answer to those questions. But, I see your point.
> >>>>>>> Bitwise operations should be faster than multiplications and divisions.
> >>>>>>> So, I shall change the code to use bitwise ops and compare the execution
> >>>>>>> time against its present value (well, average value over multiple runs)
> >>>>>>
> >>>>>> it might not make a difference if the compiler does optmize them out
> >>>>>> but you should know the approximate speed of the instructions
> >>>>>> that your code is likely/potentially going to use. I mean this is
> >>>>>> code optimized for PPC.
> >>>>>>
> >>>>>> [...]
> >>>>>> _______________________________________________
> >>>>>> ffmpeg-devel mailing list
> >>>>>> ffmpeg-devel at ffmpeg.org
> >>>>>> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> >>>>>
> >>>>> I take exception to the tone in that last sentence and I shall respond
> >>>>> in the same spirit.
> >>>>>
> >>>>> 1. I could spend time obtaining the detailed POWER8 micro architecture
> >>>>> description and compare the execution time of each machine instruction.
> >>>>> 2. Study gcc source to find out exactly which machine instructions it
> >>>>> generates for each C language operator
> >>>>> 3. Use 1 and 2 above to determine which C operators to use here.
> >>>>>
> >>>>> OR
> >>>>>
> >>>>> I could go ahead and run 2 simulations and compare their average
> >>>>> execution times.
> >>>>>
> >>>>> Seems to me pretty clear which is a better use of time.
> >>>>
> >>>> Knowing the execution times of instructions is quite usefull
> >>>> it certainly takes alot more time to search and read that than to
> >>>> benchmark once but once you know the timings approximatly you can
> >>>> roughly guess how fast/slow some code is by just looking.
> >>>> If knowing that doesnt interrest you then please ignore my comment
> >>>> to me these things always feel interresting and it was to me certainly
> >>>> quite usefull when optmizing code to know this kind of stuff for x86
> >>>>
> >>>> also what are you testing on ?
> >>>>
> >>>> [...]
> >>>> _______________________________________________
> >>>> ffmpeg-devel mailing list
> >>>> ffmpeg-devel at ffmpeg.org
> >>>> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> >>>
> >>> Changing the operations to use bitwise operators instead of
> >>> multiplications, divisions and modulo arithmetic did not appreciably
> >>> change execution times. The execution times of the two versions were
> >>> within 1.5 seconds of each other in all of worst, best and average times
> >>> reported by command "/usr/bin/time -p make -j 4 fate
> >>> SAMPLES=fate-suite/"
> >>> Worst case real components clustered around 310s. Best case times
> >>> clustered about 296s.
> >>>
> >>
> >>> Do you want me to submit a patch using the bitwise operators to replace
> >>> the previous patch?
> >>
> >> i would slightly prefer that but it doesnt really matter if they are
> >> the same speed
> >>
> >> how much faster is your patch compared to before your patch ?
> >>
> >> [...]
> >> _______________________________________________
> >> ffmpeg-devel mailing list
> >> ffmpeg-devel at ffmpeg.org
> >> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> > 
> > The averages over 10 runs for "make -j 4 fate"  are:
> > 
> > Pre-patch: 308.0s
> > Post-patch: 304.5s
> 
> Ideally you'd use the timer.h macros to wrap calls to these functions in
> order to test them alone without all the overhead, or at least try to use
> "ffmpeg -benchmark -threads 1 -i INPUT -f null -" or similar to decode one
> file at a time that uses some or all of these functions.
> As mentioned before, there are tons of variables that can change the
> results of a make fate run.

I used SystemTap probes to measure the times between function entry and
function exit for all functions in libswscale/input.c. The functions in
the patch were all slower than the non-SIMD versions currently in the
repository. On examining the assembly, I don't believe they can be
improved on PPC64. The main reason is that on PPC64 there are no
register-register moves between the integer unit registers and the VSX
SIMD unit registers. All transfers between the integer registers and
SIMD registers generate memory load and store instructions. So, if
either input or output data is non-contiguous in memory, then splicing
scalar data to form a SIMD vector; or splitting a vector into scalars
results in increased memory bandwidth usage, hence the worse running
time.

The above means that the number of functions in libswscale/input.c that
benefit from SIMD is cut by about half. Here are the times reported by
SystemTap for the first 8 functions whose running times improve (SIMD
versions have suffix _vsx):
========
yuy2ToY_c_vsx. 
no. of calls: 864. min: 1880 ns. avg: 2014 ns. max: 29844 ns. total:
1740366 ns. 
yuy2ToY_c. 
no. of calls: 864. min: 2326 ns. avg: 2451 ns. max: 15950 ns. total:
2118226 ns. 

yvy2ToUV_c_vsx. 
no. of calls: 288. min: 1891 ns. avg: 1989 ns. max: 13644 ns. total:
573038 ns. 
yvy2ToUV_c. 
no. of calls: 288. min: 2089 ns. avg: 2131 ns. max: 2462 ns. total:
613813 ns. 

rgbaToA_c_vsx. 
no. of calls: 1152. min: 1975 ns. avg: 2123 ns. max: 31356 ns. total:
2446276 ns. 
rgbaToA_c. 
no. of calls: 1152. min: 2368 ns. avg: 2448 ns. max: 12496 ns. total:
2820401 ns. 

uyvyToUV_c_vsx. 
no. of calls: 288. min: 1901 ns. avg: 1932 ns. max: 2122 ns. total:
556697 ns. 
uyvyToUV_c. 
no. of calls: 288. min: 2088 ns. avg: 2129 ns. max: 2370 ns. total:
613202 ns. 

uyvyToY_c_vsx. 
no. of calls: 576. min: 1877 ns. avg: 1956 ns. max: 15821 ns. total:
1127222 ns. 
uyvyToY_c. 
no. of calls: 576. min: 2325 ns. avg: 2408 ns. max: 15332 ns. total:
1387168 ns. 

nv12ToUV_c_vsx. 
no. of calls: 144. min: 1869 ns. avg: 2006 ns. max: 15480 ns. total:
288867 ns. 
nv12ToUV_c. 
no. of calls: 144. min: 2101 ns. avg: 2273 ns. max: 19774 ns. total:
327432 ns. 

abgrToA_c_vsx. 
no. of calls: 1152. min: 1949 ns. avg: 2060 ns. max: 15496 ns. total:
2373206 ns. 
abgrToA_c. 
no. of calls: 1152. min: 2374 ns. avg: 2471 ns. max: 52452 ns. total:
2847044 ns. 

yuy2ToUV_c_vsx. 
no. of calls: 288. min: 1873 ns. avg: 1972 ns. max: 16608 ns. total:
568154 ns. 
yuy2ToUV_c. 
no. of calls: 288. min: 2087 ns. avg: 2123 ns. max: 2252 ns. total:
611621 ns. 

nv21ToUV_c_vsx. 
no. of calls: 144. min: 1879 ns. avg: 2019 ns. max: 14290 ns. total:
290860 ns. 
nv21ToUV_c. 
no. of calls: 144. min: 2098 ns. avg: 2233 ns. max: 14750 ns. total:
321692 ns. 
=================

The dataset used was fate-filter-pixfmts-scale. Let me know if the
performance numbers are acceptable for submission of a patch
incorporating the changes.

Thanks.
Dan.




More information about the ffmpeg-devel mailing list