[FFmpeg-devel] [PATCH v4] libswscale/ppc: VSX-optimize 9-16 bit yuv2planeX

Michael Niedermayer michael at niedermayer.cc
Fri Jan 11 10:56:15 EET 2019


On Thu, Jan 10, 2019 at 11:55:34AM +0200, Lauri Kasanen wrote:
> ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p16be \
> -s 1920x1728 -f null -vframes 100 -v error -nostats -
> 
> 9-14 bit funcs get about 6x speedup, 16-bit gets about 15x.
> Fate passes, each format tested with an image to video conversion.
> 
> Only POWER8 includes 32-bit vector multiplies, so POWER7 is locked out
> of the 16-bit function. This includes the vec_mulo/mule functions too,
> not just vmuluwm.
> 
> yuv420p9le
>   12341 UNITS in planarX,  130976 runs,     96 skips
>   73752 UNITS in planarX,  131066 runs,      6 skips
> yuv420p9be
>   12364 UNITS in planarX,  131025 runs,     47 skips
>   73001 UNITS in planarX,  131055 runs,     17 skips
> yuv420p10le
>   12386 UNITS in planarX,  131042 runs,     30 skips
>   72735 UNITS in planarX,  131062 runs,     10 skips
> yuv420p10be
>   12337 UNITS in planarX,  131045 runs,     27 skips
>   72734 UNITS in planarX,  131057 runs,     15 skips
> yuv420p12le
>   12236 UNITS in planarX,  131058 runs,     14 skips
>   73029 UNITS in planarX,  131062 runs,     10 skips
> yuv420p12be
>   12218 UNITS in planarX,  130973 runs,     99 skips
>   72402 UNITS in planarX,  131069 runs,      3 skips
> yuv420p14le
>   12168 UNITS in planarX,  131067 runs,      5 skips
>   72480 UNITS in planarX,  131069 runs,      3 skips
> yuv420p14be
>   12358 UNITS in planarX,  130948 runs,    124 skips
>   73772 UNITS in planarX,  131063 runs,      9 skips
> yuv420p16le
>   10439 UNITS in planarX,  130911 runs,    161 skips
>  157923 UNITS in planarX,  131068 runs,      4 skips
> yuv420p16be
>   10463 UNITS in planarX,  130874 runs,    198 skips
>  154405 UNITS in planarX,  131061 runs,     11 skips
> 
> Signed-off-by: Lauri Kasanen <cand at gmx.com>
> ---
> 
> v2: Separate macros so that yuv2plane1_16_vsx remains available for power7
> v3: Remove accidental tabs, switch to HAVE_POWER8 from configure + runtime check
> v4: #if HAVE_POWER8
> 
>  libswscale/ppc/swscale_ppc_template.c |   4 +-
>  libswscale/ppc/swscale_vsx.c          | 195 +++++++++++++++++++++++++++++++++-
>  2 files changed, 193 insertions(+), 6 deletions(-)
[...]
> +static void yuv2planeX_16_vsx(const int16_t *filter, int filterSize,
> +                              const int32_t **src, uint16_t *dest, int dstW,
> +                              int big_endian, int output_bits)
> +{
> +    const int dst_u = -(uintptr_t)dest & 7;
> +    const int shift = 15;
> +    const int bias = 0x8000;
> +    const int add = (1 << (shift - 1)) - 0x40000000;
> +    const uint16_t swap = big_endian ? 8 : 0;
> +    const vector uint32_t vadd = (vector uint32_t) {add, add, add, add};
> +    const vector uint32_t vshift = (vector uint32_t) {shift, shift, shift, shift};
> +    const vector uint16_t vswap = (vector uint16_t) {swap, swap, swap, swap, swap, swap, swap, swap};
> +    const vector uint16_t vbias = (vector uint16_t) {bias, bias, bias, bias, bias, bias, bias, bias};
> +    vector int32_t vfilter[MAX_FILTER_SIZE];
> +    vector uint16_t v;
> +    vector uint32_t vleft, vright, vtmp;
> +    vector int32_t vin32l, vin32r;
> +    int i, j;
> +
> +    for (i = 0; i < filterSize; i++) {
> +        vfilter[i] = (vector int32_t) {filter[i], filter[i], filter[i], filter[i]};
> +    }
> +
> +    yuv2planeX_16_u(filter, filterSize, src, dest, dst_u, big_endian, output_bits, 0);
> +
> +    for (i = dst_u; i < dstW - 7; i += 8) {
> +        vleft = vright = vadd;
> +
> +        for (j = 0; j < filterSize; j++) {
> +            vin32l = vec_vsx_ld(0, &src[j][i]);
> +            vin32r = vec_vsx_ld(0, &src[j][i + 4]);
> +

> +#ifdef __GNUC__
> +            // GCC does not support vmuluwm yet. Bug open.

this should probably be tested by configure similar to how other
compiler limitations are tested


[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Old school: Use the lowest level language in which you can solve the problem
            conveniently.
New school: Use the highest level language in which the latest supercomputer
            can solve the problem without the user falling asleep waiting.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20190111/77a1553d/attachment.sig>


More information about the ffmpeg-devel mailing list