[FFmpeg-devel] libavcodec/exr : add x86 SIMD for predictor
James Almer
jamrial at gmail.com
Sun Oct 1 17:14:45 EEST 2017
On 10/1/2017 9:47 AM, Henrik Gramner wrote:
> On Fri, Sep 22, 2017 at 11:12 PM, Martin Vignali
> <martin.vignali at gmail.com> wrote:
>> +static void predictor_scalar(uint8_t *src, ptrdiff_t size)
>> +{
>> + uint8_t *t = src + 1;
>> + uint8_t *stop = src + size;
>> +
>> + while (t < stop) {
>> + int d = (int) t[-1] + (int) t[0] - 128;
>> + t[0] = d;
>> + ++t;
>> + }
>> +}
>
> Can be simplified quite a bit:
>
> static void predictor_scalar(uint8_t *src, ptrdiff_t size)
> {
> for (size_t i = 1; i < size; i++)
We normally use int for counters, and don't mix declaration and statements.
And in any case ptrdiff_t would be "more correct" for this.
> src[i] += src[i-1] - 128;
> }
>
>> +SECTION_RODATA 32
>> +
>> +neg_128: times 16 db -128
>> +shuffle_15: times 16 db 15
>
> Drop the 32-byte alignment from the section directive, we don't need it here.
>
> db -128 is weird since it's identical to +128. I would rename those as such:
>
> pb_128: times 16 db 128
> pb_15: times 16 db 15
We have both of those in constants.c, so use instead
cextern pb_15
cextern pb_80
>
>> +INIT_XMM ssse3
>> +cglobal predictor, 2,3,5, src, size, tmp
>> +
>> + mov tmpb, [srcq]
>> + xor tmpb, -128
>> + mov [srcq], tmpb
>> +
>> +;offset src by size
>> + add srcq, sizeq
>> + neg sizeq ; size = offset for src
>> +
>> +;init mm
>> + mova m0, [neg_128] ; m0 = const for xor high byte
>> + mova m1, [shuffle_15] ; m1 = shuffle mask
>> + pxor m2, m2 ; m2 = prev_buffer
>> +
>> +
>> +.loop:
>> + mova m3, [srcq + sizeq]
>> + pxor m3, m0
>> +
>> + ;compute prefix sum
>> + mova m4, m3
>> + pslldq m4, 1
>> +
>> + paddb m4, m3
>> + mova m3, m4
>> + pslldq m3, 2
>> +
>> + paddb m3, m4
>> + mova m4, m3
>> + pslldq m4, 4
>> +
>> + paddb m4, m3
>> + mova m3, m4
>> + pslldq m3, 8
>> +
>> + paddb m4, m2
>> + paddb m4, m3
>> +
>> + mova [srcq + sizeq], m4
>> +
>> + ;broadcast high byte for next iter
>> + pshufb m4, m1
>> + mova m2, m4
>> +
>> + add sizeq, mmsize
>> + jl .loop
>> + RET
>
> %macro PREDICTOR 0
> cglobal predictor, 2,3,5, src, size, tmp
> %if mmsize == 32
> vbroadcasti128 m0, [pb_128]
> %else
> mova xm0, [pb_128]
> %endif
> mova xm1, [pb_15]
> mova xm2, xm0
> add srcq, sizeq
> neg sizeq
> .loop:
> pxor m3, m0, [srcq + sizeq]
> pslldq m4, m3, 1
> paddb m3, m4
> pslldq m4, m3, 2
> paddb m3, m4
> pslldq m4, m3, 4
> paddb m3, m4
> pslldq m4, m3, 8
> %if mmsize == 32
> paddb m3, m4
> paddb xm2, xm3
> vextracti128 xm4, m3, 1
> mova [srcq + sizeq], xm2
> pshufb xm2, xm1
> paddb xm2, xm4
> mova [srcq + sizeq + 16], xm2
> %else
> paddb m2, m3
> paddb m2, m4
> mova [srcq + sizeq], m2
> %endif
> pshufb xm2, xm1
> add sizeq, mmsize
> jl .loop
> RET
> %endmacro
>
> INIT_XMM ssse3
> PREDICTOR
>
> INIT_XMM avx
> PREDICTOR
>
> %if HAVE_AVX2_EXTERNAL
> INIT_YMM avx2
> PREDICTOR
> %endif
>
> predictor_c: 15351.5
> predictor_ssse3: 1206.5
> predictor_avx: 1207.5
> predictor_avx2: 880.0
>
> On SKL-X. Only tested in checkasm.
>
> AVX is same speed as SSSE3 since modern Intel CPU:s eliminate reg-reg
> moves in the register renaming stage, but somewhat older CPU:s such as
> Sandy Bridge, which is still quite popular, does not so it should help
> there.
Does that apply to Haswell and newer? Was wondering why so many of the
AVX functions that only used the three operand instructions were
reported to be as fast or even slower than <= SSE4 versions for me.
More information about the ffmpeg-devel
mailing list