[FFmpeg-devel] libavcodec/exr : add SSE2 simd for reorder pixels (WIP)

Sun Aug 6 02:23:02 EEST 2017

On 8/5/17, Martin Vignali <martin.vignali at gmail.com> wrote:
> Hello,
>
> Based on pull request in the official openexr library :
> https://github.com/openexr/openexr/pull/229/commits/4198128397c033d4f69e5cc0833195da500c31cf
>
> i try to port the reorder part (used in zip and rle decompression), to asm
> Tested on OS X
> pass fate test for me
>
> I test with this command :
> ./ffmpeg -i ./fate-suite/exr/rgba_slice_zip16.exr -f null -;./ffmpeg -i
> ./fate-suite/exr/rgba_slice_zip16.exr -f null - -cpuflags 0
>
> the result of the timer are :
> scalar : 960355 decicycles in reorder_pixels_zip,      64 runs,      0
> skips
> sse :       84763 decicycles in reorder_pixels_zip,      64 runs,      0
> skips
>
> Comments Welcome
>
> Martin
> Jokyo Images

64 runs seems way too few runs for reliable result.
Please try to run ffmpeg encoding for about a minute.

About the patch:

> +%include "libavutil/x86/x86util.asm"
> +
> +SECTION .text

Please include "x86inc.asm" explicitly.

> +INIT_XMM sse2
> +cglobal reorder_pixels, 3,5,3, src, dst, size
> +
> +shr    r2,    1;half_size

It's better to use the labels you define in the cglobal,
If I count correctly, "r2" would be "sizeq".
("srcq" and "dstq" would be the correct labels for the pointers).

> +;calc loop count for simd reorder
> +mov    r3,    r2;
> +shr    r3,    4;calc loop count simd

Align the instruction at 4 spaces.
Align the first operands so that the ending commas are vertically aligned.
For the follow up operands, just add one space after the comma.

Put some spaces before the comment, to separate it from the operands.
BTW, this is not C, you don't need to end every line with ";"
e.g.
> +    mov  r3, r2
> +    shr  r3, 4  ;calc loop count simd

> +;jump to afterloop2 if loop count simd is 0
> +cmp    r3,    0;
> +jle    afterloop2;

If you only check for size==0, then
the "shr" has already set the correct Z flag.

However, if r3/size==1, you'd still read
16 bytes at once in the first loop.
Aka, overread.

> +;simd loop
> +loop1:
> +movdqa    xmm0,    [r0];load first 16 bytes

Use "m0" instead.

> +lea r4, [r0 + r2];
> +
> +movdqu    xmm1, [r4]; unaligned load

r4 doesn't seem to be used for anything else.
Just move the address calculation directly into
the "movdqu", it can take it.

> +movdqa     xmm2,    xmm0;copy xmm0
> +
> +punpcklbw     xmm2,   xmm1;
> +movdqa     [r1],   xmm2;
> +add     r1, 16;

use "mmsize" instead of 16.

> +movdqa     xmm2,   xmm0;
> +punpckhbw     xmm2,   xmm1;
> +movdqa    [r1],   xmm2;
> +add    r1,    16;

You can actually avoid one of the "add"
if you use [r1+mmsize] and add 32 at the second
"add" (or 2*mmsize).

> +dec    r3;
> +add    r0,    16;

> +; test repeat
> +cmp    r3,   0;
> +jge    loop1;

First, "dec" instructions are avoided,
because they do a partial update
on the flag register and
this causes interdependence.

Second, you can use the flags from
the "sub" to check if it has reached
zero or has gone negative.

> +afterloop2:
> +;scalar part
> +
> +mov r3, r2;
> +and r3, 15     ;half_size % 16
> +lea r4, [r0 + r2];
> +
> +;initial condition loop
> +cmp    r3,    0;
> +jle    end;
> +
> +loop2:
> +mov r1, r0;
> +inc r1;
> +mov r1, r4;
> +inc r1;
> +inc r0;
> +inc r4;
> +dec r3;
> +; test repeat
> +cmp    r3,   0;
> +jg    loop2;

O_o
This loop does not read or write to memory.

You can replace the second "cmp" in this
loop with unconditional jump to the
"initial condition loop" compare.
(aka move loop2: two instruction above).

This is how "for(;;)" is usually implemented in C compilers.

Be sure to test your function with different sizes,
to salt your output buffers and check for underwrites, overwrites
etc...

Best Regards.