[FFmpeg-devel] libavcodec/exr : add SSE2 simd for reorder pixels (WIP)

Martin Vignali martin.vignali at gmail.com
Mon Aug 7 12:34:48 EEST 2017


>
> 64 runs seems way too few runs for reliable result.
> Please try to run ffmpeg encoding for about a minute.
>
> About the patch:
>
> > +%include "libavutil/x86/x86util.asm"
> > +
> > +SECTION .text
>
> Please include "x86inc.asm" explicitly.
>
> > +INIT_XMM sse2
> > +cglobal reorder_pixels, 3,5,3, src, dst, size
> > +
> > +shr    r2,    1;half_size
>
> It's better to use the labels you define in the cglobal,
> If I count correctly, "r2" would be "sizeq".
> ("srcq" and "dstq" would be the correct labels for the pointers).
>
>
> > +;calc loop count for simd reorder
> > +mov    r3,    r2;
> > +shr    r3,    4;calc loop count simd
>
> Align the instruction at 4 spaces.
> Align the first operands so that the ending commas are vertically aligned.
> For the follow up operands, just add one space after the comma.
>
> Put some spaces before the comment, to separate it from the operands.
> BTW, this is not C, you don't need to end every line with ";"
> e.g.
> > +    mov  r3, r2
> > +    shr  r3, 4  ;calc loop count simd
>
>
> > +;jump to afterloop2 if loop count simd is 0
> > +cmp    r3,    0;
> > +jle    afterloop2;
>
> If you only check for size==0, then
> the "shr" has already set the correct Z flag.
>
> However, if r3/size==1, you'd still read
> 16 bytes at once in the first loop.
> Aka, overread.
>
>
> > +;simd loop
> > +loop1:
> > +movdqa    xmm0,    [r0];load first 16 bytes
>
> Use "m0" instead.
>
>
> > +lea r4, [r0 + r2];
> > +
> > +movdqu    xmm1, [r4]; unaligned load
>
> r4 doesn't seem to be used for anything else.
> Just move the address calculation directly into
> the "movdqu", it can take it.
>
> > +movdqa     xmm2,    xmm0;copy xmm0
> > +
> > +punpcklbw     xmm2,   xmm1;
> > +movdqa     [r1],   xmm2;
> > +add     r1, 16;
>
> use "mmsize" instead of 16.
>
> > +movdqa     xmm2,   xmm0;
> > +punpckhbw     xmm2,   xmm1;
> > +movdqa    [r1],   xmm2;
> > +add    r1,    16;
>
> You can actually avoid one of the "add"
> if you use [r1+mmsize] and add 32 at the second
> "add" (or 2*mmsize).
>
> > +dec    r3;
> > +add    r0,    16;
>
> > +; test repeat
> > +cmp    r3,   0;
> > +jge    loop1;
>
> First, "dec" instructions are avoided,
> because they do a partial update
> on the flag register and
> this causes interdependence.
>
> Second, you can use the flags from
> the "sub" to check if it has reached
> zero or has gone negative.
>
>
> > +afterloop2:
> > +;scalar part
> > +
> > +mov r3, r2;
> > +and r3, 15     ;half_size % 16
> > +lea r4, [r0 + r2];
> > +
> > +;initial condition loop
> > +cmp    r3,    0;
> > +jle    end;
> > +
> > +loop2:
> > +mov r1, r0;
> > +inc r1;
> > +mov r1, r4;
> > +inc r1;
> > +inc r0;
> > +inc r4;
> > +dec r3;
> > +; test repeat
> > +cmp    r3,   0;
> > +jg    loop2;
>
> O_o
> This loop does not read or write to memory.
>
> You can replace the second "cmp" in this
> loop with unconditional jump to the
> "initial condition loop" compare.
> (aka move loop2: two instruction above).
>
> This is how "for(;;)" is usually implemented in C compilers.
>
> Be sure to test your function with different sizes,
> to salt your output buffers and check for underwrites, overwrites
> etc...
>
> Best Regards.
>
>
> Thanks for your comments,

i will take a look, on all of this.

Martin


More information about the ffmpeg-devel mailing list