[FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Tue Jun 16 14:34:11 CEST 2015

On Tue, Jun 16, 2015 at 2:30 PM, Stefano Sabatini <stefasab at gmail.com> wrote:
> On date Tuesday 2015-06-16 14:16:11 +0200, Gwenole Beauchesne encoded:
>> Hi,
>>
>> 2015-06-16 14:03 GMT+02:00 Michael Niedermayer <michaelni at gmx.at>:
> [...]
>> >> +#if HAVE_SSE2
>> >> +/* Copy 16/64 bytes from srcp to dstp loading data with the SSE>=2 instruction
>> >> + * load and storing data with the SSE>=2 instruction store.
>> >> + */
>> >> +#define COPY16(dstp, srcp, load, store) \
>> >> +    __asm__ volatile (                  \
>> >> +        load "  0(%[src]), %%xmm1\n"    \
>> >> +        store " %%xmm1,    0(%[dst])\n" \
>> >> +        : : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1")
>> >> +
>> >> +#define COPY64(dstp, srcp, load, store) \
>> >> +    __asm__ volatile (                  \
>> >> +        load "  0(%[src]), %%xmm1\n"    \
>> >> +        load " 16(%[src]), %%xmm2\n"    \
>> >> +        load " 32(%[src]), %%xmm3\n"    \
>> >> +        load " 48(%[src]), %%xmm4\n"    \
>> >> +        store " %%xmm1,    0(%[dst])\n" \
>> >> +        store " %%xmm2,   16(%[dst])\n" \
>> >> +        store " %%xmm3,   32(%[dst])\n" \
>> >> +        store " %%xmm4,   48(%[dst])\n" \
>> >> +        : : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1", "xmm2", "xmm3", "xmm4")
>> >> +#endif
>> >> +
>> >> +#define COPY_LINE(dstp, srcp, size, load)                               \
>> >> +    const unsigned unaligned = (-(uintptr_t)srcp) & 0x0f;               \
>> >> +    unsigned x = unaligned;                                             \
>> >> +                                                                        \
>> >> +    av_assert0(((intptr_t)dstp & 0x0f) == 0);                           \
>> >> +                                                                        \
>> >> +    __asm__ volatile ("mfence");                                        \
>> >> +    if (!unaligned) {                                                   \
>> >> +        for (; x+63 < size; x += 64)                                    \
>> >> +            COPY64(&dstp[x], &srcp[x], load, "movdqa");                 \
>> >> +    } else {                                                            \
>> >> +        COPY16(dst, src, "movdqu", "movdqa");                           \
>> >> +        for (; x+63 < size; x += 64)                                    \
>> >> +            COPY64(&dstp[x], &srcp[x], load, "movdqu");                 \
>> >
>> > to use SSE registers in inline asm operands or clobber list you need
>> > to build with -msse (which probably is default on on x86-64)
>> >
>> > files build with -msse will result in undefined behavior if anything
>> > in them is executed on a pre SSE cpu, as these allow gcc to put
>> > SSE instructions directly in the code where it likes
>> >
>> > The way out of this "design" is not to tell gcc that it passes
>> > a string with SSE code to the assembler
>> > that is not to use SSE registers in operands and not to put them
>> > on the clobber list unless gcc actually is in SSE mode and can use and
>> > need them there.
>> > see XMM_CLOBBERS*
>>
>> Well, from past experience, lying to gcc is generally not a good thing
>> either. There are multiple interesting ways it could fail from time to
>> time. :)
>>
>> Other approaches:
>> - With GCC >= 4.4, you can use __attribute__((target(T))) where T =
>> "ssse3", "sse4.1", etc. This is the easiest way ;
>> - Split into several separate files per target. Though, one would then
>> argue that while we are at it why not just start moving to yasm.
>>
>
>> The former approach looks more appealing to me, considering there may
>> be an effort to migrate to yasm afterwards.
>
> I plan to port this patch to yasm. I'll ask for help on IRC since
> probably it will take too much time otherwise without any guidance.
> --

If you accept a few restrictions (like requiring aligned and padded
input/output) and maybe give it a more specific name so that people
won't try to replace generic memcpy with it, yasm'ing it would be
pretty simple.
If you want it to be generic like the C version, supporting unaligned
and whatnot, the asm is going to get a bit more verbose..

I could probably whip up a basic implementation of the restricted
version, and the yasm experts can make suggestions on improvements
then.

- Hendrik