[FFmpeg-devel] [FFmpeg-cvslog] r12171 - trunk/doc/optimization.txt

Trent Piepho xyzzy
Fri Feb 22 00:47:28 CET 2008


On Thu, 21 Feb 2008, Michael Niedermayer wrote:
> +    long offset = -128;
> +    MOVQ_ZERO(mm7);
> +    do {
> +        asm volatile(
> +            "movq (%0), %%mm0         \n\t"
> +            "movq (%1), %%mm2         \n\t"
> +            "movq %%mm0, %%mm1        \n\t"
> +            "movq %%mm2, %%mm3        \n\t"
> +            "punpcklbw %%mm7, %%mm0   \n\t"
> +            "punpckhbw %%mm7, %%mm1   \n\t"
> +            "punpcklbw %%mm7, %%mm2   \n\t"
> +            "punpckhbw %%mm7, %%mm3   \n\t"
> +            "psubw %%mm2, %%mm0       \n\t"
> +            "psubw %%mm3, %%mm1       \n\t"
> +            "movq %%mm0, (%2, %4)     \n\t"
> +            "movq %%mm1, 8(%2, %4)    \n\t"
> +            : : "r" (s1), "r" (s2), "r" (block+64), "r" (stride), "r" (offset)
> +            : "memory");
> +        s1 += stride;
> +        s2 += stride;
> +        offset += 16;
> +    } while (offset < 0);

That asm block doesn't make a lot of sense.  Why is stride an input, it's
not used in the asm?  It should also do better if you let gcc handle the
addressing and avoid the memory clobber.  Should is of course not the same
as does.

I'm not sure what types block, s1 and s2 are so this might not be exactly
right.

uint64_t *s1, *s2;
uint64_t *block;
stride>>=3; /* was stride in bytes or qwords? */

       asm volatile(
           "movq %0, %%mm0		\n\t"
           "movq %1, %%mm2		\n\t"
           "movq %%mm0, %%mm1		\n\t"
           "movq %%mm2, %%mm3		\n\t"
           "punpcklbw %%mm7, %%mm0	\n\t"
           "punpckhbw %%mm7, %%mm1	\n\t"
           "punpcklbw %%mm7, %%mm2	\n\t"
           "punpckhbw %%mm7, %%mm3	\n\t"
           "psubw %%mm2, %%mm0		\n\t"
           "psubw %%mm3, %%mm1		\n\t"
           "movq %%mm0, %2		\n\t"
           "movq %%mm1, %3		\n\t"
           : "m"(*s1), "m"(*s2)"
	   : "=m" (block[8+offset]), "=m"(block[9+offset]) );
	   /* Maybe that should be 64+offset and 65+offset? */
       s1 += stride;
       s2 += stride;
       offset+=2;

That should let gcc use better addressing modes.  Maybe it will be smart
enough to stick the inital values of s1 and s2 into registers and index
them with another register that += stride each time.




More information about the ffmpeg-devel mailing list