[Ffmpeg-devel] building without optimizations, but with mmx enabled - II

Thu Aug 10 17:39:14 CEST 2006

The last thing is transpose4x4 which has 4 inputs and 4 outputs, resulting in 
register starvation. The obvious idea is to use "y" constraints, since the 
values go into mmx registers anyway:

static inline void transpose4x4(uint8_t *dst, uint8_t *src, int dst_stride, 
int src_stride){
    asm volatile(
        "punpcklbw %5, %4         \n\t"
        "punpcklbw %7, %6         \n\t"
        "movq %4, %5              \n\t"
        "punpcklwd %6, %4         \n\t"
        "punpckhwd %6, %5         \n\t"
        "movd  %4, %0             \n\t"
        "punpckhdq %4, %4         \n\t"
        "movd  %4, %1             \n\t"
        "movd  %5, %2             \n\t"
        "punpckhdq %5, %5         \n\t"
        "movd  %5, %3             \n\t"
        : "=m" (*(uint32_t*)(dst + 0*dst_stride)),
          "=m" (*(uint32_t*)(dst + 1*dst_stride)),
          "=m" (*(uint32_t*)(dst + 2*dst_stride)),
          "=m" (*(uint32_t*)(dst + 3*dst_stride))
        :  "y" (*(uint32_t*)(src + 0*src_stride)),
           "y" (*(uint32_t*)(src + 1*src_stride)),
           "y" (*(uint32_t*)(src + 2*src_stride)),
           "y" (*(uint32_t*)(src + 3*src_stride))
    ); 
}

Theoretically, this should also lead to better optimisation, because the 
compiler can arrange the loads. On the downside, "y" constraints work only if 
the compiler gets at least the -fmmx switch. So this requires a change to the 
configuration. Is that viable?

Marco