[Ffmpeg-devel] building without optimizations, but with mmx enabled - II
Marco Manfredini
mldb
Thu Aug 10 17:39:14 CEST 2006
The last thing is transpose4x4 which has 4 inputs and 4 outputs, resulting in
register starvation. The obvious idea is to use "y" constraints, since the
values go into mmx registers anyway:
static inline void transpose4x4(uint8_t *dst, uint8_t *src, int dst_stride,
int src_stride){
asm volatile(
"punpcklbw %5, %4 \n\t"
"punpcklbw %7, %6 \n\t"
"movq %4, %5 \n\t"
"punpcklwd %6, %4 \n\t"
"punpckhwd %6, %5 \n\t"
"movd %4, %0 \n\t"
"punpckhdq %4, %4 \n\t"
"movd %4, %1 \n\t"
"movd %5, %2 \n\t"
"punpckhdq %5, %5 \n\t"
"movd %5, %3 \n\t"
: "=m" (*(uint32_t*)(dst + 0*dst_stride)),
"=m" (*(uint32_t*)(dst + 1*dst_stride)),
"=m" (*(uint32_t*)(dst + 2*dst_stride)),
"=m" (*(uint32_t*)(dst + 3*dst_stride))
: "y" (*(uint32_t*)(src + 0*src_stride)),
"y" (*(uint32_t*)(src + 1*src_stride)),
"y" (*(uint32_t*)(src + 2*src_stride)),
"y" (*(uint32_t*)(src + 3*src_stride))
);
}
Theoretically, this should also lead to better optimisation, because the
compiler can arrange the loads. On the downside, "y" constraints work only if
the compiler gets at least the -fmmx switch. So this requires a change to the
configuration. Is that viable?
Marco
More information about the ffmpeg-devel
mailing list