[MPlayer-dev-eng] [RFC] Color management filter

Reimar Döffinger Reimar.Doeffinger at stud.uni-karlsruhe.de
Mon Aug 25 20:05:55 CEST 2008


On Mon, Aug 25, 2008 at 01:40:23PM -0400, Yue Shi Lai wrote:
> The algorithm uses internally a 256^3 color LUT (which occupies ~ 50 MB 
> RAM), and by default does not calculate the entire LUT using Little CMS. 
> Instead, a Catmull-Rom spline interpolation is applied on the LUT, which 
> for HAVE_SSE2=1 is implemented using SSE2 assembly to speed up the 
> calculation and reduce the latency it causes during MPlayer startup.

Hm, you might be interested in knowing that -vo gl:yuv=6 uses that
approach for yuv -> rgb conversion already. I never got around to
implement loading of a custom 3D texture (also because I did not
know of a suitable image format), but this might be a way to make
this available with less CPU usage, too...
That was actually the only point I wanted to make originally, but since
I had already been skimming the code...

> Index: libmpcodecs/vf_cm.c
> ===================================================================
> --- libmpcodecs/vf_cm.c	(revision 0)
> +++ libmpcodecs/vf_cm.c	(revision 0)
> @@ -0,0 +1,2898 @@
> +/*
> + * vf_cm - Experimental color management filter
> + *
> + * This filter matches the video color to the color management profile
> + * of the hardware display using two components: A interface to the
> + * Little CMS and Argyll color management modules (CMM) a fast Y'CbCr
> + * color look up table (CLUT).

A proper license would probably be a good idea. In theory, you could
copy one of the license header on some other file, depending on which
license you prefer (MPlayer files are most GPL, libav* stuff most LGPL)...

> +#define SMPTE_R_X		0.630
> +#define	SMPTE_R_Y		0.340

stray tab.

> +/* Unsigned 8 bit saturations */
> +#define UINT8_T_SATURATE(x)		\
> +	(uint8_t)((x) < 0 ? 0 : (x) > 255 ? 255 : (x))

libavutil/common.h: av_clip_uint8. Also likely to be faster.

> +	u = (111 * clut[i] - 9 * clut[i - 12] +
> +		 31 * clut[i + 12] - 5 * clut[i + 21]) >> 7;
> +	clut[i + 3] = UINT8_T_SATURATE(u);
> +	u = (111 * clut[i + 1] - 9 * clut[i - 11] +
> +		 31 * clut[i + 13] - 5 * clut[i + 22]) >> 7;
> +	clut[i + 4] = UINT8_T_SATURATE(u);
> +	u = (111 * clut[i + 2] - 9 * clut[i - 10] +
> +		 31 * clut[i + 14] - 5 * clut[i + 23]) >> 7;
> +	clut[i + 5] = UINT8_T_SATURATE(u);
> +	u = (71 * clut[i] - (clut[i - 12] << 3) +
> +		 77 * clut[i + 12] - 12 * clut[i + 21]) >> 7;
> +	clut[i + 6] = UINT8_T_SATURATE(u);
> +	u = (71 * clut[i + 1] - (clut[i - 11] << 3) +
> +		 77 * clut[i + 13] - 12 * clut[i + 22]) >> 7;
> +	clut[i + 7] = UINT8_T_SATURATE(u);
> +	u = (71 * clut[i + 2] - (clut[i - 10] << 3) +
> +		 77 * clut[i + 14] - 12 * clut[i + 23]) >> 7;
> +	clut[i + 8] = UINT8_T_SATURATE(u);
> +	u = (28 * clut[i] - 3 * clut[i - 12] +
> +		 117 * clut[i + 12] - 14 * clut[i + 21]) >> 7;
> +	clut[i + 9] = UINT8_T_SATURATE(u);
> +	u = (28 * clut[i + 1] - 3 * clut[i - 11] +
> +		 117 * clut[i + 13] - 14 * clut[i + 22]) >> 7;
> +	clut[i + 10] = UINT8_T_SATURATE(u);
> +	u = (28 * clut[i + 2] - 3 * clut[i - 10] +
> +		 117 * clut[i + 14] - 14 * clut[i + 23]) >> 7;
> +	clut[i + 11] = UINT8_T_SATURATE(u);

I think you should take advantage of that wonderful new invention called
"loop" ;-)
If you think speed is a problem, forcing gcc to fully unroll the loop
and using static const arrays for the coefficients should result in the
same code - though due to reduced code cache pressure a non-unrolled
loop may well be faster.

> +	/* Coefficients */
> +	/* These constant are in the format __n = n + (n << 16), which
> +	   when loaded, avoids an additional pshuflw (i.e. before pshufd)
> +	   to generate the correct vector layout. */
> +	static uint32_t __3 = 0x30003;
> +	static uint32_t __5 = 0x50005;
> +	static uint32_t __7 = 0x70007;
> +	static uint32_t __9 = 0x90009;
> +	static uint32_t __12 = 0xc000c;
> +	static uint32_t __14 = 0xe000e;
> +	static uint32_t __28 = 0x1c001c;
> +	static uint32_t __29 = 0x1d001d;
> +	static uint32_t __31 = 0x1f001f;
> +	static uint32_t __33 = 0x210021;
> +	static uint32_t __60 = 0x3c003c;
> +	static uint32_t __61 = 0x3d003d;
> +	static uint32_t __71 = 0x470047;
> +	static uint32_t __77 = 0x4d004d;
> +	static uint32_t __107 = 0x6b006b;
> +	static uint32_t __111 = 0x6f006f;
> +	static uint32_t __117 = 0x750075;

identifiers starting with __ are reserved for the system/system headers.

> +	unsigned long stride_j;
> +	unsigned long stride_2_j;
> +	unsigned long stride_3_j;
> +	unsigned long stride_4_j;
> +	unsigned long stride_8_j;
> +	unsigned long i_m_4_stride;
> +	unsigned long i_stride;
> +	unsigned long i_p_1_stride;
> +	unsigned long i_p_2_stride;
> +	unsigned long i_p_3_stride;
> +	unsigned long i_p_4_stride;
> +	unsigned long i_p_8_stride;
> +	unsigned long i_m_4_stride_j;
> +	unsigned long i_stride_j;
> +	unsigned long i_p_1_stride_j;
> +	unsigned long i_p_2_stride_j;
> +	unsigned long i_p_3_stride_j;
> +	unsigned long i_p_4_stride_j;
> +	unsigned long i_p_5_stride_j;
> +	unsigned long i_p_6_stride_j;
> +	unsigned long i_p_7_stride_j;

Using arrays might make it quite a bit less ugly, even if you might have
to use either less logical indices or leave a few elements unused.

> +	for(j = head; j < tail; j += 8) {
> +		__asm__ __volatile__ (
[...]
> +			: "%xmm0", "%xmm1", "%xmm2", "%xmm3",
> +			  "%xmm4", "%xmm5", "%xmm6", "%xmm7");
> +		stride_3_j += 8;
> +		stride_4_j += 8;
> +		stride_8_j += 8;
> +	}

Note that quite a few people recommend to put the loop code inside the
asm, gcc is quite likely to do something stupid with the loop variable
otherwise. Not sure if it matters enough.
IMHO the code also could severely need things to structure it and avoid
duplication and if possible separate the C and SSE implementations
- macros, "static inline" functions, whatever.
Also not sure if it does the right thing if runtime cpudetection is
enabled.

Greetings,
Reimar Döffinger



More information about the MPlayer-dev-eng mailing list