[MPlayer-dev-eng] [RFC] Color management filter

Yue Shi Lai ylai at users.sourceforge.net
Tue Aug 26 15:22:34 CEST 2008


Dear Reimar, dev-eng list,

This new patch addresses all issues have been raised (i.e. including 
interface to vo_gl). There is still few comment gaps and comments in 
need of update (especially where the modifications have been applied), 
which I will fill up, but I would appreciate any other issues you might 
find.

I also made the following additional changes:

- Remove the left over wording "experimental" (in fact, I have been 
using this implementation for ~ 1 year now on daily basis)

- Replace dpy_... by dest_... This is also a legacy naming, and 
dest/destination is the proper term in color management (also, dpy does 
not make sense when using with mencoder)

- The default BPC black level is too conservative, adjusted it to 10.

Best,

Yue Shi Lai

Reimar Döffinger schrieb:
> On Mon, Aug 25, 2008 at 01:40:23PM -0400, Yue Shi Lai wrote:
>> The algorithm uses internally a 256^3 color LUT (which occupies ~ 50 MB 
>> RAM), and by default does not calculate the entire LUT using Little CMS. 
>> Instead, a Catmull-Rom spline interpolation is applied on the LUT, which 
>> for HAVE_SSE2=1 is implemented using SSE2 assembly to speed up the 
>> calculation and reduce the latency it causes during MPlayer startup.
> 
> Hm, you might be interested in knowing that -vo gl:yuv=6 uses that
> approach for yuv -> rgb conversion already. I never got around to
> implement loading of a custom 3D texture (also because I did not
> know of a suitable image format), but this might be a way to make
> this available with less CPU usage, too...
> That was actually the only point I wanted to make originally, but since
> I had already been skimming the code...
> 
>> Index: libmpcodecs/vf_cm.c
>> ===================================================================
>> --- libmpcodecs/vf_cm.c	(revision 0)
>> +++ libmpcodecs/vf_cm.c	(revision 0)
>> @@ -0,0 +1,2898 @@
>> +/*
>> + * vf_cm - Experimental color management filter
>> + *
>> + * This filter matches the video color to the color management profile
>> + * of the hardware display using two components: A interface to the
>> + * Little CMS and Argyll color management modules (CMM) a fast Y'CbCr
>> + * color look up table (CLUT).
> 
> A proper license would probably be a good idea. In theory, you could
> copy one of the license header on some other file, depending on which
> license you prefer (MPlayer files are most GPL, libav* stuff most LGPL)...
> 
>> +#define SMPTE_R_X		0.630
>> +#define	SMPTE_R_Y		0.340
> 
> stray tab.
> 
>> +/* Unsigned 8 bit saturations */
>> +#define UINT8_T_SATURATE(x)		\
>> +	(uint8_t)((x) < 0 ? 0 : (x) > 255 ? 255 : (x))
> 
> libavutil/common.h: av_clip_uint8. Also likely to be faster.
> 
>> +	u = (111 * clut[i] - 9 * clut[i - 12] +
>> +		 31 * clut[i + 12] - 5 * clut[i + 21]) >> 7;
>> +	clut[i + 3] = UINT8_T_SATURATE(u);
>> +	u = (111 * clut[i + 1] - 9 * clut[i - 11] +
>> +		 31 * clut[i + 13] - 5 * clut[i + 22]) >> 7;
>> +	clut[i + 4] = UINT8_T_SATURATE(u);
>> +	u = (111 * clut[i + 2] - 9 * clut[i - 10] +
>> +		 31 * clut[i + 14] - 5 * clut[i + 23]) >> 7;
>> +	clut[i + 5] = UINT8_T_SATURATE(u);
>> +	u = (71 * clut[i] - (clut[i - 12] << 3) +
>> +		 77 * clut[i + 12] - 12 * clut[i + 21]) >> 7;
>> +	clut[i + 6] = UINT8_T_SATURATE(u);
>> +	u = (71 * clut[i + 1] - (clut[i - 11] << 3) +
>> +		 77 * clut[i + 13] - 12 * clut[i + 22]) >> 7;
>> +	clut[i + 7] = UINT8_T_SATURATE(u);
>> +	u = (71 * clut[i + 2] - (clut[i - 10] << 3) +
>> +		 77 * clut[i + 14] - 12 * clut[i + 23]) >> 7;
>> +	clut[i + 8] = UINT8_T_SATURATE(u);
>> +	u = (28 * clut[i] - 3 * clut[i - 12] +
>> +		 117 * clut[i + 12] - 14 * clut[i + 21]) >> 7;
>> +	clut[i + 9] = UINT8_T_SATURATE(u);
>> +	u = (28 * clut[i + 1] - 3 * clut[i - 11] +
>> +		 117 * clut[i + 13] - 14 * clut[i + 22]) >> 7;
>> +	clut[i + 10] = UINT8_T_SATURATE(u);
>> +	u = (28 * clut[i + 2] - 3 * clut[i - 10] +
>> +		 117 * clut[i + 14] - 14 * clut[i + 23]) >> 7;
>> +	clut[i + 11] = UINT8_T_SATURATE(u);
> 
> I think you should take advantage of that wonderful new invention called
> "loop" ;-)
> If you think speed is a problem, forcing gcc to fully unroll the loop
> and using static const arrays for the coefficients should result in the
> same code - though due to reduced code cache pressure a non-unrolled
> loop may well be faster.
> 
>> +	/* Coefficients */
>> +	/* These constant are in the format __n = n + (n << 16), which
>> +	   when loaded, avoids an additional pshuflw (i.e. before pshufd)
>> +	   to generate the correct vector layout. */
>> +	static uint32_t __3 = 0x30003;
>> +	static uint32_t __5 = 0x50005;
>> +	static uint32_t __7 = 0x70007;
>> +	static uint32_t __9 = 0x90009;
>> +	static uint32_t __12 = 0xc000c;
>> +	static uint32_t __14 = 0xe000e;
>> +	static uint32_t __28 = 0x1c001c;
>> +	static uint32_t __29 = 0x1d001d;
>> +	static uint32_t __31 = 0x1f001f;
>> +	static uint32_t __33 = 0x210021;
>> +	static uint32_t __60 = 0x3c003c;
>> +	static uint32_t __61 = 0x3d003d;
>> +	static uint32_t __71 = 0x470047;
>> +	static uint32_t __77 = 0x4d004d;
>> +	static uint32_t __107 = 0x6b006b;
>> +	static uint32_t __111 = 0x6f006f;
>> +	static uint32_t __117 = 0x750075;
> 
> identifiers starting with __ are reserved for the system/system headers.
> 
>> +	unsigned long stride_j;
>> +	unsigned long stride_2_j;
>> +	unsigned long stride_3_j;
>> +	unsigned long stride_4_j;
>> +	unsigned long stride_8_j;
>> +	unsigned long i_m_4_stride;
>> +	unsigned long i_stride;
>> +	unsigned long i_p_1_stride;
>> +	unsigned long i_p_2_stride;
>> +	unsigned long i_p_3_stride;
>> +	unsigned long i_p_4_stride;
>> +	unsigned long i_p_8_stride;
>> +	unsigned long i_m_4_stride_j;
>> +	unsigned long i_stride_j;
>> +	unsigned long i_p_1_stride_j;
>> +	unsigned long i_p_2_stride_j;
>> +	unsigned long i_p_3_stride_j;
>> +	unsigned long i_p_4_stride_j;
>> +	unsigned long i_p_5_stride_j;
>> +	unsigned long i_p_6_stride_j;
>> +	unsigned long i_p_7_stride_j;
> 
> Using arrays might make it quite a bit less ugly, even if you might have
> to use either less logical indices or leave a few elements unused.
> 
>> +	for(j = head; j < tail; j += 8) {
>> +		__asm__ __volatile__ (
> [...]
>> +			: "%xmm0", "%xmm1", "%xmm2", "%xmm3",
>> +			  "%xmm4", "%xmm5", "%xmm6", "%xmm7");
>> +		stride_3_j += 8;
>> +		stride_4_j += 8;
>> +		stride_8_j += 8;
>> +	}
> 
> Note that quite a few people recommend to put the loop code inside the
> asm, gcc is quite likely to do something stupid with the loop variable
> otherwise. Not sure if it matters enough.
> IMHO the code also could severely need things to structure it and avoid
> duplication and if possible separate the C and SSE implementations
> - macros, "static inline" functions, whatever.
> Also not sure if it does the right thing if runtime cpudetection is
> enabled.
> 
> Greetings,
> Reimar Döffinger
> _______________________________________________
> MPlayer-dev-eng mailing list
> MPlayer-dev-eng at mplayerhq.hu
> https://lists.mplayerhq.hu/mailman/listinfo/mplayer-dev-eng

-------------- next part --------------
A non-text attachment was scrubbed...
Name: mplayer-vf_cm-20080826.patch
Type: text/x-patch
Size: 91845 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/mplayer-dev-eng/attachments/20080826/0bcd818f/attachment.bin>


More information about the MPlayer-dev-eng mailing list