[MPlayer-dev-eng] [RFC] Color management filter
Yue Shi Lai
ylai at users.sourceforge.net
Tue Aug 26 15:22:34 CEST 2008
Dear Reimar, dev-eng list,
This new patch addresses all issues have been raised (i.e. including
interface to vo_gl). There is still few comment gaps and comments in
need of update (especially where the modifications have been applied),
which I will fill up, but I would appreciate any other issues you might
find.
I also made the following additional changes:
- Remove the left over wording "experimental" (in fact, I have been
using this implementation for ~ 1 year now on daily basis)
- Replace dpy_... by dest_... This is also a legacy naming, and
dest/destination is the proper term in color management (also, dpy does
not make sense when using with mencoder)
- The default BPC black level is too conservative, adjusted it to 10.
Best,
Yue Shi Lai
Reimar Döffinger schrieb:
> On Mon, Aug 25, 2008 at 01:40:23PM -0400, Yue Shi Lai wrote:
>> The algorithm uses internally a 256^3 color LUT (which occupies ~ 50 MB
>> RAM), and by default does not calculate the entire LUT using Little CMS.
>> Instead, a Catmull-Rom spline interpolation is applied on the LUT, which
>> for HAVE_SSE2=1 is implemented using SSE2 assembly to speed up the
>> calculation and reduce the latency it causes during MPlayer startup.
>
> Hm, you might be interested in knowing that -vo gl:yuv=6 uses that
> approach for yuv -> rgb conversion already. I never got around to
> implement loading of a custom 3D texture (also because I did not
> know of a suitable image format), but this might be a way to make
> this available with less CPU usage, too...
> That was actually the only point I wanted to make originally, but since
> I had already been skimming the code...
>
>> Index: libmpcodecs/vf_cm.c
>> ===================================================================
>> --- libmpcodecs/vf_cm.c (revision 0)
>> +++ libmpcodecs/vf_cm.c (revision 0)
>> @@ -0,0 +1,2898 @@
>> +/*
>> + * vf_cm - Experimental color management filter
>> + *
>> + * This filter matches the video color to the color management profile
>> + * of the hardware display using two components: A interface to the
>> + * Little CMS and Argyll color management modules (CMM) a fast Y'CbCr
>> + * color look up table (CLUT).
>
> A proper license would probably be a good idea. In theory, you could
> copy one of the license header on some other file, depending on which
> license you prefer (MPlayer files are most GPL, libav* stuff most LGPL)...
>
>> +#define SMPTE_R_X 0.630
>> +#define SMPTE_R_Y 0.340
>
> stray tab.
>
>> +/* Unsigned 8 bit saturations */
>> +#define UINT8_T_SATURATE(x) \
>> + (uint8_t)((x) < 0 ? 0 : (x) > 255 ? 255 : (x))
>
> libavutil/common.h: av_clip_uint8. Also likely to be faster.
>
>> + u = (111 * clut[i] - 9 * clut[i - 12] +
>> + 31 * clut[i + 12] - 5 * clut[i + 21]) >> 7;
>> + clut[i + 3] = UINT8_T_SATURATE(u);
>> + u = (111 * clut[i + 1] - 9 * clut[i - 11] +
>> + 31 * clut[i + 13] - 5 * clut[i + 22]) >> 7;
>> + clut[i + 4] = UINT8_T_SATURATE(u);
>> + u = (111 * clut[i + 2] - 9 * clut[i - 10] +
>> + 31 * clut[i + 14] - 5 * clut[i + 23]) >> 7;
>> + clut[i + 5] = UINT8_T_SATURATE(u);
>> + u = (71 * clut[i] - (clut[i - 12] << 3) +
>> + 77 * clut[i + 12] - 12 * clut[i + 21]) >> 7;
>> + clut[i + 6] = UINT8_T_SATURATE(u);
>> + u = (71 * clut[i + 1] - (clut[i - 11] << 3) +
>> + 77 * clut[i + 13] - 12 * clut[i + 22]) >> 7;
>> + clut[i + 7] = UINT8_T_SATURATE(u);
>> + u = (71 * clut[i + 2] - (clut[i - 10] << 3) +
>> + 77 * clut[i + 14] - 12 * clut[i + 23]) >> 7;
>> + clut[i + 8] = UINT8_T_SATURATE(u);
>> + u = (28 * clut[i] - 3 * clut[i - 12] +
>> + 117 * clut[i + 12] - 14 * clut[i + 21]) >> 7;
>> + clut[i + 9] = UINT8_T_SATURATE(u);
>> + u = (28 * clut[i + 1] - 3 * clut[i - 11] +
>> + 117 * clut[i + 13] - 14 * clut[i + 22]) >> 7;
>> + clut[i + 10] = UINT8_T_SATURATE(u);
>> + u = (28 * clut[i + 2] - 3 * clut[i - 10] +
>> + 117 * clut[i + 14] - 14 * clut[i + 23]) >> 7;
>> + clut[i + 11] = UINT8_T_SATURATE(u);
>
> I think you should take advantage of that wonderful new invention called
> "loop" ;-)
> If you think speed is a problem, forcing gcc to fully unroll the loop
> and using static const arrays for the coefficients should result in the
> same code - though due to reduced code cache pressure a non-unrolled
> loop may well be faster.
>
>> + /* Coefficients */
>> + /* These constant are in the format __n = n + (n << 16), which
>> + when loaded, avoids an additional pshuflw (i.e. before pshufd)
>> + to generate the correct vector layout. */
>> + static uint32_t __3 = 0x30003;
>> + static uint32_t __5 = 0x50005;
>> + static uint32_t __7 = 0x70007;
>> + static uint32_t __9 = 0x90009;
>> + static uint32_t __12 = 0xc000c;
>> + static uint32_t __14 = 0xe000e;
>> + static uint32_t __28 = 0x1c001c;
>> + static uint32_t __29 = 0x1d001d;
>> + static uint32_t __31 = 0x1f001f;
>> + static uint32_t __33 = 0x210021;
>> + static uint32_t __60 = 0x3c003c;
>> + static uint32_t __61 = 0x3d003d;
>> + static uint32_t __71 = 0x470047;
>> + static uint32_t __77 = 0x4d004d;
>> + static uint32_t __107 = 0x6b006b;
>> + static uint32_t __111 = 0x6f006f;
>> + static uint32_t __117 = 0x750075;
>
> identifiers starting with __ are reserved for the system/system headers.
>
>> + unsigned long stride_j;
>> + unsigned long stride_2_j;
>> + unsigned long stride_3_j;
>> + unsigned long stride_4_j;
>> + unsigned long stride_8_j;
>> + unsigned long i_m_4_stride;
>> + unsigned long i_stride;
>> + unsigned long i_p_1_stride;
>> + unsigned long i_p_2_stride;
>> + unsigned long i_p_3_stride;
>> + unsigned long i_p_4_stride;
>> + unsigned long i_p_8_stride;
>> + unsigned long i_m_4_stride_j;
>> + unsigned long i_stride_j;
>> + unsigned long i_p_1_stride_j;
>> + unsigned long i_p_2_stride_j;
>> + unsigned long i_p_3_stride_j;
>> + unsigned long i_p_4_stride_j;
>> + unsigned long i_p_5_stride_j;
>> + unsigned long i_p_6_stride_j;
>> + unsigned long i_p_7_stride_j;
>
> Using arrays might make it quite a bit less ugly, even if you might have
> to use either less logical indices or leave a few elements unused.
>
>> + for(j = head; j < tail; j += 8) {
>> + __asm__ __volatile__ (
> [...]
>> + : "%xmm0", "%xmm1", "%xmm2", "%xmm3",
>> + "%xmm4", "%xmm5", "%xmm6", "%xmm7");
>> + stride_3_j += 8;
>> + stride_4_j += 8;
>> + stride_8_j += 8;
>> + }
>
> Note that quite a few people recommend to put the loop code inside the
> asm, gcc is quite likely to do something stupid with the loop variable
> otherwise. Not sure if it matters enough.
> IMHO the code also could severely need things to structure it and avoid
> duplication and if possible separate the C and SSE implementations
> - macros, "static inline" functions, whatever.
> Also not sure if it does the right thing if runtime cpudetection is
> enabled.
>
> Greetings,
> Reimar Döffinger
> _______________________________________________
> MPlayer-dev-eng mailing list
> MPlayer-dev-eng at mplayerhq.hu
> https://lists.mplayerhq.hu/mailman/listinfo/mplayer-dev-eng
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mplayer-vf_cm-20080826.patch
Type: text/x-patch
Size: 91845 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/mplayer-dev-eng/attachments/20080826/0bcd818f/attachment.bin>
More information about the MPlayer-dev-eng
mailing list