[FFmpeg-devel] ffprobe can produce invalid XML

Sat Apr 7 17:48:31 CEST 2012

On date Tuesday 2012-04-03 10:54:19 +0000, Ian Ashley encoded:
> My original post seemed to be missing the body and only have the
> attachments so here goes again. Sorry if the full message arrives
> more than once.
> 

> ffprobe can put strings in assets directly into the XML
> returned. This can result in invalid UTF-8 strings being included in
> the XML meaning that the XML cannot be parsed by, for example,
> libxml.

Do you have a command for reproducing the failure with a libxml
application?

> The problem can be reproduced on the latest git head using the attached movie and the command
> ffprobe -print_format xml -show_streams -loglevel 0 cnet.mp4
> [ alas the movie is too big to post to the mailing list and I don't have a smaller sample that causes the problem, do you have an FTP site where I can post the movie? ]
> 
> The output is attached. The invalid character is before the V on the line
> <tag key="handler_name" value=" VideoHandler"/>
> 
> A patch file produced using diff -u old/ffprobe.c new/ffprobe.c is
> attached. The replaces the xml_escape_str function by one that
> replaces any invalid UTF-8 characters by the inverted question mark.
> 
> Regards,
> Ian

> --- old/ffprobe.c	2012-04-03 09:11:27.000000000 +0100
> +++ new/ffprobe.c	2012-04-03 10:59:06.000000000 +0100
> @@ -1058,22 +1058,77 @@
>  static const char *xml_escape_str(char **dst, size_t *dst_size, const char *src,
>                                    void *log_ctx)
>  {
> +    // the unknown character (inverted question mark)
> +    const unsigned char BAD_CHARACTER_1 = 194, BAD_CHARACTER_2 = 191;
> +
>      const char *p;
>      char *q;
> +    int copyAll = 1;
>      size_t size = 1;
>  
>      /* precompute size */
> -    for (p = src; *p; p++, size++) {
> +    for (p = src; *p;) {
> +		int badChar = 0;
> +		unsigned char byte;
> +
>          ESCAPE_CHECK_SIZE(src, size, SIZE_MAX-10);
> -        switch (*p) {
> -        case '&' : size += strlen("&");  break;
> -        case '<' : size += strlen("<");   break;
> -        case '>' : size += strlen(">");   break;
> -        case '\"': size += strlen("""); break;
> -        case '\'': size += strlen("'"); break;
> -        default: size++;
> -        }
> -    }
> +

> +		byte = (unsigned char)*p;

why?

> +		if (byte < 32 && byte != 9 && byte != 10 && byte != 13) {
> +			badChar = 1;

so this is the condition to check for a "bad char"

> +			++p;
> +		} else if (byte < 128) {
> +			switch (byte) {
> +				case '&' : size += 5; /* & */  copyAll = 0; break;
> +				case '<' : size += 4; /* < */   copyAll = 0; break;
> +				case '>' : size += 4; /* > */   copyAll = 0; break;
> +				case '\"': size += 6; /* " */ copyAll = 0; break;
> +				case '\'': size += 6; /* ' */ copyAll = 0; break;
> +				default: size++;
> +				}
> +			++p;
> +			++size;
> +			}
> +		else if (byte < 0xC0)
> +			{
> +			badChar = 1;
> +			++p;
> +			}
> +		else
> +			{
> +			int extra;
> +
> +			copyAll = 0;
> +			if (byte < 0xe0)
> +				extra = 1;
> +			else if (byte < 0xf0)
> +				extra = 2;
> +			else if (byte < 0xf8)
> +				extra = 3;
> +			else
> +				badChar = 1;
> +
> +			if (badChar)
> +				++p;
> +			else
> +				{
> +				++p;
> +				for (int i = 0; i < extra && *p; ++i, ++p)
> +					{
> +					byte = (unsigned char)*p;
> +					if ((byte & 0xc0) != 0x80)
> +						badChar = 1;
> +					}
> +				if (!badChar)
> +					size += extra;
> +				}
> +			}
> +		if (badChar) {
> +			size += 2;
> +			copyAll = 0;
> +			}
> +		}
> +
>      ESCAPE_REALLOC_BUF(dst_size, dst, src, size);
>  
>  #define COPY_STR(str) {      \
> @@ -1084,18 +1139,74 @@
>  
>      p = src;
>      q = *dst;
> -    while (*p) {
> -        switch (*p) {
> -        case '&' : COPY_STR("&");  break;
> -        case '<' : COPY_STR("<");   break;
> -        case '>' : COPY_STR(">");   break;
> -        case '\"': COPY_STR("""); break;
> -        case '\'': COPY_STR("'"); break;
> -        default: *q++ = *p;
> -        }
> -        p++;
> -    }
> -    *q = 0;

> +    if (copyAll)
> +    	COPY_STR(p)

the copyAll optimization seems unrelated to the fix (and makes much
harder to follow the logic of the patch)

> +    else {
> +		while (*p) {
> +			int badChar = 0;
> +			unsigned char byte;
> +	
> +			byte = (unsigned char)*p;
> +			if (byte < 32 && byte != 9 && byte != 10 && byte != 13) {
> +				badChar = 1;
> +				++p;
> +			} else if (byte < 128) {
> +				switch (byte) {
> +					case '&' : COPY_STR("&");  break;
> +					case '<' : COPY_STR("<");   break;
> +					case '>' : COPY_STR(">");   break;
> +					case '\"': COPY_STR("""); break;
> +					case '\'': COPY_STR("'"); break;
> +					default: *q++ = *p;
> +					}
> +				++p;
> +				++size;
> +				}
> +			else if (byte < 0xC0)
> +				{
> +				badChar = 1;
> +				++p;
> +				}
> +			else
> +				{
> +				int extra;
> +	
> +				copyAll = 0;
> +				if (byte < 0xe0)
> +					extra = 1;
> +				else if (byte < 0xf0)
> +					extra = 2;
> +				else if (byte < 0xf8)
> +					extra = 3;
> +				else
> +					badChar = 1;
> +	
> +				if (badChar)
> +					++p;
> +				else
> +					{
> +					const char *startChar = p;
> +					int i;
> +					++p;
> +					for (i = 0; i < extra && *p; ++i, ++p)
> +						{
> +						byte = (unsigned char)*p;
> +						if ((byte & 0xc0) != 0x80)
> +							badChar = 1;
> +						}
> +					if (!badChar) {
> +						for (i = 0; i < extra;)
> +							*q++ = *startChar++;
> +					}
> +				}
> +			}
> +			if (badChar) {
> +				*q++ = BAD_CHARACTER_1;
> +				*q++ = BAD_CHARACTER_2;
> +			}
> +		}
> +	}
> +	*q = 0;

Also this looks more like UTF-8 validation rather than something
related to XML escaping.

Wait before to send a new patch, since I'm changing the escaping code
in a recent patch (ffprobe: use avbprint API).

For creating a patch:
git diff > patch.diff

is fine, git format-patch is better.
-- 
FFmpeg = Fundamental & Fundamental Magic Powered Evangelical Game