[FFmpeg-devel] ffprobe can produce invalid XML
Stefano Sabatini
stefasab at gmail.com
Sat Apr 7 17:48:31 CEST 2012
On date Tuesday 2012-04-03 10:54:19 +0000, Ian Ashley encoded:
> My original post seemed to be missing the body and only have the
> attachments so here goes again. Sorry if the full message arrives
> more than once.
>
> ffprobe can put strings in assets directly into the XML
> returned. This can result in invalid UTF-8 strings being included in
> the XML meaning that the XML cannot be parsed by, for example,
> libxml.
Do you have a command for reproducing the failure with a libxml
application?
> The problem can be reproduced on the latest git head using the attached movie and the command
> ffprobe -print_format xml -show_streams -loglevel 0 cnet.mp4
> [ alas the movie is too big to post to the mailing list and I don't have a smaller sample that causes the problem, do you have an FTP site where I can post the movie? ]
>
> The output is attached. The invalid character is before the V on the line
> <tag key="handler_name" value=" VideoHandler"/>
>
> A patch file produced using diff -u old/ffprobe.c new/ffprobe.c is
> attached. The replaces the xml_escape_str function by one that
> replaces any invalid UTF-8 characters by the inverted question mark.
>
> Regards,
> Ian
> --- old/ffprobe.c 2012-04-03 09:11:27.000000000 +0100
> +++ new/ffprobe.c 2012-04-03 10:59:06.000000000 +0100
> @@ -1058,22 +1058,77 @@
> static const char *xml_escape_str(char **dst, size_t *dst_size, const char *src,
> void *log_ctx)
> {
> + // the unknown character (inverted question mark)
> + const unsigned char BAD_CHARACTER_1 = 194, BAD_CHARACTER_2 = 191;
> +
> const char *p;
> char *q;
> + int copyAll = 1;
> size_t size = 1;
>
> /* precompute size */
> - for (p = src; *p; p++, size++) {
> + for (p = src; *p;) {
> + int badChar = 0;
> + unsigned char byte;
> +
> ESCAPE_CHECK_SIZE(src, size, SIZE_MAX-10);
> - switch (*p) {
> - case '&' : size += strlen("&"); break;
> - case '<' : size += strlen("<"); break;
> - case '>' : size += strlen(">"); break;
> - case '\"': size += strlen("""); break;
> - case '\'': size += strlen("'"); break;
> - default: size++;
> - }
> - }
> +
> + byte = (unsigned char)*p;
why?
> + if (byte < 32 && byte != 9 && byte != 10 && byte != 13) {
> + badChar = 1;
so this is the condition to check for a "bad char"
> + ++p;
> + } else if (byte < 128) {
> + switch (byte) {
> + case '&' : size += 5; /* & */ copyAll = 0; break;
> + case '<' : size += 4; /* < */ copyAll = 0; break;
> + case '>' : size += 4; /* > */ copyAll = 0; break;
> + case '\"': size += 6; /* " */ copyAll = 0; break;
> + case '\'': size += 6; /* ' */ copyAll = 0; break;
> + default: size++;
> + }
> + ++p;
> + ++size;
> + }
> + else if (byte < 0xC0)
> + {
> + badChar = 1;
> + ++p;
> + }
> + else
> + {
> + int extra;
> +
> + copyAll = 0;
> + if (byte < 0xe0)
> + extra = 1;
> + else if (byte < 0xf0)
> + extra = 2;
> + else if (byte < 0xf8)
> + extra = 3;
> + else
> + badChar = 1;
> +
> + if (badChar)
> + ++p;
> + else
> + {
> + ++p;
> + for (int i = 0; i < extra && *p; ++i, ++p)
> + {
> + byte = (unsigned char)*p;
> + if ((byte & 0xc0) != 0x80)
> + badChar = 1;
> + }
> + if (!badChar)
> + size += extra;
> + }
> + }
> + if (badChar) {
> + size += 2;
> + copyAll = 0;
> + }
> + }
> +
> ESCAPE_REALLOC_BUF(dst_size, dst, src, size);
>
> #define COPY_STR(str) { \
> @@ -1084,18 +1139,74 @@
>
> p = src;
> q = *dst;
> - while (*p) {
> - switch (*p) {
> - case '&' : COPY_STR("&"); break;
> - case '<' : COPY_STR("<"); break;
> - case '>' : COPY_STR(">"); break;
> - case '\"': COPY_STR("""); break;
> - case '\'': COPY_STR("'"); break;
> - default: *q++ = *p;
> - }
> - p++;
> - }
> - *q = 0;
> + if (copyAll)
> + COPY_STR(p)
the copyAll optimization seems unrelated to the fix (and makes much
harder to follow the logic of the patch)
> + else {
> + while (*p) {
> + int badChar = 0;
> + unsigned char byte;
> +
> + byte = (unsigned char)*p;
> + if (byte < 32 && byte != 9 && byte != 10 && byte != 13) {
> + badChar = 1;
> + ++p;
> + } else if (byte < 128) {
> + switch (byte) {
> + case '&' : COPY_STR("&"); break;
> + case '<' : COPY_STR("<"); break;
> + case '>' : COPY_STR(">"); break;
> + case '\"': COPY_STR("""); break;
> + case '\'': COPY_STR("'"); break;
> + default: *q++ = *p;
> + }
> + ++p;
> + ++size;
> + }
> + else if (byte < 0xC0)
> + {
> + badChar = 1;
> + ++p;
> + }
> + else
> + {
> + int extra;
> +
> + copyAll = 0;
> + if (byte < 0xe0)
> + extra = 1;
> + else if (byte < 0xf0)
> + extra = 2;
> + else if (byte < 0xf8)
> + extra = 3;
> + else
> + badChar = 1;
> +
> + if (badChar)
> + ++p;
> + else
> + {
> + const char *startChar = p;
> + int i;
> + ++p;
> + for (i = 0; i < extra && *p; ++i, ++p)
> + {
> + byte = (unsigned char)*p;
> + if ((byte & 0xc0) != 0x80)
> + badChar = 1;
> + }
> + if (!badChar) {
> + for (i = 0; i < extra;)
> + *q++ = *startChar++;
> + }
> + }
> + }
> + if (badChar) {
> + *q++ = BAD_CHARACTER_1;
> + *q++ = BAD_CHARACTER_2;
> + }
> + }
> + }
> + *q = 0;
Also this looks more like UTF-8 validation rather than something
related to XML escaping.
Wait before to send a new patch, since I'm changing the escaping code
in a recent patch (ffprobe: use avbprint API).
For creating a patch:
git diff > patch.diff
is fine, git format-patch is better.
--
FFmpeg = Fundamental & Fundamental Magic Powered Evangelical Game
More information about the ffmpeg-devel
mailing list