[FFmpeg-devel] [PATCH] lavu/avstring: add av_get_utf8() function

Wed Nov 13 15:17:19 CET 2013

Le tridi 23 brumaire, an CCXXII, Stefano Sabatini a écrit :
> >From ea516d668152f7c5f59615c40748fc28e510c488 Mon Sep 17 00:00:00 2001
> From: Stefano Sabatini <stefasab at gmail.com>
> Date: Thu, 3 Oct 2013 01:21:40 +0200
> Subject: [PATCH] lavu/avstring: add av_utf8_decode() function
> 
> ---
>  doc/APIchanges       |  3 +++
>  libavutil/Makefile   |  1 +
>  libavutil/avstring.c | 53 ++++++++++++++++++++++++++++++++++++++++
>  libavutil/avstring.h | 26 ++++++++++++++++++++
>  libavutil/utf8.c     | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  libavutil/version.h  |  2 +-
>  6 files changed, 153 insertions(+), 1 deletion(-)
>  create mode 100644 libavutil/utf8.c
> 
> diff --git a/doc/APIchanges b/doc/APIchanges
> index dfdc159..b292d19 100644
> --- a/doc/APIchanges
> +++ b/doc/APIchanges
> @@ -15,6 +15,9 @@ libavutil:     2012-10-22
>  
>  API changes, most recent first:
>  
> +2013-11-12 - xxxxxxx - lavu 52.53.100 - avstring.h
> +  Add av_utf8_decode() function.
> +

Completely unrelated to this patch, but I wonder whether we should continue
maintaining this APIchanges file: maybe adding "API-Changes" pseudo-headers
to the Git commit message, and a script to scan the history, would be more
practical, especially when it comes to branches for releases.

>  2013-11-xx - xxxxxxx - lavc 55.41.100 / 55.25.0 - avcodec.h
>                         lavu 52.51.100 - frame.h
>    Add ITU-R BT.2020 and other not yet included values to color primaries,
> diff --git a/libavutil/Makefile b/libavutil/Makefile
> index 7b3b439..19540e4 100644
> --- a/libavutil/Makefile
> +++ b/libavutil/Makefile
> @@ -155,6 +155,7 @@ TESTPROGS = adler32                                                     \
>              sha                                                         \
>              sha512                                                      \
>              tree                                                        \
> +            utf8                                                        \
>              xtea                                                        \
>  
>  TESTPROGS-$(HAVE_LZO1X_999_COMPRESS) += lzo
> diff --git a/libavutil/avstring.c b/libavutil/avstring.c
> index eed58fa..69a0230 100644
> --- a/libavutil/avstring.c
> +++ b/libavutil/avstring.c
> @@ -307,6 +307,59 @@ int av_isxdigit(int c)
>      return av_isdigit(c) || (c >= 'a' && c <= 'f');
>  }
>  

> +int av_utf8_decode(int32_t *code, const uint8_t **buf, size_t left, int flags)

I am a bit unsure about the "left" argument: that means *buf and left must
be updated together. Usually, someone (either in the function or calling it)
ends up forgetting to update one and it results in a catastrophe.

I have found more practical to use a pointer to the end of the byte array:
it is constant across the whole process.

Another solution would be to make left a pointer and update its value, but
that is less practical (frequently, the caller will have the size of the
buffer stored in a different type).

And very minor nit: I like variable used with bit arithmetic, such as flags,
to be unsigned.

> +{
> +    const uint8_t *p = *buf;
> +    uint32_t top;
> +    int ret = 0;
> +
> +    if (!left)
> +        return 0;
> +
> +    *code = *p++;
> +
> +    /* first sequence byte starts with 10, or is 1111-1110 or 1111-1111,
> +       which is not admitted */
> +    if ((*code & 0xc0) == 0x80 || *code >= 0xFE) {
> +        ret = AVERROR(EILSEQ);
> +        goto end;
> +    }
> +    top = (*code & 128) >> 1;
> +
> +    while (*code & top) {
> +        int tmp;

> +        if (!--left) {

For example, here it would just become "if (p >= end)"

> +            ret = AVERROR(EILSEQ); /* incomplete sequence */
> +            goto end;
> +        }
> +
> +        /* we assume the byte to be in the form 10xx-xxxx */
> +        tmp = *p++ - 128;   /* strip leading 1 */
> +        if (tmp>>6) {
> +            ret = AVERROR(EILSEQ);
> +            goto end;
> +        }
> +        *code = (*code<<6) + tmp;
> +        top <<= 5;
> +    }
> +    *code &= (top << 1) - 1;
> +

> +    if (flags & AV_UTF8_DECODE_FLAG_CHECK_RANGE) {
> +        /* only accepts valid Unicode points:
> +           #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
> +           that is any Unicode character, excluding the surrogate blocks, FFFE, and FFFF */
> +        if (*code != 0x9 && *code != 0xA && *code != 0xD &&
> +            (*code <    0x20 || *code >   0xD7FF) &&
> +            (*code <  0xE000 || *code >   0xFFFD) &&
> +            (*code < 0x10000 || *code > 0x10FFFF))
> +            ret = AVERROR(EILSEQ);

I do not think this is really correct. All code points in the range 0-31 are
valid control characters. They are invalid in some contexts, such as XML,
but that is not on the same level as surrogates or malformed characters.

I suggest to enable Unicode range check by default, with a flag to disable
it, and to have a second flag to check XML validity.

Also, did you check whether the control codes in the 0x7F-0x9F range are
acceptable in XML?

> +    }
> +

> +end:
> +    *buf = p;
> +    return ret;

Is there a reason you are working directly with *code but using a separate
variable for *buf?

> +}
> +
>  #ifdef TEST
>  
>  int main(void)
> diff --git a/libavutil/avstring.h b/libavutil/avstring.h
> index 438ef79..b4f3b07 100644
> --- a/libavutil/avstring.h
> +++ b/libavutil/avstring.h
> @@ -22,6 +22,7 @@
>  #define AVUTIL_AVSTRING_H
>  
>  #include <stddef.h>
> +#include <stdint.h>
>  #include "attributes.h"
>  
>  /**
> @@ -296,6 +297,31 @@ int av_escape(char **dst, const char *src, const char *special_chars,
>                enum AVEscapeMode mode, int flags);
>  
>  /**

> + * Check UNICODE range validity of the decoded code.

Nit: Unicode.

> + */
> +#define AV_UTF8_DECODE_FLAG_CHECK_RANGE 1
> +
> +/**

> + * Read and decode a single UTF-8 character sequence from buffer in

Nit: character -> code point, possibly with character in parentheses. The
issue is that some people consider an Unicode character to be made of the
base character plus all the following combinant characters.

> + * *buf, and update *buf to point to the next byte after the parsed
> + * sequence.
> + *
> + * In case of invalid sequence, the pointer will be updated to the
> + * next byte after the invalid sequence.
> + *
> + * @param code pointer whose pointed value is updated to keep the
> + * parsed code in case of success

"Pointer used to return the parsed code in case of success"

> + * @param left bytes left to read in the buffer. By default it won't
> + * read more than 6 bytes (maximum number of bytes in an UTF-8
> + * sequence).
> + * @param flags a collection of AV_UTF8_DECODE_FLAG_* flags
> + * @return >= 0 in case a sequence was successfully read, a negative
> + * value in case of invalid sequence

Nit: indenting the descriptions after the parameters names will make the
doxygen comment itself more readable.

> + * @see GET_UTF8()
> + */
> +int av_utf8_decode(int32_t *code, const uint8_t **buf, size_t left, int flags);
> +
> +/**
>   * @}
>   */
>  
> diff --git a/libavutil/utf8.c b/libavutil/utf8.c
> new file mode 100644
> index 0000000..4be0e95
> --- /dev/null
> +++ b/libavutil/utf8.c
> @@ -0,0 +1,69 @@
> +/*
> + * Copyright (c) 2013 Stefano Sabatini
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include <stdio.h>
> +
> +#include "libavutil/avstring.h"
> +#include "libavutil/file.h"
> +
> +static void print_sequence(const char *p, int l, int indent)
> +{
> +    int i;
> +    for (i = 0; i < l; i++)
> +        printf("%02X", (uint8_t)p[i]);

> +    for (i = l*2; i < indent; i++)
> +        printf("%c", ' ');

Nuf? What is wrong with printf(" ")? Or putchar(' ')? But to print a
sequence of spaces, there is even simpler using printf's power:

	printf("%*s", l * 2, "");

> +}
> +
> +int main(int argc, char **argv)
> +{
> +    int ret;
> +    char *filename = argv[1];
> +    uint8_t *file_buf;
> +    size_t file_buf_size;
> +    uint32_t code;
> +    const uint8_t *p;
> +    int i = 0;
> +
> +    ret = av_file_map(filename, &file_buf, &file_buf_size, 0, NULL);

> +    if (ret < 0)
> +        goto end;

Should you be unmaping a file that was not successfully mapped?

> +
> +    p = file_buf;
> +    while (i < file_buf_size) {
> +        int l, r;
> +        const uint8_t *p0 = p;
> +        r = av_utf8_decode(&code, &p, file_buf_size - i, 0);
> +        l = (int)(p-p0);
> +        print_sequence(p0, l, 20);
> +        if (r >= 0) {
> +            printf("%-10d 0x%-10X %-5d ", code, code, l);
> +            if (*p0 == '\n') { printf("\\n\n"); }
> +            else             { printf ("%.*s\n", l, p0); }
> +        } else {
> +            printf("-> illegal sequence\n");
> +        }
> +        i += l;
> +    }
> +
> +end:
> +    av_file_unmap(file_buf, file_buf_size);
> +    return ret < 0;
> +}
> diff --git a/libavutil/version.h b/libavutil/version.h
> index 2e2f571..c6ec6e0 100644
> --- a/libavutil/version.h
> +++ b/libavutil/version.h
> @@ -75,7 +75,7 @@
>   */
>  
>  #define LIBAVUTIL_VERSION_MAJOR  52
> -#define LIBAVUTIL_VERSION_MINOR  52
> +#define LIBAVUTIL_VERSION_MINOR  53
>  #define LIBAVUTIL_VERSION_MICRO 100
>  
>  #define LIBAVUTIL_VERSION_INT   AV_VERSION_INT(LIBAVUTIL_VERSION_MAJOR, \

Regards,

-- 
  Nicolas George
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20131113/3ae9650c/attachment.asc>