[FFmpeg-devel] [PATCH] lavu/avstring: add av_get_utf8() function

Wed Nov 13 17:17:53 CET 2013

On date Wednesday 2013-11-13 15:17:19 +0100, Nicolas George encoded:
> Le tridi 23 brumaire, an CCXXII, Stefano Sabatini a écrit :
> > >From ea516d668152f7c5f59615c40748fc28e510c488 Mon Sep 17 00:00:00 2001
> > From: Stefano Sabatini <stefasab at gmail.com>
> > Date: Thu, 3 Oct 2013 01:21:40 +0200
> > Subject: [PATCH] lavu/avstring: add av_utf8_decode() function
> > 
> > ---
> >  doc/APIchanges       |  3 +++
> >  libavutil/Makefile   |  1 +
> >  libavutil/avstring.c | 53 ++++++++++++++++++++++++++++++++++++++++
> >  libavutil/avstring.h | 26 ++++++++++++++++++++
> >  libavutil/utf8.c     | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  libavutil/version.h  |  2 +-
> >  6 files changed, 153 insertions(+), 1 deletion(-)
> >  create mode 100644 libavutil/utf8.c
> > 
> > diff --git a/doc/APIchanges b/doc/APIchanges
> > index dfdc159..b292d19 100644
> > --- a/doc/APIchanges
> > +++ b/doc/APIchanges
> > @@ -15,6 +15,9 @@ libavutil:     2012-10-22
> >  
> >  API changes, most recent first:
> >  
> > +2013-11-12 - xxxxxxx - lavu 52.53.100 - avstring.h
> > +  Add av_utf8_decode() function.
> > +
> 

> Completely unrelated to this patch, but I wonder whether we should continue
> maintaining this APIchanges file: maybe adding "API-Changes" pseudo-headers
> to the Git commit message, and a script to scan the history, would be more
> practical, especially when it comes to branches for releases.

Yes, on the other hand developers would easily forget to update their
commit log, resulting in missing entries in the resulting output (and
we're not allowed to change log commits). I don't know if git allows
to markup specific commits after they have been committed.

> 
> >  2013-11-xx - xxxxxxx - lavc 55.41.100 / 55.25.0 - avcodec.h
> >                         lavu 52.51.100 - frame.h
> >    Add ITU-R BT.2020 and other not yet included values to color primaries,
> > diff --git a/libavutil/Makefile b/libavutil/Makefile
> > index 7b3b439..19540e4 100644
> > --- a/libavutil/Makefile
> > +++ b/libavutil/Makefile
> > @@ -155,6 +155,7 @@ TESTPROGS = adler32                                                     \
> >              sha                                                         \
> >              sha512                                                      \
> >              tree                                                        \
> > +            utf8                                                        \
> >              xtea                                                        \
> >  
> >  TESTPROGS-$(HAVE_LZO1X_999_COMPRESS) += lzo
> > diff --git a/libavutil/avstring.c b/libavutil/avstring.c
> > index eed58fa..69a0230 100644
> > --- a/libavutil/avstring.c
> > +++ b/libavutil/avstring.c
> > @@ -307,6 +307,59 @@ int av_isxdigit(int c)
> >      return av_isdigit(c) || (c >= 'a' && c <= 'f');
> >  }
> >  
> 
> > +int av_utf8_decode(int32_t *code, const uint8_t **buf, size_t left, int flags)
> 
> I am a bit unsure about the "left" argument: that means *buf and left must
> be updated together. Usually, someone (either in the function or calling it)
> ends up forgetting to update one and it results in a catastrophe.
> 
> I have found more practical to use a pointer to the end of the byte array:
> it is constant across the whole process.
> 
> Another solution would be to make left a pointer and update its value, but
> that is less practical (frequently, the caller will have the size of the
> buffer stored in a different type).
> 
> And very minor nit: I like variable used with bit arithmetic, such as flags,
> to be unsigned.

Changed both, but with the only difference that endp points to the
last byte in the buffer, in order to avoid overflow issues.

> 
> > +{
> > +    const uint8_t *p = *buf;
> > +    uint32_t top;
> > +    int ret = 0;
> > +
> > +    if (!left)
> > +        return 0;
> > +
> > +    *code = *p++;
> > +
> > +    /* first sequence byte starts with 10, or is 1111-1110 or 1111-1111,
> > +       which is not admitted */
> > +    if ((*code & 0xc0) == 0x80 || *code >= 0xFE) {
> > +        ret = AVERROR(EILSEQ);
> > +        goto end;
> > +    }
> > +    top = (*code & 128) >> 1;
> > +
> > +    while (*code & top) {
> > +        int tmp;
> 
> > +        if (!--left) {
> 
> For example, here it would just become "if (p >= end)"
> 
> > +            ret = AVERROR(EILSEQ); /* incomplete sequence */
> > +            goto end;
> > +        }
> > +
> > +        /* we assume the byte to be in the form 10xx-xxxx */
> > +        tmp = *p++ - 128;  /* strip leading 1 */
> > +        if (tmp>>6) {
> > +            ret = AVERROR(EILSEQ);
> > +            goto end;
> > +        }
> > +        *code = (*code<<6) + tmp;
> > +        top <<= 5;
> > +    }
> > +    *code &= (top << 1) - 1;
> > +
> 
> > +    if (flags & AV_UTF8_DECODE_FLAG_CHECK_RANGE) {
> > +        /* only accepts valid Unicode points:
> > +           #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
> > +           that is any Unicode character, excluding the surrogate blocks, FFFE, and FFFF */

> > +        if (*code != 0x9 && *code != 0xA && *code != 0xD &&
> > +            (*code <    0x20 || *code >   0xD7FF) &&
> > +            (*code <  0xE000 || *code >   0xFFFD) &&
> > +            (*code < 0x10000 || *code > 0x10FFFF))
> > +            ret = AVERROR(EILSEQ);
> 
> I do not think this is really correct. All code points in the range 0-31 are
> valid control characters. They are invalid in some contexts, such as XML,
> but that is not on the same level as surrogates or malformed characters.
> 

> I suggest to enable Unicode range check by default, with a flag to disable
> it, and to have a second flag to check XML validity.

I implemented the code < (1<<31) check in the patch. I don't know what
you exactly mean by "Unicode range check", indeed there is a lot of
documentation about which code points should be considered valid, and
for some it is not entirely clear (for example surrogates).

Which flags do you propose to support?

> Also, did you check whether the control codes in the 0x7F-0x9F range are
> acceptable in XML?

I cheated, indeed this list is directly taken from the XML specs:
http://www.w3.org/TR/xml/#charsets

after much time spent browsing various Unicode documents. Thus I
suppose these ranges should be universally accepted by XML parsers.

On the other hand I'm not sure what we should really disallow by
default, for example JSON parsers are usually much less strict than
XML parsers with regards to accepted code-points.

[...]

Other issues fixed.
-- 
FFmpeg = Fast Fundamentalist Most Puritan Evangelical Governor
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-lavu-avstring-add-av_utf8_decode-function.patch
Type: text/x-diff
Size: 7998 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20131113/b497e27f/attachment.bin>