[FFmpeg-devel] [PATCH] lavu/avstring: add av_get_utf8() function

Stefano Sabatini stefasab at gmail.com
Thu Nov 14 10:57:47 CET 2013


On date Wednesday 2013-11-13 19:15:31 +0100, Stefano Sabatini encoded:
> On date Wednesday 2013-11-13 17:51:32 +0100, Nicolas George encoded:
> > Le tridi 23 brumaire, an CCXXII, Stefano Sabatini a écrit :
> > > Yes, on the other hand developers would easily forget to update their
> > > commit log, resulting in missing entries in the resulting output (and
> > > we're not allowed to change log commits). I don't know if git allows
> > > to markup specific commits after they have been committed.
> > 
> > There is git-notes, it allows to attach a note that will be displayed along
> > with the commit message. Unfortunately, it is not cloned by default. Another
> > solution would be to add an empty commit to the history with the APIchanges
> > tag; unfortunately, in that case the commit would not appear in the log for
> > the corresponding files.
> > 
> > But enough of this digressions.
> > 
> > > Changed both, but with the only difference that endp points to the
> > > last byte in the buffer, in order to avoid overflow issues.
> > 
> > The C standard specifically allows pointers to the first byte after an
> > object, probably exactly for this kind of situation. And it is easier to
> > write:
> > 
> >     end = buf + size;
> > 
> > ... than to subtract one, because you must check size for 0 (C does not
> > allow a pointer to the byte before an object, and anyways size is probably
> > unsigned).
> 
> Suppose that you have an overflow with PTR+1, then you have PTR+1=0 <
> PTR, in this case the code will misbehave. I don't know if the specs
> explicitly allow this (PTR+1 for every allocated byte pointer should
> not overflow).
> 
> > > I implemented the code < (1<<31) check in the patch. I don't know what
> > > you exactly mean by "Unicode range check", indeed there is a lot of
> > > documentation about which code points should be considered valid, and
> > > for some it is not entirely clear (for example surrogates).
> > 
> > There is absolutely no doubt about surrogates: they are only valid in
> > UTF-16.
> > The most ambiguous issue is the upper bound: it was initially 0xFFFF, then
> > became 0x7FFFFFFF when thousands of ideograms were found in old books, and
> > then was lowered to 0x10FFFF when it became apparent that microsoft and sun
> > had once again made a mess with UTF-16.
> > 
> > > Which flags do you propose to support?
> > 
> > Default, accept any code that is structurally valid in current Unicode:
> > 0x000000-0x10FFFF except the surrogates planes and 0xFFFE and 0xFFFF.
> 
> > Flag #1: accept any code that is structurally possible in UTF-8, i.e.
> > 0x00000000-0x7FFFFFFF.
> > Flag #2: reject codes that would make XML choke.
> 
> That is: exclude various ASCII control codes, UTF-16 surrogates, and
> codes over 0x10FFFF upper bound.
> 
> > (Flag #3: toggle the default check for overlong encodings.)
> 
> ?
> 
> Or we could have something like:
> AV_UTF8_CHECK_RANGE_FLAG_EXCLUDE_OVERLONG       ///< exclude codepoints over 0x10FFFF)
> AV_UTF8_CHECK_RANGE_FLAG_EXCLUDE_CONTROL        ///< exclude invalid XML control codes
> AV_UTF8_CHECK_RANGE_FLAG_EXCLUDE_SURROGATES     ///< exclude UTF-16 surrogates codes
> AV_UTF8_CHECK_RANGE_FLAG_EXCLUDE_NON_CHARACTERS ///< exclude non-characters - 0xFFFE and 0xFFFF
> 
> and so we could define:
> #define AV_UTF8_CHECK_RANGE_FLAG_XML \
>         EXCLUDE_SURROGATES|EXCLUDE_OVERLONG|EXCLUDE_NON_CHARACTERS|EXCLUDE_CONTROL
> 
> A safe default could be:
> #define AV_UTF8_CHECK_RANGE_FLAG_LOOSE \
>         EXCLUDE_SURROGATES|EXCLUDE_OVERLONG|EXCLUDE_NON_CHARACTERS
> 
> > > I cheated, indeed this list is directly taken from the XML specs:
> > > http://www.w3.org/TR/xml/#charsets
> > > 
> > > after much time spent browsing various Unicode documents. Thus I
> > > suppose these ranges should be universally accepted by XML parsers.
> > 
> > Ok.
> > 
> > > On the other hand I'm not sure what we should really disallow by
> > > default, for example JSON parsers are usually much less strict than
> > > XML parsers with regards to accepted code-points.
> > 
> > I agree, but surrogates, 0xFFFE, 0xFFFF and codes beyond 0x10FFFF should
> > really not be there.

Updated.
-- 
FFmpeg = Funny Fanciful Meaningful Pacific Exxagerate Ghost
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-lavu-avstring-add-av_utf8_decode-function.patch
Type: text/x-diff
Size: 8660 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20131114/25b40ad3/attachment.bin>


More information about the ffmpeg-devel mailing list