[FFmpeg-devel] [PATCH] lavu/avstring: add av_get_utf8() function

Mon Nov 11 00:43:17 CET 2013

On Sun, Nov 10, 2013 at 01:16:44PM +0100, Stefano Sabatini wrote:
> On date Friday 2013-11-08 11:41:36 +0100, Stefano Sabatini encoded:
> > On date Thursday 2013-11-07 15:40:54 -0800, Timothy Gu encoded:
> > > On Nov 7, 2013 2:22 PM, "Lukasz M" <lukasz.m.luki at gmail.com> wrote:
> > [...]
> > > > > av_get_utf8_code()?
> > > > > av_get_code_from_utf8()?
> > > > > av_decode_utf8()?
> > > > >
> > > > > I don't mind changing the name of the function.
> > > > >
> > > >
> > 
> > > > When I poseted I had av_get_code_from_utf8() in mind, but all seems OK.
> > > 
> > > I prefer av_decode_utf8 which seems to be the shortest while preserving the
> > > true meaning.
> > 
> > or av_utf8_decode(), in case we add av_utf8_encode().
> > 
> > Still waiting for a complete review, will auto-review and push in
> > three days if I see none.
> 
> Updated, renamed to av_utf8_decode() to provide hierarchical naming
> (and help intellisense in case we add av_utf8_encode()).
> 
> I also added another parameter to deal with buffer overreads in case
> of unterminated sequences towards the end of a buffer.
> 
> Please comment, I'd like to push it soon.
> -- 
> FFmpeg = Fanciful and Friendly Monstrous Political Erroneous Geisha

>  Makefile   |    1 
>  avstring.c |   40 +++++++++++++++++++++++++++++++++++
>  avstring.h |   19 ++++++++++++++++
>  utf8.c     |   69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 129 insertions(+)
> b4aae72461f90e3d30a7f831761172cdacf84089  0002-lavu-avstring-add-av_utf8_decode-function.patch
> From 55d076a23a9e41d2d5af0222221471af1d2a30e5 Mon Sep 17 00:00:00 2001
> From: Stefano Sabatini <stefasab at gmail.com>
> Date: Thu, 3 Oct 2013 01:21:40 +0200
> Subject: [PATCH] lavu/avstring: add av_utf8_decode() function
> 
> TODO: minor bump, APIchanges entry
> ---
>  libavutil/Makefile   |  1 +
>  libavutil/avstring.c | 40 ++++++++++++++++++++++++++++++
>  libavutil/avstring.h | 19 +++++++++++++++
>  libavutil/utf8.c     | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 129 insertions(+)
>  create mode 100644 libavutil/utf8.c
> 
> diff --git a/libavutil/Makefile b/libavutil/Makefile
> index 7b3b439..19540e4 100644
> --- a/libavutil/Makefile
> +++ b/libavutil/Makefile
> @@ -155,6 +155,7 @@ TESTPROGS = adler32                                                     \
>              sha                                                         \
>              sha512                                                      \
>              tree                                                        \
> +            utf8                                                        \
>              xtea                                                        \
>  
>  TESTPROGS-$(HAVE_LZO1X_999_COMPRESS) += lzo
> diff --git a/libavutil/avstring.c b/libavutil/avstring.c
> index eed58fa..8666c86 100644
> --- a/libavutil/avstring.c
> +++ b/libavutil/avstring.c
> @@ -307,6 +307,46 @@ int av_isxdigit(int c)
>      return av_isdigit(c) || (c >= 'a' && c <= 'f');
>  }
>  
> +int av_utf8_decode(int32_t *code, const uint8_t **buf, size_t left)
> +{
> +    const uint8_t *p = *buf;
> +    uint32_t top;
> +
> +    if (!left)
> +        return 0;
> +
> +    *code = *p++;
> +
> +    /* first sequence byte starts with 10, or is 1111-1110 or 1111-1111,
> +       which is not admitted */
> +    if ((*code & 0xc0) == 0x80 || *code >= 0xFE) {
> +        *buf = p;
> +        return AVERROR(EINVAL);
> +    }
> +    top = (*code & 128) >> 1;
> +
> +    while (*code & top) {
> +        int tmp;
> +        if (!--left) {
> +            *buf = p;
> +            return AVERROR(EINVAL); /* incomplete sequence */
> +        }
> +
> +        /* we assume the byte to be in the form 10xx-xxxx */
> +        tmp = *p++ - 128;   /* strip leading 1 */
> +        if (tmp>>6) {
> +            *buf = p;
> +            return AVERROR(EINVAL);
> +        }
> +        *code = (*code<<6) + tmp;
> +        top <<= 5;
> +        left--;
> +    }
> +    *code &= (top << 1) - 1;
> +    *buf = p;
> +    return 0;
> +}
> +
>  #ifdef TEST
>  
>  int main(void)
> diff --git a/libavutil/avstring.h b/libavutil/avstring.h
> index 438ef79..b3ddc3a 100644
> --- a/libavutil/avstring.h
> +++ b/libavutil/avstring.h
> @@ -22,6 +22,7 @@
>  #define AVUTIL_AVSTRING_H
>  
>  #include <stddef.h>
> +#include <stdint.h>
>  #include "attributes.h"
>  
>  /**
> @@ -296,6 +297,24 @@ int av_escape(char **dst, const char *src, const char *special_chars,
>                enum AVEscapeMode mode, int flags);
>  
>  /**
> + * Read and decode a single UTF-8 character sequence from buffer in
> + * *buf, and update *buf to point to the next byte after the parsed
> + * sequence.
> + *
> + * In case of invalid sequence, the pointer will be updated to the
> + * next byte after the invalid sequence.
> + *
> + * @param code pointer whose pointed value is updated to keep the
> + * parsed code in case of success
> + * @param left bytes left to read in the buffer. By default it won't
> + * read more than 6 chaaracters (maximum number of bytes in an UTF-8
> + * sequence).
> + * @return >= 0 in case a sequence was successfully read, a negative
> + * value in case of invalid sequence
> + */
> +int av_utf8_decode(int32_t *code, const uint8_t **buf, size_t left);

what is the relation of this to GET_UTF8()
how should a developer choose which of the 2 to use ?

also is there a performance difference ?
benchmark might be interresting

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

The greatest way to live with honor in this world is to be what we pretend
to be. -- Socrates
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20131111/6f16a42a/attachment.asc>