[FFmpeg-devel] [PATCH 2/4] avcodec/samidec: use ff_htmlmarkup_to_ass()

Fri Aug 28 08:05:11 CEST 2015

Thank you for the suggestion.. It took me a while to get it but I
think it looks much better now.

Please let me know what you think!

Cheers,

On Sat, Aug 15, 2015 at 11:24 AM, Clément Bœsch <u at pkh.me> wrote:
> On Sat, Aug 08, 2015 at 09:24:03PM -0700, Yayoi Ukai wrote:
>> On Sat, Aug 8, 2015 at 8:23 AM, Clément Bœsch <u at pkh.me> wrote:
>> > On Fri, Aug 07, 2015 at 11:03:29PM -0700, Yayoi wrote:
>> >> ---
>> >>  libavcodec/samidec.c    | 59 +++++++++++++++++++++++++++++--------------------
>> >>  tests/ref/fate/sub-sami | 18 +++++++--------
>> >>  2 files changed, 44 insertions(+), 33 deletions(-)
>> >>
>> >> diff --git a/libavcodec/samidec.c b/libavcodec/samidec.c
>> >> index 47850e2..41826a7 100644
>> >> --- a/libavcodec/samidec.c
>> >> +++ b/libavcodec/samidec.c
>> >> @@ -27,6 +27,7 @@
>> >>  #include "ass.h"
>> >>  #include "libavutil/avstring.h"
>> >>  #include "libavutil/bprint.h"
>> >> +#include "htmlsubtitles.h"
>> >>
>> >>  typedef struct {
>> >>      AVBPrint source;
>> >> @@ -41,11 +42,12 @@ static int sami_paragraph_to_ass(AVCodecContext *avctx, const char *src)
>> >>      char *tag = NULL;
>> >>      char *dupsrc = av_strdup(src);
>> >>      char *p = dupsrc;
>> >> +    char *pcopy = NULL;
>> >> +    int index = 0;
>> >> +    int second_paragraph = 0;
>> >>
>> >> -    av_bprint_clear(&sami->content);
>> >>      for (;;) {
>> >>          char *saveptr = NULL;
>> >> -        int prev_chr_is_space = 0;
>> >>          AVBPrint *dst = &sami->content;
>> >>
>> >>          /* parse & extract paragraph tag */
>> >> @@ -77,37 +79,46 @@ static int sami_paragraph_to_ass(AVCodecContext *avctx, const char *src)
>> >>              goto end;
>> >>          }
>> >>
>> >
>> >> -        /* extract the text, stripping most of the tags */
>> >> +        /* check for the second paragrph */
>> >
>> > Why change the comment? What does "check" mean here? What is the "second
>> > paragraph"?
>>
>>
>> I answer it below with the other questions you have because you are
>> basically asking the same things.
>> And I will document better too.
>>
>> >
>> >> +        pcopy = av_strdup(p);
>> >>          while (*p) {
>> >>              if (*p == '<') {
>> >> -                if (!av_strncasecmp(p, "<P", 2) && (p[2] == '>' || av_isspace(p[2])))
>> >> +                if (!av_strncasecmp(p, "<p", 2) && (p[2] == '>' || av_isspace(p[2]))) {
>> >> +                    second_paragraph = 1;
>> >>                      break;
>> >> -                if (!av_strncasecmp(p, "<BR", 3))
>> >> -                    av_bprintf(dst, "\\N");
>> >> -                p++;
>> >> -                while (*p && *p != '>')
>> >> -                    p++;
>> >> -                if (!*p)
>> >> -                    break;
>> >> -                if (*p == '>')
>> >> -                    p++;
>> >> -                continue;
>> >> +                }
>> >>              }
>> >> -            if (!av_isspace(*p))
>> >> -                av_bprint_chars(dst, *p, 1);
>> >> -            else if (!prev_chr_is_space)
>> >> -                av_bprint_chars(dst, ' ', 1);
>> >> -            prev_chr_is_space = av_isspace(*p);
>> >>              p++;
>> >> +            index++;
>> >> +        }
>> >> +        p = p - index;
>> >> +        if (second_paragraph) {
>> >> +            p[index] = 0;
>> >>          }
>> >> -    }
>> >>
>> >> -    av_bprint_clear(&sami->full);
>> >> -    if (sami->source.len)
>> >> -        av_bprintf(&sami->full, "{\\i1}%s{\\i0}\\N", sami->source.str);
>> >> -    av_bprintf(&sami->full, "%s", sami->content.str);
>> >> +        ff_htmlmarkup_to_ass(avctx, dst, p);
>> >> +
>> >> +        /* add the source if there are any. */
>> >> +        av_bprint_clear(&sami->full);
>> >> +        if (sami->source.len) {
>> >> +            av_bprintf(&sami->full, "{\\i1}%s{\\i0}\\N", sami->source.str);
>> >> +            av_bprintf(&sami->full, "%s", sami->content.str);
>> >> +            if (second_paragraph) {
>> >> +                second_paragraph = 0;
>> >> +                p = pcopy;
>> >> +                p += index;
>> >> +                index = 0;
>> >> +                continue;
>> >> +            }
>> >> +        } else {
>> >> +            av_bprintf(&sami->full, "%s", sami->content.str);
>> >> +        }
>> >> +        av_bprint_clear(&sami->content);
>> >> +
>> >> +    }
>> >>
>> >
>> > This looks clumsy at best: you are finalizing the subtitle event inside
>> > the paragraph loop when it should be outside. It also seems there is a
>> > duplicating "second paragraph" logic even though the loop is supposed to
>> > handle one paragraph at a time.
>>
>> I know. It is clumsy.. I explain it below as well.
>>
>> >
>> > If you are uncomfortable with the current logic or believe it's badly
>> > designed for the logic you're trying to mangle, feel free to rewrite it;
>> > it's better than trying to inhibit the old behaviour with hacks.
>>
>> It's not that your code was badly designed to begin with. I feel like
>> it is the nature
>> of this format.
>>
>> So let me explain what was going on your old code and what I had changed.
>> Here is the example: (This one is from current fate sample.)
>>
>> <SYNC Start=73000>
>>       <P Class=ENUSCC ID=Source>End of:
>>       <P Class=ENUSCC>President John F. Kennedy Speech
>>
>> So this is one subtitle event and there are two paragraphs. One
>> paragraph has a new source and
>> the other paragraph has its content.
>> Since it is decided that this code uses html
>> parser(ff_htmlmarkup_to_ass) and use it to parse the contents of the
>> paragraph. Each paragraph content needs to go to to html parser,
>> regardless whether it is source or subtitle content.
>>
>> So I need to tell html_parser, ff_htmlmarkup_to_ass that where the
>> destination is.. (whether it is source or content).
>> I can't no longer specify the destination which I was able to do in
>> your old code. I mean maybe I can change the function header and add
>> additional flag to tell our parser where to put in the dst but I don't
>> want to because it may require to change srtdec and I do not want to
>> do that..
>>
>> Anyways, so that's the basic idea. I honestly think it is good that
>> subtitle event handle is also in loop because one loop iteration per
>> one subtitle event and subtitle event does not guarantee that it
>> contains only one paragraph to begin with. It was a possibility either
>> adding additional loop or having goto to honor your original logic but
>> after I noticed that subtitle event contains more than one paragraph,
>> I gave up on the idea to keep your logic. The comment says "check
>> second paragraph" but it should be able to handle multiple paragraphs
>> more than 2 in one subtitle event. (I definitely tested while I was
>> working on this part.. I can modify sample or attach test file if that
>> makes you feel happier. And I make a better comment then.)
>>
>> So I want to leave it as it is if you do not see major flaws or you do
>> not have very very strong opinion about how it needs to be modified.
>> Well, given the information, if you have some ideas, you can tell me
>> too. Besides, I can't cross out the possibility if you or me or
>> someone wants to add more enhancement (maybe one day you decided to
>> add support for img tag? who knows..) , and it may need to change code
>> structure again. And as it is now, it is still enhancement that works.
>> ;-).
>>
>
> I think it's breakable one way or another without that much effort.
>
> Here is a suggestion based on the original code:
>
>  - make sure you have 5 bprint buffers: source_sami, content_sami,
>    source_encoded, content_encoded, full (you already have 3 of them in
>    the context)
>
>  - remove the current markup processing from the loop and make sure the
>    loop just adds the content of each paragraph where it belongs (that is,
>    in either source_sami or content_sami). The code already does that, so
>    there is probably no change to do (code already concats in each bprint
>    buffer if more than 2 paragraphs happen).
>
>  - after the loop ends, you now have the SAMI data of your packet
>    split into content_sami and eventually in source_sami, without the SAMI
>    crap (such as <P ...>). So what you need to do is just call the markup
>    on content_sami (and eventually source_sami if there is data) to create
>    content_encoded (and eventually source_encoded).
>
>  - the existing code will then finish things up by concatenating source &
>    content as it is already doing
>
> This is probably way more solid than trying to interlace another partial
> sub-loop in a loop of the same purpose like you proposed.
>
> If you do not want to do that, I will have to push your version and do the
> above or similar before pushing the whole set.
>
> [...]
>
> --
> Clément B.
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>