[FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame\n for subtitle handling

Michael Niedermayer michael at niedermayer.cc
Sun Dec 12 23:26:24 EET 2021


On Sun, Dec 12, 2021 at 02:21:42AM +0000, Soft Works wrote:
> 
> 
> > -----Original Message-----
> > From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of Daniel
> > Cantarín
> > Sent: Sunday, December 12, 2021 12:39 AM
> > To: ffmpeg-devel at ffmpeg.org
> > Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame\n
> > for subtitle handling
> > 
> >  > One of the important points to understand is that - in case of subtitles,
> >  > the AVFrame IS NOT the subtitle event. The subtitle event is actually
> >  > a different and separate entity. (...)
> > 
> > 
> > Wouldn't it qualify then as a different abstraction?
> > 
> > I mean: instead of avframe.subtitle_property, perhaps something in the
> > lines of avframe.some_property_used_for_linked_abstractions, which in
> > turn lets you access some proper Subtitle abstraction instance.
> > 
> > That way, devs would not need to defend AVFrame, and Subtitle could
> > have whatever properties needed.
> > 
> > I see there's AVSubtitle, as you mention:
> > https://ffmpeg.org/doxygen/trunk/structAVSubtitle.html
> > 
> > Isn't it less socially problematic to just link an instance of AVSubtitle,
> > instead of adding a subtitle timing property to AVFrame?
> > IIUC, that AVSubtitle instance could live in filter context, and be linked
> > by the filter doing the heartbeat frames.
> > 
> > Please note I'm not saying the property is wrong, or even that I understand
> > the best way to deal with it, but that I recognize some social problem here.
> > Devs don't like that property, that's a fact. And technical or not, seems to
> > be a problem.
> > 
> >  > (...)
> >  > The chairs are obviously AVFrames. They need to be numbered monotonically
> >  > increasing - that's the frame.pts. without increasing numbering the
> >  > transport would get stuck. We are filling the chairs with copies
> >  > of the most recent subtitle event, so an AVSubtitle could be repeated
> >  > like for example 5 times. It's always the exact same AVSubtitle event
> >  > sitting in those 5 chairs. The subtitle event has always the same
> > start time
> >  > (subtitle_pts) but each frame has a different pts.
> > 
> > I can see AVSubtitle has a "start_display_time" property, as well as a
> > "pts" property "in AV_TIME_BASE":
> > 
> > https://ffmpeg.org/doxygen/trunk/structAVSubtitle.html#af7cc390bba4f9d6c32e39
> > 1ca59d117a2
> > 
> > Is it too much trouble to reuse that while persisting an AVSubtitle instance
> > in filter context? I guess it could even be used in decoder context.
> > 
> > I also see a quirky property in AVFrame: "best_effort_timestamp"
> > https://ffmpeg.org/doxygen/trunk/structAVFrame.html#a0943e85eb624c2191490862e
> > cecd319d
> > Perhaps adding there some added "various heuristics" that it claims to
> > have,
> > this time related to a linked AVSubtitle, so an extra property is not
> > needed?
> > 
> > 
> >  > (...)
> >  > Considering the relation between AVFrame and subtitle event as laid out
> >  > above, it should be apparent that there's no guarantee for a certain
> >  > kind of relation between the subtitle_pts and the frame's pts who
> >  > is carrying it. Such relation _can_ exist, but doesn't necessarily.
> >  > It can easily be possible that the frame pts is just increased by 1
> >  > on subsequent frames. The time_base may change from filter to filter
> >  > and can be oriented on the transport of the subtitle events which
> >  > might have nothing to do with the subtitle display time at all.
> > 
> > This confuses me.
> > I understand the difference between filler frame pts and subtitle pts.
> > That's ok.
> > But if transport timebase changes, I understand subtitle pts also changes.
> > 
> > I mean: "transport timebase" means "video timebase", and if subs are synced
> > to video, then that sync needs to be mantained. If subs are synced, then
> > their timing is never independant. And if they're not synced, then its
> > AVFrame
> > is independant from video frames, and thus does not need any extra prop.
> > 
> > Here's what I do right now with the filler frames. I'm talking about current
> > ffmpeg with no subs frames in lavfi, and real-time conversion from dvbsub
> > to WEBVTT using OCR. Quite dirty stuff what I do:
> >    - Change FPS to a low value, let's say 1.
> >    - Apply OCR to dvb sub, using vf_ocr.
> >    - Read the metadata downstream, and writting vtt to file or pipe output.
> > 
> > As there's no sub frame capability in lavfi, I can't use vtt encoder
> > downstream.
> > Therefore, the output is raw C string and file manipulation. And given
> > that I
> > set first the FPS to 1, I have 1 line per second, no matter the
> > timestamp of
> > either subs or video or filler frame. The point then is to check for
> > text diffs
> > instead of pts for detecting the frame nature. And I can even naively
> > just put
> > the frame's pts once per sec with the same text, and with empty lines when
> > there's no text, without caring about the frame nature (filler or not).
> > 
> > There's a similar behaviour when dealing with CEA-608: I need to check text
> > differences instead of any pts, as inner workings of this captions are more
> > related to video than subs. I assume in my filters that frame PTS is
> > correct.
> > 
> > I understand the idea behind PTS, I get that there's also DTS, and so I
> > can get
> > that there could be an use case where another timing is needed. But I still
> > don't see the need for this particular extra timing, as the distance
> > between
> > subtitle_pts and filler.pts does not means downstream something like "now
> > clear the current subtitle line" or something like that. What will
> > happen if
> > there's no subtitle_pts, is that the same line will still be active,
> > which will
> > only change when there's an actual subtitle difference. So, I believe this
> > value is more theoretically useful rather than factual.
> > 
> > I understand that there are subs formats that need precise start and end
> > timing, but I fail to see the case where that timing avoids the need for
> > text
> > differences checking, be it filter or encoder. And if filters or
> > encoders naively
> > use PTS, then the filler frames would not break anything: will show
> > repeatedly
> > the same text line, at current FPS speed. And if the sparseness problem is
> > finally solved by your logic somehow, and there's no need for filler
> > frames,
> > then there's also no need for subtitle_pts, as pts would be actually fine.
> > 
> > So, I'm confused, given that you state this property as very important.
> > Would you please tell us some actual, non-theoretical use case for the prop?
> > 
> > 
> >  >
> >  > Also, subtitle events are sometimes duplicated. When we would convert
> >  > the subtitle_pts to the time_base that is negotiated between two filters,
> >  > then it could happen that multiple copies of a single subtitle event have
> >  > different subtitle_pts values.
> >  >
> > 
> > If it's repeated, doesn't it have different pts?
> > I get repeated lines from time to time. But they have slightly different
> > PTS.
> > 
> > "Repeated event" != "same event".
> > If you check for repeated events, then you're doing some extra checking,
> > as I point with "text difference checks" in previous paragraphs, and so
> > PTS is not ruling all the logic. Otherwise, worst case scenario you get the
> > same PTS twice, which will discard some frame. And most likely scenario,
> > you get two identical frames with different PTS, that actually changes
> > nothing in viewer's experience.
> > 
> >  >
> >  > Besides that, there are practical considerations: The subtitle_pts
> >  > is almost nowhere needed in any other time_base than AV_TIMEBASE_Q.
> >  >
> >  > All decoders expect it to be like this, all encoders and all filters.
> >  > Conversion would need to happen all over the place.
> >  > Every filter would need to take care of rescaling the subtitle_pts
> >  > value (when time_base is different between in and out).
> >  >
> > 
> > I'm not well versed enough in ffmpeg/libav to understand that.
> > But I tell you what. You think is possible for you to do some practical
> > test?
> > I mean this:
> >    - Take some short video example with dvbsubs (or whatever graphical).
> >    - Apply graphicsub2text, converting to webvtt, srt, or something.
> >    - Do the same, but taking away subtitle_pts from AVFrame.
> > 
> > Let's compare both text outputs.
> > I propose text, because is easier to share. But if you think of any other
> > practical example like this, it's also welcome. The point is to understand
> > the relevance of subtitle_pts by looking at the problem of not having it.
> > 
> > If there's no big deal, then screw it: you take it away, devs get pleased,
> > and everybody in the world gets the blessing of having subtitle frames in
> > lavfi. If there's some big deal, then the devs should understand.
> 
> I'm afraid, the only reply that I have to this is:
> 
> - Take my patchset 
> - Remove subtitle_pts
> - Get everything working
>   (all example command lines in filters.texi)
> 
> => THEN start talking
> 
> The same goes out to everybody else who keeps telling it can be 
> removed and that it's an unnecessary duplication.

Maybe some ascii art representation would make things clearer?
So that everyone would better understand the problem, what people
want, what the patch set does and why.
I didnt review this set so i likely will fail to capture this fully
but heres a start

So something like this:
Theres a AVFrame which represents a timespan from its pts
to its pts + duration. And theres a underlaying subtitle event
which overlaps with that period and generally is longer.

do i understand it correctly that:
subtitle event:  A---------|
                        B------|   C----|
Frames:
               F0--F1--F2--F3--F4--F5--F6--F7
                
F0,F1 contains A with A subtitle_pts
F2    contains A and B with A or B subtitle_pts (its not able to contain both)
F3    contains B with B subtitle_pts
F4    is a empty hearbeat frame of some sort
F5,F6 contain C with its subtitle_pts
all subtitle_pts are relative to the AVFrame pts in a fixed timebase

What i would very naively have expected if one asked me a year ago:
(we enforce a maximum duration to generate hearbeats, and we introduce new
 AVFrames every time theres a end or start of a new event so that the
 AVFrame pts alone can represent all times

subtitle event:  A---------|
                        B------|   C----|
Frames:
                 F0--F1-F2-F3--F4--F5--FF7

and with this the AVFrame pts would basically be the subtitle_pts
but this runs into a problem i think
If we consider a subtitle evemt which has effects at a granularity lower
than the AVFrame spacing. The renderer needs to know where in the original
subtitle event the AVFrame is placed. Otherwise that effect will not be
rendered correctly. Here we need the time between the event and the
AVFrame.pts. I would have tried to put that in sidedata. Its the time
at which the event was cut by the AVFrame so a renderer seeing just this
AVFrame and no other know which part timewise of the event in AVFrame.data[]
it needs to render into the AVFrames pts + duration

Likely iam missing many things and iam wrong with many things but it
surprises me that in my thought-experimental design above a similar 2nd
time/duration arrises.
I wonder if sidedata would be a nicer place for this than a
subtitle_pts in AVFrame ?

thx

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

If a bugfix only changes things apparently unrelated to the bug with no
further explanation, that is a good sign that the bugfix is wrong.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
URL: <https://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20211212/adb1ea51/attachment.sig>


More information about the ffmpeg-devel mailing list