[FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame\n for subtitle handling

Daniel Cantarín canta at canta.com.ar
Sun Dec 12 01:38:30 EET 2021


 > One of the important points to understand is that - in case of subtitles,
 > the AVFrame IS NOT the subtitle event. The subtitle event is actually
 > a different and separate entity. (...)


Wouldn't it qualify then as a different abstraction?

I mean: instead of avframe.subtitle_property, perhaps something in the
lines of avframe.some_property_used_for_linked_abstractions, which in
turn lets you access some proper Subtitle abstraction instance.

That way, devs would not need to defend AVFrame, and Subtitle could
have whatever properties needed.

I see there's AVSubtitle, as you mention:
https://ffmpeg.org/doxygen/trunk/structAVSubtitle.html

Isn't it less socially problematic to just link an instance of AVSubtitle,
instead of adding a subtitle timing property to AVFrame?
IIUC, that AVSubtitle instance could live in filter context, and be linked
by the filter doing the heartbeat frames.

Please note I'm not saying the property is wrong, or even that I understand
the best way to deal with it, but that I recognize some social problem here.
Devs don't like that property, that's a fact. And technical or not, seems to
be a problem.

 > (...)
 > The chairs are obviously AVFrames. They need to be numbered monotonically
 > increasing - that's the frame.pts. without increasing numbering the
 > transport would get stuck. We are filling the chairs with copies
 > of the most recent subtitle event, so an AVSubtitle could be repeated
 > like for example 5 times. It's always the exact same AVSubtitle event
 > sitting in those 5 chairs. The subtitle event has always the same 
start time
 > (subtitle_pts) but each frame has a different pts.

I can see AVSubtitle has a "start_display_time" property, as well as a
"pts" property "in AV_TIME_BASE":

https://ffmpeg.org/doxygen/trunk/structAVSubtitle.html#af7cc390bba4f9d6c32e391ca59d117a2

Is it too much trouble to reuse that while persisting an AVSubtitle instance
in filter context? I guess it could even be used in decoder context.

I also see a quirky property in AVFrame: "best_effort_timestamp"
https://ffmpeg.org/doxygen/trunk/structAVFrame.html#a0943e85eb624c2191490862ececd319d
Perhaps adding there some added "various heuristics" that it claims to 
have,
this time related to a linked AVSubtitle, so an extra property is not 
needed?


 > (...)
 > Considering the relation between AVFrame and subtitle event as laid out
 > above, it should be apparent that there's no guarantee for a certain
 > kind of relation between the subtitle_pts and the frame's pts who
 > is carrying it. Such relation _can_ exist, but doesn't necessarily.
 > It can easily be possible that the frame pts is just increased by 1
 > on subsequent frames. The time_base may change from filter to filter
 > and can be oriented on the transport of the subtitle events which
 > might have nothing to do with the subtitle display time at all.

This confuses me.
I understand the difference between filler frame pts and subtitle pts. 
That's ok.
But if transport timebase changes, I understand subtitle pts also changes.

I mean: "transport timebase" means "video timebase", and if subs are synced
to video, then that sync needs to be mantained. If subs are synced, then
their timing is never independant. And if they're not synced, then its 
AVFrame
is independant from video frames, and thus does not need any extra prop.

Here's what I do right now with the filler frames. I'm talking about current
ffmpeg with no subs frames in lavfi, and real-time conversion from dvbsub
to WEBVTT using OCR. Quite dirty stuff what I do:
   - Change FPS to a low value, let's say 1.
   - Apply OCR to dvb sub, using vf_ocr.
   - Read the metadata downstream, and writting vtt to file or pipe output.

As there's no sub frame capability in lavfi, I can't use vtt encoder 
downstream.
Therefore, the output is raw C string and file manipulation. And given 
that I
set first the FPS to 1, I have 1 line per second, no matter the 
timestamp of
either subs or video or filler frame. The point then is to check for 
text diffs
instead of pts for detecting the frame nature. And I can even naively 
just put
the frame's pts once per sec with the same text, and with empty lines when
there's no text, without caring about the frame nature (filler or not).

There's a similar behaviour when dealing with CEA-608: I need to check text
differences instead of any pts, as inner workings of this captions are more
related to video than subs. I assume in my filters that frame PTS is 
correct.

I understand the idea behind PTS, I get that there's also DTS, and so I 
can get
that there could be an use case where another timing is needed. But I still
don't see the need for this particular extra timing, as the distance 
between
subtitle_pts and filler.pts does not means downstream something like "now
clear the current subtitle line" or something like that. What will 
happen if
there's no subtitle_pts, is that the same line will still be active, 
which will
only change when there's an actual subtitle difference. So, I believe this
value is more theoretically useful rather than factual.

I understand that there are subs formats that need precise start and end
timing, but I fail to see the case where that timing avoids the need for 
text
differences checking, be it filter or encoder. And if filters or 
encoders naively
use PTS, then the filler frames would not break anything: will show 
repeatedly
the same text line, at current FPS speed. And if the sparseness problem is
finally solved by your logic somehow, and there's no need for filler 
frames,
then there's also no need for subtitle_pts, as pts would be actually fine.

So, I'm confused, given that you state this property as very important.
Would you please tell us some actual, non-theoretical use case for the prop?


 >
 > Also, subtitle events are sometimes duplicated. When we would convert
 > the subtitle_pts to the time_base that is negotiated between two filters,
 > then it could happen that multiple copies of a single subtitle event have
 > different subtitle_pts values.
 >

If it's repeated, doesn't it have different pts?
I get repeated lines from time to time. But they have slightly different 
PTS.

"Repeated event" != "same event".
If you check for repeated events, then you're doing some extra checking,
as I point with "text difference checks" in previous paragraphs, and so
PTS is not ruling all the logic. Otherwise, worst case scenario you get the
same PTS twice, which will discard some frame. And most likely scenario,
you get two identical frames with different PTS, that actually changes
nothing in viewer's experience.

 >
 > Besides that, there are practical considerations: The subtitle_pts
 > is almost nowhere needed in any other time_base than AV_TIMEBASE_Q.
 >
 > All decoders expect it to be like this, all encoders and all filters.
 > Conversion would need to happen all over the place.
 > Every filter would need to take care of rescaling the subtitle_pts
 > value (when time_base is different between in and out).
 >

I'm not well versed enough in ffmpeg/libav to understand that.
But I tell you what. You think is possible for you to do some practical 
test?
I mean this:
   - Take some short video example with dvbsubs (or whatever graphical).
   - Apply graphicsub2text, converting to webvtt, srt, or something.
   - Do the same, but taking away subtitle_pts from AVFrame.

Let's compare both text outputs.
I propose text, because is easier to share. But if you think of any other
practical example like this, it's also welcome. The point is to understand
the relevance of subtitle_pts by looking at the problem of not having it.

If there's no big deal, then screw it: you take it away, devs get pleased,
and everybody in the world gets the blessing of having subtitle frames in
lavfi. If there's some big deal, then the devs should understand.




More information about the ffmpeg-devel mailing list