[FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame for subtitle handling

Mon Dec 6 01:23:11 EET 2021

> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of Soft Works
> Sent: Sunday, December 5, 2021 6:58 PM
> To: FFmpeg development discussions and patches <ffmpeg-devel at ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame
> for subtitle handling
> 
> 
> 
> > -----Original Message-----
> > From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of Lynne
> > Sent: Sunday, December 5, 2021 5:40 PM
> > To: FFmpeg development discussions and patches <ffmpeg-devel at ffmpeg.org>
> > Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame
> > for subtitle handling
> >
> > 5 Dec 2021, 17:23 by softworkz at hotmail.com:
> >
> > > @@ -491,6 +499,39 @@ typedef struct AVFrame {
> > >  */
> > >  uint64_t channel_layout;
> > >
> > > +    /**
> > > +     * Display start time, relative to packet pts, in ms.
> > > +     */
> > > +    uint32_t subtitle_start_time;
> > > +
> > > +    /**
> > > +     * Display end time, relative to packet pts, in ms.
> > > +     */
> > > +    uint32_t subtitle_end_time;
> > >
> >
> > Milliseconds? Our entire API's based around timestamps
> > with time bases. Plus, we all know what happened when
> > Matroska settled onto milliseconds and ruined a perfectly
> > complex but good container.
> > Make this relative to the PTS field, with the same timebase
> > as the PTS field.
> > There's even a new AVFrame->time_base field for you to
> > set so you wouldn't forget it.
> 
> The internal format for text subtitles is ASS, and this is
> using a timebase of milliseconds.
> 
> All existing decoders and encoders are using this and I'm
> afraid, but I will not go and change them all.
> 
> > > +    /**
> > > +     * Number of items in the @ref subtitle_areas array.
> > > +     */
> > > +    unsigned num_subtitle_areas;
> > > +
> > > +    /**
> > > +     * Array of subtitle areas, may be empty.
> > > +     */
> > > +    AVSubtitleArea **subtitle_areas;
> > >
> >
> > There's no reason why this cannot be handled using the buffer
> > and data fields. If you need more space there, you're free to bump
> > the maximum number of pointers, plus this removes the horrid
> > malloc(1) hack. We've discussed this, and I couldn't follow why
> > this was bad in the email discussion you linked.
> 
> There are reasons. Some of those had I mentioned in an earlier
> discussion with Hendrik.
> The effort alone to relate the buffers to subtitle areas (which area
> 'owns' which buffer) is not reasonable. Too many cases to consider,
> what's when there are 3 areas and the second area doesn't have any
> buffer? The convention is that the buffer should be used contiguously.
> Managing those relations is error-prone and would require a lot of code.
> 
> > > +    /**
> > > +     * Usually the same as packet pts, in AV_TIME_BASE.
> > > +     *
> > > +     * @deprecated This is kept for compatibility reasons and
> corresponds
> > to
> > > +     * AVSubtitle->pts. Might be removed in the future.
> > > +     */
> > > +    int64_t subtitle_pts;
> > >
> >
> > I'm not going to accept a field which is instantly deprecated.
> > As we've discussed multiple times, please merge this into
> > the regular frame PTS field. We already have _2_ necessary
> > stard/end fields.
> 
> --
> 
> > I agree with this entirely. Even ignoring the fact that adding a new
> > field thats deprecated is instantly a disqualification, AVSubtitle had
> > one pts field, AVFrame already has one pts field - both are even
> > documented to have the same semantic. They should just contain the
> > exact same data, thats how you achieve compatibility, not by claiming
> > you need a new field for compatibility reasons.
> >
> > - Hendrik
> 
> I think the mistake is to declare subtitle_pts as deprecated. I had
> added the deprecation at a very early point in time where I had still
> thought that it can be eliminated.
> 
> Even though we are driving subtitle data through the graph attached
> to AVFrame, the behavior involved is very different from audio and
> video frames. Actually there's not one but many different ways how
> subtitle data can appear in a source and travel through a filtergraph:
> 
> - Sometimes, subtitle events are muxed into a stream many seconds
>   ahead of display time. In this case, AVFrame.pts is the mux position
>   and AVFrame.subtitle_pts is the actual presentation time.
>   When filtering subtitles to modify something, it would be still desired
>   to retain the offset between mux time and display start
> 
> - Sometimes, subtitle events are occurring in the mux "live" - right in
>   the moment when they are meant to be shown. An example for this are
>   closed captions, and when extracting those via the new splitcc filter,
>   the subtitle_pts is equal to the frame.pts value.
>   But CC events do not come regularly, while downstream filters might
>   expect exactly that in order to proceed. That's why the splitcc filter
>   can emit subtitle frames at a certain framerate. It does so by re-
>   sending the most recent subtitle frame - basically not much different
>   than the main heatbeat mechanism (which cannot be used here because
>   the stream has it's origin inside the filtegraph (secondary output
>   of splitcc).
>   Now when splitcc sends such a repeated frame, it needs to adjust the
>   frame's pts in order to match the configured output framerate.
>   But that's for keeping the graph running, the actual display start
>   time (subtitle_pts) must not be changed.
>   It depends then on the downstream filters how to handle those frames.
>   For example, a filter could detect (based on subtitle_pts) repeated
>   subtitle events. Then it can skip processing and use a previous cached
>   result (some of the filter are actually doing that).
> 
> - I already mentioned the subtitle_heartbeat mechanism that I have kept.
>   The same applies here: it needs to resend frames to retain a certain
>   frequency of frames and for this purpose it needs to send duplicate
>   frames. Those frames need a new pts, but subtitle_pts must remain
>   at its original value
> 
> - There exist extreme cases, like where the complete set of ASS subtitles
>   is muxed at   the beginning of the stream with pkt-pts values like 1,2,3,..
>   The pkt-pts is what subtitle frames get as AVFrame.pts while the actual
>   display start-time it in subtitle_pts.
>   The desire way to handle such situations surely varies from case to case.
>   Having those values separate is an important key for keeping a wide range
>   of options open to handle this - for current and future subtitle filters.
> 
> My patchset offers a really wide range of possibilities for subtitle
> filtering. Those are real-world production scenarios that are working.
> This is not based on just a few command line experiments.
> 
> Not every scenario that would be desirable is working. With subtitles,
> there are a lot of different cases for which I have given just a few
> examples above, and certain scenarios will need some work to get it
> working by some tweaking, adding a new filter or adding additional
> options to existing filters (like you could see in the version history).
> My experience from doing just that has made me realize one important
> point: with the proposed architecture, it's possible to get almost
> any desired scenario working 

Maybe this doesn't sound that much substantial; let me illustrate what 
I mean by looking at an example.
Last week, I needed to convert closed captions during a live transcoding
and deliver these as regular text subtitles (ass, srt, vtt).

As explained above, the problem with CCs is that they are not muxed 
ahead of time as it's usually the case with other subtitles - they are
appearing in the stream right in the moment when they are meant to be 
presented. And even worse: there is no duration included. The duration 
is solely determined by the occurrence of the following update (can be 
an addition to the current text or a completely new text (or nothing).

The splitcc filter uses the existing cc_dec decoder, which can work in
two different modes (splitcc exposes the decoder's options):

- Default
  On each CC update, it saves the data and waits (for a certain time)
  until the next update happens. Then it emits the saved event, 
  setting the duration from the time difference.

- Realtime
  Emits subtitle events immediately with the duration set to infinite.
  A minimum interval between two subsequent outputs can be specified 
  for this mode.

None of those modes is really useful for the desired conversion:

In the default mode, subtitles are shown too late, with the delay not
even being constant but varying as the decoder doesn’t delay by a fixed
timespan but by the interval between two events which can range between
milliseconds and several seconds. The display start times are correct 
though (not delayed), and that can lead to situations where for example
an event with a 3 second duration arrives at the client at a time 
where 2.5s of the display time have elapsed already and the remaining 
0.5s display duration appears like a flicker.
In turn, that mode is surely suitable for offline conversions but not
for live transcoding.

When using the real_time mode on the other side, we don't have any
durations, and when we convert that to a text subtitle format, the 
lines will just fill up the screen to the max. Setting a fixed duration
doesn’t help either: the display start times are varying arbitrarily
and when we don't specify the accurate duration, the events will overlap
in time and texts will be shown stacked on overlap, which looks like 
jumping up and down.

I've been at the same point before in the context of burning-in CCs
where I had discussed the issue with the guys from libass:
https://github.com/libass/libass/issues/562

The solution there was the 'render_latest_only' option that
I ended up adding to the overlaytextsubs and textsub2video filters.
Unfortunately this can only work when implemented at the final end
of the chain (in the former case: the libass renderer), and that option
isn't available when converting to a subtitle output stream which 
will be processed by a client we don't have control of.

Eventually I came to a new idea for which I added the scatter_realtime_output
option to splitcc. It basically works like this:

- The splitcc output is no longer driven by the cc_decoder's output
- Instead, splitcc is emitting subtitle events at a fixed "frame"-rate
  based on the configured real_time_latency_msec value (e.g. 250msec)
- all subtitle frames/events have a fixed duration set to this value
- subsequent events have start display times (subtitle_pts) 
  increasing by this (fixed) value
- if there has been an update from the split_cc decoder between two 
  of those events, that update is taken and emitted with the next
  output. The decoded event's start time is replaced by the output
  event time
  (means: start times are quantized to 250ms intervals)
- if there hasn't been a change, the previous content is re-sent
  (but with the timings changed to the next 250ms interval)

Essentially, closed caption events are quantized and split ("scattered")
to match that fixed time raster and the filter output is decoupled from
the CC decoder's output timings.

But that's not all. There's another, additional level of decoupling 
in the case of the splitcc filter:

This filter has one input (video) and two outputs (0: video, 1: text subs)
where the second output works in a special way.
The first output is just for passthrough of the video frames from which 
we take the CC side data without further manipulation. 
It's very simple behavior: you push a frame to the input and the same 
frame is synchronously being output at output pin 0.

The second output uses the request_frame api instead, which works 
as a kind of pull model. Frames are requested according to the configured
output framerate and we need to increase the frame.pts value each
time by the reciprocal value of that output framerate.

For splitcc, this is set to 5 fps by default. In our example, we have 
a 250ms interval, which makes 4 fps. That means that we need to duplicate
every 4th of our output frames. The duplicated frame needs to have 
the subtitle_pts unchanged, so it can be identified as a duplicate and 
removed/ignored downstream. 
While, frame.pts needs to follow the graph timing to keep it all going.

Effectively, these are all different kinds of heartbeat cases. Those beats
are required to drive subtitle filtering because the subtitles' own timings
alone are not capable for driving a constant filtering flow.
All earlier discussions about subtitle filtering had that conclusion by
the way, that a heartbeat mechanism is required for subtitle filtering.

The purpose of that heartbeat mechanism is to ensure the filtering flow
and this is driven by AVFrame.pts, that's why it's exactly that field 
that needs to be used for this and no other one - even when it might not 
exactly fit to the code doc comment.

It can only work like that. 

Kind regards,
softworkz