[FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame for subtitle handling

Sun Dec 5 19:58:16 EET 2021

> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of Lynne
> Sent: Sunday, December 5, 2021 5:40 PM
> To: FFmpeg development discussions and patches <ffmpeg-devel at ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame
> for subtitle handling
> 
> 5 Dec 2021, 17:23 by softworkz at hotmail.com:
> 
> > @@ -491,6 +499,39 @@ typedef struct AVFrame {
> >  */
> >  uint64_t channel_layout;
> >
> > +    /**
> > +     * Display start time, relative to packet pts, in ms.
> > +     */
> > +    uint32_t subtitle_start_time;
> > +
> > +    /**
> > +     * Display end time, relative to packet pts, in ms.
> > +     */
> > +    uint32_t subtitle_end_time;
> >
> 
> Milliseconds? Our entire API's based around timestamps
> with time bases. Plus, we all know what happened when
> Matroska settled onto milliseconds and ruined a perfectly
> complex but good container.
> Make this relative to the PTS field, with the same timebase
> as the PTS field.
> There's even a new AVFrame->time_base field for you to
> set so you wouldn't forget it.

The internal format for text subtitles is ASS, and this is 
using a timebase of milliseconds.

All existing decoders and encoders are using this and I'm
afraid, but I will not go and change them all.

> > +    /**
> > +     * Number of items in the @ref subtitle_areas array.
> > +     */
> > +    unsigned num_subtitle_areas;
> > +
> > +    /**
> > +     * Array of subtitle areas, may be empty.
> > +     */
> > +    AVSubtitleArea **subtitle_areas;
> >
> 
> There's no reason why this cannot be handled using the buffer
> and data fields. If you need more space there, you're free to bump
> the maximum number of pointers, plus this removes the horrid
> malloc(1) hack. We've discussed this, and I couldn't follow why
> this was bad in the email discussion you linked.

There are reasons. Some of those had I mentioned in an earlier 
discussion with Hendrik.
The effort alone to relate the buffers to subtitle areas (which area
'owns' which buffer) is not reasonable. Too many cases to consider,
what's when there are 3 areas and the second area doesn't have any
buffer? The convention is that the buffer should be used contiguously.
Managing those relations is error-prone and would require a lot of code.

> > +    /**
> > +     * Usually the same as packet pts, in AV_TIME_BASE.
> > +     *
> > +     * @deprecated This is kept for compatibility reasons and corresponds
> to
> > +     * AVSubtitle->pts. Might be removed in the future.
> > +     */
> > +    int64_t subtitle_pts;
> >
> 
> I'm not going to accept a field which is instantly deprecated.
> As we've discussed multiple times, please merge this into
> the regular frame PTS field. We already have _2_ necessary
> stard/end fields.

--

> I agree with this entirely. Even ignoring the fact that adding a new
> field thats deprecated is instantly a disqualification, AVSubtitle had
> one pts field, AVFrame already has one pts field - both are even
> documented to have the same semantic. They should just contain the
> exact same data, thats how you achieve compatibility, not by claiming
> you need a new field for compatibility reasons.
> 
> - Hendrik

I think the mistake is to declare subtitle_pts as deprecated. I had
added the deprecation at a very early point in time where I had still
thought that it can be eliminated.

Even though we are driving subtitle data through the graph attached
to AVFrame, the behavior involved is very different from audio and
video frames. Actually there's not one but many different ways how 
subtitle data can appear in a source and travel through a filtergraph:

- Sometimes, subtitle events are muxed into a stream many seconds 
  ahead of display time. In this case, AVFrame.pts is the mux position
  and AVFrame.subtitle_pts is the actual presentation time.
  When filtering subtitles to modify something, it would be still desired
  to retain the offset between mux time and display start

- Sometimes, subtitle events are occurring in the mux "live" - right in 
  the moment when they are meant to be shown. An example for this are
  closed captions, and when extracting those via the new splitcc filter,
  the subtitle_pts is equal to the frame.pts value. 
  But CC events do not come regularly, while downstream filters might
  expect exactly that in order to proceed. That's why the splitcc filter
  can emit subtitle frames at a certain framerate. It does so by re-
  sending the most recent subtitle frame - basically not much different
  than the main heatbeat mechanism (which cannot be used here because 
  the stream has it's origin inside the filtegraph (secondary output
  of splitcc).
  Now when splitcc sends such a repeated frame, it needs to adjust the
  frame's pts in order to match the configured output framerate.
  But that's for keeping the graph running, the actual display start
  time (subtitle_pts) must not be changed.
  It depends then on the downstream filters how to handle those frames.
  For example, a filter could detect (based on subtitle_pts) repeated
  subtitle events. Then it can skip processing and use a previous cached
  result (some of the filter are actually doing that).

- I already mentioned the subtitle_heartbeat mechanism that I have kept.
  The same applies here: it needs to resend frames to retain a certain 
  frequency of frames and for this purpose it needs to send duplicate 
  frames. Those frames need a new pts, but subtitle_pts must remain
  at its original value

- There exist extreme cases, like where the complete set of ASS subtitles 
  is muxed at   the beginning of the stream with pkt-pts values like 1,2,3,..
  The pkt-pts is what subtitle frames get as AVFrame.pts while the actual
  display start-time it in subtitle_pts.
  The desire way to handle such situations surely varies from case to case.
  Having those values separate is an important key for keeping a wide range
  of options open to handle this - for current and future subtitle filters.

My patchset offers a really wide range of possibilities for subtitle 
filtering. Those are real-world production scenarios that are working.
This is not based on just a few command line experiments.

Not every scenario that would be desirable is working. With subtitles,
there are a lot of different cases for which I have given just a few 
examples above, and certain scenarios will need some work to get it
working by some tweaking, adding a new filter or adding additional 
options to existing filters (like you could see in the version history).
My experience from doing just that has made me realize one important
point: with the proposed architecture, it's possible to get almost
any desired scenario working (what might not work already).

It can all be done with filters and the next change I'd want to do
after merge is to move the heartbeat functionality into a filter, so 
it allows to have much better control about its functionality (and the
possibility to choose at which point in the graph heartbeat should 
originate). Also it will make sense to have a counterpart filter to this, 
which eliminates duplicated subtitle frames.

Having both, pts and subtitle_pts is an important element for all these
things to work.
If it would be merged into a single value, probably half of the 
example scenarios of my patchset would no longer work.

Kind regards,
softworkz