[FFmpeg-devel] [PATCH] lavf/mov: Add support for edit list parsing.

Fri Oct 21 18:32:44 EEST 2016

> In YouTube we have long been receiving MOV files from users, which have non-trivial edit lists (Those edit lists which are not just used to offset video start from audio start) and multiple edit lists. Recently the uploads of such files has increased with the introduction of apps that allow video editing and upload like Vine etc. mov.c in ffmpeg does not have edit list parsing support i.e. the support for deciding what video/audio packets should be output with what timestamps, based on edit lists. For this reason, we had built a basic support for edit list parsing in our version of ffmpeg, which fixes the AVIndexEntries built by mov_build_index by looking at edit lists, and introduces new DISCARD flags in AVPacket and AVFrame to signal to the decoder to discard certain packets.
> 
> For a while our edit list support was broken, as in it didn't properly work for multiple edit lists and it had problems with edit-lists ending on B-frames. But we've fixed most of these issues in recent times, and we think that now it is in a good enough condition so that it can be submitted to HEAD. This will not only help the vast userbase of ffmepg, but will also help us with staying up-to-date with ffmpeg and also by adding the power of ffmpeg developer community to our MOV support. So here's a go at it.
> What is supported:
>  - multiple edit lists
>  - edit lists ending on B-frames
>  - zero segment duration edit lists
> 
> What is not supported:
>  - Changing the rate of playing with edit lists. We basically ignore MediaRate field of edit.

Sorry for being late to the party, but this recently broke some code I work on.
I actually re-subscribed to the mailing list just so I could reply to this.

I see two problems with this implementation, one of which breaks things,
and one of which is a code-breaking bug, and one of which is a design problem.

The design one fist: This approach is fundamentally wrong. Edit lists are
meant to be applied at presentation level, not packet level. The current
implementation will cause problems in a number of cases:

    * Audio packets. Especially audio packets with a large number of samples.
      It's extremely likely that edits will not fall on packet boundaries, and
      depending on the number of samples per packet, audio sync issues can and
      will occur, even with smaller packets of e.g. 1024 samples, if there are
      a large number of entries in the edit list. Gradual loss of sync will
      occur.
    * Edit list entries that are out of order, or repeat. This one is obvious;
      simply dropping packets is not sufficient to create a virtual timeline
      like edit lists can be used for by various NLEs. I need to look up
      if these are actually allowed in the ISOBMFF of QT specs, but I know
      Matroska allows it in its virtual timeline feature.
    * Returning timestamps that differ from the codec timestamps. I very
      much disagree with this approach. It's the responsibility of the
      presentation/render/transcode/app/whatever layer to adjust timestamps
      based on edits. I absolutely disagree with libavformat changing
      timestamps from what is coded in the actual file. avformat provides
      demuxers.

Now onto the actual code breaking bug: You are losing packets entirely.

For example, here is part of a diff of pre-edit list ffprobe and post-edit
list ffprobe:

    --- old.json    2016-10-21 15:35:00.449619260 +0100
    +++ new.json    2016-10-21 15:35:09.392976730 +0100
    @@ -3,28 +3,15 @@
             {
                 "codec_type": "audio",
                 "stream_index": 1,
    -            "pts": -2048,
    -            "pts_time": "-0.046440",
    -            "dts": -2048,
    -            "dts_time": "-0.046440",
    -            "duration": 1024,
    -            "duration_time": "0.023220",
    -            "size": "480",
    -            "pos": "958620",
    -            "flags": "K"
    -        },
    -        {
    -            "codec_type": "audio",
    -            "stream_index": 1,
                 "pts": -1024,
                 "pts_time": "-0.023220",
                 "dts": -1024,
                 "dts_time": "-0.023220",
                 "duration": 1024,
                 "duration_time": "0.023220",
    -            "size": "457",
    +            "size": "480",
                 "pos": "959077",
    -            "flags": "K"
    +            "flags": "KD"
             },

As you can see, the packet, sized 457, is entirely missing. This was generated via
'ffprobe -show_packets -of json -select_streams 1 sample.mp4'. I can provide the
sample privately to you, if you wish. I did also confirm that the packet as not
added elswwhere:

    $ jq '.packets | length' new.json
    57050
    $ jq '.packets | length' old.json
    57051

The hashes I used were today's git HEAD (03ec6b780cfae85b8bf0f32b2eda201063ad061b)
for new.json and 590f025b3dc078a6be58b36c67d87499f62b521c for old.json.

P.S. I do under stand the desire to not refactor YouTube's internal codebase to
properly do this at the presentation level, and that YouTube has had this patchset
in it's internal fork of FFmpeg for many years, and also that there is a push to
update FFmpeg Google-wide / use more vanilla stuff, but I don't think that's a good
enough reason to have pushed something very specific like this to FFmpeg. For example,
if it wad done at presentation level, that code could also have been the foundation
of virtual timeline support for Matroska. I realize I'm too late to the part, and
it's already in, though, and I'm not going to advocate for removal. It's just my 2
cents.

P.P.S. I do apologize if this comes off as hostile, I don't mean it as such. I
appreciate that YouTube has moved towards more community involvement, a lot.

Cheers,
- Derek