[FFmpeg-devel] [PATCH 1/2] decode: add ff_decode_skip_samples function
Derek Buitenhuis
derek.buitenhuis at gmail.com
Sat Nov 4 18:22:11 EET 2023
Hi,
I'm going to opine a bit here, and also comment on the mov/MP4 patch[0] that accompanies
this set.
This is for both historical purposes, and to distill IRC logs into something more
digestible for others on the list to gain context on the issue, so apologies for
re-treading ground.
On 10/30/2023 5:09 AM, Lynne wrote:
> This is a convenience function, which is required to be called by decoders
> needing to skip samples every time.
> It automatically creates and increments side data.
>
> The idea is to get rid of skip_samples eventually and replace it with this
> function.
So there is a lot to cover here, and lot of nuance to how things should be handled,
that I want to list out, before discussing the specific changes of these two sets of
patches. A bit of of a 'state of the world'.
The goal of this patchset seems to be to aid in gapless playback and correct seeking
in AAC streams (or later, more generally MDCT-styl audio codecs in general), but properly
skipping initial priming samples (which include pre-roll), pre-roll (both normal, and extra
incurred due to SBR) when seeking, and and, though not covered in these sets, I'll mention,
end padding.
First a note on terminology: 'Algorithmic delay' as it is being used here is not quite
correct, for two reasons:
* Latency and pre-roll (or roll distance) are separate things. Opus, for example,
can have a latency as low as 2.5ms, but pre-roll is always at least 80ms - they
are different things which serve different purposes, and I confirmed this with
people who definitely know more about audio than me[1]. Pre-roll is often larger
than latency, and the values stored in file metadata reflect this.
* Pre-roll, or roll distance, are the industry standard terms. Making up out own
terms because we disagree is silly and stubborn, and makes it harder on API
users trying to use the API correctly, or understnd our code.
Next, a quick breakdown of the AAC situation, in terms of both how this it is stored,
what we support, and the state of the ecosystem and types of files that exist:
* 'Raw' ADTS streams have no way to store any of this. The best we can do is guess
the pre-roll. We should not guess priming or end padding, as no matter what we do,
it'll be wrong, and any value will be a cargo culted hack value.
* MP4 - there are two places to store this metadata - one standard, and one proprietary
Apple way. There are, separately, two ways to signal priming length when SBR is present.
* MP4s may contain a user data box with 'iTunSMPB' which contains priming, pre-roll,
and end padding data. We support reading only priming data from this at the moment,
and we set skip samples based on this. This is 'iTunes style' metadata.
* The standards compliant (read: non-iTunes) way is to use an edit list to trim the
priming samples, and, opionally end padding. End padding may also be trimmed by reducing
the sample duration of the last packet in the stts box. Pre-roll is store in the sgpb
box with the 'roll', type, which signals the roll distance as a number of packets;
for example, -1 indicates you should decode an discard the samples of 1 packet before
beginning plaback. Notably, this allows the sgpd box to also be use for video like
periodic intra refresh H.264. libavformat does not current parse or export this info,
but even if we did, converting number of packets to audio samples can get hairy.
* Notably, since in MP4, the edit list represents the exact presentation-level info,
when no edit list, or an edit list startiing at 0 is present, no samples, not even
pre-roll should be trimmed - all players in the wild handle this properly, and it
has been standard practice among streaming services for >10 years to not output
the AAC frames representing priming samples at all (even if there is a small hit
quality). This is what the patch at [0] is addressing.
* My personal opinion is that since priming samples include any inherent delay already,
that if we do not know how many priming samples there are, we should not trim anything
from the start of the file, regardless of format. I am keen on hearing others Opinions(TM)
here, particularily Anton and Martin (sorry for name dropping :)).
* Further complicating matters is the fact that, again thanks to Apple, there are a lot
of broken files around, since iTunes expects files to *not* include addition delay incurred
by SBR in their edit list / priming info, even though, by spec, you are suppose to. This
leads to the unfortunate case where you have tons of files in the wild that both do, and
do not include SBR delay in their edit lists, and there is no way of detecting when this is
the case. I do not have a good solution to this other than proposing a flag somewhere to
switch between behaviors.
Aside, for Opus, we incorrectly read this info from the codec specific box instead of the sgpd box...
but we only ever put it in a write-only field.
Now, on to the patches / implementation (opinions warning):
* As noted above, I don't think we should apply any guessed priming to initial samples (pre-roll,
or 'algorithmic delay, included). No other decoders or players do this, in the world, to my
knowledge, and violating the principal of least surpise because we think we're slightly more
correct isn't great. I also think trying to 'fix' raw ADTS is destined to always be a hack,
an we shouldn't. YMMV. I'd like to hear views from others here. This would make the patch in
[0] redundant.
* I don't think hardcoding a value of 1024 (or 960) is wise, or necessarily even correct. For example,
all HE-AAC MP4s I have signal a seek pre-roll of 2048 samples. Ideally we would pass through seek
pre-roll somehow - but I am not sure of a good way in our existing setup - the write-only codecpar
field for seek pre-roll is in samples, which is kind of incompatible with the way the sgpd box stores
pre-roll as number of packets. We could set it based one the duration of the first packet(s), assuming
all audio codecs will have only 1 packet duration (frame size), or we could add a new field. Possibly,
we could leave the hardcoded values in place, only for seeking, inside e.g. ADTS.
* This kind of brings me to: I don't really think using the same side data for both priming and pre-roll
is a great idea. Clearly this set of problems is already confusing enough to many, and conflating the
two in the API is a sure way to only make it worse.
* If we are going to discard at seek, we need to take this into account in the demuxer's seek code, or
we will always be $preroll number of samples off where the user actually wanted to seek and decode.
I am almost certain I missed even more nuance, and hopefully Martin or Anton can chime in, or I remember more.
But also, given all of this, I think we need to deeply consider how we approach this, so we don't end up with
something that only covers certain cases (and I am sure I forgot more cases). To that end, I do not think rushing
to get a patchset that can change sync on all AAC files in existence into 6.1 is wise. Even when this does go in,
it should be able to sit in master for a good long time before being in a release. As I understand it, FATE is
already unhappy, and it shouldn't be treated as being a problem with FATE vs the set.
Lastly, some time this weekend/week, I will labour to create a more extensive set of AAC files we can use to
test, and use in FATE.
Hope that all made sense and I didn't forget any details in my Covid-induced haze.
Cheers,
- Derek
[0] http://ffmpeg.org/pipermail/ffmpeg-devel/2023-October/316389.html
[1] https://gist.github.com/dwbuiten/18745d6cb253a2304f776581c9f68b30
[2] https://github.com/nu774/fdkaac/blob/master/README#L165-L177
More information about the ffmpeg-devel
mailing list