[Libav-user] questions about decoding outline and decoder state

Sat May 23 05:27:07 EEST 2020

Hello readers,

I am curious about some algorithmic / numerical aspects of specifically
decoding (not encoding) an AC3 or AAC stream. Let's assume that all sample
rates are 48000 and that all audio is mono.

*Main section 1*. Is my outline of the decoding process correct? Which
points are wrong? Some of the points are from
https://libav.org/documentation/doxygen/master/group__lavc__encdec.html.

1. When decoding an audio stream, as segment the stream into an ordered
list of packets, e.g. P[0], P[1], ..., P[999]. (Assume 1000 packets.)
2. This segmentation involves syncwords in order to guard against total
data corruption in case 1 byte is lost. If 1 byte is lost, then usually
only 1 or 2 packets are affected.
3. For AAC, the packets have different numbers of bytes. AC3 files usually
have a constant packet size.
4. In the C process, a decoding object D is initialized
5. We pass packet P[0] to the D.avcodec_send_packet() method, returning
output Y[0]. This effectively passes a small binary data string of on the
order of 500 bytes.
6. Since I'm assuming everything is mono audio, this method returns a 1-d
array of floats. This method may the internal state of D. We then pass
packet P[1], then P[2], ..., P[999]. These successive calls return Y[1],
Y[2], ..., Y[999], respectively. Because of the possible state change, it
is important to pass the packets in a specific order.
7. This array has length (always?) 1024 for AAC and 1536 for AC3.
8. This page claims that frames stand alone. Does that mean that packets
are decoded independently?; or does this just mean that 1024-sample frames
are encoded independently?; or am I just misunderstanding.
 https://wiki.multimedia.cx/index.php/Understanding_AAC
9. (less important for me) If the packet timestamps of the stream are very
uniform, then we will simply concatenate all of the returned arrays Y[0],
..., Y[999] into the full array, and this is the decoded array. If the
packets have nonuniform timestamps, then we still might concatenate all of
the arrays, or maybe insert zero samples, depending on the other parameters
of the FFmpeg call.

--

*Main section 2.* Let's suppose that my outline in section 1 is accurate.
If not, then the rest of my message might be moot.

Let's suppose we have initial decoder object D and either the AAC or AC3
codec and packets P[0], P[1], ..., P[999]. Assuming that the decoder state
matters a lot, I'd like to consider 3 orders of passing the packets to D.

*Order 1*: The same order as the packets. P[0], P[1], ..., P[999]
*Order 2*: we remove P[0] completely.  P[1], P[2], ..., P[999]
*Order 3*: We replace P[0] with an arbitrary packet, P_new. (e.g. P_new =
P[1], but P_new could be an arbitrary packet not in the list.) P_new, P[1],
..., P[999]

In order 1, suppose that the output arrays are Y[0], Y[1], ..., Y[999]
In order 2, since the state may matter, we can't say that the first array
output is Y[1]. Instead, we use different symbols  Y2[1], Y2[2], ...,
Y2[999]. (indexing from 1. This output list has 999 elements.)
In order 3, suppose that the output arrays are Y3[0], Y3[1], ..., Y3[999].
(1000 elements).

My main questions are: Is the state of D flushed fairly quickly or is the
state very persistent such that any sequence 'mutation' will significantly
change state, or somewhere in between? Although the lists Y1, Y2, and Y3
are clearly similar waveforms perceptually, are they completely different
at a low level or do they converge.

If hypothetically the state of D is flushed after 50 packets, then would
Y[n], Y2[n], Y3[n] be approximately equal length-1024 float arrays for n >=
51? Is there any such value of n? Or maybe the state of D depends on how
many packets are decoded and is otherwised flushed after 50 packets? If so,
is Y[n] ~ Y3[n] for n >= 51 but Y[n] != Y2[n] for any large n because the
decoder processed n packets before outputting Y[n] but only n-1 packets
before Y2[n]

Note that I have experimented with PyAV and I suspect that for the AC3
codec and a deletion mutation, there is no such value of n. The decoder
states will always be different. I do not know about a substitution
mutation or the AAC codec or if I am doing my PyAV analysis correctly. I
don't know for sure and I would be obliged if a reader knows.) I have only
done experimenting with PyAV snce I am not used to using C.

Sincerely,
Bobby
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ffmpeg.org/pipermail/libav-user/attachments/20200522/dcbd6500/attachment.html>