[FFmpeg-devel] [PATCH 1/2] avformat/dv: allow returning damaged audio

Sun Sep 6 18:32:05 EEST 2020

> On Aug 3, 2020, at 5:16 PM, Michael Niedermayer <michael at niedermayer.cc> wrote:
> 
> On Mon, Aug 03, 2020 at 10:38:21PM +0200, Marton Balint wrote:
>> 
>> 
>> On Sun, 2 Aug 2020, Dave Rice wrote:
>> 
>>> 
>>> 
>>>> On Aug 1, 2020, at 5:26 PM, Marton Balint <cus at passwd.hu> wrote:
>>>> 
>>>> 
>>>> 
>>>> On Sat, 1 Aug 2020, Michael Niedermayer wrote:
>>>> 
>>>>> On Sat, Aug 01, 2020 at 07:28:53PM +0200, Marton Balint wrote:
>>>>>> 
>>>>>> 
>>>>>> On Sat, 1 Aug 2020, Michael Niedermayer wrote:
>>>>>> 
>>>>>>> Fixes: Ticket8762
>>>>>>> Signed-off-by: Michael Niedermayer <michael at niedermayer.cc>
>>>>>>> ---
>>>>>>> libavformat/dv.c | 49 +++++++++++++++++++++++++++++++++++++++++-------
>>>>>>> 1 file changed, 42 insertions(+), 7 deletions(-)
>>>>>> 
>>>>>> If "dv remux loses sync", then the timestamps should be fixed, not
>>>>>> additional packets should be generated based on previously read packet data
>>>>>> (which is a fragile approach to begin with, e.g. what if the first frame is
>>>>>> the corrupt one?).
>>>>> 
>>>>> Ticket8762 is about stream copy, so if no packets are returned for audio
>>>>> but are for video and just timestamps are updated this would at least on
>>>>> its own probably not work that well.
>>>> 
>>>> If the timestamps are good, a good player should be able to play it
>>>> correctly, even if audio stream is sparse.
>>>> 
>>>> None of the demuxers generate packets because the timestamps are not
>>>> continous, I just don't think it would be consistent if DV suddenly
>>>> started to do this. E.g. what if the user wants to drop video with
>>>> no audio?
>>> 
>>> In practice, when dv frames with video and no audio are interleaved
>>> within a dv stream that otherwise has both, it is because the playback
>>> videotape player of the dv tape is in pause mode or the tape is damaged.
>>> These frames most common are filled with only video dif blocks that note
>>> concealment (so the image is a copy of a prior image) and the audio
>>> source pack metadata is missing, but the paylock of the audio dif blocks
>>> are filled with error code so they would decode as silence.
>> 
>> But if the audio source pack metadata is missing, then how can you determine
>> the audio settings?

I tested with QuickTime Player 7 and when frames are read with the audio source pack metadata missing, the first audio source pack is used. So these frames provide silent output as an earlier audio source pack is used. The disadvantage here is that a mid stream change such as 32kHz to 48kHz causes QuickTime Player 7 to mangle the audio by applying the wrong characteristics.

>> Or the number of samples the errornous frame contains
>> (e.g. 1600 v.s 1602)?
> 
> some testcase would be useful here where this is done clearly wrong currently

I put two additional samples at https://archive.org/download/001.dv.audiogap/001.dv.audiogap.dv <https://archive.org/download/001.dv.audiogap/001.dv.audiogap.dv> and https://archive.org/download/001.dv.audiogap/DVC00036_001.dv.audiogap.dv <https://archive.org/download/001.dv.audiogap/DVC00036_001.dv.audiogap.dv>. Each contains a series of frames in the middle that have all video blocks as concealed and all audio blocks are simply error code with no audio source pack.

For each example, both "ffmpeg -i file -c copy out” and “ffmpeg -i file out” has a loss of sync in the result and an audio track shorter than the video.

But true, a frame with no audio source pack does not communicate if it should be 1600 or 1602 samples.

In the SMPTE specification for DV at http://web.archive.org/web/20060927044735/http://www.smpte.org/smpte_store/standards/pdf/s314m.pdf <http://web.archive.org/web/20060927044735/http://www.smpte.org/smpte_store/standards/pdf/s314m.pdf>, it says on page 18 that for NTSC systems, the five-frame pattern should be: 1600, 1602, 1602, 1602, 1602. So if a frame has no audio source pack, the pattern of prior frames could be used or simply use this pattern upon finding a sequence of such frames starting at 1600. Or possibly the relationship between the starting time of the audio data and the starting time for the video data could be used to guess if 1600 or 1602 maintains the alignment more closely.

>> Also maybe setting the CORRUPT packet flag should be done in this case?
> 
> yes was thinking that too, that should be in the next revision

In the reference specification, table 26 shows how the STA value is interpreted to note if the frame contains concealed video DIF blocks or not. This doesn’t necessarily mean that the frame is corrupt, but that it is the product of data concealment caused by a misreading of the DV videotape.

[…]
Dave