[FFmpeg-user] audio artefacts after segment and transcode

Wed May 8 07:48:49 EEST 2019

Hi Ted

>> I am transcoding larger videos on a set of computers in parallel. I do this by segmenting an input file at key-frames (ffmpeg -i ... -f segment), then transcode parts using GNU parallel, then recombine parts into one output file using ffmpeg -f concat -i ...). This works well, but I had issues with audio being not in sync with videos or having audio "artefacts". I solved that by transcoding audio separately, but I would prefer the more direct solution to transcode both audio and video in one step.
>
> Probably transcoding video and audio (that’s been segmented while stream copying) in one step is more or less causing this…

After some tests yesterday when I applied your suggestions, and after some tests of my own that I have conducted myself prior to my initial post. I can tell my thoughts on this:

I think that the segmentation (ffmpeg -i ... -f segment ...) itself does not change anything. The segmentation is just splitting up the input file (keeping timestamps, copying data). The problem arises afterwards when processing the parts/segments. Somehow the timestamps are getting out of sync and I have a feeling that this is because of the segmentation (to be precise: muxer and encoder do not have the whole input file, but only a part of it).

Each part video is getting demuxed, decoded, encoded and muxed again. And somewhere in this process timestamps are getting modified (either (de-)muxer or by the de- or encoder). This might be because the container or stream codec needs to have another tbn (codec timebase) or changes from a constant frame rate to a variable frame rate or because the container has another requirements on the timebase than the input container has specified.

If we were using the whole file as input, like ffmpeg -i input.avi -c:v libx265 -c:a aac output.mov. ffmpeg/libav does take good care of this and the muxing/encoding works like a charm. But when first segmenting the input into parts and encode/mux them separately, ffmpeg/libav does not have the full picture and tries to fill in gaps. One sign that this is happening are warnings like "[mov @ 0x561bc46a3d80] Non-monotonous DTS in output stream 0:1; previous: 121611520, current: 121611024; changing to 121611521. This may result in incorrect timestamps in the output file.". Stream 0:1 is audio and looking at the timestamps, audio stream seems to be "behind". ffmpeg is then doing the only thing it can (lacking the whole picture because of the segmentation), it corrects the timestamp to the best known value. Unfortunately this must result in an audio gap, creating the audio artefacts.

The question is, why is this happening? If libx265 has a constant frame rate of 25. And the origin video has a constant frame rate of 25. Why can audio lack behind (why we don't have enough audio samples)? I currently can only explain this by libx265 encoder, or maybe the mov muxer somehow changing the framerate to 24.542 (as mediainfo/ffprobe tell me).

Or another question to potentially solve the issue: how could I tell ffmpeg/libav to keep the timestamps as long as possible ("timestamp passthrough") so that the ending ffmpeg -f concat -i XYZ call still has the original timestamps and might see the whole picture of the original video again?

> If you can live with just encoding in one step you might get better results?
> Of course then you’ll need to decode the whole file from start to finish, but that’s not as cpu intensive, and not reliable, as you’ve seen.

Thank you for suggesting! Yes that would actually make sense that a "pre-encoding" (into yuv4, rawvideo or so) in the segmentation phase might improve the situation and I ran through your suggestion (using yuv4 and pcm in pre-segmentation). The result is better, but I can still here some artefacts (less pronounced, but still there). The reason why I would prefer to avoid this pre-segmentation into a "raw-format" is IO boundness: a lot of videos such as timelapses or raw video captures from a camera are in 2k+. Thus pre-transcoding them into yuv4 or rawvideo will produce enormous amounts of data. Thus easily get IO bound which would annihilate the performance uplift of a Multi-Computer solution to fastly transcode a video. Still I do agree with you that it's only a matter of decoding (and convert to a raw stream) which is far less CPU intensive than the encoding. Therefore this would be a possible scenario for "small resolution" videos.

>> # step 2: create segments
>>
>> ffmpeg -y -hide_banner -i /tmp/input.avi -f segment -segment_time 0.5 -reset_timestamps 1 -segment_list /tmp/input_part.list -segment_list_type ffconcat -r 25 -c:v copy -c:a copy -strict experimental -c:s copy -map v? -map a? -map s? /tmp/input_part_%06d.mp4
>
> try changing it to
>
> ffmpeg -y -hide_banner -i /tmp/input.avi -f segment -segment_time 0.5 -segment_list /tmp/input_part.list -segment_list_type ffconcat -map 0? -c copy -c:v yuv4 -c:a pcm_f32le /tmp/input_part_%06d.mov
>
> Segment sizes should be longer though, at 0.5 seconds the overhead would not be insignificant. I’m guessing it was just for the demo?

Segment sizes: Yes, exactly, I was just using a short segment_time of 0.5 for the demo. So that the audio artefacts are getting more pronounced. A common value that I normally choose is something between 10 and 30 seconds (depending on GOP / key-frame-interval).

you used -c:a pcm_f32le. In my example I forgot to add an audio codec in the test-setup I was presenting, sorry for that. I normally have it in as well.

>> # step 4: create a ffconcat file for the output file
>>
>> for f in /tmp/output_part_*.mp4; do echo "file '$f'" >>/tmp/output_part.list; done
>
> The first line in the ffconcat being ffconcat version 1.0 seems to help, you should probably just use the generated ffconcat segment list as the template,
>
> sed 's/input/output/g' /tmp/input_part.list > /tmp/output_part.list

Right, that's the better solution.

>> Do you have an explanation or do you know how this audio artefacts can be solved? Can it be that it's just an issue with codec timebases or because libx265 is using a variable frame rate (ffprobe of output.mov has an effective fps of 23.94 while input.avi has a constant frame rate of 25 fps)? I would very much appreciate some help.
>
> The timebase thing could bake sense, something something rounding issues when segmenting, timestamps being unaligned, type of thing? But I don’t think x265 does variable frame rates (not sure), regardless in an mp4 it’s most definitely constant. Set the framerate during the encoding step if that’s important, the “normal” ones you can use abbreviations for (ntsc, pal, film, ntsc-film, etc) to pass the right rate instead of rounding the decimals.

Right. The -r must be in the encoding. Kind of doesn't make sense in combination with -c:v copy of course...

regarding variable frame rate for x265:

mediainfo ./output.mov # and ./output.mp4
...
Frame rate mode                          : Variable
Frame rate                               : 24.542 FPS
Minimum frame rate                       : 8.333 FPS
Maximum frame rate                       : 25.000 FPS
Original frame rate                      : 25.000 FPS
...

both .mp4 and .mov show a frame rate of 24.542 (and a min/max that is not the same), that's why I was referring to variable frame rate.

I appreciated your reply

Philipp