[FFmpeg-user] Nvidia Transcoding: Failing Using xstack (When Running Under systemd)
Shane Warren
shanew at innovsys.com
Tue Nov 26 21:35:13 EET 2024
This email originated from outside Innovative Systems. Do not click links or open attachments unless you recognize the sender and know the content is safe.
On Tue, 19 Nov 2024, 01:20 Shane Warren, <shanew at innovsys.com> wrote:
>> On Mon, 18 Nov 2024, 11:33 pm Shane Warren, <shanew at innovsys.com> wrote:
>>
>> >> I have been trying to track down why when transcoding using xstack
>> >> with nvidia decoding and encoding I get strange decoding issues in
>> ffmpeg.
>> >>
>> >> Note: I use 2 1 minute long .ts files for this example if you want
>> >> my inputs, they are available here (as input1.ts and input2.ts) :
>> >>
>> >>
>> >> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2F
>> >> d%2F&data=05%7C02%7Cshanew%40innovsys.com%7Cff7b74741b9c41f98cf708
>> >> dd08a85ad0%7C7a48ce45ee974a95ac183390878a179b%7C0%7C0%7C6386762413
>> >> 66957710%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwL
>> >> jAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C
>> >> %7C%7C&sdata=5D1lQZuI2nVEpdhDF9TUiWqEM%2FfCEjKH2JPkmDFDBHk%3D&rese
>> >> rved=0
>> >> riv%2F&data=05%7C02%7Cshanew%40innovsys.com%7Cc241556f6a2e4253d9bc
>> >> 0
>> >> 8dd0825844e%7C7a48ce45ee974a95ac183390878a179b%7C0%7C0%7C638675679
>> >> 4
>> >> 48993996%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwL
>> >> j
>> >> AuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%
>> >> 7
>> >> C%7C&sdata=hVnUflCd1pK6iadB%2FsXUiB1BPuSiPt%2F%2BW3FP8a%2BWDiI%3D&
>> >> r
>> >> eserved=0
>> >> e.google.com%2Fdrive%2Ffolders%2F1mZ8xiNvz5ez1ULlNsy5a3KhnhaqQ2Hgo
>> >> %
>> >> 3Fu
>> >> sp%3Ddrive_link&data=05%7C02%7Cshanew%40innovsys.com%7C02a2eccf16a
>> >> a
>> >> 494
>> >> 1b6c408dd08136cfd%7C7a48ce45ee974a95ac183390878a179b%7C0%7C0%7C638
>> >> 6
>> >> 756
>> >> 01721027151%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOi
>> >> I
>> >> wLj
>> >> AuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%
>> >> 7
>> >> C%7
>> >> C&sdata=H7Nk9G6qJ3jg17ApCn3iBkSDmN0Mz%2BX5QZzHnSHBnAQ%3D&reserved=
>> >> 0
>> >>
>> >> I got the latest ffmpeg and tried this command (xstacking 2 videos
>> >> into 1
>> >> output):
>> >>
>> >> ffmpeg -y -threads 2 -nostats -loglevel verbose -probesize 5M
>> >> -filter_threads 4 -threads 2 -re -fflags +genpts -fflags
>> >> discardcorrupt \ -extra_hw_frames 2 -hwaccel cuda
>> >> -hwaccel_output_format cuda -threads 2 -thread_queue_size 4096
>> >> -heavy_compr 1 -thread_queue_size 4096 -re -i input1.ts \
>> >> -extra_hw_frames 2 -hwaccel cuda -hwaccel_output_format cuda
>> >> -threads
>> >> 2 -thread_queue_size 4096 -heavy_compr 1 -thread_queue_size 4096
>> >> -re -i input2.ts \ -filter_complex "\
>> >> [0:v:0]yadif_cuda=deint=interlaced,scale_cuda=768:432,hwdownload,f
>> >> o
>> >> rma
>> >> t=nv12,fps=60000/1001[v0];
>> >> \
>> >> [1:v:0]yadif_cuda=deint=interlaced,scale_cuda=768:432,hwdownload,f
>> >> o
>> >> rma
>> >> t=nv12,fps=60000/1001[v1];
>> >> \
>> >> [v0][v1] xstack=inputs=2:layout=0_0|0_h0[mosaic];\
>> >>
>> [mosaic]hwupload_cuda,scale_cuda=w=1280:h=720:format=yuv420p:force_original_aspect_ratio=decrease,hwdownload,format=yuv420p,pad=1280:720:(ow-iw)/2:(oh-ih)/2,hwupload_cuda[out0]"
>> >> \
>> >> -filter:a:0 "aresample=async=10000,volume=1.00" -c:a:0 ac3
>> >> -threads
>> >> 2
>> >> -ac:a:0 6 -ar:a:0 48000 -b:a:0 384k \
>> >> -filter:a:1 "aresample=async=10000,volume=1.00" -c:a:1 ac3
>> >> -threads
>> >> 2
>> >> -ac:a:1 6 -ar:a:1 48000 -b:a:1 384k \ -map "[out0]" -map "0:a:0"
>> >> -map "1:a:0" \ -c:v h264_nvenc -b:v 6000k -minrate:v 6000k
>> >> -maxrate:v 6000k -bufsize:v 12000k -a53cc 1 -tune ll -zerolatency
>> >> 1 -cbr 1 -forced-idr 1 -strict_gop 1 -threads 2 -profile:v high
>> >> -level:v 4.2 -bf:v 0 -g:v 30 \ -f mpegts -muxrate
>> >> 8238520 -pes_payload_size 1528 "udp://@
>> >>
>> 225.105.0.37:10102?pkt_size=1316&bitrate=8238520&burst_bits=10528&ttl=64"
>> >>
>> >> If you run that command in Ubuntu 22.04 it works 100% fine and
>> >> transcodes till the end of the input file(s).
>> >>
>> >> What doesn't work is if you start that process under systemd
>> >> non-interactively like so:
>> >>
>> >> systemd-run -S
>> >>
>> >> Then run that same command it will now fail in a strange way.
>> >>
>> >> Note: It's important that you try to output to multicast, if I try
>> >> the same command outputting to a file, it works fine (my guess is
>> >> any network-based output exhibits this behavior).
>> >>
>> >> You will see logs like this:
>> >>
>> >> [Parsed_scale_cuda_1 @ 0x55da86a03340] w:1920 h:1080 fmt:nv12 ->
>> >> w:768
>> >> h:432 fmt:nv12
>> >>
>> >> And the about 1-2 seconds before another log comes out.
>> >>
>> >> Eventually (after many stalls and logs) this log comes out and the
>> >> transcode stops:
>> >>
>> >> [vost#0:0/h264_nvenc @ 0x55da86a3f780] Error submitting a packet
>> >> to the
>> >> muxer: Cannot allocate memory
>> >>
>> >> I attached GDB to ffmpeg when it is stalled and its inside trying
>> >> to compile a cuda script.
>> >>
>> >> If I'm not doing xstack (I'm pretty sure this has to do with
>> >> multiple
>> >> inputs) nvidia does not stall.
>> >>
>> >> Does anyone have any idea what is happening here? I launch ffmpeg
>> >> from a
>> >> c++ wrapper daemon, if that daemon is started via systemd, then
>> >> c++ nvidia
>> >> multiple inputs fail. However, if I launch my daemon by hand at a
>> >> terminal, it works fine.
>> >>
>> >> Thanks
>> >>
>>
>> > Paste the content of the systemd unit file here.
>> > Logs from the same (systemctl status unit-name.service) will also assist.
>> >That might help in understanding how and why the systemd unit is failing.
>>
>> systemd service file:
>>
>> [Unit]
>> Description=Transcoder Service
>> After=default.target
>> StartLimitInterval=0
>>
>> [Service]
>> Type=forking
>> ExecStart=/opt/bin/videotranscoder
>> Restart=always
>> RestartSec=15
>> TasksMax=infinity
>> LimitCORE=infinity
>>
>> [Install]
>> WantedBy=default.target
>>
>> Logs:
>>
>> * videotranscoder.service - Innovative Video Transcoder
>> Loaded: loaded (/lib/systemd/system/videotranscoder.service;
>> disabled; vendor preset: enabled)
>> Active: active (running) since Mon 2024-11-18 16:11:09 CST; 3min
>> 19s ago
>> Process: 50296 ExecStart=/opt/bin/videotranscoder (code=exited,
>> status=0/SUCCESS)
>> Main PID: 50298 (videotranscoder)
>> Tasks: 81
>> Memory: 915.1M
>> CPU: 1min 1.870s
>> CGroup: /system.slice/videotranscoder.service
>> |-50298 /opt/bin/videotranscoder
>> |-50320 /bin/sh -c "/opt/bin/ffmpeg -y -threads 2
>> -nostats -nostdin -loglevel verbose -progress pipe:1 -probesize 5M
>> -filter_threads 4 -threads 2 -re -fflags +genpts -fflags
>> discardcorrupt -hwaccel_device 3 -extra_hw_frames 2 -hwaccel cuda -h>
>> `-50322 /opt/bin/ffmpeg -y -threads 2 -nostats -nostdin
>> -loglevel verbose -progress pipe:1 -probesize 5M -filter_threads 4
>> -threads
>> 2 -re -fflags +genpts -fflags discardcorrupt -hwaccel_device 3
>> -extra_hw_frames 2 -hwaccel cuda -hwaccel_outpu>
>>
>> Nov 18 16:14:26 encoder10029unit4 videotranscoder:50296[50298]:
>> FileTranscoder: [u:4,t:1,f:9f201704-a501-4e94-bce7-f3ac8e83a519.ts]
>> Adding audio output: ac3, 6 channels, 384 kbps.
>> Nov 18 16:14:26 encoder10029unit4 videotranscoder:50296[50298]:
>> FileTranscoder: [u:4,t:1,f:9f201704-a501-4e94-bce7-f3ac8e83a519.ts]
>> Audio bitrate is 0, defaulting audio bitrate to 128k for aac.
>> Nov 18 16:14:26 encoder10029unit4 videotranscoder:50296[50298]:
>> FileTranscoder: [u:4,t:1,f:9f201704-a501-4e94-bce7-f3ac8e83a519.ts]
>> Adding audio output: aac, 2 channels, 128 kbps.
>> Nov 18 16:14:26 encoder10029unit4 videotranscoder:50296[50298]:
>> FileTranscoder: transcode ffmpeg cmd (starting): ffmpeg -hide_banner
>> -y -nostats -hwaccel_device 1 -hwaccel cuvid -i
>> /video/vod/in/9f201704-a501-4e94-bce7-f3ac8e83a519.ts -filter_complex
>> "hw> Nov 18 16:14:28 encoder10029unit4 videotranscoder:50296[50298]:
>> VideoTranscodeApp: [u:4,t:3,p:1: 225.105.0.56:10102] [fifo @
>> 0x55c5facc4840] Recovery attempt #1 Nov 18 16:14:28 encoder10029unit4
>> videotranscoder:50296[50298]:
>> VideoTranscodeApp: [u:4,t:3,p:1: 225.105.0.56:10102] [mpegts @
>> 0x55c5f6bd2900] service 1 using PCR in pid=256, pcr_period=20ms
>>
>> [mpegts @ 0x55c5f6bd2900] muxrate 8238520, Nov 18 16:14:28
>> encoder10029unit4 videotranscoder:50296[50298]:
>> VideoTranscodeApp: [u:4,t:3,p:1: 225.105.0.56:10102] sdt every 500
>> ms, pat/pmt every 100 ms Nov 18 16:14:28 encoder10029unit4
>> videotranscoder:50296[50298]:
>> VideoTranscodeApp: [u:4,t:3,p:1: 225.105.0.56:10102] [fifo @
>> 0x55c5facc4840] Recovery successful Nov 18 16:14:28 encoder10029unit4
>> videotranscoder:50296[50298]:
>> VideoTranscodeApp: [u:4,t:3,p:1: 225.105.0.56:10102] [fifo @
>> 0x55c5facc4840] FIFO queue flushed Nov 18 16:14:28 encoder10029unit4
>> videotranscoder:50296[50298]:
>> VideoTranscodeApp: [u:4,t:3,p:1: 225.105.0.56:10102] [AVIOContext @
>> 0x7fa8b4014300] Statistics: 5395788 bytes written, 0 seeks, 4657
>> writeouts Nov 18 16:14:29 encoder10029unit4 videotranscoder:50296[50298]:
>> VideoTranscodeApp: [u:4,t:3,p:1: 225.105.0.56:10102] [fifo @
>> 0x55c5facc4840] FIFO queue full
>>
>> I see the problem.
>>
>> Your output is emulating CBR over mpegts, but it's overshooting.
>> Lower your buffersize to about 5*(bitrate/fps). Assuming a frame rate of 30 fps, use -bufsize:v 1000 or thereabouts.
>First, thanks for that info, I was never quite sure what buffersize was correct. However, after changing to use that buffersize I get the same behavior.
>
>The key thing is I ran this under systemd-run for a reason. I was trying to show the simplest way to make this happen. I'm running under a stock Ubuntu 22.04 using Cuda 12.4 and the latest stable nvidia driver. If anyone has a nvidia compiled ffmpeg and ubuntu 22.04 with a nvidia card this will fail for them too.
>
>I'm baffled why starting it from an interactive terminal (ssh or directly on a connected keyboard/monitor) it works fine, but if I start it from systemd-run or if it's started by a systemd script (like on a reboot or package install) it exhibits this behavior.
I have some more details on this, when this is stalled calling scale_cuda, I see it suck in this call stack:
#0 0x00007f348129b4e0 in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#1 0x00007f34812444b8 in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#2 0x00007f3481047ce5 in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#3 0x00007f3480495ecb in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#4 0x00007f348049600b in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#5 0x00007f348045ff87 in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#6 0x00007f3480ff7d14 in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#7 0x00007f3480ff7daf in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#8 0x00007f34803331bc in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#9 0x00007f348033bcc1 in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#10 0x00007f3480340cd3 in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#11 0x00007f3480343c65 in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#12 0x00007f3480334789 in __cuda_CallJitEntryPoint () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#13 0x00007f359d166780 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#14 0x00007f359d15a507 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#15 0x00007f359cea6dc4 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#16 0x00007f359cec87e3 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#17 0x00007f359cdde904 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#18 0x00007f359cf13d4b in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#19 0x000055ab2bf90dd2 in ff_cuda_load_module (avctx=avctx at entry=0x55ab42468d00, hwctx=<optimized out>, cu_module=cu_module at entry=0x55ab42dfa310, data=<optimized out>, length=<optimized out>) at libavfilter/cuda/load_helper.c:90
#20 0x000055ab2bc3d80e in cudascale_load_functions (ctx=0x55ab42468d00) at libavfilter/vf_scale_cuda.c:323
#21 cudascale_config_props (outlink=<optimized out>) at libavfilter/vf_scale_cuda.c:393
If I'm watching htop I see a single core (I have 40 cores) go to 100% cpu for maybe 5-10 seconds, other cores are idle. My theory is when doing scale_cuda in parallel across N inputs and running under systemd, the parallel instances share a core doing jit compile and are fighting over a single core.
If I'm doing a single input stream, scale_cuda never goes for 5-10 seconds, and if I'm runing this command directly (not started from systemd) it also doesn't take 5-10 seconds for a scale_cuda call even when doing 6 inputs.
I've tried messing with systemd config file options for my process (CPUAffinity wsa tried), nothing seems to stop this behavior.
Any ideas here, I'm running out of things to try.
_______________________________________________
ffmpeg-user mailing list
ffmpeg-user at ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-user
To unsubscribe, visit link above, or email ffmpeg-user-request at ffmpeg.org with subject "unsubscribe".
More information about the ffmpeg-user
mailing list