[FFmpeg-user] nvenc burn subtitles while transcoding

Tue Oct 24 22:10:04 EEST 2017

>
> Shouldn't I be using hwupload_cuda, to upload frames to the CUDA engine,
> then apply overlay filter and after that download it back? At least I
> understood it that way. Or you are suggesting to download it from CUDA
> run on CPU and after that upload it back.
>

No, the overlay filter is software only so it runs on CPU in main memory.
The hw-decoded frames have to be downloaded from GPU memory to main memory,
then the CPU applies the overlay filter in main memory, finally upload the
frame from main memory to GPU memory for hw-encoding.

I am trying something like this, but I don't exactly know where to
> upload it:
> ffmpeg -hwaccel cuvid -c:v h264_cuvid  -i udp://239.255.6.2:1234
> -filter_complex "[i:0x4b1]scale_npp=w=1920:h=1080,hwdownload,format=nv12
> [base]; [base][i:0x4ba]overlay[v]" -map [v] -map i:0x4b2 -c:a libfdk_aac
> -profile:a aac_he -ac 2 -b:a 64k -ar 48000 -c:v h264_nvenc -preset llhq
> -rc vbr_hq -qmin:v 19 -qmax:v 21 -b:v 2500k -maxrate:v 5000k -profile:v
> high -f flv rtmp://127.0.0.1/live/test
>
> It works, but I don't see any acceleration probably because it is not in
> CUDA anymore and the manual is not really helpful here. Anybody with
> more experience?
>

I'm not sure what you mean by "I don't see any acceleration". The syntax
looks okay for hw-decoding/encoding, so that should be happening on your
GPU. You can monitor your GPU using "nvidia-smi dmon" during transcoding.
It will show you how much your GPU is using for decoding/encoding (to
verify that your GPU is "doing work").

Do you mean that transcoding is slow? That should be expected. Doing a
similar overlay runs at ~4X on my Phenom X2/GTX1050Ti, where pure
hw-transcoding (no overlay) can do ~23X (with resizing to 1080p). ~10X and
~110X without resizing at 480p. As above, overlay filter is a
software-based filter. I don't know the exact overlay filter internals, but
based on performance I'm guessing it's single-threaded so that could cause
a major slow-down as the encoder is waiting for each single frame from the
overlay filter to be done by your "slow" single-core CPU.

Also just a side-note, it looks like you're scaling your input video before
overlaying using scale_npp. Not sure if you're aware, but the h264_cuvid
decoder has resizing and cropping built in. So you don't need to use
scale_npp to do the resizing. You could do something like:

ffmpeg -hwaccel cuvid -c:v h264_cuvid  -resize 1920x1080 -i udp://
239.255.6.2:1234 -filter_complex "[i:0x4b1]hwdownload,format=nv12[base];
[base][i:0x4ba]overlay[v]; [v]hwupload_cuda[v]" -map "[v]"  .......

Interestingly, if you wanted to overlay, then resize. You could use
scale_npp for GPU/CUDA resizing after overlay. The hw-decoder couldn't be
used to resize at that point. That would be something like:

ffmpeg -hwaccel cuvid -c:v h264_cuvid  -i udp://239.255.6.2:1234
-filter_complex "[i:0x4b1]hwdownload,format=nv12[base];
[base][i:0x4ba]overlay[v];
[v]hwupload_cuda,scale_npp=w=1920:h=1080:format=nv12[v]"
-map "[v]" ......

Or maybe you could resize the subtitles before overlay (you could try
something similar with scale_npp instead of scale):

ffmpeg -hwaccel cuvid -c:v h264_cuvid  -resize 1920x1080 -i udp://
239.255.6.2:1234 -filter_complex "[i:0x4b1]hwdownload,format=nv12[base];
[i:0x4ba]scale=1920:1080[subtitle]; [base][subtitle]overlay[v];
[v]hwupload_cuda[v]" -map "[v]" .....

If you're still reading: scale_npp uses CUDA whereas h264_cuvid resizing
uses the GPU Video Engine. You (or anyone really) might take that into
consideration when planning to do resizing as those options would put
different kinds of loads on your GPU.

Hope This Helps (or at least points you in the right direction)!
-J