[FFmpeg-user] ffmpeg nvenc without cuda

Dennis Mungai dmngaie at gmail.com
Sat Sep 21 12:31:00 EEST 2019


On Sat, 21 Sep 2019 at 11:39, Johanna Nilson <jnils75 at gmail.com> wrote:
>
> Sorry, but I think that the problem is not in profile M10-1B. This article (
> https://support.citrix.com/article/CTX217781) says that we require profile
> with equal or more than 1GB to use NVENC, but M10-1B include 1GB, so, it's
> ok.
>
> When I use command:
> ffmpeg -f gdigrab -i desktop -framerate 30 -tune zerolatency -r 30 -c:v
> hevc_nvenc -f mpegts udp://...
>
> FFMPEG log clearly says that:
> [hevc_nvenc @ 00000000005a6840] dl_fn->cuda_dl->cuInit(0) failed ->
> CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
>
> There is no coda cores available in the profile M10-1B, but they are not
> needed for encoding. So, FFMPEG requirement of them is redundant. But I'm
> 100% sure that nvenc is available in configuration M10-1B. I've made a
> solution similar to this:
> https://github.com/bloodelves88/CloudyNvCapture/blob/master/samples/NvFBC/NvFBCDX9NvEnc/NvFBCDX9NvEnc.cpp
> It works ok in M10-1B configuration. It uses Nvidia capture and Nvidia
> Nvenc without using cuda to produce h264 frame sequence. The problem in
> FFMPEG is that it requires cuda even when it is not used. So, please, can
> we do anything with this unneccesary requirement of cuda when we try to use
> hardware encoding on nvidia cards?
>
> It's not only my issue. This man also have similar problems (
> https://superuser.com/questions/1482726/is-there-a-way-to-use-nvenc-for-ffmpeg-without-cuda).
> The FFMPEG log is not the same, but question is the same. Changing M10-1B
> profile is not a option. It is sutable for using NVENC, the only problem is
> that FFMPEG requires CUDA when is is noot needed.
>
> пт, 20 сент. 2019 г. в 18:26, Dennis Mungai <dmngaie at gmail.com>:
>
> > On Fri, 20 Sep 2019 at 17:55, Johanna Nilson <jnils75 at gmail.com> wrote:
> > >
> > > nvidia-smi -L
> > > GPU 0: GRID M10-1B
> >
> > Seems like a known issue.
> > See https://support.citrix.com/article/CTX217781 and this post in
> > particular
> > https://gridforums.nvidia.com/default/topic/983/xendesktop/m60-nvenc-xd-vda-7-11-only-1gb-vgpu-profiles-and-above-/post/3478/#3478
> > Please switch to a different vGPU with at least 2 GB of VRAM, such as
> > M10-4A.
> > See this for available profiles:
> > https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html

Ahh, I get it now.
In the NVIDIA SDK, an NVENC session can be initialized via either a
shared CUDA context or DirectX.
Which reminds me: A while back, I ran into something similar with
VMWare's ESXi with NVIDIA's vGPU solution (GRID) based on the Tesla
T4.
Here is what I encountered and the workarounds I tried then:

I enabled TCC mode and rebooted:

nvidia-smi -g 0 -fdm 1

Then ran the command:

ffmpeg.exe -y -thread_queue_size 5120 -use_wallclock_as_timestamps 1
-fflags +genpts -loglevel debug -vsync 1 ^
-f gdigrab -draw_mouse 0 -framerate 60 -i desktop ^
-c:v h264_nvenc -profile:v high -rc:v cbr_ld_hq -r:v 60 -g:v 120 -b:v
8000k -minrate:v 8000k -maxrate:v 8000k -bufsize:v 8000k ^
-an -flush_packets 0 -bsf:v h264_mp4toannexb ^
-muxrate 16000k -pcr_period 20 -mpegts_flags +resend_headers
-mpegts_start_pid 0x15 -t 240 -f mpegts "lol.m2ts"

log:

C:\bin>ffmpeg.exe -y -thread_queue_size 5120
-use_wallclock_as_timestamps 1 -fflags +genpts -loglevel debug -vsync
1 ^
More? -f gdigrab -draw_mouse 0 -framerate 60 -i desktop ^
More? -c:v h264_nvenc -profile:v high -rc:v cbr_ld_hq -r:v 60 -g:v 120
-b:v 8000k -minrate:v 8000k -maxrate:v 8000k -bufsize:v 8000k ^
More? -an -flush_packets 0 -bsf:v h264_mp4toannexb ^
More? -muxrate 16000k -pcr_period 20 -mpegts_flags +resend_headers
-mpegts_start_pid 0x15 -t 240 -f mpegts "lol.m2ts"
ffmpeg version N-94000-g78e1d7f421-ffmpeg-windows-build-helpers
Copyright (c) 2000-2019 the FFmpeg developers
  built with gcc 8.2.0 (GCC)
  configuration: --pkg-config=pkg-config --pkg-config-flags=--static
--extra-version=ffmpeg-windows-build-helpers --enable-version3
--disable-debug --disable-w32threads --arch=x86_64 --target-os=mingw32
--cross-prefix=/home/brainiarc7/source.build/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/bin/x86_64-w64-mingw32-
--enable-libcaca --enable-gray --enable-libtesseract
--enable-fontconfig --enable-gmp --enable-gnutls --enable-libass
--enable-libbluray --enable-libbs2b --enable-libflite
--enable-libfreetype --enable-libfribidi --enable-libgme
--enable-libgsm --enable-libilbc --enable-libmodplug
--enable-libmp3lame --enable-libopencore-amrnb
--enable-libopencore-amrwb --enable-libopus --enable-libsnappy
--enable-libsoxr --enable-libspeex --enable-libtheora
--enable-libtwolame --enable-libvo-amrwbenc --enable-libvorbis
--enable-libvpx --enable-libwebp --enable-libzimg --enable-libzvbi
--enable-libmysofa --enable-libaom --enable-libopenjpeg
--enable-libopenh264 --enable-liblensfun --enable-libvmaf
--enable-libsrt --enable-demuxer=dash --enable-libxml2 --enable-nvenc
--enable-nvdec --extra-libs=-lharfbuzz --extra-libs=-lm
--extra-libs=-lpthread --extra-cflags=-DLIBTWOLAME_STATIC
--extra-cflags=-DMODPLUG_STATIC --extra-cflags=-DCACA_STATIC
--enable-amf --enable-libmfx --enable-gpl --enable-avisynth
--enable-frei0r --enable-filter=frei0r --enable-librubberband
--enable-libvidstab --enable-libx264 --enable-libx265 --enable-libxvid
--enable-libxavs --enable-avresample --extra-cflags='-mtune=generic'
--extra-cflags=-O3 --enable-static --disable-shared
--prefix=/home/brainiarc7/source.build/ffmpeg-windows-build-helpers/sandbox/cross_compilers/mingw-w64-x86_64/x86_64-w64-mingw32
  libavutil      56. 28.100 / 56. 28.100
  libavcodec     58. 52.102 / 58. 52.102
  libavformat    58. 27.103 / 58. 27.103
  libavdevice    58.  7.100 / 58.  7.100
  libavfilter     7. 55.100 /  7. 55.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  4.101 /  5.  4.101
  libswresample   3.  4.100 /  3.  4.100
  libpostproc    55.  4.100 / 55.  4.100
Splitting the commandline.
Reading option '-y' ... matched as option 'y' (overwrite output files)
with argument '1'.
Reading option '-thread_queue_size' ... matched as option
'thread_queue_size' (set the maximum number of queued packets from the
demuxer) with argument '5120'.
Reading option '-use_wallclock_as_timestamps' ... matched as AVOption
'use_wallclock_as_timestamps' with argument '1'.
Reading option '-fflags' ... matched as AVOption 'fflags' with
argument '+genpts'.
Reading option '-loglevel' ... matched as option 'loglevel' (set
logging level) with argument 'debug'.
Reading option '-vsync' ... matched as option 'vsync' (video sync
method) with argument '1'.
Reading option '-f' ... matched as option 'f' (force format) with
argument 'gdigrab'.
Reading option '-draw_mouse' ... matched as AVOption 'draw_mouse' with
argument '0'.
Reading option '-framerate' ... matched as AVOption 'framerate' with
argument '60'.
Reading option '-i' ... matched as input url with argument 'desktop'.
Reading option '-c:v' ... matched as option 'c' (codec name) with
argument 'h264_nvenc'.
Reading option '-profile:v' ... matched as option 'profile' (set
profile) with argument 'high'.
Reading option '-rc:v' ... matched as AVOption 'rc:v' with argument 'cbr_ld_hq'.
Reading option '-r:v' ... matched as option 'r' (set frame rate (Hz
value, fraction or abbreviation)) with argument '60'.
Reading option '-g:v' ... matched as AVOption 'g:v' with argument '120'.
Reading option '-b:v' ... matched as option 'b' (video bitrate (please
use -b:v)) with argument '8000k'.
Reading option '-minrate:v' ... matched as AVOption 'minrate:v' with
argument '8000k'.
Reading option '-maxrate:v' ... matched as AVOption 'maxrate:v' with
argument '8000k'.
Reading option '-bufsize:v' ... matched as AVOption 'bufsize:v' with
argument '8000k'.
Reading option '-an' ... matched as option 'an' (disable audio) with
argument '1'.
Reading option '-flush_packets' ... matched as AVOption
'flush_packets' with argument '0'.
Reading option '-bsf:v' ... matched as option 'bsf' (A comma-separated
list of bitstream filters) with argument 'h264_mp4toannexb'.
Reading option '-muxrate' ... matched as AVOption 'muxrate' with
argument '16000k'.
Reading option '-pcr_period' ... matched as AVOption 'pcr_period' with
argument '20'.
Reading option '-mpegts_flags' ... matched as AVOption 'mpegts_flags'
with argument '+resend_headers'.
Reading option '-mpegts_start_pid' ... matched as AVOption
'mpegts_start_pid' with argument '0x15'.
Reading option '-t' ... matched as option 't' (record or transcode
"duration" seconds of audio/video) with argument '240'.
Reading option '-f' ... matched as option 'f' (force format) with
argument 'mpegts'.
Reading option 'lol.m2ts' ... matched as output url.
Finished splitting the commandline.
Parsing a group of options: global .
Applying option y (overwrite output files) with argument 1.
Applying option loglevel (set logging level) with argument debug.
Applying option vsync (video sync method) with argument 1.
Successfully parsed a group of options.
Parsing a group of options: input url desktop.
Applying option thread_queue_size (set the maximum number of queued
packets from the demuxer) with argument 5120.
Applying option f (force format) with argument gdigrab.
Successfully parsed a group of options.
Opening an input file: desktop.
[gdigrab @ 000001a5e9277bc0] Capturing whole desktop as 1920x1080x32 at (0,0)
[gdigrab @ 000001a5e9277bc0] Probe buffer size limit of 5000000 bytes reached
[gdigrab @ 000001a5e9277bc0] Stream #0: not enough frames to estimate
rate; consider increasing probesize
Input #0, gdigrab, from 'desktop':
  Duration: N/A, start: 1560869500.651357, bitrate: 3981337 kb/s
    Stream #0:0, 1, 1/1000000: Video: bmp, 1 reference frame, bgra,
1920x1080, 0/1, 3981337 kb/s, 60 fps, 1000k tbr, 1000k tbn, 1000k tbc
Successfully opened the file.
Parsing a group of options: output url lol.m2ts.
Applying option c:v (codec name) with argument h264_nvenc.
Applying option profile:v (set profile) with argument high.
Applying option r:v (set frame rate (Hz value, fraction or
abbreviation)) with argument 60.
Applying option b:v (video bitrate (please use -b:v)) with argument 8000k.
Applying option an (disable audio) with argument 1.
Applying option bsf:v (A comma-separated list of bitstream filters)
with argument h264_mp4toannexb.
Applying option t (record or transcode "duration" seconds of
audio/video) with argument 240.
Applying option f (force format) with argument mpegts.
Successfully parsed a group of options.
Opening an output file: lol.m2ts.
[file @ 000001a5e927edc0] Setting default whitelist 'file,crypto'
Successfully opened the file.
Stream mapping:
  Stream #0:0 -> #0:0 (bmp (native) -> h264 (h264_nvenc))
Press [q] to stop, [?] for help
cur_dts is invalid st:0 (0) [init:0 i_done:0 finish:0] (this is
harmless if it occurs once at the start per stream)
detected 16 logical cores
[graph 0 input from stream 0:0 @ 000001a5e981f640] Setting
'video_size' to value '1920x1080'
[graph 0 input from stream 0:0 @ 000001a5e981f640] Setting 'pix_fmt'
to value '28'
[graph 0 input from stream 0:0 @ 000001a5e981f640] Setting 'time_base'
to value '1/1000000'
[graph 0 input from stream 0:0 @ 000001a5e981f640] Setting
'pixel_aspect' to value '0/1'
[graph 0 input from stream 0:0 @ 000001a5e981f640] Setting 'sws_param'
to value 'flags=2'
[graph 0 input from stream 0:0 @ 000001a5e981f640] Setting
'frame_rate' to value '60/1'
[graph 0 input from stream 0:0 @ 000001a5e981f640] w:1920 h:1080
pixfmt:bgra tb:1/1000000 fr:60/1 sar:0/1 sws_param:flags=2
[format @ 000001a5e981c3c0] Setting 'pix_fmts' to value
'yuv420p|nv12|p010le|yuv444p|p016le|yuv444p16le|bgr0|rgb0|cuda|d3d11'
[auto_scaler_0 @ 000001a5e981a940] Setting 'flags' to value 'bicubic'
[auto_scaler_0 @ 000001a5e981a940] w:iw h:ih flags:'bicubic' interl:0
[format @ 000001a5e981c3c0] auto-inserting filter 'auto_scaler_0'
between the filter 'Parsed_null_0' and the filter 'format'
[AVFilterGraph @ 000001a5e927dec0] query_formats: 5 queried, 3 merged,
1 already done, 0 delayed
[auto_scaler_0 @ 000001a5e981a940] picking rgb0 out of 8 ref:bgra alpha:1
[swscaler @ 000001a5eb0100c0] Forcing full internal H chroma due to
input having non subsampled chroma
[auto_scaler_0 @ 000001a5e981a940] w:1920 h:1080 fmt:bgra sar:0/1 ->
w:1920 h:1080 fmt:rgb0 sar:0/1 flags:0x4
[h264_nvenc @ 000001a5e927c1c0] Loaded lib: nvcuda.dll
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuInit
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuDeviceGetCount
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuDeviceGet
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuDeviceGetAttribute
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuDeviceGetName
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuDeviceComputeCapability
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuCtxCreate_v2
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuCtxSetLimit
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuCtxPushCurrent_v2
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuCtxPopCurrent_v2
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuCtxDestroy_v2
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuMemAlloc_v2
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuMemAllocPitch_v2
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuMemsetD8Async
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuMemFree_v2
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuMemcpy2D_v2
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuMemcpy2DAsync_v2
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuGetErrorName
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuGetErrorString
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuStreamCreate
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuStreamQuery
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuStreamSynchronize
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuStreamDestroy_v2
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuStreamAddCallback
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuEventCreate
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuEventDestroy_v2
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuEventSynchronize
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuEventQuery
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuEventRecord
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuLaunchKernel
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuModuleLoadData
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuModuleUnload
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuModuleGetFunction
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuTexObjectCreate
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuTexObjectDestroy
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuGLGetDevices_v2
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuGraphicsGLRegisterImage
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuGraphicsUnregisterResource
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuGraphicsMapResources
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuGraphicsUnmapResources
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuGraphicsSubResourceGetMappedArray
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuDeviceGetUuid
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuImportExternalMemory
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuDestroyExternalMemory
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuExternalMemoryGetMappedBuffer
[h264_nvenc @ 000001a5e927c1c0] Loaded sym:
cuExternalMemoryGetMappedMipmappedArray
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuMipmappedArrayGetLevel
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuMipmappedArrayDestroy
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuImportExternalSemaphore
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuDestroyExternalSemaphore
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuSignalExternalSemaphoresAsync
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: cuWaitExternalSemaphoresAsync
[h264_nvenc @ 000001a5e927c1c0] Loaded lib: nvEncodeAPI64.dll
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: NvEncodeAPICreateInstance
[h264_nvenc @ 000001a5e927c1c0] Loaded sym: NvEncodeAPIGetMaxSupportedVersion
[h264_nvenc @ 000001a5e927c1c0] Loaded Nvenc version 9.0
[h264_nvenc @ 000001a5e927c1c0] Nvenc initialized successfully
[h264_nvenc @ 000001a5e927c1c0] 1 CUDA capable devices found
[h264_nvenc @ 000001a5e927c1c0] [ GPU #0 - < GRID T4-2B > has Compute SM 7.5 ]
[h264_nvenc @ 000001a5e927c1c0]
dl_fn->cuda_dl->cuCtxCreate(&ctx->cu_context_internal, 0, cu_device)
failed -> CUDA_ERROR_UNKNOWN: unknown error
[h264_nvenc @ 000001a5e927c1c0] No NVENC capable devices found
[h264_nvenc @ 000001a5e927c1c0] Nvenc unloaded
Error initializing output stream 0:0 -- Error while opening encoder
for output stream #0:0 - maybe incorrect parameters such as bit_rate,
rate, width or height
[AVIOContext @ 000001a5e9819840] Statistics: 0 seeks, 0 writeouts
Conversion failed!

I then disabled TCC mode and rebooted:

nvidia-smi -g 0 -dm 0

Then ran the command:

ffmpeg.exe -y -thread_queue_size 5120 -use_wallclock_as_timestamps 1
-fflags +genpts -loglevel debug -vsync 1 ^
-f gdigrab -draw_mouse 0 -framerate 60 -i desktop ^
-c:v h264_nvenc -profile:v high -rc:v cbr_ld_hq -r:v 60 -g:v 120 -b:v
8000k -minrate:v 8000k -maxrate:v 8000k -bufsize:v 8000k ^
-an -flush_packets 0 -bsf:v h264_mp4toannexb ^
-muxrate 16000k -pcr_period 20 -mpegts_flags +resend_headers
-mpegts_start_pid 0x15 -t 240 -f mpegts "lol.m2ts"

And ran into the same issue.

This limitation was overcome by switching to a larger vGPU config, so
I basically gave up on the case and never looked into it again.
With that in mind, perhaps you could try running the same but with
dxva hwaccel instead? Perhaps (and I could be wrong) switching to
dxva2 as a hwaccel should flip the device type for NVENC as DirectX
instead of CUDA.
That interop seems to be implemented, see
https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/nvenc.c#L52

See an example of such a command with your parameters:

ffmpeg.exe -y -thread_queue_size 5120 ^
-fflags +genpts -loglevel debug -vsync 1 -hwaccel dxva2 -hwaccel_device 0 ^
-f gdigrab -draw_mouse 0 -framerate 60 -i desktop ^
-c:v h264_nvenc -profile:v high -preset:v llhq -rc:v cbr_ld_hq -r:v 60
-g:v 120 -b:v 8000k -minrate:v 8000k -maxrate:v 8000k -bufsize:v 8000k
-gpu:v 0 ^
-an -flush_packets 0 -bsf:v h264_mp4toannexb ^
-muxrate 16000k -pcr_period 20 -mpegts_flags +resend_headers -f mpegts
"udp://..."

Let me know how that goes.


More information about the ffmpeg-user mailing list