[FFmpeg-devel] Development of a CUDA accelerated variant of the libav vf_tonemap

Tue Jan 12 00:27:59 EET 2021

Hi guys and gals, first post on this mailing list, apologies for any 
formatting/stylistic snafus

TLDR; we currently have tone mapping filters (typically used to map 
content from a 10bit HDR source to an 8bit SDR output) that are done on 
CPU with Zscale from Zlib, or hardware implementations using VAAPI or 
OpenCL. Having a version implemented in CUDA would round out the main 
HWaccels types.

Context:
	I'm a computer engineering student up in Canada with an interest in 
high efficiency distributed processing. As a personal project I'm 
trying to build a cluster of Nvidia Jetson Nano's to be able to handle 
a few dozen streams (mix of SD, HD, FHD, UHD, 4kHDR) at once while 
drawing south of 100W at peak. These little devices can do anywhere 
from 1 to 9 streams of content at a time depending on 
resolution/framerate in hardware in any mix of HEVC or H.264, so 3 of 
them should get me most of the way to where I want to go (this would be 
a 30W package capable of ~12 2160p30 at 10 bit -> 1080p30 8bit streams).

The issue is that, 4 little arm64 cores are just not going to be able 
to tonemap using Zscale in real time, even with the encoder and 
decoders sharing memory with the CPU (so no PCIe memcopy penalty). On 
the other hand, the built in GPU and the relative simplicity of most 
tone mapping algorithms (say hable) should make quick work of this. 
Unfortunately (or fortunately for me to learn with?) there isn't a CUDA 
version of the filter.

Question/guidance:
I've read through the doc on how to write filters, as well as looking 
at the other cuda filters currently in the source and have a general 
idea of where I'm going, but haven't been able to fully nail down how 
to access frames from hwupload_cuda passed to vf_tonemap_cuda.c which 
in turn passes that frame to vf_tonemap_cuda.cu for processing. I have 
a repo with everything I've been pulling together for my project, but 
the piece of interest is under */cuda_filter/ in the source tree. 
<https://github.com/Camofelix/Jetson_ffmpeg_trancode_cluster/>

Would anyone mind helping me out with how to architect this?

Thanks!

FelixCLC
(Alias's: FCLC, camofelix )