[FFmpeg-devel] [PATCH] avfilter: add CUDA stack filters (hstack_cuda, vstack_cuda, xstack_cuda)
faeez kadiri
f1k2faeez at gmail.com
Mon May 26 14:35:54 EEST 2025
Hi all,
Friendly ping on the patch below (sent 23 May, link in Patchwork:
https://patchwork.ffmpeg.org/project/ffmpeg/patch/20250523215814.365246-1-f1k2faeez@gmail.com/
).
Patch summary
-------------
* Adds a CUDA implementation of the existing stack_* filter family
(parallels stack_qsv / stack_vaapi).
* Supports up to 16 inputs and both horizontal/vertical layouts.
If anything needs adjustment (coding-style, fate naming, etc.) please let
me know and I’ll resend an updated v2.
Many thanks for your time!
Best regards,
Faeez Kadiri
On Sat, May 24, 2025 at 3:28 AM Faeez Kadiri <f1k2faeez at gmail.com> wrote:
> Add hardware-accelerated stack filters for CUDA that provide equivalent
> functionality to the software stack filters but with GPU acceleration.
>
> Features:
> - Support for hstack, vstack, and xstack operations
> - Compatible pixel formats such as:
> yuv420p, nv12, yuv444p, p010le, p016le, yuv444p16le, rgb0, bgr0, rgba,
> bgra
> - Fill color support with automatic RGB to YUV conversion for YUV formats
> - Proper chroma subsampling handling for all supported formats
> - Integration with existing stack filter infrastructure via
> stack_internal.h
>
> The implementation follows the established CUDA filter pattern from
> vf_scale_cuda.c, using PTX modules for kernel execution and proper
> CUDA context management. Copy operations handle frame placement while
> color operations fill background areas when using fill colors.
>
> This enables efficient video composition workflows entirely on GPU
> without CPU-GPU memory transfers, significantly improving performance
> for multi-input video processing pipelines.
>
> Examples:
> $ ffmpeg -hwaccel cuda -i input.h265 -filter_complex
> "[0:v][0:v]hstack_cuda" -c:v hevc_nvenc out.h265
>
> $ ffmpeg \
> -hwaccel cuda -i input1.mp4 \
> -hwaccel cuda -i input2.mp4 \
> -hwaccel cuda -i input3.mp4 \
> -hwaccel cuda -i input4.mp4 \
> -filter_complex
> "[0:v]hwupload_cuda[0v];[1:v]hwupload_cuda[1v];[2:v]hwupload_cuda[2v];[3:v]hwupload_cuda[3v];[0v][1v][2v][3v]xstack_cuda=inputs=4:fill=black:layout=0_0|w0_0|0_h0|w0_h0"
> \
> -c:v hevc_nvenc out.mp4
>
> Signed-off-by: Faeez Kadiri <f1k2faeez at gmail.com>
> ---
> Changelog | 1 +
> configure | 6 +
> doc/filters.texi | 78 +++++
> libavfilter/Makefile | 3 +
> libavfilter/allfilters.c | 3 +
> libavfilter/vf_stack_cuda.c | 589 +++++++++++++++++++++++++++++++++++
> libavfilter/vf_stack_cuda.cu | 389 +++++++++++++++++++++++
> 7 files changed, 1069 insertions(+)
> create mode 100644 libavfilter/vf_stack_cuda.c
> create mode 100644 libavfilter/vf_stack_cuda.cu
>
> diff --git a/Changelog b/Changelog
> index 4217449438..0dec3443d4 100644
> --- a/Changelog
> +++ b/Changelog
> @@ -18,6 +18,7 @@ version <next>:
> - APV encoding support through a libopenapv wrapper
> - VVC decoder supports all content of SCC (Screen Content Coding):
> IBC (Inter Block Copy), Palette Mode and ACT (Adaptive Color Transform
> +- hstack_cuda, vstack_cuda and xstack_cuda filters
>
>
> version 7.1:
> diff --git a/configure b/configure
> index 3730b0524c..5c2d6e132d 100755
> --- a/configure
> +++ b/configure
> @@ -4033,6 +4033,12 @@ xfade_vulkan_filter_deps="vulkan spirv_compiler"
> yadif_cuda_filter_deps="ffnvcodec"
> yadif_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
> yadif_videotoolbox_filter_deps="metal corevideo videotoolbox"
> +hstack_cuda_filter_deps="ffnvcodec"
> +hstack_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
> +vstack_cuda_filter_deps="ffnvcodec"
> +vstack_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
> +xstack_cuda_filter_deps="ffnvcodec"
> +xstack_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
> hstack_vaapi_filter_deps="vaapi_1"
> vstack_vaapi_filter_deps="vaapi_1"
> xstack_vaapi_filter_deps="vaapi_1"
> diff --git a/doc/filters.texi b/doc/filters.texi
> index 6d2df07508..1c9afac9eb 100644
> --- a/doc/filters.texi
> +++ b/doc/filters.texi
> @@ -26850,6 +26850,84 @@ Only deinterlace frames marked as interlaced.
> The default value is @code{all}.
> @end table
>
> + at section hstack_cuda
> +Stack input videos horizontally.
> +
> +This is the CUDA variant of the @ref{vstack} filter, each input stream may
> +have different width, this filter will scale down/up each input stream
> while
> +keeping the orignal aspect.
> +
> +It accepts the following options:
> +
> + at table @option
> + at item inputs
> +See @ref{hstack}.
> +
> + at item shortest
> +See @ref{hstack}.
> +
> + at item height
> +Set height of output. If set to 0, this filter will set height of output
> to
> +height of the first input stream. Default value is 0.
> + at end table
> +
> + at section vstack_cuda
> +Stack input videos vertically.
> +
> +This is the CUDA variant of the @ref{vstack} filter, each input stream may
> +have different width, this filter will scale down/up each input stream
> while
> +keeping the orignal aspect.
> +
> +It accepts the following options:
> +
> + at table @option
> + at item inputs
> +See @ref{vstack}.
> +
> + at item shortest
> +See @ref{vstack}.
> +
> + at item width
> +Set width of output. If set to 0, this filter will set width of output to
> +width of the first input stream. Default value is 0.
> + at end table
> +
> + at section xstack_cuda
> +Stack video inputs into custom layout.
> +
> +This is the CUDA variant of the @ref{xstack} filter, each input stream
> may
> +have different size, this filter will scale down/up each input stream to
> the
> +given output size, or the size of the first input stream.
> +
> +It accepts the following options:
> +
> + at table @option
> + at item inputs
> +See @ref{xstack}.
> +
> + at item shortest
> +See @ref{xstack}.
> +
> + at item layout
> +See @ref{xstack}.
> +Moreover, this permits the user to supply output size for each input
> stream.
> + at example
>
> +xstack_cuda=inputs=4:layout=0_0_1920x1080|0_h0_1920x1080|w0_0_1920x1080|w0_h0_1920x1080
> + at end example
> +
> + at item grid
> +See @ref{xstack}.
> +
> + at item grid_tile_size
> +Set output size for each input stream when @option{grid} is set. If this
> option
> +is not set, this filter will set output size by default to the size of the
> +first input stream. For the syntax of this option, check the
> + at ref{video size syntax,,"Video size" section in the ffmpeg-utils
> manual,ffmpeg-utils}.
> +
> + at item fill
> +See @ref{xstack}.
> + at end table
> +
> @anchor{CUDA NPP}
> @section CUDA NPP
> Below is a description of the currently available NVIDIA Performance
> Primitives (libnpp) video filters.
> diff --git a/libavfilter/Makefile b/libavfilter/Makefile
> index 0effe4127f..ad876ccd53 100644
> --- a/libavfilter/Makefile
> +++ b/libavfilter/Makefile
> @@ -583,6 +583,9 @@ OBJS-$(CONFIG_YAEPBLUR_FILTER) +=
> vf_yaepblur.o
> OBJS-$(CONFIG_ZMQ_FILTER) += f_zmq.o
> OBJS-$(CONFIG_ZOOMPAN_FILTER) += vf_zoompan.o
> OBJS-$(CONFIG_ZSCALE_FILTER) += vf_zscale.o
> +OBJS-$(CONFIG_HSTACK_CUDA_FILTER) += vf_stack_cuda.o
> framesync.o vf_stack_cuda.ptx.o cuda/load_helper.o
> +OBJS-$(CONFIG_VSTACK_CUDA_FILTER) += vf_stack_cuda.o
> framesync.o vf_stack_cuda.ptx.o cuda/load_helper.o
> +OBJS-$(CONFIG_XSTACK_CUDA_FILTER) += vf_stack_cuda.o
> framesync.o vf_stack_cuda.ptx.o cuda/load_helper.o
> OBJS-$(CONFIG_HSTACK_VAAPI_FILTER) += vf_stack_vaapi.o
> framesync.o vaapi_vpp.o
> OBJS-$(CONFIG_VSTACK_VAAPI_FILTER) += vf_stack_vaapi.o
> framesync.o vaapi_vpp.o
> OBJS-$(CONFIG_XSTACK_VAAPI_FILTER) += vf_stack_vaapi.o
> framesync.o vaapi_vpp.o
> diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
> index 5ea33cdf01..89a7fb9277 100644
> --- a/libavfilter/allfilters.c
> +++ b/libavfilter/allfilters.c
> @@ -545,6 +545,9 @@ extern const FFFilter ff_vf_yaepblur;
> extern const FFFilter ff_vf_zmq;
> extern const FFFilter ff_vf_zoompan;
> extern const FFFilter ff_vf_zscale;
> +extern const FFFilter ff_vf_hstack_cuda;
> +extern const FFFilter ff_vf_vstack_cuda;
> +extern const FFFilter ff_vf_xstack_cuda;
> extern const FFFilter ff_vf_hstack_vaapi;
> extern const FFFilter ff_vf_vstack_vaapi;
> extern const FFFilter ff_vf_xstack_vaapi;
> diff --git a/libavfilter/vf_stack_cuda.c b/libavfilter/vf_stack_cuda.c
> new file mode 100644
> index 0000000000..002602b2bf
> --- /dev/null
> +++ b/libavfilter/vf_stack_cuda.c
> @@ -0,0 +1,589 @@
> +/*
> + * Copyright (c) 2025, Faeez Kadiri < f1k2faeez at gmail dot com>
> + *
> + * This file is part of FFmpeg.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the
> "Software"),
> + * to deal in the Software without restriction, including without
> limitation
> + * the rights to use, copy, modify, merge, publish, distribute,
> sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be
> included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
> SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
> OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
> + * DEALINGS IN THE SOFTWARE.
> + */
> +
> +/**
> + * @file
> + * Hardware accelerated hstack, vstack and xstack filters based on CUDA
> + */
> +
> +#include "config_components.h"
> +
> +#include "libavutil/opt.h"
> +#include "libavutil/common.h"
> +#include "libavutil/pixdesc.h"
> +#include "libavutil/eval.h"
> +#include "libavutil/hwcontext.h"
> +#include "libavutil/hwcontext_cuda_internal.h"
> +#include "libavutil/cuda_check.h"
> +#include "libavutil/avstring.h"
> +#include "libavutil/avassert.h"
> +#include "libavutil/imgutils.h"
> +#include "libavutil/mathematics.h"
> +#include "libavutil/parseutils.h"
> +#include "libavutil/colorspace.h"
> +#include "libavutil/mem.h"
> +
> +#include "filters.h"
> +#include "formats.h"
> +#include "video.h"
> +
> +#include "framesync.h"
> +#include "cuda/load_helper.h"
> +
> +static const enum AVPixelFormat supported_formats[] = {
> + AV_PIX_FMT_YUV420P,
> + AV_PIX_FMT_NV12,
> + AV_PIX_FMT_YUV444P,
> + AV_PIX_FMT_P010,
> + AV_PIX_FMT_P016,
> + AV_PIX_FMT_YUV444P16,
> + AV_PIX_FMT_0RGB32,
> + AV_PIX_FMT_0BGR32,
> + AV_PIX_FMT_RGB32,
> + AV_PIX_FMT_BGR32,
> +};
> +
> +#define DIV_UP(a, b) ( ((a) + (b) - 1) / (b) )
> +#define BLOCKX 32
> +#define BLOCKY 16
> +
> +#define CHECK_CU(x) FF_CUDA_CHECK_DL(ctx, s->hwctx->internal->cuda_dl, x)
> +
> +typedef struct CUDAStackContext {
> + AVCUDADeviceContext *hwctx;
> + CudaFunctions *cuda_dl;
> +
> + CUcontext cu_ctx;
> + CUmodule cu_module;
> + CUstream cu_stream;
> +
> + // For copy operations
> + CUfunction cu_func_copy;
> + CUfunction cu_func_copy_uv;
> +
> + // For color operations
> + CUfunction cu_func_color;
> + CUfunction cu_func_color_uv;
> +
> + enum AVPixelFormat in_fmt;
> + const AVPixFmtDescriptor *in_desc;
> + int in_planes;
> + int in_plane_depths[4];
> + int in_plane_channels[4];
> +
> + uint8_t fillcolor_rgba[4];
> + uint8_t fillcolor_yuv[4];
> +} CUDAStackContext;
> +
> +#define HSTACK_NAME "hstack_cuda"
> +#define VSTACK_NAME "vstack_cuda"
> +#define XSTACK_NAME "xstack_cuda"
> +#define HWContext CUDAStackContext
> +#define StackHWContext StackCudaContext
> +#include "stack_internal.h"
> +
> +typedef struct StackCudaContext {
> + StackBaseContext base;
> + CUDAStackContext cuda;
> +} StackCudaContext;
> +
> +static int format_is_supported(enum AVPixelFormat fmt)
> +{
> + int i;
> +
> + for (i = 0; i < FF_ARRAY_ELEMS(supported_formats); i++)
> + if (supported_formats[i] == fmt)
> + return 1;
> + return 0;
> +}
> +
> +static void rgb2yuv(float r, float g, float b, int *y, int *u, int *v,
> int depth)
> +{
> + *y = ((0.21260*219.0/255.0) * r + (0.71520*219.0/255.0) * g +
> + (0.07220*219.0/255.0) * b) * ((1 << depth) - 1);
> + *u = (-(0.11457*224.0/255.0) * r - (0.38543*224.0/255.0) * g +
> + (0.50000*224.0/255.0) * b + 0.5) * ((1 << depth) - 1);
> + *v = ((0.50000*224.0/255.0) * r - (0.45415*224.0/255.0) * g -
> + (0.04585*224.0/255.0) * b + 0.5) * ((1 << depth) - 1);
> +}
> +
> +static av_cold int cuda_stack_load_functions(AVFilterContext *ctx, enum
> AVPixelFormat format)
> +{
> + StackCudaContext *sctx = ctx->priv;
> + CUDAStackContext *s = &sctx->cuda;
> + CUcontext dummy, cuda_ctx = s->hwctx->cuda_ctx;
> + CudaFunctions *cu = s->cuda_dl;
> + int ret;
> + char buf[128];
> +
> + const char *fmt_name = av_get_pix_fmt_name(format);
> +
> + extern const unsigned char ff_vf_stack_cuda_ptx_data[];
> + extern const unsigned int ff_vf_stack_cuda_ptx_len;
> +
> + ret = CHECK_CU(cu->cuCtxPushCurrent(cuda_ctx));
> + if (ret < 0)
> + return ret;
> +
> + ret = ff_cuda_load_module(ctx, s->hwctx, &s->cu_module,
> + ff_vf_stack_cuda_ptx_data,
> ff_vf_stack_cuda_ptx_len);
> + if (ret < 0)
> + goto fail;
> +
> + // Load copy functions
> + snprintf(buf, sizeof(buf), "StackCopy_%s_%s", fmt_name, fmt_name);
> + ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_copy,
> s->cu_module, buf));
> + if (ret < 0) {
> + av_log(ctx, AV_LOG_FATAL, "Unsupported format for copy: %s\n",
> fmt_name);
> + ret = AVERROR(ENOSYS);
> + goto fail;
> + }
> +
> + snprintf(buf, sizeof(buf), "StackCopy_%s_%s_uv", fmt_name, fmt_name);
> + ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_copy_uv,
> s->cu_module, buf));
> + if (ret < 0)
> + goto fail;
> +
> + // Load color functions
> + snprintf(buf, sizeof(buf), "SetColor_%s", fmt_name);
> + ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_color,
> s->cu_module, buf));
> + if (ret < 0) {
> + av_log(ctx, AV_LOG_FATAL, "Unsupported format for color: %s\n",
> fmt_name);
> + ret = AVERROR(ENOSYS);
> + goto fail;
> + }
> +
> + snprintf(buf, sizeof(buf), "SetColor_%s_uv", fmt_name);
> + ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_color_uv,
> s->cu_module, buf));
> + if (ret < 0)
> + goto fail;
> +
> +fail:
> + CHECK_CU(cu->cuCtxPopCurrent(&dummy));
> + return ret;
> +}
> +
> +static av_cold int cuda_stack_color_kernel(AVFilterContext *ctx,
> CUfunction func,
> + AVFrame *out_frame, const uint8_t *color,
> + int width, int height,
> + int dst_x, int dst_y,
> + int dst_width, int dst_height, int dst_pitch)
> +{
> + StackCudaContext *sctx = ctx->priv;
> + CUDAStackContext *s = &sctx->cuda;
> + CudaFunctions *cu = s->cuda_dl;
> +
> + CUdeviceptr dst_devptr[4] = {
> + (CUdeviceptr)out_frame->data[0], (CUdeviceptr)out_frame->data[1],
> + (CUdeviceptr)out_frame->data[2], (CUdeviceptr)out_frame->data[3]
> + };
> +
> + void *args[] = {
> + &dst_devptr[0], &dst_devptr[1], &dst_devptr[2], &dst_devptr[3],
> + &width, &height, &dst_pitch,
> + &dst_x, &dst_y,
> + (void *)&color[0], (void *)&color[1], (void *)&color[2], (void
> *)&color[3],
> + &dst_width, &dst_height,
> + };
> +
> + return CHECK_CU(cu->cuLaunchKernel(func,
> + DIV_UP(width, BLOCKX),
> DIV_UP(height, BLOCKY), 1,
> + BLOCKX, BLOCKY, 1,
> + 0, s->cu_stream, args, NULL));
> +}
> +
> +static av_cold int cuda_stack_copy_kernel(AVFilterContext *ctx,
> CUfunction func,
> + CUtexObject src_tex[4],
> + AVFrame *out_frame,
> + int width, int height,
> + int dst_x, int dst_y, int dst_pitch,
> + int src_width, int src_height)
> +{
> + StackCudaContext *sctx = ctx->priv;
> + CUDAStackContext *s = &sctx->cuda;
> + CudaFunctions *cu = s->cuda_dl;
> +
> + CUdeviceptr dst_devptr[4] = {
> + (CUdeviceptr)out_frame->data[0], (CUdeviceptr)out_frame->data[1],
> + (CUdeviceptr)out_frame->data[2], (CUdeviceptr)out_frame->data[3]
> + };
> +
> + void *args[] = {
> + &src_tex[0], &src_tex[1], &src_tex[2], &src_tex[3],
> + &dst_devptr[0], &dst_devptr[1], &dst_devptr[2], &dst_devptr[3],
> + &width, &height, &dst_pitch,
> + &dst_x, &dst_y,
> + &src_width, &src_height,
> + &out_frame->width, &out_frame->height
> + };
> +
> + return CHECK_CU(cu->cuLaunchKernel(func,
> + DIV_UP(width, BLOCKX),
> DIV_UP(height, BLOCKY), 1,
> + BLOCKX, BLOCKY, 1,
> + 0, s->cu_stream, args, NULL));
> +}
> +
> +static int cuda_stack_color_op(AVFilterContext *ctx, StackItemRegion
> *region, AVFrame *out, const uint8_t *color) {
> + StackCudaContext *sctx = ctx->priv;
> + CUDAStackContext *s = &sctx->cuda;
> + CudaFunctions *cu = s->cuda_dl;
> + int ret = 0;
> + CUcontext dummy;
> +
> + // Push CUDA context
> + ret = CHECK_CU(cu->cuCtxPushCurrent(s->hwctx->cuda_ctx));
> + if (ret < 0)
> + return ret;
> +
> + ret = cuda_stack_color_kernel(ctx, s->cu_func_color,
> + out, color, region->width, region->height,
> + region->x, region->y,
> + out->width, out->height,
> + out->linesize[0]);
> + if (ret < 0) {
> + av_log(ctx, AV_LOG_ERROR, "Error during color operation: %d\n",
> ret);
> + goto fail;
> + }
> +
> + if (s->in_planes > 1) {
> + ret = cuda_stack_color_kernel(ctx, s->cu_func_color_uv,
> + out, color,
> + AV_CEIL_RSHIFT(region->width,
> s->in_desc->log2_chroma_w),
> + AV_CEIL_RSHIFT(region->height,
> s->in_desc->log2_chroma_h),
> + AV_CEIL_RSHIFT(region->x,
> s->in_desc->log2_chroma_w),
> + AV_CEIL_RSHIFT(region->y,
> s->in_desc->log2_chroma_h),
> + out->width, out->height,
> + out->linesize[1]);
> + if (ret < 0)
> + av_log(ctx, AV_LOG_ERROR, "Error during color UV operation:
> %d\n", ret);
> + }
> +
> +fail:
> + CHECK_CU(cu->cuCtxPopCurrent(&dummy));
> + return ret;
> +}
> +
> +static int cuda_stack_copy_op(AVFilterContext *ctx, StackItemRegion
> *region, AVFrame *in, AVFrame *out) {
> + StackCudaContext *sctx = ctx->priv;
> + CUDAStackContext *s = &sctx->cuda;
> + CudaFunctions *cu = s->cuda_dl;
> + CUtexObject tex[4] = { 0, 0, 0, 0 };
> + int ret = 0;
> + int i;
> + CUcontext dummy;
> +
> + // Push CUDA context
> + ret = CHECK_CU(cu->cuCtxPushCurrent(s->hwctx->cuda_ctx));
> + if (ret < 0)
> + return ret;
> +
> + for (i = 0; i < s->in_planes; i++) {
> + CUDA_TEXTURE_DESC tex_desc = {
> + .filterMode = CU_TR_FILTER_MODE_POINT,
> + .flags = CU_TRSF_READ_AS_INTEGER,
> + };
> +
> + CUDA_RESOURCE_DESC res_desc = {
> + .resType = CU_RESOURCE_TYPE_PITCH2D,
> + .res.pitch2D.format = s->in_plane_depths[i] <= 8 ?
> + CU_AD_FORMAT_UNSIGNED_INT8 :
> + CU_AD_FORMAT_UNSIGNED_INT16,
> + .res.pitch2D.numChannels = s->in_plane_channels[i],
> + .res.pitch2D.pitchInBytes = in->linesize[i],
> + .res.pitch2D.devPtr = (CUdeviceptr)in->data[i],
> + };
> +
> + if (i == 1 || i == 2) {
> + res_desc.res.pitch2D.width = AV_CEIL_RSHIFT(in->width,
> s->in_desc->log2_chroma_w);
> + res_desc.res.pitch2D.height = AV_CEIL_RSHIFT(in->height,
> s->in_desc->log2_chroma_h);
> + } else {
> + res_desc.res.pitch2D.width = in->width;
> + res_desc.res.pitch2D.height = in->height;
> + }
> +
> + ret = CHECK_CU(cu->cuTexObjectCreate(&tex[i], &res_desc,
> &tex_desc, NULL));
> + if (ret < 0)
> + goto fail;
> + }
> +
> + ret = cuda_stack_copy_kernel(ctx, s->cu_func_copy,
> + tex, out, region->width, region->height,
> + region->x, region->y, out->linesize[0],
> + in->width, in->height);
> + if (ret < 0) {
> + av_log(ctx, AV_LOG_ERROR, "Error during copy operation: %d\n",
> ret);
> + goto fail;
> + }
> +
> + if (s->in_planes > 1) {
> + ret = cuda_stack_copy_kernel(ctx, s->cu_func_copy_uv, tex, out,
> + AV_CEIL_RSHIFT(region->width,
> s->in_desc->log2_chroma_w),
> + AV_CEIL_RSHIFT(region->height,
> s->in_desc->log2_chroma_h),
> + AV_CEIL_RSHIFT(region->x,
> s->in_desc->log2_chroma_w),
> + AV_CEIL_RSHIFT(region->y,
> s->in_desc->log2_chroma_h),
> + out->linesize[1],
> + AV_CEIL_RSHIFT(in->width,
> s->in_desc->log2_chroma_w),
> + AV_CEIL_RSHIFT(in->height,
> s->in_desc->log2_chroma_h));
> + if (ret < 0)
> + av_log(ctx, AV_LOG_ERROR, "Error during copy UV operation:
> %d\n", ret);
> + }
> +
> +fail:
> + for (i = 0; i < FF_ARRAY_ELEMS(tex); i++)
> + if (tex[i])
> + CHECK_CU(cu->cuTexObjectDestroy(tex[i]));
> +
> + CHECK_CU(cu->cuCtxPopCurrent(&dummy));
> + return ret;
> +}
> +
> +static int process_frame(FFFrameSync *fs)
> +{
> + AVFilterContext *ctx = fs->parent;
> + StackCudaContext *sctx = fs->opaque;
> + CUDAStackContext *s = &sctx->cuda;
> + AVFilterLink *outlink = ctx->outputs[0];
> + AVFrame *out_frame = NULL;
> + AVFrame *in_frame = NULL;
> + int ret = 0;
> +
> + out_frame = ff_get_video_buffer(outlink, outlink->w, outlink->h);
> + if (!out_frame)
> + return AVERROR(ENOMEM);
> +
> + // Fill the entire output frame with fill color if enabled
> + if (sctx->base.fillcolor_enable) {
> + StackItemRegion full_region = {
> + .x = 0,
> + .y = 0,
> + .width = outlink->w,
> + .height = outlink->h
> + };
> +
> + ret = cuda_stack_color_op(ctx, &full_region, out_frame,
> s->fillcolor_yuv);
> + if (ret < 0) {
> + av_log(ctx, AV_LOG_ERROR, "Failed to fill background
> color\n");
> + goto fail;
> + }
> + }
> +
> + for (int i = 0; i < ctx->nb_inputs; i++) {
> + ret = ff_framesync_get_frame(fs, i, &in_frame, 0);
> + if (ret)
> + goto fail;
> +
> + if (i == 0) {
> + ret = av_frame_copy_props(out_frame, in_frame);
> + if (ret < 0)
> + goto fail;
> + }
> +
> + ret = cuda_stack_copy_op(ctx, &sctx->base.regions[i], in_frame,
> out_frame);
> + if (ret < 0)
> + goto fail;
> + }
> +
> + out_frame->pts = av_rescale_q(sctx->base.fs.pts,
> sctx->base.fs.time_base, outlink->time_base);
> + out_frame->sample_aspect_ratio = outlink->sample_aspect_ratio;
> +
> + return ff_filter_frame(outlink, out_frame);
> +
> +fail:
> + av_frame_free(&out_frame);
> + return ret;
> +}
> +
> +static int config_output(AVFilterLink *outlink)
> +{
> + AVFilterContext *ctx = outlink->src;
> + StackCudaContext *sctx = ctx->priv;
> + CUDAStackContext *s = &sctx->cuda;
> + AVFilterLink *inlink0 = ctx->inputs[0];
> + FilterLink *inl0 = ff_filter_link(inlink0);
> + FilterLink *outl = ff_filter_link(outlink);
> + enum AVPixelFormat in_format;
> + int depth = 8, ret;
> + AVHWFramesContext *in_frames_ctx;
> + AVBufferRef *hw_frames_ctx;
> + AVHWFramesContext *out_frames_ctx;
> +
> + if (inlink0->format != AV_PIX_FMT_CUDA || !inl0->hw_frames_ctx ||
> !inl0->hw_frames_ctx->data) {
> + av_log(ctx, AV_LOG_ERROR, "Software pixel format is not
> supported.\n");
> + return AVERROR(EINVAL);
> + }
> +
> + in_frames_ctx = (AVHWFramesContext*)inl0->hw_frames_ctx->data;
> + in_format = in_frames_ctx->sw_format;
> +
> + if (!format_is_supported(in_format)) {
> + av_log(ctx, AV_LOG_ERROR, "Unsupported input format: %s\n",
> + av_get_pix_fmt_name(in_format));
> + return AVERROR(ENOSYS);
> + }
> +
> + s->in_fmt = in_format;
> + s->in_desc = av_pix_fmt_desc_get(s->in_fmt);
> + s->in_planes = av_pix_fmt_count_planes(s->in_fmt);
> +
> + // Set up plane information
> + for (int i = 0; i < s->in_desc->nb_components; i++) {
> + int d = (s->in_desc->comp[i].depth + 7) / 8;
> + int p = s->in_desc->comp[i].plane;
> + s->in_plane_channels[p] = FFMAX(s->in_plane_channels[p],
> s->in_desc->comp[i].step / d);
> + s->in_plane_depths[p] = s->in_desc->comp[i].depth;
> + }
> +
> + s->hwctx = in_frames_ctx->device_ctx->hwctx;
> + s->cuda_dl = s->hwctx->internal->cuda_dl;
> + s->cu_stream = s->hwctx->stream;
> +
> + for (int i = 1; i < sctx->base.nb_inputs; i++) {
> + AVFilterLink *inlink = ctx->inputs[i];
> + FilterLink *inl = ff_filter_link(inlink);
> + AVHWFramesContext *hwfc = NULL;
> +
> + if (inlink->format != AV_PIX_FMT_CUDA || !inl->hw_frames_ctx ||
> !inl->hw_frames_ctx->data) {
> + av_log(ctx, AV_LOG_ERROR, "Software pixel format is not
> supported.\n");
> + return AVERROR(EINVAL);
> + }
> +
> + hwfc = (AVHWFramesContext *)inl->hw_frames_ctx->data;
> +
> + if (in_frames_ctx->sw_format != hwfc->sw_format) {
> + av_log(ctx, AV_LOG_ERROR, "All inputs should have the same
> underlying software pixel format.\n");
> + return AVERROR(EINVAL);
> + }
> + }
> +
> + if (in_format == AV_PIX_FMT_P010)
> + depth = 10;
> +
> + if (sctx->base.fillcolor_enable) {
> + // Check if this is an RGB format
> + if (s->in_desc->flags & AV_PIX_FMT_FLAG_RGB) {
> + // For RGB formats, use RGB values directly
> + s->fillcolor_yuv[0] = sctx->base.fillcolor[0]; // R
> + s->fillcolor_yuv[1] = sctx->base.fillcolor[1]; // G
> + s->fillcolor_yuv[2] = sctx->base.fillcolor[2]; // B
> + s->fillcolor_yuv[3] = sctx->base.fillcolor[3]; // A
> + } else {
> + // For YUV formats, convert RGB to YUV
> + int Y, U, V;
> +
> + rgb2yuv(sctx->base.fillcolor[0] / 255.0,
> sctx->base.fillcolor[1] / 255.0,
> + sctx->base.fillcolor[2] / 255.0, &Y, &U, &V, depth);
> + s->fillcolor_yuv[0] = Y;
> + s->fillcolor_yuv[1] = U;
> + s->fillcolor_yuv[2] = V;
> + s->fillcolor_yuv[3] = sctx->base.fillcolor[3];
> + }
> + }
> +
> + ret = config_comm_output(outlink);
> + if (ret < 0)
> + return ret;
> +
> + ret = cuda_stack_load_functions(ctx, in_format);
> + if (ret < 0)
> + return ret;
> +
> + // Initialize hardware frames context for output
> + hw_frames_ctx = av_hwframe_ctx_alloc(in_frames_ctx->device_ref);
> + if (!hw_frames_ctx)
> + return AVERROR(ENOMEM);
> +
> + out_frames_ctx = (AVHWFramesContext*)hw_frames_ctx->data;
> + out_frames_ctx->format = AV_PIX_FMT_CUDA;
> + out_frames_ctx->sw_format = in_format;
> + out_frames_ctx->width = outlink->w;
> + out_frames_ctx->height = outlink->h;
> +
> + ret = av_hwframe_ctx_init(hw_frames_ctx);
> + if (ret < 0) {
> + av_buffer_unref(&hw_frames_ctx);
> + return ret;
> + }
> +
> + av_buffer_unref(&outl->hw_frames_ctx);
> + outl->hw_frames_ctx = hw_frames_ctx;
> +
> + return 0;
> +}
> +
> +static int cuda_stack_init(AVFilterContext *ctx)
> +{
> + int ret;
> +
> + ret = stack_init(ctx);
> + if (ret)
> + return ret;
> +
> + return 0;
> +}
> +
> +static av_cold void cuda_stack_uninit(AVFilterContext *ctx)
> +{
> + StackCudaContext *sctx = ctx->priv;
> + CUDAStackContext *s = &sctx->cuda;
> +
> + if (s->hwctx && s->cu_module) {
> + CudaFunctions *cu = s->cuda_dl;
> + CUcontext dummy;
> +
> + CHECK_CU(cu->cuCtxPushCurrent(s->hwctx->cuda_ctx));
> + CHECK_CU(cu->cuModuleUnload(s->cu_module));
> + s->cu_module = NULL;
> + CHECK_CU(cu->cuCtxPopCurrent(&dummy));
> + }
> +
> + stack_uninit(ctx);
> +}
> +
> +static const enum AVPixelFormat cuda_stack_pix_fmts[] = {
> + AV_PIX_FMT_CUDA,
> + AV_PIX_FMT_NONE,
> +};
> +
> +#include "stack_internal.c"
> +
> +#if CONFIG_HSTACK_CUDA_FILTER
> +
> +DEFINE_HSTACK_OPTIONS(cuda);
> +DEFINE_STACK_FILTER(hstack, cuda, "CUDA", 0);
> +
> +#endif
> +
> +#if CONFIG_VSTACK_CUDA_FILTER
> +
> +DEFINE_VSTACK_OPTIONS(cuda);
> +DEFINE_STACK_FILTER(vstack, cuda, "CUDA", 0);
> +
> +#endif
> +
> +#if CONFIG_XSTACK_CUDA_FILTER
> +
> +DEFINE_XSTACK_OPTIONS(cuda);
> +DEFINE_STACK_FILTER(xstack, cuda, "CUDA", 0);
> +
> +#endif
> \ No newline at end of file
> diff --git a/libavfilter/vf_stack_cuda.cu b/libavfilter/vf_stack_cuda.cu
> new file mode 100644
> index 0000000000..c19595e0a6
> --- /dev/null
> +++ b/libavfilter/vf_stack_cuda.cu
> @@ -0,0 +1,389 @@
> +/*
> + * Copyright (c) 2025, Faeez Kadiri < f1k2faeez at gmail dot com>
> + *
> + * This file is part of FFmpeg.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the
> "Software"),
> + * to deal in the Software without restriction, including without
> limitation
> + * the rights to use, copy, modify, merge, publish, distribute,
> sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be
> included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
> SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
> OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
> + * DEALINGS IN THE SOFTWARE.
> + */
> +
> +#include "cuda/vector_helpers.cuh"
> +
> +// --- CONVERSION LOGIC ---
> +
> +static const ushort mask_10bit = 0xFFC0;
> +static const ushort mask_16bit = 0xFFFF;
> +
> +static inline __device__ ushort conv_8to16(uchar in, ushort mask)
> +{
> + return ((ushort)in | ((ushort)in << 8)) & mask;
> +}
> +
> +// FFmpeg passes pitch in bytes, CUDA uses potentially larger types
> +#define FIXED_PITCH \
> + (dst_pitch/sizeof(*dst[0]))
> +
> +#define DEFAULT_DST(n) \
> + dst[n][yo*FIXED_PITCH+xo]
> +
> +#define OFFSET_DST(n) \
> + dst[n][(yo+dst_y)*FIXED_PITCH+(xo+dst_x)]
> +
> +// --- COMMON BOUNDS CHECKING ---
> +
> +#define BOUNDS_CHECK() \
> + if (xo >= width || yo >= height) \
> + return; \
> + int target_x = xo + dst_x; \
> + int target_y = yo + dst_y; \
> + if (target_x < 0 || target_y < 0 || target_x >= frame_width ||
> target_y >= frame_height) \
> + return;
> +
> +#define BOUNDS_CHECK_UV(chroma_shift) \
> + if (xo >= width || yo >= height) \
> + return; \
> + int target_x = xo + dst_x; \
> + int target_y = yo + dst_y; \
> + int frame_uv_height = frame_height >> chroma_shift; \
> + if (target_x < 0 || target_y < 0 || target_x >= frame_width ||
> target_y >= frame_uv_height) \
> + return;
> +
> +#define COPY_BOUNDS_CHECK() \
> + int target_x = xo + dst_x; \
> + int target_y = yo + dst_y; \
> + if (target_x < 0 || target_y < 0 || target_x >= frame_width ||
> target_y >= frame_height) \
> + return;
> +
> +#define COPY_BOUNDS_CHECK_UV(chroma_shift) \
> + int target_x = xo + dst_x; \
> + int target_y = yo + dst_y; \
> + int frame_uv_width = frame_width >> chroma_shift; \
> + int frame_uv_height = frame_height >> chroma_shift; \
> + if (target_x < 0 || target_y < 0 || target_x >= frame_uv_width ||
> target_y >= frame_uv_height) \
> + return;
> +
> +// --- COLOR OPERATIONS ---
> +
> +#define COLOR_DEF_F(N, T) \
> + __device__ static inline void N(T *dst[4], int xo, int yo, \
> + int width, int height, int
> dst_pitch, \
> + int dst_x, int dst_y, \
> + unsigned char y_color, unsigned char
> u_color, \
> + unsigned char v_color, unsigned char
> a_color, \
> + int frame_width, int frame_height)
> +
> +// Macro for YUV planar formats (420p, 444p, etc.)
> +#define DEFINE_SETCOLOR_YUV_PLANAR(name, out_type, out_type_uv, y_assign,
> uv_assign) \
> +struct SetColor_##name \
> +{ \
> + typedef out_type out_T; \
> + typedef out_type_uv out_T_uv; \
> + \
> + COLOR_DEF_F(SetColor, out_T) \
> + { \
> + BOUNDS_CHECK(); \
> + OFFSET_DST(0) = y_assign; \
> + } \
> + \
> + COLOR_DEF_F(SetColor_uv, out_T_uv) \
> + { \
> + BOUNDS_CHECK(); \
> + uv_assign; \
> + } \
> +};
> +
> +// Macro for NV12-style formats (interleaved UV)
> +#define DEFINE_SETCOLOR_NV(name, out_type, out_type_uv, y_assign,
> uv_assign) \
> +struct SetColor_##name \
> +{ \
> + typedef out_type out_T; \
> + typedef out_type_uv out_T_uv; \
> + \
> + COLOR_DEF_F(SetColor, out_T) \
> + { \
> + BOUNDS_CHECK(); \
> + OFFSET_DST(0) = y_assign; \
> + } \
> + \
> + COLOR_DEF_F(SetColor_uv, out_T_uv) \
> + { \
> + BOUNDS_CHECK_UV(1); \
> + OFFSET_DST(1) = uv_assign; \
> + } \
> +};
> +
> +// Macro for RGB formats
> +#define DEFINE_SETCOLOR_RGB(name, out_type, color_assign) \
> +struct SetColor_##name \
> +{ \
> + typedef out_type out_T; \
> + typedef uchar out_T_uv; \
> + \
> + COLOR_DEF_F(SetColor, out_T) \
> + { \
> + BOUNDS_CHECK(); \
> + OFFSET_DST(0) = color_assign; \
> + } \
> + \
> + COLOR_DEF_F(SetColor_uv, out_T_uv) \
> + { \
> + /* No UV plane for RGB formats */ \
> + } \
> +};
> +
> +// Define all SetColor structs using macros
> +DEFINE_SETCOLOR_YUV_PLANAR(yuv420p, uchar, uchar, y_color,
> + OFFSET_DST(1) = u_color; OFFSET_DST(2) = v_color)
> +
> +DEFINE_SETCOLOR_NV(nv12, uchar, uchar2, y_color,
> + make_uchar2(u_color, v_color))
> +
> +DEFINE_SETCOLOR_YUV_PLANAR(yuv444p, uchar, uchar, y_color,
> + OFFSET_DST(1) = u_color; OFFSET_DST(2) = v_color)
> +
> +DEFINE_SETCOLOR_NV(p010le, ushort, ushort2, conv_8to16(y_color,
> mask_10bit),
> + make_ushort2(conv_8to16(u_color, mask_10bit), conv_8to16(v_color,
> mask_10bit)))
> +
> +DEFINE_SETCOLOR_NV(p016le, ushort, ushort2, conv_8to16(y_color,
> mask_16bit),
> + make_ushort2(conv_8to16(u_color, mask_16bit), conv_8to16(v_color,
> mask_16bit)))
> +
> +DEFINE_SETCOLOR_YUV_PLANAR(yuv444p16le, ushort, ushort,
> conv_8to16(y_color, mask_16bit),
> + OFFSET_DST(1) = conv_8to16(u_color, mask_16bit); OFFSET_DST(2) =
> conv_8to16(v_color, mask_16bit))
> +
> +DEFINE_SETCOLOR_RGB(rgb0, uchar4, make_uchar4(y_color, u_color, v_color,
> 0))
> +DEFINE_SETCOLOR_RGB(bgr0, uchar4, make_uchar4(v_color, u_color, y_color,
> 0))
> +DEFINE_SETCOLOR_RGB(rgba, uchar4, make_uchar4(y_color, u_color, v_color,
> a_color))
> +DEFINE_SETCOLOR_RGB(bgra, uchar4, make_uchar4(v_color, u_color, y_color,
> a_color))
> +
> +// --- COPY OPERATIONS ---
> +
> +template<typename T>
> +using copy_function_t = T (*)(cudaTextureObject_t tex, int xo, int yo,
> + int dst_width, int dst_height,
> + int src_width, int src_height,
> + int bit_depth);
> +
> +#define COPY_DEF_F(N, T) \
> + template<copy_function_t<in_T> copy_func_y,
> \
> + copy_function_t<in_T_uv> copy_func_uv>
> \
> + __device__ static inline void N(cudaTextureObject_t src_tex[4], T
> *dst[4], int xo, int yo, \
> + int dst_width, int dst_height,
> int dst_pitch, \
> + int dst_x, int dst_y,
> \
> + int src_width, int src_height,
> \
> + int frame_width, int frame_height)
> +
> +#define COPY_SUB_F(m, plane) \
> + copy_func_##m(src_tex[plane], xo, yo, \
> + dst_width, dst_height, \
> + src_width, src_height, \
> + in_bit_depth)
> +
> +// Macro for YUV planar copy operations
> +#define DEFINE_STACKCOPY_YUV_PLANAR(name, bit_depth, in_type, in_type_uv,
> out_type, out_type_uv, chroma_shift) \
> +struct StackCopy_##name \
> +{ \
> + static const int in_bit_depth = bit_depth; \
> + typedef in_type in_T; \
> + typedef in_type_uv in_T_uv; \
> + typedef out_type out_T; \
> + typedef out_type_uv out_T_uv; \
> + \
> + COPY_DEF_F(StackCopy, out_T) \
> + { \
> + COPY_BOUNDS_CHECK(); \
> + OFFSET_DST(0) = COPY_SUB_F(y, 0); \
> + } \
> + \
> + COPY_DEF_F(StackCopy_uv, out_T_uv) \
> + { \
> + COPY_BOUNDS_CHECK_UV(chroma_shift); \
> + OFFSET_DST(1) = COPY_SUB_F(uv, 1); \
> + OFFSET_DST(2) = COPY_SUB_F(uv, 2); \
> + } \
> +};
> +
> +// Macro for NV12-style copy operations
> +#define DEFINE_STACKCOPY_NV(name, bit_depth, in_type, in_type_uv,
> out_type, out_type_uv) \
> +struct StackCopy_##name \
> +{ \
> + static const int in_bit_depth = bit_depth; \
> + typedef in_type in_T; \
> + typedef in_type_uv in_T_uv; \
> + typedef out_type out_T; \
> + typedef out_type_uv out_T_uv; \
> + \
> + COPY_DEF_F(StackCopy, out_T) \
> + { \
> + COPY_BOUNDS_CHECK(); \
> + OFFSET_DST(0) = COPY_SUB_F(y, 0); \
> + } \
> + \
> + COPY_DEF_F(StackCopy_uv, out_T_uv) \
> + { \
> + COPY_BOUNDS_CHECK_UV(1); \
> + OFFSET_DST(1) = COPY_SUB_F(uv, 1); \
> + } \
> +};
> +
> +// Macro for RGB copy operations
> +#define DEFINE_STACKCOPY_RGB(name, bit_depth, in_type, out_type) \
> +struct StackCopy_##name \
> +{ \
> + static const int in_bit_depth = bit_depth; \
> + typedef in_type in_T; \
> + typedef uchar in_T_uv; \
> + typedef out_type out_T; \
> + typedef uchar out_T_uv; \
> + \
> + COPY_DEF_F(StackCopy, out_T) \
> + { \
> + COPY_BOUNDS_CHECK(); \
> + OFFSET_DST(0) = COPY_SUB_F(y, 0); \
> + } \
> + \
> + COPY_DEF_F(StackCopy_uv, out_T_uv) \
> + { \
> + /* No UV plane for RGB formats */ \
> + } \
> +};
> +
> +// Define all StackCopy structs using macros
> +DEFINE_STACKCOPY_YUV_PLANAR(yuv420p_yuv420p, 8, uchar, uchar, uchar,
> uchar, 1)
> +DEFINE_STACKCOPY_NV(nv12_nv12, 8, uchar, uchar2, uchar, uchar2)
> +DEFINE_STACKCOPY_YUV_PLANAR(yuv444p_yuv444p, 8, uchar, uchar, uchar,
> uchar, 0)
> +DEFINE_STACKCOPY_NV(p010le_p010le, 10, ushort, ushort2, ushort, ushort2)
> +DEFINE_STACKCOPY_NV(p016le_p016le, 16, ushort, ushort2, ushort, ushort2)
> +DEFINE_STACKCOPY_YUV_PLANAR(yuv444p16le_yuv444p16le, 16, ushort, ushort,
> ushort, ushort, 0)
> +DEFINE_STACKCOPY_RGB(rgb0_rgb0, 8, uchar4, uchar4)
> +DEFINE_STACKCOPY_RGB(bgr0_bgr0, 8, uchar4, uchar4)
> +DEFINE_STACKCOPY_RGB(rgba_rgba, 8, uchar4, uchar4)
> +DEFINE_STACKCOPY_RGB(bgra_bgra, 8, uchar4, uchar4)
> +
> +// --- COPY LOGIC ---
> +
> +template<typename T>
> +__device__ static inline T StackCopyPixel(cudaTextureObject_t tex,
> + int xo, int yo,
> + int dst_width, int dst_height,
> + int src_width, int src_height,
> + int bit_depth)
> +{
> + float hscale = (float)src_width / (float)dst_width;
> + float vscale = (float)src_height / (float)dst_height;
> + float xi = (xo + 0.5f) * hscale;
> + float yi = (yo + 0.5f) * vscale;
> +
> + return tex2D<T>(tex, xi, yi);
> +}
> +
> +/// --- FUNCTION EXPORTS ---
> +
> +#define COLOR_KERNEL_ARGS(T) \
> + T *dst_0, T *dst_1, T *dst_2, T *dst_3, \
> + int width, int height, int dst_pitch, \
> + int dst_x, int dst_y, \
> + unsigned char y_color, unsigned char u_color, \
> + unsigned char v_color, unsigned char a_color, \
> + int frame_width, int frame_height
> +
> +#define COLOR_FUNC(SetColorFunc, T) \
> + T *dst[4] = { dst_0, dst_1, dst_2, dst_3 }; \
> + int xo = blockIdx.x * blockDim.x + threadIdx.x; \
> + int yo = blockIdx.y * blockDim.y + threadIdx.y; \
> + if (yo >= height || xo >= width) return; \
> + SetColorFunc( \
> + dst, xo, yo, \
> + width, height, dst_pitch, \
> + dst_x, dst_y, y_color, u_color, v_color, a_color, \
> + frame_width, frame_height);
> +
> +#define COPY_KERNEL_ARGS(T) \
> + cudaTextureObject_t src_tex_0, cudaTextureObject_t src_tex_1, \
> + cudaTextureObject_t src_tex_2, cudaTextureObject_t src_tex_3, \
> + T *dst_0, T *dst_1, T *dst_2, T *dst_3, \
> + int dst_width, int dst_height, int dst_pitch, \
> + int dst_x, int dst_y, \
> + int src_width, int src_height, \
> + int frame_width, int frame_height
> +
> +#define COPY_FUNC(StackCopyFunc, T) \
> + cudaTextureObject_t src_tex[4] = \
> + { src_tex_0, src_tex_1, src_tex_2, src_tex_3 }; \
> + T *dst[4] = { dst_0, dst_1, dst_2, dst_3 }; \
> + int xo = blockIdx.x * blockDim.x + threadIdx.x; \
> + int yo = blockIdx.y * blockDim.y + threadIdx.y; \
> + if (yo >= dst_height || xo >= dst_width) return; \
> + StackCopyFunc( \
> + src_tex, dst, xo, yo, \
> + dst_width, dst_height, dst_pitch, \
> + dst_x, dst_y, \
> + src_width, src_height, \
> + frame_width, frame_height);
> +
> +extern "C" {
> +
> +#define COLOR_KERNEL(C, S) \
> + __global__ void SetColor_##C##S( \
> + COLOR_KERNEL_ARGS(SetColor_##C::out_T##S)) \
> + { \
> + COLOR_FUNC(SetColor_##C::SetColor##S, SetColor_##C::out_T##S) \
> + }
> +
> +#define COLOR_KERNEL_RAW(C) \
> + COLOR_KERNEL(C,) \
> + COLOR_KERNEL(C,_uv)
> +
> +// Define color kernels for all supported formats
> +COLOR_KERNEL_RAW(yuv420p)
> +COLOR_KERNEL_RAW(nv12)
> +COLOR_KERNEL_RAW(yuv444p)
> +COLOR_KERNEL_RAW(p010le)
> +COLOR_KERNEL_RAW(p016le)
> +COLOR_KERNEL_RAW(yuv444p16le)
> +COLOR_KERNEL_RAW(rgb0)
> +COLOR_KERNEL_RAW(bgr0)
> +COLOR_KERNEL_RAW(rgba)
> +COLOR_KERNEL_RAW(bgra)
> +
> +#define COPY_KERNEL(C, S) \
> + __global__ void StackCopy_##C##S( \
> + COPY_KERNEL_ARGS(StackCopy_##C::out_T##S)) \
> + { \
> + COPY_FUNC((StackCopy_##C::StackCopy##S< \
> + StackCopyPixel<StackCopy_##C::in_T>, \
> + StackCopyPixel<StackCopy_##C::in_T_uv> >), \
> + StackCopy_##C::out_T##S) \
> + }
> +
> +#define COPY_KERNEL_RAW(C) \
> + COPY_KERNEL(C,) \
> + COPY_KERNEL(C,_uv)
> +
> +// Define copy kernels for all supported formats
> +COPY_KERNEL_RAW(yuv420p_yuv420p)
> +COPY_KERNEL_RAW(nv12_nv12)
> +COPY_KERNEL_RAW(yuv444p_yuv444p)
> +COPY_KERNEL_RAW(p010le_p010le)
> +COPY_KERNEL_RAW(p016le_p016le)
> +COPY_KERNEL_RAW(yuv444p16le_yuv444p16le)
> +COPY_KERNEL_RAW(rgb0_rgb0)
> +COPY_KERNEL_RAW(bgr0_bgr0)
> +COPY_KERNEL_RAW(rgba_rgba)
> +COPY_KERNEL_RAW(bgra_bgra)
> +
> +}
> \ No newline at end of file
> --
> 2.34.1
>
>
More information about the ffmpeg-devel
mailing list