[FFmpeg-devel] [PATCH] avfilter: add CUDA stack filters (hstack_cuda, vstack_cuda, xstack_cuda)

Mon May 26 14:35:54 EEST 2025

Hi all,

Friendly ping on the patch below (sent 23 May, link in Patchwork:
https://patchwork.ffmpeg.org/project/ffmpeg/patch/20250523215814.365246-1-f1k2faeez@gmail.com/
).

Patch summary
-------------
* Adds a CUDA implementation of the existing stack_* filter family
(parallels stack_qsv / stack_vaapi).
* Supports up to 16 inputs and both horizontal/vertical layouts.

If anything needs adjustment (coding-style, fate naming, etc.) please let
me know and I’ll resend an updated v2.

Many thanks for your time!

Best regards,
Faeez Kadiri

On Sat, May 24, 2025 at 3:28 AM Faeez Kadiri <f1k2faeez at gmail.com> wrote:

> Add hardware-accelerated stack filters for CUDA that provide equivalent
> functionality to the software stack filters but with GPU acceleration.
>
> Features:
> - Support for hstack, vstack, and xstack operations
> - Compatible pixel formats such as:
>   yuv420p, nv12, yuv444p, p010le, p016le, yuv444p16le, rgb0, bgr0, rgba,
> bgra
> - Fill color support with automatic RGB to YUV conversion for YUV formats
> - Proper chroma subsampling handling for all supported formats
> - Integration with existing stack filter infrastructure via
> stack_internal.h
>
> The implementation follows the established CUDA filter pattern from
> vf_scale_cuda.c, using PTX modules for kernel execution and proper
> CUDA context management. Copy operations handle frame placement while
> color operations fill background areas when using fill colors.
>
> This enables efficient video composition workflows entirely on GPU
> without CPU-GPU memory transfers, significantly improving performance
> for multi-input video processing pipelines.
>
> Examples:
> $ ffmpeg -hwaccel cuda -i input.h265 -filter_complex
> "[0:v][0:v]hstack_cuda" -c:v hevc_nvenc out.h265
>
> $ ffmpeg \
>   -hwaccel cuda -i input1.mp4 \
>   -hwaccel cuda -i input2.mp4 \
>   -hwaccel cuda -i input3.mp4 \
>   -hwaccel cuda -i input4.mp4 \
>   -filter_complex
> "[0:v]hwupload_cuda[0v];[1:v]hwupload_cuda[1v];[2:v]hwupload_cuda[2v];[3:v]hwupload_cuda[3v];[0v][1v][2v][3v]xstack_cuda=inputs=4:fill=black:layout=0_0|w0_0|0_h0|w0_h0"
> \
>   -c:v hevc_nvenc out.mp4
>
> Signed-off-by: Faeez Kadiri <f1k2faeez at gmail.com>
> ---
>  Changelog                    |   1 +
>  configure                    |   6 +
>  doc/filters.texi             |  78 +++++
>  libavfilter/Makefile         |   3 +
>  libavfilter/allfilters.c     |   3 +
>  libavfilter/vf_stack_cuda.c  | 589 +++++++++++++++++++++++++++++++++++
>  libavfilter/vf_stack_cuda.cu | 389 +++++++++++++++++++++++
>  7 files changed, 1069 insertions(+)
>  create mode 100644 libavfilter/vf_stack_cuda.c
>  create mode 100644 libavfilter/vf_stack_cuda.cu
>
> diff --git a/Changelog b/Changelog
> index 4217449438..0dec3443d4 100644
> --- a/Changelog
> +++ b/Changelog
> @@ -18,6 +18,7 @@ version <next>:
>  - APV encoding support through a libopenapv wrapper
>  - VVC decoder supports all content of SCC (Screen Content Coding):
>    IBC (Inter Block Copy), Palette Mode and ACT (Adaptive Color Transform
> +- hstack_cuda, vstack_cuda and xstack_cuda filters
>
>
>  version 7.1:
> diff --git a/configure b/configure
> index 3730b0524c..5c2d6e132d 100755
> --- a/configure
> +++ b/configure
> @@ -4033,6 +4033,12 @@ xfade_vulkan_filter_deps="vulkan spirv_compiler"
>  yadif_cuda_filter_deps="ffnvcodec"
>  yadif_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
>  yadif_videotoolbox_filter_deps="metal corevideo videotoolbox"
> +hstack_cuda_filter_deps="ffnvcodec"
> +hstack_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
> +vstack_cuda_filter_deps="ffnvcodec"
> +vstack_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
> +xstack_cuda_filter_deps="ffnvcodec"
> +xstack_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
>  hstack_vaapi_filter_deps="vaapi_1"
>  vstack_vaapi_filter_deps="vaapi_1"
>  xstack_vaapi_filter_deps="vaapi_1"
> diff --git a/doc/filters.texi b/doc/filters.texi
> index 6d2df07508..1c9afac9eb 100644
> --- a/doc/filters.texi
> +++ b/doc/filters.texi
> @@ -26850,6 +26850,84 @@ Only deinterlace frames marked as interlaced.
>  The default value is @code{all}.
>  @end table
>
> + at section hstack_cuda
> +Stack input videos horizontally.
> +
> +This is the CUDA variant of the @ref{vstack} filter, each input stream may
> +have different width, this filter will scale down/up each input stream
> while
> +keeping the orignal aspect.
> +
> +It accepts the following options:
> +
> + at table @option
> + at item inputs
> +See @ref{hstack}.
> +
> + at item shortest
> +See @ref{hstack}.
> +
> + at item height
> +Set height of output. If set to 0, this filter will set height of output
> to
> +height of the first input stream. Default value is 0.
> + at end table
> +
> + at section vstack_cuda
> +Stack input videos vertically.
> +
> +This is the CUDA variant of the @ref{vstack} filter, each input stream may
> +have different width, this filter will scale down/up each input stream
> while
> +keeping the orignal aspect.
> +
> +It accepts the following options:
> +
> + at table @option
> + at item inputs
> +See @ref{vstack}.
> +
> + at item shortest
> +See @ref{vstack}.
> +
> + at item width
> +Set width of output. If set to 0, this filter will set width of output to
> +width of the first input stream. Default value is 0.
> + at end table
> +
> + at section xstack_cuda
> +Stack video inputs into custom layout.
> +
> +This is the CUDA variant of the @ref{xstack} filter,  each input stream
> may
> +have different size, this filter will scale down/up each input stream to
> the
> +given output size, or the size of the first input stream.
> +
> +It accepts the following options:
> +
> + at table @option
> + at item inputs
> +See @ref{xstack}.
> +
> + at item shortest
> +See @ref{xstack}.
> +
> + at item layout
> +See @ref{xstack}.
> +Moreover, this permits the user to supply output size for each input
> stream.
> + at example
>
> +xstack_cuda=inputs=4:layout=0_0_1920x1080|0_h0_1920x1080|w0_0_1920x1080|w0_h0_1920x1080
> + at end example
> +
> + at item grid
> +See @ref{xstack}.
> +
> + at item grid_tile_size
> +Set output size for each input stream when @option{grid} is set. If this
> option
> +is not set, this filter will set output size by default to the size of the
> +first input stream. For the syntax of this option, check the
> + at ref{video size syntax,,"Video size" section in the ffmpeg-utils
> manual,ffmpeg-utils}.
> +
> + at item fill
> +See @ref{xstack}.
> + at end table
> +
>  @anchor{CUDA NPP}
>  @section CUDA NPP
>  Below is a description of the currently available NVIDIA Performance
> Primitives (libnpp) video filters.
> diff --git a/libavfilter/Makefile b/libavfilter/Makefile
> index 0effe4127f..ad876ccd53 100644
> --- a/libavfilter/Makefile
> +++ b/libavfilter/Makefile
> @@ -583,6 +583,9 @@ OBJS-$(CONFIG_YAEPBLUR_FILTER)               +=
> vf_yaepblur.o
>  OBJS-$(CONFIG_ZMQ_FILTER)                    += f_zmq.o
>  OBJS-$(CONFIG_ZOOMPAN_FILTER)                += vf_zoompan.o
>  OBJS-$(CONFIG_ZSCALE_FILTER)                 += vf_zscale.o
> +OBJS-$(CONFIG_HSTACK_CUDA_FILTER)            += vf_stack_cuda.o
> framesync.o vf_stack_cuda.ptx.o cuda/load_helper.o
> +OBJS-$(CONFIG_VSTACK_CUDA_FILTER)            += vf_stack_cuda.o
> framesync.o vf_stack_cuda.ptx.o cuda/load_helper.o
> +OBJS-$(CONFIG_XSTACK_CUDA_FILTER)            += vf_stack_cuda.o
> framesync.o vf_stack_cuda.ptx.o cuda/load_helper.o
>  OBJS-$(CONFIG_HSTACK_VAAPI_FILTER)           += vf_stack_vaapi.o
> framesync.o vaapi_vpp.o
>  OBJS-$(CONFIG_VSTACK_VAAPI_FILTER)           += vf_stack_vaapi.o
> framesync.o vaapi_vpp.o
>  OBJS-$(CONFIG_XSTACK_VAAPI_FILTER)           += vf_stack_vaapi.o
> framesync.o vaapi_vpp.o
> diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
> index 5ea33cdf01..89a7fb9277 100644
> --- a/libavfilter/allfilters.c
> +++ b/libavfilter/allfilters.c
> @@ -545,6 +545,9 @@ extern const FFFilter ff_vf_yaepblur;
>  extern const FFFilter ff_vf_zmq;
>  extern const FFFilter ff_vf_zoompan;
>  extern const FFFilter ff_vf_zscale;
> +extern const FFFilter ff_vf_hstack_cuda;
> +extern const FFFilter ff_vf_vstack_cuda;
> +extern const FFFilter ff_vf_xstack_cuda;
>  extern const FFFilter ff_vf_hstack_vaapi;
>  extern const FFFilter ff_vf_vstack_vaapi;
>  extern const FFFilter ff_vf_xstack_vaapi;
> diff --git a/libavfilter/vf_stack_cuda.c b/libavfilter/vf_stack_cuda.c
> new file mode 100644
> index 0000000000..002602b2bf
> --- /dev/null
> +++ b/libavfilter/vf_stack_cuda.c
> @@ -0,0 +1,589 @@
> +/*
> + * Copyright (c) 2025, Faeez Kadiri < f1k2faeez at gmail dot com>
> + *
> + * This file is part of FFmpeg.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the
> "Software"),
> + * to deal in the Software without restriction, including without
> limitation
> + * the rights to use, copy, modify, merge, publish, distribute,
> sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be
> included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT
> SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
> OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
> + * DEALINGS IN THE SOFTWARE.
> + */
> +
> +/**
> + * @file
> + * Hardware accelerated hstack, vstack and xstack filters based on CUDA
> + */
> +
> +#include "config_components.h"
> +
> +#include "libavutil/opt.h"
> +#include "libavutil/common.h"
> +#include "libavutil/pixdesc.h"
> +#include "libavutil/eval.h"
> +#include "libavutil/hwcontext.h"
> +#include "libavutil/hwcontext_cuda_internal.h"
> +#include "libavutil/cuda_check.h"
> +#include "libavutil/avstring.h"
> +#include "libavutil/avassert.h"
> +#include "libavutil/imgutils.h"
> +#include "libavutil/mathematics.h"
> +#include "libavutil/parseutils.h"
> +#include "libavutil/colorspace.h"
> +#include "libavutil/mem.h"
> +
> +#include "filters.h"
> +#include "formats.h"
> +#include "video.h"
> +
> +#include "framesync.h"
> +#include "cuda/load_helper.h"
> +
> +static const enum AVPixelFormat supported_formats[] = {
> +    AV_PIX_FMT_YUV420P,
> +    AV_PIX_FMT_NV12,
> +    AV_PIX_FMT_YUV444P,
> +    AV_PIX_FMT_P010,
> +    AV_PIX_FMT_P016,
> +    AV_PIX_FMT_YUV444P16,
> +    AV_PIX_FMT_0RGB32,
> +    AV_PIX_FMT_0BGR32,
> +    AV_PIX_FMT_RGB32,
> +    AV_PIX_FMT_BGR32,
> +};
> +
> +#define DIV_UP(a, b) ( ((a) + (b) - 1) / (b) )
> +#define BLOCKX 32
> +#define BLOCKY 16
> +
> +#define CHECK_CU(x) FF_CUDA_CHECK_DL(ctx, s->hwctx->internal->cuda_dl, x)
> +
> +typedef struct CUDAStackContext {
> +    AVCUDADeviceContext *hwctx;
> +    CudaFunctions *cuda_dl;
> +
> +    CUcontext   cu_ctx;
> +    CUmodule    cu_module;
> +    CUstream    cu_stream;
> +
> +    // For copy operations
> +    CUfunction  cu_func_copy;
> +    CUfunction  cu_func_copy_uv;
> +
> +    // For color operations
> +    CUfunction  cu_func_color;
> +    CUfunction  cu_func_color_uv;
> +
> +    enum AVPixelFormat in_fmt;
> +    const AVPixFmtDescriptor *in_desc;
> +    int in_planes;
> +    int in_plane_depths[4];
> +    int in_plane_channels[4];
> +
> +    uint8_t fillcolor_rgba[4];
> +    uint8_t fillcolor_yuv[4];
> +} CUDAStackContext;
> +
> +#define HSTACK_NAME             "hstack_cuda"
> +#define VSTACK_NAME             "vstack_cuda"
> +#define XSTACK_NAME             "xstack_cuda"
> +#define HWContext               CUDAStackContext
> +#define StackHWContext          StackCudaContext
> +#include "stack_internal.h"
> +
> +typedef struct StackCudaContext {
> +    StackBaseContext base;
> +    CUDAStackContext cuda;
> +} StackCudaContext;
> +
> +static int format_is_supported(enum AVPixelFormat fmt)
> +{
> +    int i;
> +
> +    for (i = 0; i < FF_ARRAY_ELEMS(supported_formats); i++)
> +        if (supported_formats[i] == fmt)
> +            return 1;
> +    return 0;
> +}
> +
> +static void rgb2yuv(float r, float g, float b, int *y, int *u, int *v,
> int depth)
> +{
> +    *y = ((0.21260*219.0/255.0) * r + (0.71520*219.0/255.0) * g +
> +         (0.07220*219.0/255.0) * b) * ((1 << depth) - 1);
> +    *u = (-(0.11457*224.0/255.0) * r - (0.38543*224.0/255.0) * g +
> +         (0.50000*224.0/255.0) * b + 0.5) * ((1 << depth) - 1);
> +    *v = ((0.50000*224.0/255.0) * r - (0.45415*224.0/255.0) * g -
> +         (0.04585*224.0/255.0) * b + 0.5) * ((1 << depth) - 1);
> +}
> +
> +static av_cold int cuda_stack_load_functions(AVFilterContext *ctx, enum
> AVPixelFormat format)
> +{
> +    StackCudaContext *sctx = ctx->priv;
> +    CUDAStackContext *s = &sctx->cuda;
> +    CUcontext dummy, cuda_ctx = s->hwctx->cuda_ctx;
> +    CudaFunctions *cu = s->cuda_dl;
> +    int ret;
> +    char buf[128];
> +
> +    const char *fmt_name = av_get_pix_fmt_name(format);
> +
> +    extern const unsigned char ff_vf_stack_cuda_ptx_data[];
> +    extern const unsigned int ff_vf_stack_cuda_ptx_len;
> +
> +    ret = CHECK_CU(cu->cuCtxPushCurrent(cuda_ctx));
> +    if (ret < 0)
> +        return ret;
> +
> +    ret = ff_cuda_load_module(ctx, s->hwctx, &s->cu_module,
> +                              ff_vf_stack_cuda_ptx_data,
> ff_vf_stack_cuda_ptx_len);
> +    if (ret < 0)
> +        goto fail;
> +
> +    // Load copy functions
> +    snprintf(buf, sizeof(buf), "StackCopy_%s_%s", fmt_name, fmt_name);
> +    ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_copy,
> s->cu_module, buf));
> +    if (ret < 0) {
> +        av_log(ctx, AV_LOG_FATAL, "Unsupported format for copy: %s\n",
> fmt_name);
> +        ret = AVERROR(ENOSYS);
> +        goto fail;
> +    }
> +
> +    snprintf(buf, sizeof(buf), "StackCopy_%s_%s_uv", fmt_name, fmt_name);
> +    ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_copy_uv,
> s->cu_module, buf));
> +    if (ret < 0)
> +        goto fail;
> +
> +    // Load color functions
> +    snprintf(buf, sizeof(buf), "SetColor_%s", fmt_name);
> +    ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_color,
> s->cu_module, buf));
> +    if (ret < 0) {
> +        av_log(ctx, AV_LOG_FATAL, "Unsupported format for color: %s\n",
> fmt_name);
> +        ret = AVERROR(ENOSYS);
> +        goto fail;
> +    }
> +
> +    snprintf(buf, sizeof(buf), "SetColor_%s_uv", fmt_name);
> +    ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_color_uv,
> s->cu_module, buf));
> +    if (ret < 0)
> +        goto fail;
> +
> +fail:
> +    CHECK_CU(cu->cuCtxPopCurrent(&dummy));
> +    return ret;
> +}
> +
> +static av_cold int cuda_stack_color_kernel(AVFilterContext *ctx,
> CUfunction func,
> +                            AVFrame *out_frame, const uint8_t *color,
> +                            int width, int height,
> +                            int dst_x, int dst_y,
> +                            int dst_width, int dst_height, int dst_pitch)
> +{
> +    StackCudaContext *sctx = ctx->priv;
> +    CUDAStackContext *s = &sctx->cuda;
> +    CudaFunctions *cu = s->cuda_dl;
> +
> +    CUdeviceptr dst_devptr[4] = {
> +        (CUdeviceptr)out_frame->data[0], (CUdeviceptr)out_frame->data[1],
> +        (CUdeviceptr)out_frame->data[2], (CUdeviceptr)out_frame->data[3]
> +    };
> +
> +    void *args[] = {
> +        &dst_devptr[0], &dst_devptr[1], &dst_devptr[2], &dst_devptr[3],
> +        &width, &height, &dst_pitch,
> +        &dst_x, &dst_y,
> +        (void *)&color[0], (void *)&color[1], (void *)&color[2], (void
> *)&color[3],
> +        &dst_width, &dst_height,
> +    };
> +
> +    return CHECK_CU(cu->cuLaunchKernel(func,
> +                                     DIV_UP(width, BLOCKX),
> DIV_UP(height, BLOCKY), 1,
> +                                     BLOCKX, BLOCKY, 1,
> +                                     0, s->cu_stream, args, NULL));
> +}
> +
> +static av_cold int cuda_stack_copy_kernel(AVFilterContext *ctx,
> CUfunction func,
> +                            CUtexObject src_tex[4],
> +                            AVFrame *out_frame,
> +                            int width, int height,
> +                            int dst_x, int dst_y, int dst_pitch,
> +                            int src_width, int src_height)
> +{
> +    StackCudaContext *sctx = ctx->priv;
> +    CUDAStackContext *s = &sctx->cuda;
> +    CudaFunctions *cu = s->cuda_dl;
> +
> +    CUdeviceptr dst_devptr[4] = {
> +        (CUdeviceptr)out_frame->data[0], (CUdeviceptr)out_frame->data[1],
> +        (CUdeviceptr)out_frame->data[2], (CUdeviceptr)out_frame->data[3]
> +    };
> +
> +    void *args[] = {
> +        &src_tex[0], &src_tex[1], &src_tex[2], &src_tex[3],
> +        &dst_devptr[0], &dst_devptr[1], &dst_devptr[2], &dst_devptr[3],
> +        &width, &height, &dst_pitch,
> +        &dst_x, &dst_y,
> +        &src_width, &src_height,
> +        &out_frame->width, &out_frame->height
> +    };
> +
> +    return CHECK_CU(cu->cuLaunchKernel(func,
> +                                     DIV_UP(width, BLOCKX),
> DIV_UP(height, BLOCKY), 1,
> +                                     BLOCKX, BLOCKY, 1,
> +                                     0, s->cu_stream, args, NULL));
> +}
> +
> +static int cuda_stack_color_op(AVFilterContext *ctx, StackItemRegion
> *region, AVFrame *out, const uint8_t *color) {
> +    StackCudaContext *sctx = ctx->priv;
> +    CUDAStackContext *s = &sctx->cuda;
> +    CudaFunctions *cu = s->cuda_dl;
> +    int ret = 0;
> +    CUcontext dummy;
> +
> +    // Push CUDA context
> +    ret = CHECK_CU(cu->cuCtxPushCurrent(s->hwctx->cuda_ctx));
> +    if (ret < 0)
> +        return ret;
> +
> +    ret = cuda_stack_color_kernel(ctx, s->cu_func_color,
> +                                out, color, region->width, region->height,
> +                                region->x, region->y,
> +                                out->width, out->height,
> +                                out->linesize[0]);
> +    if (ret < 0) {
> +        av_log(ctx, AV_LOG_ERROR, "Error during color operation: %d\n",
> ret);
> +        goto fail;
> +    }
> +
> +    if (s->in_planes > 1) {
> +        ret = cuda_stack_color_kernel(ctx, s->cu_func_color_uv,
> +                                    out, color,
> +                                    AV_CEIL_RSHIFT(region->width,
> s->in_desc->log2_chroma_w),
> +                                    AV_CEIL_RSHIFT(region->height,
> s->in_desc->log2_chroma_h),
> +                                    AV_CEIL_RSHIFT(region->x,
> s->in_desc->log2_chroma_w),
> +                                    AV_CEIL_RSHIFT(region->y,
> s->in_desc->log2_chroma_h),
> +                                    out->width, out->height,
> +                                    out->linesize[1]);
> +        if (ret < 0)
> +            av_log(ctx, AV_LOG_ERROR, "Error during color UV operation:
> %d\n", ret);
> +    }
> +
> +fail:
> +    CHECK_CU(cu->cuCtxPopCurrent(&dummy));
> +    return ret;
> +}
> +
> +static int cuda_stack_copy_op(AVFilterContext *ctx, StackItemRegion
> *region, AVFrame *in, AVFrame *out) {
> +    StackCudaContext *sctx = ctx->priv;
> +    CUDAStackContext *s = &sctx->cuda;
> +    CudaFunctions *cu = s->cuda_dl;
> +    CUtexObject tex[4] = { 0, 0, 0, 0 };
> +    int ret = 0;
> +    int i;
> +    CUcontext dummy;
> +
> +    // Push CUDA context
> +    ret = CHECK_CU(cu->cuCtxPushCurrent(s->hwctx->cuda_ctx));
> +    if (ret < 0)
> +        return ret;
> +
> +    for (i = 0; i < s->in_planes; i++) {
> +        CUDA_TEXTURE_DESC tex_desc = {
> +            .filterMode = CU_TR_FILTER_MODE_POINT,
> +            .flags = CU_TRSF_READ_AS_INTEGER,
> +        };
> +
> +        CUDA_RESOURCE_DESC res_desc = {
> +            .resType = CU_RESOURCE_TYPE_PITCH2D,
> +            .res.pitch2D.format = s->in_plane_depths[i] <= 8 ?
> +                                  CU_AD_FORMAT_UNSIGNED_INT8 :
> +                                  CU_AD_FORMAT_UNSIGNED_INT16,
> +            .res.pitch2D.numChannels = s->in_plane_channels[i],
> +            .res.pitch2D.pitchInBytes = in->linesize[i],
> +            .res.pitch2D.devPtr = (CUdeviceptr)in->data[i],
> +        };
> +
> +        if (i == 1 || i == 2) {
> +            res_desc.res.pitch2D.width = AV_CEIL_RSHIFT(in->width,
> s->in_desc->log2_chroma_w);
> +            res_desc.res.pitch2D.height = AV_CEIL_RSHIFT(in->height,
> s->in_desc->log2_chroma_h);
> +        } else {
> +            res_desc.res.pitch2D.width = in->width;
> +            res_desc.res.pitch2D.height = in->height;
> +        }
> +
> +        ret = CHECK_CU(cu->cuTexObjectCreate(&tex[i], &res_desc,
> &tex_desc, NULL));
> +        if (ret < 0)
> +            goto fail;
> +    }
> +
> +    ret = cuda_stack_copy_kernel(ctx, s->cu_func_copy,
> +                             tex, out, region->width, region->height,
> +                                region->x, region->y, out->linesize[0],
> +                                in->width, in->height);
> +    if (ret < 0) {
> +        av_log(ctx, AV_LOG_ERROR, "Error during copy operation: %d\n",
> ret);
> +        goto fail;
> +    }
> +
> +    if (s->in_planes > 1) {
> +        ret = cuda_stack_copy_kernel(ctx, s->cu_func_copy_uv, tex, out,
> +                                    AV_CEIL_RSHIFT(region->width,
> s->in_desc->log2_chroma_w),
> +                                    AV_CEIL_RSHIFT(region->height,
> s->in_desc->log2_chroma_h),
> +                                    AV_CEIL_RSHIFT(region->x,
> s->in_desc->log2_chroma_w),
> +                                    AV_CEIL_RSHIFT(region->y,
> s->in_desc->log2_chroma_h),
> +                                    out->linesize[1],
> +                                    AV_CEIL_RSHIFT(in->width,
> s->in_desc->log2_chroma_w),
> +                                    AV_CEIL_RSHIFT(in->height,
> s->in_desc->log2_chroma_h));
> +        if (ret < 0)
> +            av_log(ctx, AV_LOG_ERROR, "Error during copy UV operation:
> %d\n", ret);
> +    }
> +
> +fail:
> +    for (i = 0; i < FF_ARRAY_ELEMS(tex); i++)
> +        if (tex[i])
> +            CHECK_CU(cu->cuTexObjectDestroy(tex[i]));
> +
> +    CHECK_CU(cu->cuCtxPopCurrent(&dummy));
> +    return ret;
> +}
> +
> +static int process_frame(FFFrameSync *fs)
> +{
> +    AVFilterContext *ctx = fs->parent;
> +    StackCudaContext *sctx = fs->opaque;
> +    CUDAStackContext *s = &sctx->cuda;
> +    AVFilterLink *outlink = ctx->outputs[0];
> +    AVFrame *out_frame = NULL;
> +    AVFrame *in_frame = NULL;
> +    int ret = 0;
> +
> +    out_frame = ff_get_video_buffer(outlink, outlink->w, outlink->h);
> +    if (!out_frame)
> +        return AVERROR(ENOMEM);
> +
> +    // Fill the entire output frame with fill color if enabled
> +    if (sctx->base.fillcolor_enable) {
> +        StackItemRegion full_region = {
> +            .x = 0,
> +            .y = 0,
> +            .width = outlink->w,
> +            .height = outlink->h
> +        };
> +
> +        ret = cuda_stack_color_op(ctx, &full_region, out_frame,
> s->fillcolor_yuv);
> +        if (ret < 0) {
> +            av_log(ctx, AV_LOG_ERROR, "Failed to fill background
> color\n");
> +            goto fail;
> +        }
> +    }
> +
> +    for (int i = 0; i < ctx->nb_inputs; i++) {
> +        ret = ff_framesync_get_frame(fs, i, &in_frame, 0);
> +        if (ret)
> +            goto fail;
> +
> +        if (i == 0) {
> +            ret = av_frame_copy_props(out_frame, in_frame);
> +            if (ret < 0)
> +                goto fail;
> +        }
> +
> +        ret = cuda_stack_copy_op(ctx, &sctx->base.regions[i], in_frame,
> out_frame);
> +        if (ret < 0)
> +            goto fail;
> +    }
> +
> +    out_frame->pts = av_rescale_q(sctx->base.fs.pts,
> sctx->base.fs.time_base, outlink->time_base);
> +    out_frame->sample_aspect_ratio = outlink->sample_aspect_ratio;
> +
> +    return ff_filter_frame(outlink, out_frame);
> +
> +fail:
> +    av_frame_free(&out_frame);
> +    return ret;
> +}
> +
> +static int config_output(AVFilterLink *outlink)
> +{
> +    AVFilterContext *ctx = outlink->src;
> +    StackCudaContext *sctx = ctx->priv;
> +    CUDAStackContext *s = &sctx->cuda;
> +    AVFilterLink *inlink0 = ctx->inputs[0];
> +    FilterLink      *inl0 = ff_filter_link(inlink0);
> +    FilterLink      *outl = ff_filter_link(outlink);
> +    enum AVPixelFormat in_format;
> +    int depth = 8, ret;
> +    AVHWFramesContext *in_frames_ctx;
> +    AVBufferRef *hw_frames_ctx;
> +    AVHWFramesContext *out_frames_ctx;
> +
> +    if (inlink0->format != AV_PIX_FMT_CUDA || !inl0->hw_frames_ctx ||
> !inl0->hw_frames_ctx->data) {
> +        av_log(ctx, AV_LOG_ERROR, "Software pixel format is not
> supported.\n");
> +        return AVERROR(EINVAL);
> +    }
> +
> +    in_frames_ctx = (AVHWFramesContext*)inl0->hw_frames_ctx->data;
> +    in_format = in_frames_ctx->sw_format;
> +
> +    if (!format_is_supported(in_format)) {
> +        av_log(ctx, AV_LOG_ERROR, "Unsupported input format: %s\n",
> +               av_get_pix_fmt_name(in_format));
> +        return AVERROR(ENOSYS);
> +    }
> +
> +    s->in_fmt = in_format;
> +    s->in_desc = av_pix_fmt_desc_get(s->in_fmt);
> +    s->in_planes = av_pix_fmt_count_planes(s->in_fmt);
> +
> +    // Set up plane information
> +    for (int i = 0; i < s->in_desc->nb_components; i++) {
> +        int d = (s->in_desc->comp[i].depth + 7) / 8;
> +        int p = s->in_desc->comp[i].plane;
> +        s->in_plane_channels[p] = FFMAX(s->in_plane_channels[p],
> s->in_desc->comp[i].step / d);
> +        s->in_plane_depths[p] = s->in_desc->comp[i].depth;
> +    }
> +
> +    s->hwctx = in_frames_ctx->device_ctx->hwctx;
> +    s->cuda_dl = s->hwctx->internal->cuda_dl;
> +    s->cu_stream = s->hwctx->stream;
> +
> +    for (int i = 1; i < sctx->base.nb_inputs; i++) {
> +        AVFilterLink *inlink = ctx->inputs[i];
> +        FilterLink      *inl = ff_filter_link(inlink);
> +        AVHWFramesContext *hwfc = NULL;
> +
> +        if (inlink->format != AV_PIX_FMT_CUDA || !inl->hw_frames_ctx ||
> !inl->hw_frames_ctx->data) {
> +            av_log(ctx, AV_LOG_ERROR, "Software pixel format is not
> supported.\n");
> +            return AVERROR(EINVAL);
> +        }
> +
> +        hwfc = (AVHWFramesContext *)inl->hw_frames_ctx->data;
> +
> +        if (in_frames_ctx->sw_format != hwfc->sw_format) {
> +            av_log(ctx, AV_LOG_ERROR, "All inputs should have the same
> underlying software pixel format.\n");
> +            return AVERROR(EINVAL);
> +        }
> +    }
> +
> +    if (in_format == AV_PIX_FMT_P010)
> +        depth = 10;
> +
> +    if (sctx->base.fillcolor_enable) {
> +        // Check if this is an RGB format
> +        if (s->in_desc->flags & AV_PIX_FMT_FLAG_RGB) {
> +            // For RGB formats, use RGB values directly
> +            s->fillcolor_yuv[0] = sctx->base.fillcolor[0]; // R
> +            s->fillcolor_yuv[1] = sctx->base.fillcolor[1]; // G
> +            s->fillcolor_yuv[2] = sctx->base.fillcolor[2]; // B
> +            s->fillcolor_yuv[3] = sctx->base.fillcolor[3]; // A
> +        } else {
> +            // For YUV formats, convert RGB to YUV
> +            int Y, U, V;
> +
> +            rgb2yuv(sctx->base.fillcolor[0] / 255.0,
> sctx->base.fillcolor[1] / 255.0,
> +                    sctx->base.fillcolor[2] / 255.0, &Y, &U, &V, depth);
> +            s->fillcolor_yuv[0] = Y;
> +            s->fillcolor_yuv[1] = U;
> +            s->fillcolor_yuv[2] = V;
> +            s->fillcolor_yuv[3] = sctx->base.fillcolor[3];
> +        }
> +    }
> +
> +    ret = config_comm_output(outlink);
> +    if (ret < 0)
> +        return ret;
> +
> +    ret = cuda_stack_load_functions(ctx, in_format);
> +    if (ret < 0)
> +        return ret;
> +
> +    // Initialize hardware frames context for output
> +    hw_frames_ctx = av_hwframe_ctx_alloc(in_frames_ctx->device_ref);
> +    if (!hw_frames_ctx)
> +        return AVERROR(ENOMEM);
> +
> +    out_frames_ctx = (AVHWFramesContext*)hw_frames_ctx->data;
> +    out_frames_ctx->format = AV_PIX_FMT_CUDA;
> +    out_frames_ctx->sw_format = in_format;
> +    out_frames_ctx->width = outlink->w;
> +    out_frames_ctx->height = outlink->h;
> +
> +    ret = av_hwframe_ctx_init(hw_frames_ctx);
> +    if (ret < 0) {
> +        av_buffer_unref(&hw_frames_ctx);
> +        return ret;
> +    }
> +
> +    av_buffer_unref(&outl->hw_frames_ctx);
> +    outl->hw_frames_ctx = hw_frames_ctx;
> +
> +    return 0;
> +}
> +
> +static int cuda_stack_init(AVFilterContext *ctx)
> +{
> +    int ret;
> +
> +    ret = stack_init(ctx);
> +    if (ret)
> +        return ret;
> +
> +    return 0;
> +}
> +
> +static av_cold void cuda_stack_uninit(AVFilterContext *ctx)
> +{
> +    StackCudaContext *sctx = ctx->priv;
> +    CUDAStackContext *s = &sctx->cuda;
> +
> +    if (s->hwctx && s->cu_module) {
> +        CudaFunctions *cu = s->cuda_dl;
> +        CUcontext dummy;
> +
> +        CHECK_CU(cu->cuCtxPushCurrent(s->hwctx->cuda_ctx));
> +        CHECK_CU(cu->cuModuleUnload(s->cu_module));
> +        s->cu_module = NULL;
> +        CHECK_CU(cu->cuCtxPopCurrent(&dummy));
> +    }
> +
> +    stack_uninit(ctx);
> +}
> +
> +static const enum AVPixelFormat cuda_stack_pix_fmts[] = {
> +    AV_PIX_FMT_CUDA,
> +    AV_PIX_FMT_NONE,
> +};
> +
> +#include "stack_internal.c"
> +
> +#if CONFIG_HSTACK_CUDA_FILTER
> +
> +DEFINE_HSTACK_OPTIONS(cuda);
> +DEFINE_STACK_FILTER(hstack, cuda, "CUDA", 0);
> +
> +#endif
> +
> +#if CONFIG_VSTACK_CUDA_FILTER
> +
> +DEFINE_VSTACK_OPTIONS(cuda);
> +DEFINE_STACK_FILTER(vstack, cuda, "CUDA", 0);
> +
> +#endif
> +
> +#if CONFIG_XSTACK_CUDA_FILTER
> +
> +DEFINE_XSTACK_OPTIONS(cuda);
> +DEFINE_STACK_FILTER(xstack, cuda, "CUDA", 0);
> +
> +#endif
> \ No newline at end of file
> diff --git a/libavfilter/vf_stack_cuda.cu b/libavfilter/vf_stack_cuda.cu
> new file mode 100644
> index 0000000000..c19595e0a6
> --- /dev/null
> +++ b/libavfilter/vf_stack_cuda.cu
> @@ -0,0 +1,389 @@
> +/*
> + * Copyright (c) 2025, Faeez Kadiri < f1k2faeez at gmail dot com>
> + *
> + * This file is part of FFmpeg.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the
> "Software"),
> + * to deal in the Software without restriction, including without
> limitation
> + * the rights to use, copy, modify, merge, publish, distribute,
> sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be
> included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT
> SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
> OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
> + * DEALINGS IN THE SOFTWARE.
> + */
> +
> +#include "cuda/vector_helpers.cuh"
> +
> +// --- CONVERSION LOGIC ---
> +
> +static const ushort mask_10bit = 0xFFC0;
> +static const ushort mask_16bit = 0xFFFF;
> +
> +static inline __device__ ushort conv_8to16(uchar in, ushort mask)
> +{
> +    return ((ushort)in | ((ushort)in << 8)) & mask;
> +}
> +
> +// FFmpeg passes pitch in bytes, CUDA uses potentially larger types
> +#define FIXED_PITCH \
> +    (dst_pitch/sizeof(*dst[0]))
> +
> +#define DEFAULT_DST(n) \
> +    dst[n][yo*FIXED_PITCH+xo]
> +
> +#define OFFSET_DST(n) \
> +    dst[n][(yo+dst_y)*FIXED_PITCH+(xo+dst_x)]
> +
> +// --- COMMON BOUNDS CHECKING ---
> +
> +#define BOUNDS_CHECK() \
> +    if (xo >= width || yo >= height) \
> +        return; \
> +    int target_x = xo + dst_x; \
> +    int target_y = yo + dst_y; \
> +    if (target_x < 0 || target_y < 0 || target_x >= frame_width ||
> target_y >= frame_height) \
> +        return;
> +
> +#define BOUNDS_CHECK_UV(chroma_shift) \
> +    if (xo >= width || yo >= height) \
> +        return; \
> +    int target_x = xo + dst_x; \
> +    int target_y = yo + dst_y; \
> +    int frame_uv_height = frame_height >> chroma_shift; \
> +    if (target_x < 0 || target_y < 0 || target_x >= frame_width ||
> target_y >= frame_uv_height) \
> +        return;
> +
> +#define COPY_BOUNDS_CHECK() \
> +    int target_x = xo + dst_x; \
> +    int target_y = yo + dst_y; \
> +    if (target_x < 0 || target_y < 0 || target_x >= frame_width ||
> target_y >= frame_height) \
> +        return;
> +
> +#define COPY_BOUNDS_CHECK_UV(chroma_shift) \
> +    int target_x = xo + dst_x; \
> +    int target_y = yo + dst_y; \
> +    int frame_uv_width = frame_width >> chroma_shift; \
> +    int frame_uv_height = frame_height >> chroma_shift; \
> +    if (target_x < 0 || target_y < 0 || target_x >= frame_uv_width ||
> target_y >= frame_uv_height) \
> +        return;
> +
> +// --- COLOR OPERATIONS ---
> +
> +#define COLOR_DEF_F(N, T) \
> +    __device__ static inline void N(T *dst[4], int xo, int yo, \
> +                                     int width, int height, int
> dst_pitch, \
> +                                     int dst_x, int dst_y, \
> +                                     unsigned char y_color, unsigned char
> u_color, \
> +                                     unsigned char v_color, unsigned char
> a_color, \
> +                                     int frame_width, int frame_height)
> +
> +// Macro for YUV planar formats (420p, 444p, etc.)
> +#define DEFINE_SETCOLOR_YUV_PLANAR(name, out_type, out_type_uv, y_assign,
> uv_assign) \
> +struct SetColor_##name \
> +{ \
> +    typedef out_type out_T; \
> +    typedef out_type_uv out_T_uv; \
> +    \
> +    COLOR_DEF_F(SetColor, out_T) \
> +    { \
> +        BOUNDS_CHECK(); \
> +        OFFSET_DST(0) = y_assign; \
> +    } \
> +    \
> +    COLOR_DEF_F(SetColor_uv, out_T_uv) \
> +    { \
> +        BOUNDS_CHECK(); \
> +        uv_assign; \
> +    } \
> +};
> +
> +// Macro for NV12-style formats (interleaved UV)
> +#define DEFINE_SETCOLOR_NV(name, out_type, out_type_uv, y_assign,
> uv_assign) \
> +struct SetColor_##name \
> +{ \
> +    typedef out_type out_T; \
> +    typedef out_type_uv out_T_uv; \
> +    \
> +    COLOR_DEF_F(SetColor, out_T) \
> +    { \
> +        BOUNDS_CHECK(); \
> +        OFFSET_DST(0) = y_assign; \
> +    } \
> +    \
> +    COLOR_DEF_F(SetColor_uv, out_T_uv) \
> +    { \
> +        BOUNDS_CHECK_UV(1); \
> +        OFFSET_DST(1) = uv_assign; \
> +    } \
> +};
> +
> +// Macro for RGB formats
> +#define DEFINE_SETCOLOR_RGB(name, out_type, color_assign) \
> +struct SetColor_##name \
> +{ \
> +    typedef out_type out_T; \
> +    typedef uchar out_T_uv; \
> +    \
> +    COLOR_DEF_F(SetColor, out_T) \
> +    { \
> +        BOUNDS_CHECK(); \
> +        OFFSET_DST(0) = color_assign; \
> +    } \
> +    \
> +    COLOR_DEF_F(SetColor_uv, out_T_uv) \
> +    { \
> +        /* No UV plane for RGB formats */ \
> +    } \
> +};
> +
> +// Define all SetColor structs using macros
> +DEFINE_SETCOLOR_YUV_PLANAR(yuv420p, uchar, uchar, y_color,
> +    OFFSET_DST(1) = u_color; OFFSET_DST(2) = v_color)
> +
> +DEFINE_SETCOLOR_NV(nv12, uchar, uchar2, y_color,
> +    make_uchar2(u_color, v_color))
> +
> +DEFINE_SETCOLOR_YUV_PLANAR(yuv444p, uchar, uchar, y_color,
> +    OFFSET_DST(1) = u_color; OFFSET_DST(2) = v_color)
> +
> +DEFINE_SETCOLOR_NV(p010le, ushort, ushort2, conv_8to16(y_color,
> mask_10bit),
> +    make_ushort2(conv_8to16(u_color, mask_10bit), conv_8to16(v_color,
> mask_10bit)))
> +
> +DEFINE_SETCOLOR_NV(p016le, ushort, ushort2, conv_8to16(y_color,
> mask_16bit),
> +    make_ushort2(conv_8to16(u_color, mask_16bit), conv_8to16(v_color,
> mask_16bit)))
> +
> +DEFINE_SETCOLOR_YUV_PLANAR(yuv444p16le, ushort, ushort,
> conv_8to16(y_color, mask_16bit),
> +    OFFSET_DST(1) = conv_8to16(u_color, mask_16bit); OFFSET_DST(2) =
> conv_8to16(v_color, mask_16bit))
> +
> +DEFINE_SETCOLOR_RGB(rgb0, uchar4, make_uchar4(y_color, u_color, v_color,
> 0))
> +DEFINE_SETCOLOR_RGB(bgr0, uchar4, make_uchar4(v_color, u_color, y_color,
> 0))
> +DEFINE_SETCOLOR_RGB(rgba, uchar4, make_uchar4(y_color, u_color, v_color,
> a_color))
> +DEFINE_SETCOLOR_RGB(bgra, uchar4, make_uchar4(v_color, u_color, y_color,
> a_color))
> +
> +// --- COPY OPERATIONS ---
> +
> +template<typename T>
> +using copy_function_t = T (*)(cudaTextureObject_t tex, int xo, int yo,
> +                               int dst_width, int dst_height,
> +                               int src_width, int src_height,
> +                               int bit_depth);
> +
> +#define COPY_DEF_F(N, T) \
> +    template<copy_function_t<in_T> copy_func_y,
>                  \
> +             copy_function_t<in_T_uv> copy_func_uv>
>                  \
> +    __device__ static inline void N(cudaTextureObject_t src_tex[4], T
> *dst[4], int xo, int yo, \
> +                                        int dst_width, int dst_height,
> int dst_pitch,          \
> +                                        int dst_x, int dst_y,
>                       \
> +                                        int src_width, int src_height,
>                     \
> +                                        int frame_width, int frame_height)
> +
> +#define COPY_SUB_F(m, plane) \
> +    copy_func_##m(src_tex[plane], xo, yo, \
> +                  dst_width, dst_height,  \
> +                  src_width, src_height,  \
> +                  in_bit_depth)
> +
> +// Macro for YUV planar copy operations
> +#define DEFINE_STACKCOPY_YUV_PLANAR(name, bit_depth, in_type, in_type_uv,
> out_type, out_type_uv, chroma_shift) \
> +struct StackCopy_##name \
> +{ \
> +    static const int in_bit_depth = bit_depth; \
> +    typedef in_type in_T; \
> +    typedef in_type_uv in_T_uv; \
> +    typedef out_type out_T; \
> +    typedef out_type_uv out_T_uv; \
> +    \
> +    COPY_DEF_F(StackCopy, out_T) \
> +    { \
> +        COPY_BOUNDS_CHECK(); \
> +        OFFSET_DST(0) = COPY_SUB_F(y, 0); \
> +    } \
> +    \
> +    COPY_DEF_F(StackCopy_uv, out_T_uv) \
> +    { \
> +        COPY_BOUNDS_CHECK_UV(chroma_shift); \
> +        OFFSET_DST(1) = COPY_SUB_F(uv, 1); \
> +        OFFSET_DST(2) = COPY_SUB_F(uv, 2); \
> +    } \
> +};
> +
> +// Macro for NV12-style copy operations
> +#define DEFINE_STACKCOPY_NV(name, bit_depth, in_type, in_type_uv,
> out_type, out_type_uv) \
> +struct StackCopy_##name \
> +{ \
> +    static const int in_bit_depth = bit_depth; \
> +    typedef in_type in_T; \
> +    typedef in_type_uv in_T_uv; \
> +    typedef out_type out_T; \
> +    typedef out_type_uv out_T_uv; \
> +    \
> +    COPY_DEF_F(StackCopy, out_T) \
> +    { \
> +        COPY_BOUNDS_CHECK(); \
> +        OFFSET_DST(0) = COPY_SUB_F(y, 0); \
> +    } \
> +    \
> +    COPY_DEF_F(StackCopy_uv, out_T_uv) \
> +    { \
> +        COPY_BOUNDS_CHECK_UV(1); \
> +        OFFSET_DST(1) = COPY_SUB_F(uv, 1); \
> +    } \
> +};
> +
> +// Macro for RGB copy operations
> +#define DEFINE_STACKCOPY_RGB(name, bit_depth, in_type, out_type) \
> +struct StackCopy_##name \
> +{ \
> +    static const int in_bit_depth = bit_depth; \
> +    typedef in_type in_T; \
> +    typedef uchar in_T_uv; \
> +    typedef out_type out_T; \
> +    typedef uchar out_T_uv; \
> +    \
> +    COPY_DEF_F(StackCopy, out_T) \
> +    { \
> +        COPY_BOUNDS_CHECK(); \
> +        OFFSET_DST(0) = COPY_SUB_F(y, 0); \
> +    } \
> +    \
> +    COPY_DEF_F(StackCopy_uv, out_T_uv) \
> +    { \
> +        /* No UV plane for RGB formats */ \
> +    } \
> +};
> +
> +// Define all StackCopy structs using macros
> +DEFINE_STACKCOPY_YUV_PLANAR(yuv420p_yuv420p, 8, uchar, uchar, uchar,
> uchar, 1)
> +DEFINE_STACKCOPY_NV(nv12_nv12, 8, uchar, uchar2, uchar, uchar2)
> +DEFINE_STACKCOPY_YUV_PLANAR(yuv444p_yuv444p, 8, uchar, uchar, uchar,
> uchar, 0)
> +DEFINE_STACKCOPY_NV(p010le_p010le, 10, ushort, ushort2, ushort, ushort2)
> +DEFINE_STACKCOPY_NV(p016le_p016le, 16, ushort, ushort2, ushort, ushort2)
> +DEFINE_STACKCOPY_YUV_PLANAR(yuv444p16le_yuv444p16le, 16, ushort, ushort,
> ushort, ushort, 0)
> +DEFINE_STACKCOPY_RGB(rgb0_rgb0, 8, uchar4, uchar4)
> +DEFINE_STACKCOPY_RGB(bgr0_bgr0, 8, uchar4, uchar4)
> +DEFINE_STACKCOPY_RGB(rgba_rgba, 8, uchar4, uchar4)
> +DEFINE_STACKCOPY_RGB(bgra_bgra, 8, uchar4, uchar4)
> +
> +// --- COPY LOGIC ---
> +
> +template<typename T>
> +__device__ static inline T StackCopyPixel(cudaTextureObject_t tex,
> +                                        int xo, int yo,
> +                                        int dst_width, int dst_height,
> +                                        int src_width, int src_height,
> +                                        int bit_depth)
> +{
> +    float hscale = (float)src_width / (float)dst_width;
> +    float vscale = (float)src_height / (float)dst_height;
> +    float xi = (xo + 0.5f) * hscale;
> +    float yi = (yo + 0.5f) * vscale;
> +
> +    return tex2D<T>(tex, xi, yi);
> +}
> +
> +/// --- FUNCTION EXPORTS ---
> +
> +#define COLOR_KERNEL_ARGS(T) \
> +    T *dst_0, T *dst_1, T *dst_2, T *dst_3,             \
> +    int width, int height, int dst_pitch,               \
> +    int dst_x, int dst_y,                               \
> +    unsigned char y_color, unsigned char u_color,       \
> +    unsigned char v_color, unsigned char a_color,       \
> +    int frame_width, int frame_height
> +
> +#define COLOR_FUNC(SetColorFunc, T) \
> +    T *dst[4] = { dst_0, dst_1, dst_2, dst_3 };           \
> +    int xo = blockIdx.x * blockDim.x + threadIdx.x;       \
> +    int yo = blockIdx.y * blockDim.y + threadIdx.y;       \
> +    if (yo >= height || xo >= width) return;              \
> +    SetColorFunc(                                         \
> +        dst, xo, yo,                                      \
> +        width, height, dst_pitch,                         \
> +        dst_x, dst_y, y_color, u_color, v_color, a_color, \
> +        frame_width, frame_height);
> +
> +#define COPY_KERNEL_ARGS(T) \
> +    cudaTextureObject_t src_tex_0, cudaTextureObject_t src_tex_1, \
> +    cudaTextureObject_t src_tex_2, cudaTextureObject_t src_tex_3, \
> +    T *dst_0, T *dst_1, T *dst_2, T *dst_3,                       \
> +    int dst_width, int dst_height, int dst_pitch,                 \
> +    int dst_x, int dst_y,                                         \
> +    int src_width, int src_height,                                \
> +    int frame_width, int frame_height
> +
> +#define COPY_FUNC(StackCopyFunc, T) \
> +    cudaTextureObject_t src_tex[4] =                    \
> +        { src_tex_0, src_tex_1, src_tex_2, src_tex_3 }; \
> +    T *dst[4] = { dst_0, dst_1, dst_2, dst_3 };         \
> +    int xo = blockIdx.x * blockDim.x + threadIdx.x;     \
> +    int yo = blockIdx.y * blockDim.y + threadIdx.y;     \
> +    if (yo >= dst_height || xo >= dst_width) return;   \
> +    StackCopyFunc(                                      \
> +        src_tex, dst, xo, yo,                           \
> +        dst_width, dst_height, dst_pitch,               \
> +        dst_x, dst_y,                                   \
> +        src_width, src_height,                          \
> +        frame_width, frame_height);
> +
> +extern "C" {
> +
> +#define COLOR_KERNEL(C, S) \
> +    __global__ void SetColor_##C##S(                                  \
> +        COLOR_KERNEL_ARGS(SetColor_##C::out_T##S))                    \
> +    {                                                                 \
> +        COLOR_FUNC(SetColor_##C::SetColor##S, SetColor_##C::out_T##S) \
> +    }
> +
> +#define COLOR_KERNEL_RAW(C) \
> +    COLOR_KERNEL(C,)        \
> +    COLOR_KERNEL(C,_uv)
> +
> +// Define color kernels for all supported formats
> +COLOR_KERNEL_RAW(yuv420p)
> +COLOR_KERNEL_RAW(nv12)
> +COLOR_KERNEL_RAW(yuv444p)
> +COLOR_KERNEL_RAW(p010le)
> +COLOR_KERNEL_RAW(p016le)
> +COLOR_KERNEL_RAW(yuv444p16le)
> +COLOR_KERNEL_RAW(rgb0)
> +COLOR_KERNEL_RAW(bgr0)
> +COLOR_KERNEL_RAW(rgba)
> +COLOR_KERNEL_RAW(bgra)
> +
> +#define COPY_KERNEL(C, S) \
> +    __global__ void StackCopy_##C##S(                      \
> +        COPY_KERNEL_ARGS(StackCopy_##C::out_T##S))                 \
> +    {                                                        \
> +        COPY_FUNC((StackCopy_##C::StackCopy##S<                \
> +                  StackCopyPixel<StackCopy_##C::in_T>,           \
> +                  StackCopyPixel<StackCopy_##C::in_T_uv> >),     \
> +                  StackCopy_##C::out_T##S)                    \
> +    }
> +
> +#define COPY_KERNEL_RAW(C) \
> +    COPY_KERNEL(C,)   \
> +    COPY_KERNEL(C,_uv)
> +
> +// Define copy kernels for all supported formats
> +COPY_KERNEL_RAW(yuv420p_yuv420p)
> +COPY_KERNEL_RAW(nv12_nv12)
> +COPY_KERNEL_RAW(yuv444p_yuv444p)
> +COPY_KERNEL_RAW(p010le_p010le)
> +COPY_KERNEL_RAW(p016le_p016le)
> +COPY_KERNEL_RAW(yuv444p16le_yuv444p16le)
> +COPY_KERNEL_RAW(rgb0_rgb0)
> +COPY_KERNEL_RAW(bgr0_bgr0)
> +COPY_KERNEL_RAW(rgba_rgba)
> +COPY_KERNEL_RAW(bgra_bgra)
> +
> +}
> \ No newline at end of file
> --
> 2.34.1
>
>