[FFmpeg-devel] [FFmpeg-devel, RFC] lavfi: add opencl tonemap filter.

Sat May 5 19:45:02 EEST 2018

Another thing that came to my mind:

On Fri,  4 May 2018 15:32:58 +0800, Ruiling Song <ruiling.song at intel.com> wrote:
> +static int tonemap_opencl_filter_frame(AVFilterLink *inlink, AVFrame *input)
> +{
> +    AVFilterContext    *avctx = inlink->dst;
> +    AVFilterLink     *outlink = avctx->outputs[0];
> +    TonemapOpenCLContext *ctx = avctx->priv;
> +    AVFrame *output = NULL;
> +    cl_int cle;
> +    int err;
> +    double peak = ctx->peak;
> +
> +    AVHWFramesContext *input_frames_ctx =
> +        (AVHWFramesContext*)input->hw_frames_ctx->data;
> +
> +    av_log(ctx, AV_LOG_DEBUG, "Filter input: %s, %ux%u (%"PRId64").\n",
> +           av_get_pix_fmt_name(input->format),
> +           input->width, input->height, input->pts);
> +
> +    if (!input->hw_frames_ctx)
> +        return AVERROR(EINVAL);
> +
> +    output = ff_get_video_buffer(outlink, outlink->w, outlink->h);
> +    if (!output) {
> +        err = AVERROR(ENOMEM);
> +        goto fail;
> +    }
> +
> +    err = av_frame_copy_props(output, input);
> +    if (err < 0)
> +        goto fail;
> +
> +    if (!peak)
> +        peak = determine_signal_peak(input);
> +
> +    if (ctx->trc != -1)
> +        output->color_trc = ctx->trc;
> +    if (ctx->primaries != -1)
> +        output->color_primaries = ctx->primaries;
> +    if (ctx->colorspace != -1)
> +        output->colorspace = ctx->colorspace;
> +
> +    ctx->trc_in = input->color_trc;
> +    ctx->trc_out = output->color_trc;
> +    ctx->colorspace_in = input->colorspace;
> +    ctx->colorspace_out = output->colorspace;
> +    ctx->primaries_in = input->color_primaries;
> +    ctx->primaries_out = output->color_primaries;
> +
> +    assert(output->sw_format == AV_PIX_FMT_NV12);
> +
> +    if (!ctx->initialised) {
> +        err = tonemap_opencl_init(avctx);
> +        if (err < 0)
> +            goto fail;
> +    }
> +
> +    switch(input_frames_ctx->sw_format) {
> +    case AV_PIX_FMT_P010:
> +        err = launch_kernel(avctx, ctx->kernel, output, input, peak);
> +        if (err < 0) goto fail;
> +        break;
> +    default:
> +        av_log(ctx, AV_LOG_ERROR, "unsupported format in tonemap_opencl.\n");
> +        err = AVERROR(ENOSYS);
> +        goto fail;
> +    }
> +
> +    cle = clFinish(ctx->command_queue);
> +    if (cle != CL_SUCCESS) {
> +        av_log(avctx, AV_LOG_ERROR, "Failed to finish command queue: %d.\n",
> +               cle);
> +        err = AVERROR(EIO);
> +        goto fail;
> +    }
> +
> +
> +    av_frame_free(&input);
> +
> +    av_log(ctx, AV_LOG_DEBUG, "Filter output: %s, %ux%u (%"PRId64").\n",
> +           av_get_pix_fmt_name(output->format),
> +           output->width, output->height, output->pts);
> +
> +    return ff_filter_frame(outlink, output);
> +
> +fail:
> +    clFinish(ctx->command_queue);
> +    av_frame_free(&input);
> +    av_frame_free(&output);
> +    return err;
> +}

Reading this logic, it seems like each frame is individually processed
by calling clEnqueueNDRangeKernel followed by clFinish, after which the
function returns. I'm not very familiar with OpenCL, but if it behaves
anything like OpenGL and Vulkan, this kind of design essentially forces
blocking until the frame is finished processing and fully resident in
host RAM again, before the next frame can be started?

Generally when batch processing large amounts of images, it's better to
do asynchronous processing to avoid pipeline stalls. If the GPU is free
to resume work on the next image while the previous image is still in
the process of getting cleared from caches and DMA'd back into host
RAM, without having to stall for the transfer either way. Ideally,
PCIe bandwidth permitting, you want either the compute processors or the
DMA engine to be saturated. A blocking design permits neither.

I'm not entirely sure how that's done in OpenCL, if at all possible, but
normally it would revolve around using some kind of fence or signal
object to regularly poll in-flight frames and only emit them from the
filter once they're available. In general, this requires some kind of
filter API that allows ingesting and emitting frames at different times
- I'm sure this is possible with lavf somehow?

Correct me if I'm wrong,
Niklas