[FFmpeg-devel] [PATCH] Add NVENC encoder

Thu Nov 27 12:36:07 CET 2014

> Is it necessary to split the _api part in a separate file? The whole code is
> a bit large, but still manageable, and merging the files would avoid some
> headers overhead.

Done that primarily to keep things cleaned up and easier to read. Can as 
well put it all in one huge file.

> I think moving the frei0r rules is supposed to belong in a separate patch.

Propably, will split that out when i get to it.

> The usual coding style for structure members and variables in ffmpeg is
> names_separated_with_underscodes, not uglyCamelCase. (But I believe the
> person who will end up maintaining the file should have last word on this.)

Most of this code is ported from a C++ Project where everything had to 
be camel case, so those and the C++ malloc casts are some of the leftovers.

>> +        (*head)->surface = surface;
>> +        return;
>> +    }
>> +
>> +    while ((*head)->next)
>> +        head = &((*head)->next);
>
> This looks inefficient. Do you have an estimate of the usual size of the
> queue?

The maximum is the number of adjacent B Frames plus one. So 3 in the 
case of NVENC, unless they change the supported gop patterns.

> I suggest you have a look at the dynarray (in libavutil/mem.h and
> dynarray.h) API.
>
> If you really need linked lists, you could probably keep the final pointer
> to head in the structure to avoid walking the list every time.

I basically need a fifo like structure, where i can queue output 
surfaces until NVENC is done filling them.
An array doesn't realy reflect that usage, as new elements are added to 
the front, and taken from the back.

>> +
>
>> +    (*head)->next = av_malloc(sizeof(NvencOutputSurfaceList));
>
> av_malloc() return value needs to be checked. Other similar cases below.
>

Is a simple assert on the return value enough? Can't continue in a sane 
way anyway if it ever fails.

>> +    (*head)->next->next = 0;
>> +    (*head)->next->surface = surface;
>> +}
>> +
>
>> +static NvencOutputSurface *out_surf_queue_pop(NvencOutputSurfaceList** head)
>
> If you call this one pop instead of shift, people used to Perl will be very
> confused.

push/pop is propably not ideal naming, as it reminds too much of a 
stack, which it isn't.
I renamed it to enqueue/dequeue.

>
>> +static void timestamp_list_insert_sorted(NvencTimestampList** head, int64_t timestamp)
>
> Same as before: maybe dynarray would be more efficient, avoiding malloc()
> with its huge overhead for every insertion.
>
> Also, if the list is expected to be large, you may consider using a heap
> instead of a sorted list.

This definitely can't be replaced, its purpose isn't just a plain list, 
but sorting of the input timestamps, so the dts is still monotonic after 
re-ordering for B frames.

>> +        av_log(avctx, AV_LOG_FATAL, "Failed creating CUDA context for NVENC\n");
>
> Is there a chance of getting a more detailed error reason?

Only adding the CUDA error code, which then has to be looked up manualy.

>
>> +            av_log(avctx, AV_LOG_ERROR, "Preset \"%s\" is unknown!\n", ctx->preset);
>
> Should return an error. And if you use a table with the list of presets, you
> can dump the list.

What do you mean by that? Printing which presets are available in the 
error message?

>> +    ctx->initEncodeParams.darHeight = avctx->height;
>> +    ctx->initEncodeParams.darWidth = avctx->width;
>
> Was this tested with anamorphic videos?
>

At least i didn't test that. So i don't think anyone did.

>> +
>> +    if (!ctx->profile) {
>> +        switch (avctx->profile) {
>
>> +            case FF_PROFILE_H264_BASELINE:
>> +            ctx->profile = av_strdup("baseline");
>
> Need to check the return value.
>
> But it seems you have the private option "profile" conflicting with the
> global option "profile", which is confusing, and possibly problematic, for
> users.

Not entirely sure why i did that this way. Copied it straight from the 
libx264 encoder, without thinking too much about it. I can just set the 
profileGUID straight from that switch and can remove the second profile 
variable(which the libx264 encoder has in exactly the same conflicting 
way) entirely.

>> +    switch (lockParams.pictureType) {
>> +        case NV_ENC_PIC_TYPE_IDR:
>> +        pkt->flags |= AV_PKT_FLAG_KEY;
>> +        case NV_ENC_PIC_TYPE_I:
>> +        avctx->coded_frame->pict_type = AV_PICTURE_TYPE_I;
>> +        break;
>> +
>> +        case NV_ENC_PIC_TYPE_P:
>> +        avctx->coded_frame->pict_type = AV_PICTURE_TYPE_P;
>> +        break;
>> +
>> +        case NV_ENC_PIC_TYPE_B:
>> +        avctx->coded_frame->pict_type = AV_PICTURE_TYPE_B;
>> +        break;
>> +
>> +        case NV_ENC_PIC_TYPE_BI:
>> +        avctx->coded_frame->pict_type = AV_PICTURE_TYPE_BI;
>> +        break;
>> +
>
>> +        default:
>> +        avctx->coded_frame->pict_type = AV_PICTURE_TYPE_NONE;
>
> Does this happen normally?

Not that I'm aware of. But i don't know what else to assume in this case.

>> +        break;
>> +    }

>> +        for (i = 0; i < ctx->maxSurfaceCount; ++i)
>> +            if (!ctx->inputSurfaces[i].lockCount)
>> +                inSurf = &ctx->inputSurfaces[i];
>
> Maybe a break here.

Yes

>> +        av_assert0(inSurf);
>
> Are you positively sure that an input surface will always be available?

The maximum supported number of surfaces is allocated, if it'd ever run 
out, there'd be a bug in the code managing the surfaces.

>> +            uint8_t *buf = lockBufferParams.bufferDataPtr;
>> +
>> +            av_image_copy_plane(buf, lockBufferParams.pitch,
>> +                frame->data[0], frame->linesize[0],
>> +                avctx->width, avctx->height);
>> +
>> +            buf += inSurf->height * lockBufferParams.pitch;
>
> Could be factored out, unless I am missing something.

I could do the absolute calculation in each copy_plane call, but it 
would be way harder to read then. The compiler should take care of 
optimizing this out.

>> +        if (i == ctx->maxSurfaceCount) {
>> +            inSurf->lockCount = 0;
>
>> +            av_log(avctx, AV_LOG_ERROR, "No free output surface found!\n");
>> +            return 0;
>
> Proper error code?

Not intended to be a fatal error, the frame would just be dropped.
This case should never happen anyway, and if it does, something is very 
wrong. So probably better a fatal error here.

>> +        }
>> +
>> +        ctx->outputSurfaces[i].inputSurface = inSurf;
>> +
>> +        picParams.inputBuffer = inSurf->inputSurface;
>> +        picParams.bufferFmt = inSurf->format;
>> +        picParams.inputWidth = avctx->width;
>> +        picParams.inputHeight = avctx->height;
>> +        picParams.outputBitstream = ctx->outputSurfaces[i].outputSurface;
>> +        picParams.completionEvent = 0;
>> +
>> +        if (avctx->flags & CODEC_FLAG_INTERLACED_DCT) {
>> +            if (frame->top_field_first) {
>> +                picParams.pictureStruct = NV_ENC_PIC_STRUCT_FIELD_TOP_BOTTOM;
>> +            } else {
>> +                picParams.pictureStruct = NV_ENC_PIC_STRUCT_FIELD_BOTTOM_TOP;
>> +            }
>> +        } else {
>> +            picParams.pictureStruct = NV_ENC_PIC_STRUCT_FRAME;
>> +        }
>> +
>> +        picParams.encodePicFlags = 0;
>> +        picParams.inputTimeStamp = frame->pts;
>> +        picParams.inputDuration = 0;
>> +        picParams.codecPicParams.h264PicParams.sliceMode = ctx->encodeConfig.encodeCodecConfig.h264Config.sliceMode;
>> +        picParams.codecPicParams.h264PicParams.sliceModeData = ctx->encodeConfig.encodeCodecConfig.h264Config.sliceModeData;
>> +        memcpy(&picParams.rcParams, &ctx->encodeConfig.rcParams, sizeof(NV_ENC_RC_PARAMS));
>> +
>> +        timestamp_list_insert_sorted(&ctx->timestampList, frame->pts);
>> +    } else {
>> +        picParams.encodePicFlags = NV_ENC_PIC_FLAG_EOS;
>> +    }
>> +
>> +    nvStatus = ff_pNvEnc->nvEncEncodePicture(ctx->nvencoder, &picParams);
>> +
>> +    if (frame && nvStatus == NV_ENC_ERR_NEED_MORE_INPUT) {
>> +        out_surf_queue_push(&ctx->outputSurfaceQueue, &ctx->outputSurfaces[i]);
>> +        ctx->outputSurfaces[i].busy = 1;
>> +    }
>> +
>> +    if (nvStatus != NV_ENC_SUCCESS && nvStatus != NV_ENC_ERR_NEED_MORE_INPUT) {
>> +        av_log(avctx, AV_LOG_ERROR, "EncodePicture failed!\n");
>> +        return AVERROR_EXTERNAL;
>> +    }
>> +
>> +    if (nvStatus != NV_ENC_ERR_NEED_MORE_INPUT) {
>> +        while (ctx->outputSurfaceQueue) {
>> +            tmpoutsurf = out_surf_queue_pop(&ctx->outputSurfaceQueue);
>> +            out_surf_queue_push(&ctx->outputSurfaceReadyQueue, tmpoutsurf);
>> +        }
>> +
>> +        if (frame) {
>> +            out_surf_queue_push(&ctx->outputSurfaceReadyQueue, &ctx->outputSurfaces[i]);
>> +            ctx->outputSurfaces[i].busy = 1;
>> +        }
>> +    }
>> +
>> +    if (ctx->outputSurfaceReadyQueue) {
>> +        tmpoutsurf = out_surf_queue_pop(&ctx->outputSurfaceReadyQueue);
>> +
>> +        *got_packet = process_output_surface(avctx, pkt, avctx->coded_frame, tmpoutsurf);
>> +
>> +        tmpoutsurf->busy = 0;
>> +        av_assert0(tmpoutsurf->inputSurface->lockCount);
>> +        tmpoutsurf->inputSurface->lockCount--;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int pix_fmts_nvenc_initialized;
>> +
>> +static enum AVPixelFormat pix_fmts_nvenc[] = {
>> +    AV_PIX_FMT_NV12,
>> +    AV_PIX_FMT_NONE,
>> +    AV_PIX_FMT_NONE,
>> +    AV_PIX_FMT_NONE
>> +};
>> +
>> +static av_cold void nvenc_init_static(AVCodec *codec)
>> +{
>> +    NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS stEncodeSessionParams = { 0 };
>> +    CUcontext cuctxcur = 0, cuctx = 0;
>> +    NVENCSTATUS nvStatus;
>> +    void *nvencoder = 0;
>> +    GUID encodeGuid = NV_ENC_CODEC_H264_GUID;
>> +    GUID license = dummy_license;
>> +    int i = 0, pos = 0;
>> +    int gotnv12 = 0, got420 = 0, got444 = 0;
>> +    uint32_t inputFmtCount = 32;
>> +    NV_ENC_BUFFER_FORMAT inputFmts[32];
>> +
>> +    for (i = 0; i < 32; ++i)
>> +        inputFmts[i] = (NV_ENC_BUFFER_FORMAT)0;
>> +    i = 0;
>> +
>> +    if (pix_fmts_nvenc_initialized) {
>> +        codec->pix_fmts = pix_fmts_nvenc;
>> +        return;
>> +    }
>> +
>> +    if (!ff_nvenc_dyload_nvenc(0)) {
>> +        pix_fmts_nvenc_initialized = 1;
>> +        return;
>> +    }
>> +
>> +    stEncodeSessionParams.version = NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS_VER;
>> +    stEncodeSessionParams.apiVersion = NVENCAPI_VERSION;
>> +    stEncodeSessionParams.clientKeyPtr = &license;
>> +
>> +    cuctx = 0;
>
>> +    if (ff_cuCtxCreate(&cuctx, 0, ff_pNvencDevices[ff_iNvencUseDeviceID]) != CUDA_SUCCESS) {
>
> It would probably be better to get ff_cuCtxCreate() return an AVERROR code
> instead of a CUDA error code. Same for all ff_ helper functions.

ff_cuCtxCreate is a library function loaded from the CUDA dll/so.
Same for all the other ff_cu* functions, there is no way to change what 
it returns, as it's not my function.

> Some of these options are redundant with global ones; "profile" already
> cited, "2pass" = -flags +pass1/+pass2; "cqp" = "global_quality".

Profile is entirely obsolete, just copied it from x264, which also has 
that redundant option.

The twopass for nvenc means something entirely diffrent. Using the 
global option would be bad. It does not support the normal twopass 
encoding, like for example libx264 does. TWO_PASS is just another CBR 
rate control mode, which i honestly have no idea of what it does 
internaly, but it boosts the quality quite a bit.

Didn't know about that global_quality option, will use that instead.

>
>> +#define ifav_log(...) if (avctx) { av_log(__VA_ARGS__); }
>
> Looks strange: why no error message when there is no context?

Is it possible to call av_log without a context?
Only added that because of the static init function, which doesn't have 
a avctx yet.

>> +
>> +static int nvenc_dyload_cuda(AVCodecContext *avctx)
>> +{
>
>> +    if (cudaLib)
>> +        return 1;
>
> Thread safe?

No, does it need to be? Can multiple threads create the coded at the 
same time?
If so, the nvenc init functions need a mutex.

>> +        ifav_log(avctx, AV_LOG_FATAL, ">> %s - failed with error code 0x%x\n", func, err);
>
> The library does not provide error code -> string utility?

Unfortunately it doesn't. Same for CUDA and NVENC.

>> +    if (nvenc_init_count <= 0)
>> +        return;
>> +
>> +    nvenc_init_count--;
>
> This looks not thread safe.

No, it isn't. The entire load/unlock function needs to be locked if it 
has to be.

Fixed and refactored this quite a bit now. Not all issues are addressed 
yet(Also didn't test if this still works yet, so don't use this patch 
for anything).

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ffmpeg-nvenc.patch
Type: text/x-patch
Size: 44212 bytes
Desc: not available
URL: <https://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20141127/369b841e/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: OpenPGP digital signature
URL: <https://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20141127/369b841e/attachment.asc>