[FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC

Oscar Amoros Huguet oamoros at mediapro.tv
Mon May 7 18:05:11 EEST 2018


To clarify a bit what I was saying in the last email. When I said CUDA non-blocking streams, I meant non-default streams. All non-blocking streams are non-default streams, but non-default streams can be blocking or non-bloking with respect to the default streams. https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html

So, using cuMemcpyAsync, would allow the memory copies to overlap with any other copy or kernel execution, enqueued in any other non-default stream. https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/

If cuStreamSynchronize has to be called right after the last cuMemcpyAsync call, I see different ways of implementing this, but probably you will most likely prefer the following:

Add the cuMemcpyAsync to the list of cuda functions.
Add a field in AVCUDADeviceContext of type CUstream, and set it to 0 (zero) by default. Let's name it "CUstream cuda_stream"?
Call always cuMemcpyAsync instead of cuMemcpy, passing cuda_stream as the last parameter. cuMemcpyAsync(..., ..., ..., cuda_stream);
After the last cuMemcpyAsync, call cuStreamSynchronize on cuda_stream. cuStreamSynchronize(cuda_stream);

If the user does not change the context and the stream, the behavior will be exactly the same as it is now. No synchronization hazards. Because passing "0" as the cuda stream, makes the calls blocking, as if they weren't asynchronous calls.

But, if the user wants the copies to overlap with the rest of it's application, he can set it's own cuda context, and it's own non-default stream.

In any of the cases, ffmpeg does not have to handle cuda stream creation and destruction, which makes it simpler.

Hope you like it!

Oscar

-----Original Message-----
From: Oscar Amoros Huguet 
Sent: Monday, May 7, 2018 2:05 PM
To: FFmpeg development discussions and patches <ffmpeg-devel at ffmpeg.org>
Subject: Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC

Hi!

Even if there is need to have a syncronization before leaving the ffmpeg call, callin cuMemcpyAsync will allow the copies to overlap with any other task on the gpu, that was enqueued using any other non-blocking cuda stream. That’s exactly what we want to achieve.

This would benefit automatically any other app that uses non-blocking cuda streams, as independent cuda workflows.

Oscar

Enviat des del meu iPhone

El 7 maig 2018, a les 13:54, Timo Rothenpieler <timo at rothenpieler.org> va escriure:

>>> Additionally, could you give your opinion on the feature we also may
> want to add in the future, that we mentioned in the previous email?
> Basically, we may want to add one more CUDA function, specifically 
> cuMemcpy2DAsync, and the possibility to set a CUStream in 
> AVCUDADeviceContext, so it is used with cuMemcpy2DAsync instead of 
> cuMemcpy2D in "nvdec_retrieve_data" in file libavcodec/nvdec.c. In our 
> use case this would save up to  0.72 ms (GPU time) per frame, in case 
> of decoding 8 fullhd frames, and up to 0.5 ms (GPU time) per frame, in 
> case of decoding two 4k frames. This may sound too little, but for us 
> is significant. Our software needs to do many things in a maximum of 
> 33ms with CUDA on the GPU per frame, and we have little GPU time left.
>> 
>> This is interesting and I'm considering making that the default, as 
>> it would fit well with the current infrastructure, delaying the sync 
>> call to the moment the frame leaves avcodec, which with the internal 
>> re-ordering and delay should give plenty of time for the copy to finish.
> 
> I'm not sure if/how well this works with the mapped cuvid frames though.
> The frame would already be unmapped and potentially re-used again 
> before the async copy completes. So it would need an immediately call 
> to Sync right after the 3 async copy calls, making the entire effort pointless.
> 
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


More information about the ffmpeg-devel mailing list