[FFmpeg-devel] [PATCH]: Change Stack Frame Limit in Cuda Context

Wed Jan 31 02:25:00 EET 2018

On Fri, Jan 26, 2018 at 3:10 PM, Mark Thompson <sw at jkqxz.net> wrote:

> On 26/01/18 20:51, Ben Chang wrote:
> > On Fri, Jan 26, 2018 at 3:32 AM, Mark Thompson <sw at jkqxz.net> wrote:
> >
> >> On 26/01/18 09:06, Ben Chang wrote:
> >>> Thanks for the review Mark.
> >>>
>
> To clarify, since it is less clear now with the trimmed context: my two
> comments about this change are completely independent.  (Given that, maybe
> it should be split into two parts - one for hwcontext and one for nvenc?)

Sorry for the delay in reply Mark; been caught up by something else.

>

> This part is about the change to NVENC:
>
> >>>  There are some cuda kernels in the driver that may be invoked
> depending
> >> on
> >>> the nvenc operations specified in the commandline. My observation from
> >>> looking at the nvcc statistics is that most stack frame size for these
> >> cuda
> >>> kernels are 0 (highest observed was 120 bytes).
> >>
> >> Right, that makes sense.  If Nvidia is happy that this will always work
> in
> >> drivers compatible with this API version (including any future ones)
> then
> >> sure.
> >>
> > I am not saying this should be the "permanent" value for stack frame size
> > per GPU thread. However, at this moment (looking at existing cuda kernels
> > that devs have control over), I do not see this reduction as an issue.
>
> I think you should be confident that the chosen value here will last well
> into the future for NVENC use.  Consider that this will end up in releases
> - if a future Nvidia driver update happens to need a larger stack then all
> previous releases and binaries will stop working for all users.
>
>
> This part is about the change to the hwcontext device creation:
>
> >>>> This is technically a user-visible change, since it will apply to all
> >> user
> >>>> programs run on the CUDA context created here as well as those inside
> >>>> ffmpeg.  I'm not sure how many people actually use that, though, so
> >> maybe
> >>>> it won't affect anyone.
> >>>>
> >>> In ffmpeg, I see vf_thumbnail_cuda and vf_scale_cuda available (not
> sure
> >> if
> >>> there is more, but these two should not be affected by this reduction).
> >>> User can always raise the stack limit size if their own custom kernel
> >>> require higher stack frame size.
> >>
> >> I don't mean filters inside ffmpeg, I mean a user program which probably
> >> uses NVDEC and/or NVENC (and possibly other things) from libavcodec but
> >> then does its own CUDA processing with the same context.  This is
> silently
> >> changing the setup underneath it, and 128 feels like a very small
> number.
> >>
> > Yes, this is really a trade off between reducing memory usage (since
> there
> > are numerous complaints of high memory usage preventing having more
> ffmpeg
> > instances) and user convenience (custom cuda implementation may be
> > impacted). My thought (which can be wrong) is that users who implement
> > their own cuda kernel may have better knowledge about cuda (eg. how much
> > stack frame size their kernel needs or use cuda debugger to find out what
> > issue they may have). The size of the kernels are really implementation
> > dependent (eg, allocating arrays in stacks or heap, recursions, how much
> > register spills, etc) so stack frame sizes may vary widely. The default,
> > 1024 bytes, may not be enough at times and user will need to adjust the
> > stack limit accordingly anyway.
>
> Note that since you are changing a library the users in this context are
> all ffmpeg library users.  So, it means any program or library which uses
> ffmpeg, and transitively anyone who uses them.  The end-user need not be
> informed about CUDA at all.
>
> (What you've said also makes it sound like it can change by compiler
> version, but I guess such changes should be small.)
>
> >>>>
> >>>> If the stack limit is violated, what happens?  Will that be undefined
> >>>> behaviour with random effects (crash / incorrect results), or is it
> >> likely
> >>>> to be caught at program compile/load-time?
> >>>>
> >>> Stack will likely overflow and kernel will terminate (though I have yet
> >>> encounter this before).
> >>
> >> As long as the user gets a clear message that a stack overflow has
> >> occurred so that they can realise that they need to raise the value
> then it
> >> should be fine.
> >
> > I believe you will see stack overflow if attached to cuda debugger. But
> the
> > default error may just be kernel launch error/failure. This goes back to
> my
> > opinion that cuda developer should figure this out relatively easy if
> they
> > want to customize the cuda part of their program.
>
> Ok, sure.  It's possibly unfortunate if a user who knows nothing about
> CUDA rebuilds a program from source with newer ffmpeg libraries including
> this change because something could stop working in an opaque way, but if
> it errors out then it won't silently give the wrong answer and should be
> diagnosable by the original developer.

I have asked my colleagues about the effects of cuda kernel's stack frame
exceeding the default stack limit size. Cuda driver should automatically
increase the stack limit / allocate more memory. It will fail if there is
no more memory available.

I have done some experiments to validate this (by creating various sizes of
arrays on stack in a cuda kernel). This is based on setting cuCtxSetLimit
to 128 bytes in stack limit in ffmpeg. Having about 40000 bytes of stack
frame size in kernel will not result in failure (extra memory allocated by
driver observed in GPU-Z). Its only when I have something ridiculous like
300,000 stack frame size does the encode fail with kernel launch failure.

Therefore, I do not think this patch of reducing default stack limit will
cause regression for users (both in nvenc & hwcontext device creation).

In short, I am trying to say there is a fallback if the initial stack limit
set if not enough. This fallback only fails there is no more memory
available on gpu (in this cause, user will stumble upon this even with the
default 1024 bytes).

Thanks,
Ben