[FFmpeg-devel] [PATCH][FFmpeg-devel v2] Add GPU accelerated video crop filter

Song, Ruiling ruiling.song at intel.com
Tue Mar 26 06:19:13 EET 2019



> -----Original Message-----
> From: ffmpeg-devel [mailto:ffmpeg-devel-bounces at ffmpeg.org] On Behalf Of
> Timo Rothenpieler
> Sent: Monday, March 25, 2019 6:31 PM
> To: ffmpeg-devel at ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH][FFmpeg-devel v2] Add GPU accelerated
> video crop filter
> 
> On 25/03/2019 09:27, Tao Zhang wrote:
> >>> Hi,
> >>>
> >>> Timo and Mark and I have been discussing this, and we think the right
> >>> thing to do is add support to vf_scale_cuda to respect the crop
> >>> properties on an input AVFrame. Mark posted a patch to vf_crop to
> >>> ensure that the properties are set, and then the scale filter should
> >>> respect those properties if they are set. You can look at
> >>> vf_scale_vaapi for how the properties are read, but they will require
> >>> explicit handling to adjust the src dimensions passed to the scale
> >>> filter.
> > Maybe a little not intuitive to users.
> >>>
> >>> This will be a more efficient way of handling crops, in terms of total
> >>> lines of code and also allowing crop/scale with one less copy.
> >>>
> >>> I know this is quite different from the approach you've taken here, and
> >>> we appreciate the work you've done, but it should be better overall to
> >>> implement this integrated method.
> >> Hi Philip,
> >>
> >> Glad to hear you guys had discussion on this. As I am also considering the
> problem, I have some questions about your idea.
> >> So, what if user did not insert a scale_cuda after crop filter? Do you plan to
> automatically insert scale_cuda or just ignore the crop?
> >> What if user want to do crop,transpose_cuda,scale_cuda? So we also need
> to handle crop inside transpose_cuda filter?
>  >
> > I have the same question.
> Ideally, scale_cuda should be auto-inserted at the required places once
> it works that way.
> Otherwise it seems pointless to me if the user still has to manually
> insert it after the generic filters setting metadata.
Agree.

> 
> For that reason it should also still support getting its parameters
> passed directly as a fallback, and potentially even expose multiple
> filter names, so crop_cuda and transpose_cuda are still visible, but
> ultimately point to the same filter code.
> 
> We have a transpose_npp, right now, but with libnpp slowly being on its
> way out, transpose_cuda is needed, and ultimately even a format_cuda
> filter, since right now scale_npp is the only filter that can convert
> pixel formats on the hardware.
> I'd also like to see scale_cuda to support a few more interpolation
> algorithms, but that's not very important for now.
> 
> All this functionality can be in the same filter, which is scale_cuda.
> The point of that is that it avoids needless expensive frame copies as
> much as possible.

For crop/transpose, these are just some copy-like kernel. May be a good idea to merge with other kernels.
But I am not sure how much overall performance gain we would get for a transcoding pipeline. And merging all the things together may make code very complex.
For example, a crop+scale or crop+transpose may be easy to merge. But a crop+transpose+scale or crop+transpose+scale+format will be more complex.

I want to share some of my experience on developing opencl scale filter( https://patchwork.ffmpeg.org/patch/11910/ ).
I tried to merge scale and format-convert in one single OpenCL kernel.
But I failed to make the code clean after supporting interpolation method like bicubic, so I plan to separate them in two kernels these days.

And my experiments on scale_opencl show that merging scale with format-convert may not always get benefit.
For example, for 1080p scale-down, merging these two operations together is about 10% faster (for decode+scale), but for 4K input, merging two kernels make it slower.
My guess is different planes may compete the limited GPU cache. For scale-only, we can do it plane by plane, but for format-convert you have to read all the input planes and write all output planes at the same time.
This is just my guess, I have not root-caused what is the real reason. But I think keeping scale and format-convert in separate kernel function seems better.

I am also thinking about this issue other way, whether it is possible that we simple do the needed copy in crop/transpose and try to optimize off one filter if they are neighbors and pass the options to the other when configuring the filter pipeline?
Definitely I am really interested to see the work you described happen in FFmpeg.

Thanks!
Ruiling


More information about the ffmpeg-devel mailing list