[FFmpeg-devel] [PATCH 20/24] sws: add a function for scaling dst slices

Anton Khirnov anton at khirnov.net
Fri Jun 11 20:16:17 EEST 2021


Quoting Michael Niedermayer (2021-06-11 17:01:20)
> On Thu, Jun 10, 2021 at 05:49:48PM +0200, Anton Khirnov wrote:
> > Quoting Michael Niedermayer (2021-06-01 15:02:27)
> > > On Mon, May 31, 2021 at 09:55:11AM +0200, Anton Khirnov wrote:
> > > > Currently existing sws_scale() accepts as input a user-determined slice
> > > > of input data and produces an indeterminate number of output lines.
> > > 
> > > swscale() should return the number of lines output
> > > it does "return dstY - lastDstY;"
> > 
> > But you do not know the number of lines beforehand.
> > I suppose one could assume that the line counts will always be the same
> > for any run with the same parameters (strictly speaking this is not
> > guaranteed) and store them after the first frame, but then the first
> > scale call is not parallel. And it would be quite ugly.
> > 
> 
> > > 
> > > 
> > > > Since the calling code does not know the amount of output, it cannot
> > > > easily parallelize scaling by calling sws_scale() simultaneously on
> > > > different parts of the frame.
> > > > 
> > > > Add a new function - sws_scale_dst_slice() - that accepts as input the
> > > > entire input frame and produces a specified slice of the output. This
> > > > function can be called simultaneously on different slices of the output
> > > > frame (using different sws contexts) to implement slice threading.
> > > 
> > > an API that would allow starting before the whole frame is available
> > > would have reduced latency and better cache locality. Maybe that can
> > > be added later too but i wanted to mention it because the documentation
> > > exlicitly says "entire input"
> > 
> > That would require some way of querying how much input is required for
> > each line. I dot not feel sufficiently familiar with sws architecture to
> > see an obvious way of implementing this. And then making use of this
> > information would require a significantly more sophisticated way of
> > dispatching work to threads.
> 
> hmm, isnt the filter calculated by initFilter() (for the vertical stuff)
> basically listing the input/output relation ?
> (with some special cases like cascaded_context maybe)
> its a while since i worked on swscale so maybe iam forgetting something
> 
> Maybe that can be (easily) used ?

The logic in the loop over lines in swscale() is not exactly clear, but
I guess I could figure that out by staring at it a bit longer. But the
bigger question still is what to do with this information.

Submitting all the slices at once to execute() is simple and we already
have infrastructure for that. Submitting slices dynamically as they
become available would require significantly more work and I am not sure
that the gains are worth it.

> 
> > 
> > Or are you proposing some specific alternative way of implementing this?
> > 
> > > 
> > > Also there are a few tables between the multiple SwsContext which are
> > > identical, it would be ideal if they can be shared between threads
> > > I guess such sharing would need to be implemented before the API is
> > > stable otherwise adding it later would require application to be changed
> > 
> > In my tests, the differences are rather small. E.g. scaling
> > 2500x3000->3000x3000 with 32 threads uses only ~15% more memory than
> > with 1 thread.
> > 
> > And I do not see an obvious way to implement this that would be worth
> > the extra complexity. Do you?
> 
> Well, dont we for every case of threading in the codebase
> cleanly split the context in one thread local and one shared?

Certainly not for every case. E.g. frame threading in libavcodec spawns
several (almost) independent decoders internally.

> I certainly will not dispute that its work to do that. But we
> did it in every case because its the "right thing" to do for a
> clean implemtation. So i think we should aim toward that too here
> But maybe iam missing something ?

Depends on how you define "clean" in this case. And a related question
whether the threading should be inside swscale itself or not.

This patchset takes the route of adapting sws to allow external
slice threading. This way callers can integrate it into their existing
threading solutions, as I'm doing for vf_scale in lavfi.
One could claim that this solution is cleaner in that the individual
contexts are completely independent, so the callers are free to thread
them in any way they like.

But you could also take the position that swscale should implement slice
threading internally as a just-works black box. That would be
- significantly more work
- easier to use for people calling sws directly
- more cumbersome to integrate into lavfi

Beyond that, are you aware of any specific large constant objects that
should be shared? I suppose it should be simple enough to make them
refcounted and add a new SwsContext constructor that would take
references to these objects.

-- 
Anton Khirnov


More information about the ffmpeg-devel mailing list