[FFmpeg-devel] [PATCH 20/24] sws: add a function for scaling dst slices

Thu Jun 10 18:49:48 EEST 2021

Quoting Michael Niedermayer (2021-06-01 15:02:27)
> On Mon, May 31, 2021 at 09:55:11AM +0200, Anton Khirnov wrote:
> > Currently existing sws_scale() accepts as input a user-determined slice
> > of input data and produces an indeterminate number of output lines.
> 
> swscale() should return the number of lines output
> it does "return dstY - lastDstY;"

But you do not know the number of lines beforehand.
I suppose one could assume that the line counts will always be the same
for any run with the same parameters (strictly speaking this is not
guaranteed) and store them after the first frame, but then the first
scale call is not parallel. And it would be quite ugly.

> 
> 
> > Since the calling code does not know the amount of output, it cannot
> > easily parallelize scaling by calling sws_scale() simultaneously on
> > different parts of the frame.
> > 
> > Add a new function - sws_scale_dst_slice() - that accepts as input the
> > entire input frame and produces a specified slice of the output. This
> > function can be called simultaneously on different slices of the output
> > frame (using different sws contexts) to implement slice threading.
> 
> an API that would allow starting before the whole frame is available
> would have reduced latency and better cache locality. Maybe that can
> be added later too but i wanted to mention it because the documentation
> exlicitly says "entire input"

That would require some way of querying how much input is required for
each line. I dot not feel sufficiently familiar with sws architecture to
see an obvious way of implementing this. And then making use of this
information would require a significantly more sophisticated way of
dispatching work to threads.

Or are you proposing some specific alternative way of implementing this?

> 
> Also there are a few tables between the multiple SwsContext which are
> identical, it would be ideal if they can be shared between threads
> I guess such sharing would need to be implemented before the API is
> stable otherwise adding it later would require application to be changed

In my tests, the differences are rather small. E.g. scaling
2500x3000->3000x3000 with 32 threads uses only ~15% more memory than
with 1 thread.

And I do not see an obvious way to implement this that would be worth
the extra complexity. Do you?

-- 
Anton Khirnov