[FFmpeg-devel] [FFmpeg-cvslog] lavu/qsv: make a copy as libmfx alignment requirement for uploading

Fri Oct 26 21:07:25 EEST 2018

On Fri, 2018-10-26 at 09:34 +0000, Li, Zhong wrote:
> > From: ffmpeg-devel [mailto:ffmpeg-devel-bounces at ffmpeg.org] On
> > Behalf
> > Of Rogozhkin, Dmitry V
> > Sent: Friday, October 26, 2018 8:37 AM
> > To: ffmpeg-devel at ffmpeg.org
> > Subject: Re: [FFmpeg-devel] [FFmpeg-cvslog] lavu/qsv: make a copy
> > as
> > libmfx alignment requirement for uploading
> > 
> > On Wed, 2018-10-24 at 23:47 +0100, Mark Thompson wrote:
> > > On 24/10/18 21:45, Rogozhkin, Dmitry V wrote:
> > > > On Tue, 2018-10-23 at 23:46 +0100, Mark Thompson wrote:
> > > > > On 23/10/18 08:49, Li, Zhong wrote:
> > > > > > > From: ffmpeg-devel [mailto:ffmpeg-devel-bounces at ffmpeg.or
> > > > > > > g]
> > > > > > > On
> > > > > > > Behalf
> > > > > > > Of Mark Thompson
> > > > > > > Sent: Sunday, October 14, 2018 12:36 AM
> > > > > > > To: ffmpeg-devel at ffmpeg.org
> > > > > > > Subject: Re: [FFmpeg-devel] [FFmpeg-cvslog] lavu/qsv:
> > > > > > > make a
> > > > > > > copy as libmfx alignment requirement for uploading
> > > > > > > 
> > > > > > > On 11/10/18 06:38, Zhong Li wrote:
> > > > > > > > ffmpeg | branch: master | Zhong Li <zhong.li at intel.com>
> > > > > > > > |
> > > > > > > > Mon Sep 17
> > > > > > > > 19:16:44 2018 +0800|
> > > > > > > > [681aa7d14f97fd98181ca6d61e11be48fe65692d]
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > committer: Zhong Li
> > > > > > > > 
> > > > > > > > lavu/qsv: make a copy as libmfx alignment requirement
> > > > > > > > for
> > > > > > > > uploading
> > > > > > > > 
> > > > > > > > Libmfx requires 16 bytes aligned input/output for
> > > > > > > > uploading.
> > > > > > > > Currently only output is 16 byte aligned and assigning
> > > > > > > > same
> > > > > > > > width/height to input with smaller buffer size
> > > > > > > > actually,
> > > > > > > > thus definitely will
> > > > > > > 
> > > > > > > cause segment fault.
> > > > > > > > 
> > > > > > > > Can reproduce with any 1080p nv12 rawvideo input:
> > > > > > > > ffmpeg -init_hw_device qsv=qsv:hw -hwaccel qsv
> > > > > > > > -filter_hw_device qsv -f rawvideo -pix_fmt nv12 -s:v
> > > > > > > > 1920x1080 -i 1080p_nv12.yuv -vf
> > > > > > > > 
> > 
> > 'format=nv12,hwupload=extra_hw_frames=16,hwdownload,format=
> > > > > > > > nv12
> > > > > > > > '
> > > > > > > 
> > > > > > > -an
> > > > > > > > -y out_nv12.yuv
> > > > > > > > 
> > > > > > > > It can fix #7418
> > > > > > > > 
> > > > > > > > Signed-off-by: Zhong Li <zhong.li at intel.com>
> > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > > > > > http://git.videolan.org/gitweb.cgi/ffmpeg.git/?a=commit;h
> > > > > > > =681
> > > > > > > aa7d
> > > > > > > 14f9
> > > > > > > > > 7fd98181ca6d61e11be48fe65692d
> > > > > > > > 
> > > > > > > > ---
> > > > > > > > 
> > > > > > > >  libavutil/hwcontext_qsv.c | 31
> > > > > > > > +++++++++++++++++++++++++++++--
> > > > > > > >  1 file changed, 29 insertions(+), 2 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/libavutil/hwcontext_qsv.c
> > > > > > > > b/libavutil/hwcontext_qsv.c index
> > > > > > > > 33e121b416..814ce215ce
> > > > > > > > 100644
> > > > > > > > --- a/libavutil/hwcontext_qsv.c
> > > > > > > > +++ b/libavutil/hwcontext_qsv.c
> > > > > > > > @@ -862,6 +862,10 @@ static int
> > > > > > > 
> > > > > > > qsv_transfer_data_to(AVHWFramesContext *ctx, AVFrame
> > > > > > > *dst,
> > > > > > > >      mfxSyncPoint sync = NULL;
> > > > > > > >      mfxStatus err;
> > > > > > > >      int ret = 0;
> > > > > > > > +    /* make a copy if the input is not padded as
> > > > > > > > libmfx
> > > > > > > > requires */
> > > > > > > > +    AVFrame tmp_frame, *src_frame;
> > > > > > > > +    int realigned = 0;
> > > > > > > > +
> > > > > > > > 
> > > > > > > >      while (!s->session_upload_init && !s-
> > > > > > > > >session_upload &&
> > > > > > > > !ret) {
> > > > > > > > #if HAVE_PTHREADS @@ -887,16 +891,36 @@ static int
> > > > > > > > qsv_transfer_data_to(AVHWFramesContext *ctx, AVFrame
> > > > > > > > *dst,
> > > > > > > >      if (ret < 0)
> > > > > > > >          return ret;
> > > > > > > > 
> > > > > > > > +
> > > > > > > > +    if (src->height & 16 || src->linesize[0] & 16) {
> > > > > > > 
> > > > > > > Doesn't this still need to check that the layout is
> > > > > > > compatible
> > > > > > > with other the limitations that libmfx has before trying
> > > > > > > to do
> > > > > > > the upload?  In particular, that the pitches for the
> > > > > > > different
> > > > > > > planes are actually all the same - I certainly wouldn't
> > > > > > > expect
> > > > > > > them to be for YUV420P.
> > > > > > 
> > > > > > Do you mean our previous discussion on https://lists.libav.
> > > > > > org/
> > > > > > pipe rmail/libav-devel/2018-April/086070.html ?
> > > > > > If yes, I will try to reproduce it with hwupload pipeline
> > > > > > instead of transcoding pipeline, and then give an update
> > > > > > patch.
> > > > > 
> > > > > Yes.  You might need to write a small test program to
> > > > > generation
> > > > > an arbitrary allocation pattern, since libavutil only has a
> > > > > fixed
> > > > > layout.
> > > > > 
> > > > > > > 
> > > > > > > (IIRC there was a requirement 0 < frame->data[N] - frame-
> > > > > > > > data[0]
> > > > > > > 
> > > > > > > < 2^24
> > > > > > > as well (or something like that, I might be wrong), which
> > > > > > > also
> > > > > > > need not be
> > > > > > > true.)
> > > > > > 
> > > > > > I don't know such a requirement, could it be found from
> > > > > > MSDK
> > > > > > guide?
> > > > > 
> > > > > That came from discussion of hardware capabilities around
> > > > > VAAPI, I
> > > > > think.
> > > > 
> > > > There are upper size frame limitations which HW can support.
> > > > And
> > > > this limitation can be different for different operations, for
> > > > example AVC encoder supports 4K as a maximum. Jpeg can support
> > > > 16K.
> > > > This depend on the component and may depend on the generation
> > > > of
> > > > hardware. In any case I don't think any of these kind of
> > > > limitations
> > > > are something to be checked within ffmpeg during allocation.
> > > > That's
> > > > responsibility of underlying implementation via which you try
> > > > to
> > > > allocate to return an error to ffmpeg that allocation can not
> > > > be
> > > > done. Or responsibility of the component (msdk or vaapi) to
> > > > return
> > > > an error that processing of the given surface can not be done
> > > > (due
> > > > to per-component limitations).
> > > 
> > > I meant real hardware capabilities around layout, which are going
> > > to
> > > be a superset of anything which can actually be reported in any
> > > particular API (see below).
> > > 
> > > > > 
> > > > > Won't something like this be the reason for wanting a
> > > > > virtually
> > > > > contiguous buffer?  If there isn't a restriction along these
> > > > > lines
> > > > > then there is no need to make the buffer virtually
> > > > > contiguous.
> > > > > (After all, the hardware can't know about C undefined
> > > > > behaviour in
> > > > > pointer comparisons...)
> > > > 
> > > > "Virtually contiguous" sounds dangerous to me. There is a
> > > > discrepancy in treating layouts of color formats in ffmpeg and
> > > > other
> > > > libraries (and drivers). First of all I did not see clear
> > > > document
> > > > describing what ffmpeg thinks layout should be. What it seems
> > > > to me
> > > > is that ffmpeg treats color formats per-plane meaning that in
> > > > NV12 Y
> > > > and UV could be allocated absolutely separately.
> > > 
> > > Correct.  FFmpeg deliberately places little constraint on the
> > > real
> > > layout of an NV12 frame - the planes need not be in any
> > > particular
> > > order or place in memory, nor need the pitch have any particular
> > > value
> > > relative to the width.  The one constraint we do have is that
> > > each
> > > line start at an address which is a multiple of some defined
> > > platform-specific alignment (and therefore plane pitches are
> > > multiples
> > > of such a value as well).
> > > 
> > > >                                  While you work with
> > 
> > the SW
> > > > components this might be ok. But HW defines pretty specifically
> > > > what
> > > > it expects.
> > > > And typically memory layout is supposed to be contiguous, I
> > > > mean
> > > > _physically_ contiguous.
> > > 
> > > While this is true in a lot of implementations, it isn't in the
> > > ones
> > > being considered here.  Modern Intel GPUs, including their video
> > > hardware, have an MMU with the same level of page access as the
> > > CPU
> > > does, so anything which is virtually contiguous from the CPU
> > > point of
> > > view certainly can be from the GPU as well.  Objects which are
> > > physically contiguous might be desirable in some marginal cases
> > > for
> > > performance (desirable ordering properties for RAM chips?), but
> > > certainly aren't necessary.
> > > 
> > > >                          Secondly, HW may have some
> > 
> > other
> > > > implications over layout requiring particular alignments. These
> > > > could be width alignments, i.e. pitches. Or height padding
> > > > alignments. As of now it seems that ffmpeg don't provide an
> > > > architecture to satisfy all these needs. This leads to
> > > > overheads in
> > > > terms of memory consumption since additional "rightly
> > > > allocated"
> > > > memory is required to copy frames and eventually this implies
> > > > processing overheads as well.
> > > 
> > > These real constraints are mostly unsatifiable for reasonable
> > > use-
> > > cases of Intel GPUs.  For example, H.264 encoding on Skylake
> > > requires
> > > Y-tiled NV12 surfaces with a number of additional constraints
> > > such as
> > > the chroma plane having the same pitch as the luma plane and
> > > being
> > > located at a line offset of at most 2^15-1.  This particular
> > > constraint is the one I was thinking of above: given a pitch of
> > > (say)
> > > 2048 = 2^11, the chroma plane can't be offset from the luma plane
> > > by
> > > more than 2^26, and in the positive direction only - that pretty
> > > much
> > > means you need to allocate them virtually-contiguously, since
> > > planes
> > > allocated normally have no reason to satisfy that (especially
> > > with
> > > ASLR enabled).  (This comes from <https://01.org/sites/default/fi
> > > les/
> > > documentation/intel-gfx-prm-osrc-skl-vol02a-commandreference-
> > > instructions.pdf> - MFX_SURFACE_STATE beginning page 818 is
> > > relevant.)
> > > 
> > > The Y-tiling in particular makes it very difficult for any upload
> > > case
> > > - there is no reasonable software implementation of the address
> > > swizzle needed to write Y-tiled planes.  The one possible getout
> > > here
> > > is to use the MMU to get a linear map of the Y-tiled surface, but
> > > in
> > > general this causes problems because such a mapping is uncached
> > > and
> > > therefore has very low performance for reads.  (You can still use
> > > it
> > > in FFmpeg with VAAPI if you want - the av_hwframe_map() function
> > > and
> > > the hwmap filter allow it, though it's only really usable for
> > > write-
> > > only surfaces.)
> > > 
> > > Given that, we have to make a copy somewhere in almost all
> > > cases.  I
> > > believe the change made here for libmfx is done so that the GPU
> > > can
> > > perform the copy itself, lifting the input data directly from CPU
> > > memory and therefore saving on the CPU time which would be used
> > > to
> > > perform the same copy.  I can't find this particular case in the
> > > relevant source code, but I assume if will have some similar
> > > constraints to the above video case - in particular, the chroma
> > > plane
> > > location relative to the luma plane is probably constrained in
> > > the
> > > same way (I admit I've extrapolated that from the virtual-
> > > contiguity
> > > requirement, but I don't see where else it would come from).
> > > 
> > > > Here is an example from gstreamer where they describe what they
> > > > mean
> > > > under color formats: https://gstreamer.freedesktop.org/document
> > > > atio
> > > > n/de
> > > > sign/mediatype-video-raw.html. This would be wonderful if
> > > > ffmpeg
> > > > could define something similar. Pay attention that in gstreamer
> > > > you
> > > > clearly get understanding what is "default" NV12 memory layout
> > > > is:
> > > > it is contiguous memory where UV plane is located right after
> > > > the
> > > > end of the Y plane, no additional padding for width and height.
> > > 
> > > It's unclear what you are gaining from this being defined as the
> > > "default".  If that's defined in the API then it constrains the
> > > allocator to only return frames of that form (so you can't do
> > > things
> > > like optimising pitches to avoid cache way collisions), but
> > > offers
> > > little benefit to the user unless they happen to have input in
> > > exactly
> > > that form somehow (say because they read a raw file with
> > > identical
> > > layout).  (And it's irrelevant for any component, since they have
> > > to
> > > deal with the general case anyway.)
> > 
> > I did not try to make a point: "let's to allocation in this way and
> > call it
> > default". I was trying to say something else. Let me try to
> > explain.
> > 
> > ffmpeg components may not directly implement codecs, but use some
> > 3rd
> > party libraries underneath (which may or may not) use hardware.
> > I.e.
> > software/hardware stack below is possible and this stack may imply
> > certain
> > limitations and requirements over memory layouts which may not be
> > matched by ffmpeg. Thus, I believe it is desirable if ffmpeg will
> > do
> > 2 things:
> > 
> > 1. Document and provide a way to programmatically check what is the
> > layout of the allocated memory (what are planes sizes, relative
> > offsets, etc.).
> > 
> > 2. Provide a way to allocate memory in a way wanted by the
> > underlying
> > stack of one of the pipeline's components to avoid additional copy
> > operations. For example, if you build a pipeline of 2 components A-
> > >B and B
> > requires specific layout L for some color format, but A does not
> > care, then it
> > is better to allocate in a B fashion.
> 
> Ideally, it should be. But in fact it is hard (or impossible). 
> Because:
> 1. It is required all 3rd party libraries have an API to expose their
> memory requirement. This is out of ffmpeg's control.
> 2. FFmpeg pipeline is very flexible. The next step of A may be B, C
> and D. B/C/D may have different memory requirement.
>   Stand the point of A, how it know what is the next step and give a
> perfect memory allocation? 
> 

I disagree. At the first place this is requirement for the framework to
support all this flexibility. And there is an example - gstreamer. It
manages all that somehow. Why ffmpeg can't?

> However, for the NV12/NV21 cases, thus should be much simple.
> Continuous memory for NV12/NV21 is not just required as MSDK or Intel
> HW.
> It is a requirement of fourcc.org: https://www.fourcc.org/pixel-forma
> t/yuv-nv12/:
> "A format in which all Y samples are found first in memory as an
> array of unsigned char with an even number of lines (possibly with a
> larger stride for memory alignment), 
> _followed immediately_ by an array of unsigned char containing
> interleaved Cb and Cr samples (such that if addressed as a little-
> endian WORD type, Cb would be in the LSBs and Cr would be in the
> MSBs)
> with the same total stride as the Y samples. This is the preferred
> 4:2:0 pixel format."
> 
> The key words "_followed immediately_ " means all NV12 memory should
> be continuous at the start of allocation, not only gstreamer, but
> also all application including ffmpeg.
> 

Right. And actually that's one of the things which simplify support of
all that by the framework. If you step out of the usual definitions you
get into trouble. It does not mean that you can't step out, but you
should carefully check for pros/cons and correctly select defaults.
This will greatly simplify everything.

I understand that ffmpeg is huge. Still I believe that efforts to
solidify the architecture are possible and become required, at some
point they may become inevitable.

> > Yes, incompatibilities are possible and copy operations generally
> > speaking
> > are inevitable. But framework responsibility is to try to build an
> > effective
> > pipeline taking care of each component capabilities and
> > preferences.
> > Memory layout is one of such capabilities, different l ayouts
> > generally
> > speaking mean different color formats. Number of surfaces to
> > allocate is
> > another example.
> > 
> > > 
> > > > I have tried to find something similar for ffmpeg, but failed
> > > > (except the ffmpeg source code of course).
> > > 
> > > The pixdesc descriptors (libavutil/pixdesc.[ch]) contain all the
> > > relevant information about pixel formats in a programmatic form,
> > > so
> > > some information about that could probably be auto- generated.
> > > (Relatedly, that gets used to test which formats are
> > > representable on
> > > an OpenCL device with a given set of capabilities (e.g. CL_R and
> > > CL_RG
> > > in CL_UNORM_INT8 -> NV12 support).  Unfortunately very few APIs
> > > present this information in a machine-readable form, so generally
> > > the
> > > formats are hard-coded from fourccs or something like that.)
> > > 
> > > Thanks,
> > > 
> > > - Mark
> 
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel