[FFmpeg-devel] [FFmpeg-cvslog] lavu/qsv: make a copy as libmfx alignment requirement for uploading

Fri Oct 26 03:36:30 EEST 2018

On Wed, 2018-10-24 at 23:47 +0100, Mark Thompson wrote:
> On 24/10/18 21:45, Rogozhkin, Dmitry V wrote:
> > On Tue, 2018-10-23 at 23:46 +0100, Mark Thompson wrote:
> > > On 23/10/18 08:49, Li, Zhong wrote:
> > > > > From: ffmpeg-devel [mailto:ffmpeg-devel-bounces at ffmpeg.org]
> > > > > On
> > > > > Behalf
> > > > > Of Mark Thompson
> > > > > Sent: Sunday, October 14, 2018 12:36 AM
> > > > > To: ffmpeg-devel at ffmpeg.org
> > > > > Subject: Re: [FFmpeg-devel] [FFmpeg-cvslog] lavu/qsv: make a
> > > > > copy
> > > > > as
> > > > > libmfx alignment requirement for uploading
> > > > > 
> > > > > On 11/10/18 06:38, Zhong Li wrote:
> > > > > > ffmpeg | branch: master | Zhong Li <zhong.li at intel.com> |
> > > > > > Mon
> > > > > > Sep 17
> > > > > > 19:16:44 2018 +0800|
> > > > > > [681aa7d14f97fd98181ca6d61e11be48fe65692d]
> > > > > > > 
> > > > > > 
> > > > > > committer: Zhong Li
> > > > > > 
> > > > > > lavu/qsv: make a copy as libmfx alignment requirement for
> > > > > > uploading
> > > > > > 
> > > > > > Libmfx requires 16 bytes aligned input/output for
> > > > > > uploading.
> > > > > > Currently only output is 16 byte aligned and assigning same
> > > > > > width/height to input with smaller buffer size actually,
> > > > > > thus
> > > > > > definitely will
> > > > > 
> > > > > cause segment fault.
> > > > > > 
> > > > > > Can reproduce with any 1080p nv12 rawvideo input:
> > > > > > ffmpeg -init_hw_device qsv=qsv:hw -hwaccel qsv
> > > > > > -filter_hw_device qsv
> > > > > > -f rawvideo -pix_fmt nv12 -s:v 1920x1080 -i 1080p_nv12.yuv
> > > > > > -vf
> > > > > > 'format=nv12,hwupload=extra_hw_frames=16,hwdownload,format=
> > > > > > nv12
> > > > > > '
> > > > > 
> > > > > -an
> > > > > > -y out_nv12.yuv
> > > > > > 
> > > > > > It can fix #7418
> > > > > > 
> > > > > > Signed-off-by: Zhong Li <zhong.li at intel.com>
> > > > > > 
> > > > > > > 
> > > > > 
> > > > > http://git.videolan.org/gitweb.cgi/ffmpeg.git/?a=commit;h=681
> > > > > aa7d
> > > > > 14f9
> > > > > > > 7fd98181ca6d61e11be48fe65692d
> > > > > > 
> > > > > > ---
> > > > > > 
> > > > > >  libavutil/hwcontext_qsv.c | 31
> > > > > > +++++++++++++++++++++++++++++--
> > > > > >  1 file changed, 29 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/libavutil/hwcontext_qsv.c
> > > > > > b/libavutil/hwcontext_qsv.c
> > > > > > index 33e121b416..814ce215ce 100644
> > > > > > --- a/libavutil/hwcontext_qsv.c
> > > > > > +++ b/libavutil/hwcontext_qsv.c
> > > > > > @@ -862,6 +862,10 @@ static int
> > > > > 
> > > > > qsv_transfer_data_to(AVHWFramesContext *ctx, AVFrame *dst,
> > > > > >      mfxSyncPoint sync = NULL;
> > > > > >      mfxStatus err;
> > > > > >      int ret = 0;
> > > > > > +    /* make a copy if the input is not padded as libmfx
> > > > > > requires */
> > > > > > +    AVFrame tmp_frame, *src_frame;
> > > > > > +    int realigned = 0;
> > > > > > +
> > > > > > 
> > > > > >      while (!s->session_upload_init && !s->session_upload
> > > > > > &&
> > > > > > !ret) {
> > > > > > #if HAVE_PTHREADS @@ -887,16 +891,36 @@ static int
> > > > > > qsv_transfer_data_to(AVHWFramesContext *ctx, AVFrame *dst,
> > > > > >      if (ret < 0)
> > > > > >          return ret;
> > > > > > 
> > > > > > +
> > > > > > +    if (src->height & 16 || src->linesize[0] & 16) {
> > > > > 
> > > > > Doesn't this still need to check that the layout is
> > > > > compatible
> > > > > with other the
> > > > > limitations that libmfx has before trying to do the
> > > > > upload?  In
> > > > > particular,
> > > > > that the pitches for the different planes are actually all
> > > > > the
> > > > > same - I certainly
> > > > > wouldn't expect them to be for YUV420P.
> > > > 
> > > > Do you mean our previous discussion on https://lists.libav.org/
> > > > pipe
> > > > rmail/libav-devel/2018-April/086070.html ?
> > > > If yes, I will try to reproduce it with hwupload pipeline
> > > > instead
> > > > of transcoding pipeline, and then give an update patch.
> > > 
> > > Yes.  You might need to write a small test program to generation
> > > an
> > > arbitrary allocation pattern, since libavutil only has a fixed
> > > layout.
> > > 
> > > > > 
> > > > > (IIRC there was a requirement 0 < frame->data[N] - frame-
> > > > > >data[0] 
> > > > > < 2^24
> > > > > as well (or something like that, I might be wrong), which
> > > > > also
> > > > > need not be
> > > > > true.)
> > > > 
> > > > I don't know such a requirement, could it be found from MSDK
> > > > guide? 
> > > 
> > > That came from discussion of hardware capabilities around VAAPI,
> > > I
> > > think.
> > 
> > There are upper size frame limitations which HW can support. And
> > this
> > limitation can be different for different operations, for example
> > AVC
> > encoder supports 4K as a maximum. Jpeg can support 16K. This depend
> > on
> > the component and may depend on the generation of hardware. In any
> > case
> > I don't think any of these kind of limitations are something to be
> > checked within ffmpeg during allocation. That's responsibility of
> > underlying implementation via which you try to allocate to return
> > an
> > error to ffmpeg that allocation can not be done. Or responsibility
> > of
> > the component (msdk or vaapi) to return an error that processing of
> > the
> > given surface can not be done (due to per-component limitations).
> 
> I meant real hardware capabilities around layout, which are going to
> be a superset of anything which can actually be reported in any
> particular API (see below).
> 
> > > 
> > > Won't something like this be the reason for wanting a virtually
> > > contiguous buffer?  If there isn't a restriction along these
> > > lines
> > > then there is no need to make the buffer virtually
> > > contiguous.  (After all, the hardware can't know about C
> > > undefined
> > > behaviour in pointer comparisons...)
> > 
> > "Virtually contiguous" sounds dangerous to me. There is a
> > discrepancy
> > in treating layouts of color formats in ffmpeg and other libraries
> > (and
> > drivers). First of all I did not see clear document describing what
> > ffmpeg thinks layout should be. What it seems to me is that ffmpeg
> > treats color formats per-plane meaning that in NV12 Y and UV could
> > be
> > allocated absolutely separately.
> 
> Correct.  FFmpeg deliberately places little constraint on the real
> layout of an NV12 frame - the planes need not be in any particular
> order or place in memory, nor need the pitch have any particular
> value relative to the width.  The one constraint we do have is that
> each line start at an address which is a multiple of some defined
> platform-specific alignment (and therefore plane pitches are
> multiples of such a value as well).
> 
> >                                  While you work with the SW
> > components
> > this might be ok. But HW defines pretty specifically what it
> > expects.
> > And typically memory layout is supposed to be contiguous, I mean
> > _physically_ contiguous.
> 
> While this is true in a lot of implementations, it isn't in the ones
> being considered here.  Modern Intel GPUs, including their video
> hardware, have an MMU with the same level of page access as the CPU
> does, so anything which is virtually contiguous from the CPU point of
> view certainly can be from the GPU as well.  Objects which are
> physically contiguous might be desirable in some marginal cases for
> performance (desirable ordering properties for RAM chips?), but
> certainly aren't necessary.
> 
> >                          Secondly, HW may have some other
> > implications
> > over layout requiring particular alignments. These could be width
> > alignments, i.e. pitches. Or height padding alignments. As of now
> > it
> > seems that ffmpeg don't provide an architecture to satisfy all
> > these
> > needs. This leads to overheads in terms of memory consumption since
> > additional "rightly allocated" memory is required to copy frames
> > and
> > eventually this implies processing overheads as well.
> 
> These real constraints are mostly unsatifiable for reasonable use-
> cases of Intel GPUs.  For example, H.264 encoding on Skylake requires
> Y-tiled NV12 surfaces with a number of additional constraints such as
> the chroma plane having the same pitch as the luma plane and being
> located at a line offset of at most 2^15-1.  This particular
> constraint is the one I was thinking of above: given a pitch of (say)
> 2048 = 2^11, the chroma plane can't be offset from the luma plane by
> more than 2^26, and in the positive direction only - that pretty much
> means you need to allocate them virtually-contiguously, since planes
> allocated normally have no reason to satisfy that (especially with
> ASLR enabled).  (This comes from <https://01.org/sites/default/files/
> documentation/intel-gfx-prm-osrc-skl-vol02a-commandreference-
> instructions.pdf> - MFX_SURFACE_STATE beginning page 818 is
> relevant.)
> 
> The Y-tiling in particular makes it very difficult for any upload
> case - there is no reasonable software implementation of the address
> swizzle needed to write Y-tiled planes.  The one possible getout here
> is to use the MMU to get a linear map of the Y-tiled surface, but in
> general this causes problems because such a mapping is uncached and
> therefore has very low performance for reads.  (You can still use it
> in FFmpeg with VAAPI if you want - the av_hwframe_map() function and
> the hwmap filter allow it, though it's only really usable for write-
> only surfaces.)
> 
> Given that, we have to make a copy somewhere in almost all cases.  I
> believe the change made here for libmfx is done so that the GPU can
> perform the copy itself, lifting the input data directly from CPU
> memory and therefore saving on the CPU time which would be used to
> perform the same copy.  I can't find this particular case in the
> relevant source code, but I assume if will have some similar
> constraints to the above video case - in particular, the chroma plane
> location relative to the luma plane is probably constrained in the
> same way (I admit I've extrapolated that from the virtual-contiguity
> requirement, but I don't see where else it would come from).
> 
> > Here is an example from gstreamer where they describe what they
> > mean
> > under color formats: https://gstreamer.freedesktop.org/documentatio
> > n/de
> > sign/mediatype-video-raw.html. This would be wonderful if ffmpeg
> > could
> > define something similar. Pay attention that in gstreamer you
> > clearly
> > get understanding what is "default" NV12 memory layout is: it is
> > contiguous memory where UV plane is located right after the end of
> > the
> > Y plane, no additional padding for width and height. 
> 
> It's unclear what you are gaining from this being defined as the
> "default".  If that's defined in the API then it constrains the
> allocator to only return frames of that form (so you can't do things
> like optimising pitches to avoid cache way collisions), but offers
> little benefit to the user unless they happen to have input in
> exactly that form somehow (say because they read a raw file with
> identical layout).  (And it's irrelevant for any component, since
> they have to deal with the general case anyway.)

I did not try to make a point: "let's to allocation in this way and
call it default". I was trying to say something else. Let me try to
explain.

ffmpeg components may not directly implement codecs, but use some 3rd
party libraries underneath (which may or may not) use hardware. I.e.
software/hardware stack below is possible and this stack may imply
certain limitations and requirements over memory layouts which may not
be matched by ffmpeg. Thus, I believe it is desirable if ffmpeg will do
2 things:

1. Document and provide a way to programmatically check what is the
layout of the allocated memory (what are planes sizes, relative
offsets, etc.).

2. Provide a way to allocate memory in a way wanted by the underlying
stack of one of the pipeline's components to avoid additional copy
operations. For example, if you build a pipeline of 2 components A->B
and B requires specific layout L for some color format, but A does not
care, then it is better to allocate in a B fashion.

Yes, incompatibilities are possible and copy operations generally
speaking are inevitable. But framework responsibility is to try to
build an effective pipeline taking care of each component capabilities
and preferences. Memory layout is one of such capabilities, different l
ayouts generally speaking mean different color formats. Number of
surfaces to allocate is another example.

> 
> > I have tried to find something similar for ffmpeg, but failed
> > (except
> > the ffmpeg source code of course).
> 
> The pixdesc descriptors (libavutil/pixdesc.[ch]) contain all the
> relevant information about pixel formats in a programmatic form, so
> some information about that could probably be auto-
> generated.  (Relatedly, that gets used to test which formats are
> representable on an OpenCL device with a given set of capabilities
> (e.g. CL_R and CL_RG in CL_UNORM_INT8 -> NV12
> support).  Unfortunately very few APIs present this information in a
> machine-readable form, so generally the formats are hard-coded from
> fourccs or something like that.)
> 
> Thanks,
> 
> - Mark
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel