[FFmpeg-devel] [FFmpeg-cvslog] lavu/qsv: make a copy as libmfx alignment requirement for uploading

Fri Oct 26 12:34:09 EEST 2018

> From: ffmpeg-devel [mailto:ffmpeg-devel-bounces at ffmpeg.org] On Behalf
> Of Rogozhkin, Dmitry V
> Sent: Friday, October 26, 2018 8:37 AM
> To: ffmpeg-devel at ffmpeg.org
> Subject: Re: [FFmpeg-devel] [FFmpeg-cvslog] lavu/qsv: make a copy as
> libmfx alignment requirement for uploading
> 
> On Wed, 2018-10-24 at 23:47 +0100, Mark Thompson wrote:
> > On 24/10/18 21:45, Rogozhkin, Dmitry V wrote:
> > > On Tue, 2018-10-23 at 23:46 +0100, Mark Thompson wrote:
> > > > On 23/10/18 08:49, Li, Zhong wrote:
> > > > > > From: ffmpeg-devel [mailto:ffmpeg-devel-bounces at ffmpeg.org]
> > > > > > On
> > > > > > Behalf
> > > > > > Of Mark Thompson
> > > > > > Sent: Sunday, October 14, 2018 12:36 AM
> > > > > > To: ffmpeg-devel at ffmpeg.org
> > > > > > Subject: Re: [FFmpeg-devel] [FFmpeg-cvslog] lavu/qsv: make a
> > > > > > copy as libmfx alignment requirement for uploading
> > > > > >
> > > > > > On 11/10/18 06:38, Zhong Li wrote:
> > > > > > > ffmpeg | branch: master | Zhong Li <zhong.li at intel.com> |
> > > > > > > Mon Sep 17
> > > > > > > 19:16:44 2018 +0800|
> > > > > > > [681aa7d14f97fd98181ca6d61e11be48fe65692d]
> > > > > > > >
> > > > > > >
> > > > > > > committer: Zhong Li
> > > > > > >
> > > > > > > lavu/qsv: make a copy as libmfx alignment requirement for
> > > > > > > uploading
> > > > > > >
> > > > > > > Libmfx requires 16 bytes aligned input/output for uploading.
> > > > > > > Currently only output is 16 byte aligned and assigning same
> > > > > > > width/height to input with smaller buffer size actually,
> > > > > > > thus definitely will
> > > > > >
> > > > > > cause segment fault.
> > > > > > >
> > > > > > > Can reproduce with any 1080p nv12 rawvideo input:
> > > > > > > ffmpeg -init_hw_device qsv=qsv:hw -hwaccel qsv
> > > > > > > -filter_hw_device qsv -f rawvideo -pix_fmt nv12 -s:v
> > > > > > > 1920x1080 -i 1080p_nv12.yuv -vf
> > > > > > >
> 'format=nv12,hwupload=extra_hw_frames=16,hwdownload,format=
> > > > > > > nv12
> > > > > > > '
> > > > > >
> > > > > > -an
> > > > > > > -y out_nv12.yuv
> > > > > > >
> > > > > > > It can fix #7418
> > > > > > >
> > > > > > > Signed-off-by: Zhong Li <zhong.li at intel.com>
> > > > > > >
> > > > > > > >
> > > > > >
> > > > > > http://git.videolan.org/gitweb.cgi/ffmpeg.git/?a=commit;h=681
> > > > > > aa7d
> > > > > > 14f9
> > > > > > > > 7fd98181ca6d61e11be48fe65692d
> > > > > > >
> > > > > > > ---
> > > > > > >
> > > > > > >  libavutil/hwcontext_qsv.c | 31
> > > > > > > +++++++++++++++++++++++++++++--
> > > > > > >  1 file changed, 29 insertions(+), 2 deletions(-)
> > > > > > >
> > > > > > > diff --git a/libavutil/hwcontext_qsv.c
> > > > > > > b/libavutil/hwcontext_qsv.c index 33e121b416..814ce215ce
> > > > > > > 100644
> > > > > > > --- a/libavutil/hwcontext_qsv.c
> > > > > > > +++ b/libavutil/hwcontext_qsv.c
> > > > > > > @@ -862,6 +862,10 @@ static int
> > > > > >
> > > > > > qsv_transfer_data_to(AVHWFramesContext *ctx, AVFrame *dst,
> > > > > > >      mfxSyncPoint sync = NULL;
> > > > > > >      mfxStatus err;
> > > > > > >      int ret = 0;
> > > > > > > +    /* make a copy if the input is not padded as libmfx
> > > > > > > requires */
> > > > > > > +    AVFrame tmp_frame, *src_frame;
> > > > > > > +    int realigned = 0;
> > > > > > > +
> > > > > > >
> > > > > > >      while (!s->session_upload_init && !s->session_upload &&
> > > > > > > !ret) {
> > > > > > > #if HAVE_PTHREADS @@ -887,16 +891,36 @@ static int
> > > > > > > qsv_transfer_data_to(AVHWFramesContext *ctx, AVFrame *dst,
> > > > > > >      if (ret < 0)
> > > > > > >          return ret;
> > > > > > >
> > > > > > > +
> > > > > > > +    if (src->height & 16 || src->linesize[0] & 16) {
> > > > > >
> > > > > > Doesn't this still need to check that the layout is compatible
> > > > > > with other the limitations that libmfx has before trying to do
> > > > > > the upload?  In particular, that the pitches for the different
> > > > > > planes are actually all the same - I certainly wouldn't expect
> > > > > > them to be for YUV420P.
> > > > >
> > > > > Do you mean our previous discussion on https://lists.libav.org/
> > > > > pipe rmail/libav-devel/2018-April/086070.html ?
> > > > > If yes, I will try to reproduce it with hwupload pipeline
> > > > > instead of transcoding pipeline, and then give an update patch.
> > > >
> > > > Yes.  You might need to write a small test program to generation
> > > > an arbitrary allocation pattern, since libavutil only has a fixed
> > > > layout.
> > > >
> > > > > >
> > > > > > (IIRC there was a requirement 0 < frame->data[N] - frame-
> > > > > > >data[0]
> > > > > > < 2^24
> > > > > > as well (or something like that, I might be wrong), which also
> > > > > > need not be
> > > > > > true.)
> > > > >
> > > > > I don't know such a requirement, could it be found from MSDK
> > > > > guide?
> > > >
> > > > That came from discussion of hardware capabilities around VAAPI, I
> > > > think.
> > >
> > > There are upper size frame limitations which HW can support. And
> > > this limitation can be different for different operations, for
> > > example AVC encoder supports 4K as a maximum. Jpeg can support 16K.
> > > This depend on the component and may depend on the generation of
> > > hardware. In any case I don't think any of these kind of limitations
> > > are something to be checked within ffmpeg during allocation. That's
> > > responsibility of underlying implementation via which you try to
> > > allocate to return an error to ffmpeg that allocation can not be
> > > done. Or responsibility of the component (msdk or vaapi) to return
> > > an error that processing of the given surface can not be done (due
> > > to per-component limitations).
> >
> > I meant real hardware capabilities around layout, which are going to
> > be a superset of anything which can actually be reported in any
> > particular API (see below).
> >
> > > >
> > > > Won't something like this be the reason for wanting a virtually
> > > > contiguous buffer?  If there isn't a restriction along these lines
> > > > then there is no need to make the buffer virtually contiguous.
> > > > (After all, the hardware can't know about C undefined behaviour in
> > > > pointer comparisons...)
> > >
> > > "Virtually contiguous" sounds dangerous to me. There is a
> > > discrepancy in treating layouts of color formats in ffmpeg and other
> > > libraries (and drivers). First of all I did not see clear document
> > > describing what ffmpeg thinks layout should be. What it seems to me
> > > is that ffmpeg treats color formats per-plane meaning that in NV12 Y
> > > and UV could be allocated absolutely separately.
> >
> > Correct.  FFmpeg deliberately places little constraint on the real
> > layout of an NV12 frame - the planes need not be in any particular
> > order or place in memory, nor need the pitch have any particular value
> > relative to the width.  The one constraint we do have is that each
> > line start at an address which is a multiple of some defined
> > platform-specific alignment (and therefore plane pitches are multiples
> > of such a value as well).
> >
> > >                                  While you work with
> the SW
> > > components this might be ok. But HW defines pretty specifically what
> > > it expects.
> > > And typically memory layout is supposed to be contiguous, I mean
> > > _physically_ contiguous.
> >
> > While this is true in a lot of implementations, it isn't in the ones
> > being considered here.  Modern Intel GPUs, including their video
> > hardware, have an MMU with the same level of page access as the CPU
> > does, so anything which is virtually contiguous from the CPU point of
> > view certainly can be from the GPU as well.  Objects which are
> > physically contiguous might be desirable in some marginal cases for
> > performance (desirable ordering properties for RAM chips?), but
> > certainly aren't necessary.
> >
> > >                          Secondly, HW may have some
> other
> > > implications over layout requiring particular alignments. These
> > > could be width alignments, i.e. pitches. Or height padding
> > > alignments. As of now it seems that ffmpeg don't provide an
> > > architecture to satisfy all these needs. This leads to overheads in
> > > terms of memory consumption since additional "rightly allocated"
> > > memory is required to copy frames and eventually this implies
> > > processing overheads as well.
> >
> > These real constraints are mostly unsatifiable for reasonable use-
> > cases of Intel GPUs.  For example, H.264 encoding on Skylake requires
> > Y-tiled NV12 surfaces with a number of additional constraints such as
> > the chroma plane having the same pitch as the luma plane and being
> > located at a line offset of at most 2^15-1.  This particular
> > constraint is the one I was thinking of above: given a pitch of (say)
> > 2048 = 2^11, the chroma plane can't be offset from the luma plane by
> > more than 2^26, and in the positive direction only - that pretty much
> > means you need to allocate them virtually-contiguously, since planes
> > allocated normally have no reason to satisfy that (especially with
> > ASLR enabled).  (This comes from <https://01.org/sites/default/files/
> > documentation/intel-gfx-prm-osrc-skl-vol02a-commandreference-
> > instructions.pdf> - MFX_SURFACE_STATE beginning page 818 is
> > relevant.)
> >
> > The Y-tiling in particular makes it very difficult for any upload case
> > - there is no reasonable software implementation of the address
> > swizzle needed to write Y-tiled planes.  The one possible getout here
> > is to use the MMU to get a linear map of the Y-tiled surface, but in
> > general this causes problems because such a mapping is uncached and
> > therefore has very low performance for reads.  (You can still use it
> > in FFmpeg with VAAPI if you want - the av_hwframe_map() function and
> > the hwmap filter allow it, though it's only really usable for write-
> > only surfaces.)
> >
> > Given that, we have to make a copy somewhere in almost all cases.  I
> > believe the change made here for libmfx is done so that the GPU can
> > perform the copy itself, lifting the input data directly from CPU
> > memory and therefore saving on the CPU time which would be used to
> > perform the same copy.  I can't find this particular case in the
> > relevant source code, but I assume if will have some similar
> > constraints to the above video case - in particular, the chroma plane
> > location relative to the luma plane is probably constrained in the
> > same way (I admit I've extrapolated that from the virtual-contiguity
> > requirement, but I don't see where else it would come from).
> >
> > > Here is an example from gstreamer where they describe what they mean
> > > under color formats: https://gstreamer.freedesktop.org/documentatio
> > > n/de
> > > sign/mediatype-video-raw.html. This would be wonderful if ffmpeg
> > > could define something similar. Pay attention that in gstreamer you
> > > clearly get understanding what is "default" NV12 memory layout is:
> > > it is contiguous memory where UV plane is located right after the
> > > end of the Y plane, no additional padding for width and height.
> >
> > It's unclear what you are gaining from this being defined as the
> > "default".  If that's defined in the API then it constrains the
> > allocator to only return frames of that form (so you can't do things
> > like optimising pitches to avoid cache way collisions), but offers
> > little benefit to the user unless they happen to have input in exactly
> > that form somehow (say because they read a raw file with identical
> > layout).  (And it's irrelevant for any component, since they have to
> > deal with the general case anyway.)
> 
> I did not try to make a point: "let's to allocation in this way and call it
> default". I was trying to say something else. Let me try to explain.
> 
> ffmpeg components may not directly implement codecs, but use some 3rd
> party libraries underneath (which may or may not) use hardware. I.e.
> software/hardware stack below is possible and this stack may imply certain
> limitations and requirements over memory layouts which may not be
> matched by ffmpeg. Thus, I believe it is desirable if ffmpeg will do
> 2 things:
> 
> 1. Document and provide a way to programmatically check what is the
> layout of the allocated memory (what are planes sizes, relative offsets, etc.).
> 
> 2. Provide a way to allocate memory in a way wanted by the underlying
> stack of one of the pipeline's components to avoid additional copy
> operations. For example, if you build a pipeline of 2 components A->B and B
> requires specific layout L for some color format, but A does not care, then it
> is better to allocate in a B fashion.

Ideally, it should be. But in fact it is hard (or impossible). Because:
1. It is required all 3rd party libraries have an API to expose their memory requirement. This is out of ffmpeg's control.
2. FFmpeg pipeline is very flexible. The next step of A may be B, C and D. B/C/D may have different memory requirement.
  Stand the point of A, how it know what is the next step and give a perfect memory allocation? 

However, for the NV12/NV21 cases, thus should be much simple. Continuous memory for NV12/NV21 is not just required as MSDK or Intel HW.
It is a requirement of fourcc.org: https://www.fourcc.org/pixel-format/yuv-nv12/:
"A format in which all Y samples are found first in memory as an array of unsigned char with an even number of lines (possibly with a larger stride for memory alignment), 
_followed immediately_ by an array of unsigned char containing interleaved Cb and Cr samples (such that if addressed as a little-endian WORD type, Cb would be in the LSBs and Cr would be in the MSBs)
with the same total stride as the Y samples. This is the preferred 4:2:0 pixel format."

The key words "_followed immediately_ " means all NV12 memory should be continuous at the start of allocation, not only gstreamer, but also all application including ffmpeg.

> Yes, incompatibilities are possible and copy operations generally speaking
> are inevitable. But framework responsibility is to try to build an effective
> pipeline taking care of each component capabilities and preferences.
> Memory layout is one of such capabilities, different l ayouts generally
> speaking mean different color formats. Number of surfaces to allocate is
> another example.
> 
> >
> > > I have tried to find something similar for ffmpeg, but failed
> > > (except the ffmpeg source code of course).
> >
> > The pixdesc descriptors (libavutil/pixdesc.[ch]) contain all the
> > relevant information about pixel formats in a programmatic form, so
> > some information about that could probably be auto- generated.
> > (Relatedly, that gets used to test which formats are representable on
> > an OpenCL device with a given set of capabilities (e.g. CL_R and CL_RG
> > in CL_UNORM_INT8 -> NV12 support).  Unfortunately very few APIs
> > present this information in a machine-readable form, so generally the
> > formats are hard-coded from fourccs or something like that.)
> >
> > Thanks,
> >
> > - Mark