[FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer copies are done before submitting them

Thu Aug 13 02:01:02 EEST 2020

> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
> Steve Lhomme
> Sent: Wednesday, August 12, 2020 2:05 PM
> To: ffmpeg-devel at ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
> copies are done before submitting them
> 
> On 2020-08-11 12:43, Steve Lhomme wrote:
> >>> Sorry if you seem to know all the answers already, but I don't and
> >>> so I have to investigate.
> >>
> >> Last year, I had literally worked this down to death. I followed
> >> every slightest hint from countless searches, read through hundreds
> >> of discussions, driven because I was unwilling to believe that
> >> up-/downloading of video textures with
> >> D3D11 can't be done equally fast as with D3D9.
> >> (the big picture was the implementation of D3D11 support for
> >> QuickSync where the slowdown played a much bigger role than with
> >> D3D11VA decoders only).
> >> Eventually I landed at some internal Nvidia presentation, some talks
> >> with MS guys and some source code discussion deep inside a 3D game
> >> engine (not a no-name). It really bugs me that I didn't properly note
> >> the references, but from somewhere in between I was able to gather
> >> solid evidence about what is legal to do and what Is not. Based on
> >> that, followed several iterations to find the optimal way for doing
> >> the texture transfer. As I had implemented
> >> D3D11 support for QuickSync, this got pretty complicated because with
> >> a full transcoding pipeline, all parts (decoder, encoder and filters)
> >> can (and usually will) request textures. Only the latest Intel
> >> Drivers can work with array textures everywhere (e.g. VPP), so I also
> >> needed to add support for non-array texture allocation. The patch
> >> you've seen is the result of weeks of intensive work (a small but
> >> crucial part of it) - even when it may not look like that.
> >>
> >>
> >>> Sorry if you seem to know all the answers already
> >>
> >> Obviously, I don't know all the answers, but all the answers I have
> >> given were correct. And when I didn't have an answer I always
> >> respectfully said that your situation might be different.
> >> And I didn't reply by implying that you would have done your work by
> >> trial-and-error or most likely invalid assumptions or deductions.
> >>
> >>
> >> I still don't know how you are actually operating this and thus I
> >> also cannot tell what might or might not work in your case.
> >> All I can tell is that the procedure that I have described (1-2-3-4)
> >> can work rock-solid for multi-threaded DX11 texture transfer when
> >> it's done in the same way as I've shown.
> >> And believe it or not - I would still be happy when it would be of
> >> any use for you...
> >
> > Even though the discussion is heated (fitting with the weather here) I
> > don't mind. I learned some stuff and it pushed me to dig deeper. I
> > can't just accept your word for it. I need something solid if I'm
> > going to remove a lock that helped me so far.
> >
> > So I'm currently tooling VLC to be able to bring the decoder to its
> > knees and find out what it can and cannot do safely. So far I can
> > still see decoding artifacts when I don't a lock, which would mean I
> > still need the mutex, for the reasons given in the previous mail.
> 
> A follow-up on this. Using ID3D10Multithread seems to be enough to have
> mostly thread safe ID3D11Device/ID3D11DeviceContext/etc. Even the
> decoding with its odd API seem to know what to do when submitted
> different buffers.
> 
> I did not manage to saturate the GPU but I much bigger decoding
> speed/throughput to validate the errors I got before. Many of them were
> due to VLC dropping data because of odd timing.
> 
> Now I still have some threading issues. For example for deinterlacing we
> create a ID3D11VideoProcessor to handle the deinterlacing. And we create it
> after the decoding started (as the deinterlacing can be enabled/disabled
> dynamically). Without the mutex in the decoder it crashes on
> ID3D11VideoDevice::CreateVideoProcessor() and
> ID3D11VideoContext::SubmitDecoderBuffers() as they are being called
> simultaneously. If I add the mutex between the decoder and just this filter
> (not the rendering side) it works fine.
> 
> So I guess I'm stuck with the mutex for the time being.

At an earlier stage I had considered the idea of adding those video
processors as ffmpeg hardware filters, but due to the vast amount of
different use cases, platforms and hw accelerations we support, 
I had made the decision that we do all filtering either by CPU or in the
hw context of the en-coder, but never in the hw context of the de-coder,
so I don't have any experience with DX11 video processors.

Maybe a too obvious idea: How about activating the mutex use only for 
a short time during the process of adding the video processor?

----

Regarding the subject in general, I'm now able to fill some additional blanks
(I'm doing too many different things, and my RAM is limited ;-)
Specifically, the question "Why isn't this information more clearly documented?"

Short answer: Because "our" use case is too unusual.

While I had been searching for new or old evidence for my statement
I had found a book about DX11 talking about the use 
ID3D10_SetMultiThreadedProtected, but it said, something like that
this will work but it will lock only a small part and that would not be 
sufficient because it does not guarantee ordered execution and 
therefore a D3D11 application will always need to do its own locking 
at the application side.

Obviously, that wouldn't have been a good reference to use in the 
discussion :-)

Here comes the twist: All those books and documentation are about
writing 3D applications. And 3D applications are very different from
"our" kind of applications: Those applications may execute dozens
or hundreds of commands between two frames. And many of those
commands are sequences that need to be executed in order.

Any that explains why:
- The MSDN docs suggest application-side locking
- The SetMultiThreadedProtected method initially didn't have a DX11 version
- The later added ID3D11Threading interface didn't only have 
  the SetMultiThreadedProtected method, but also the Begin.. and 
  End.. methods
- The DX11 book said that SetMultiThreadedProtected would not be
  sufficient
- The documentation of SetMultiThreadedProtected in the new 
  Interface is stating that it would create possibly unwanted overhead

In our case, we're doing nothing more than a one or two calls
per frame (instead of hundreds). That's why we don't need to care 
about any of the bullet points above.

Unfortunately, nobody cared about us, when writing the docs.

Kind regards,
softworkz