[FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer copies are done before submitting them

Soft Works softworkz at hotmail.com
Mon Aug 10 13:04:45 EEST 2020



> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
> Steve Lhomme
> Sent: Monday, August 10, 2020 8:19 AM
> To: ffmpeg-devel at ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
> copies are done before submitting them
> 
> On 2020-08-08 8:24, Soft Works wrote:
> >
> >
> >> -----Original Message-----
> >> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
> >> Steve Lhomme
> >> Sent: Saturday, August 8, 2020 7:10 AM
> >> To: ffmpeg-devel at ffmpeg.org
> >> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11
> >> buffer copies are done before submitting them
> >
[...]


> >> the very least the locks are needed inside libavcodec to avoid
> >> setting DXVA buffers concurrently from different threads. It will
> >> most likely result in very bad distortions if not crashes. Maybe
> >> you're only using 1 decoding thread with DXVA (which a lot of people
> >> do) so you don't have this issue, but this is not my case.
> >
> > I see no point in employing multiple threads for hw accelerated decoding.
> > To be honest I never looked into or tried whether ffmpeg even supports
> > multiple threads with dva2 or d3d11va hw acceleration.
> 
> Maybe you're in an ideal situation where all the files you play through
> libavcodec are hardware accelerated (so also with matching hardware). In
> this case you don't need to care about the case where it will fallback to
> software decoding. Using a single thread in that case would have terrible
> performance.

I think we need to clarify the use cases we're talking about.

There is no "my case". All I'm talking about is using D3D11VA hardware 
acceleration using the ffmpeg.exe CLI.

You seem to have a rather special case where you are using parts of
the ffmpeg DXVA/D3D11VA code from another application (VLC?)

Did I understand that correctly?

> >> The documentation for ID3D11DeviceContext is very clear about that [1]:
> >> "Because each ID3D11DeviceContext is single threaded, only one thread
> >> can call a ID3D11DeviceContext at a time. If multiple threads must
> >> access a single ID3D11DeviceContext, they must use some
> >> synchronization mechanism, such as critical sections, to synchronize
> access to that ID3D11DeviceContext."
> >
> > Yes, but this doesn't apply to accessing staging textures IIRC.
> 
> It does. To copy to a staging texture you need to use
> ID3D11DeviceContext::CopySubresourceRegion().

Correct. And with DX11 and using SetMultithreadProtected it is legal
to call this from multiple threads without synchronization.

> You probably don't have any synchronization issues in your pipeline because
> it seems you copy from GPU to CPU. In that case it forces the
> ID3D11DeviceContext::GetData() internally to make sure all the commands
> to produce your source texture on that video context are finished
> processing. You may not see it, but there's a wait happening there. 

I've looked back into my work history and gladly most memory 
came back.

Yes, it's correct, there's a "wait happening". From your wording I 
would assume that you've already realized that I was right in stating
that there's no need for an external locking:

- Not for uploading 
- Not for downloading (at least not for the regular ffmpeg use case)

There is still some locking applied: Internally inside the DX11 runtime
(because we are using SetMultithreadProtected). And there's also
the "wait happening".

Let's go through an example: Downloading of a texture

1. Context_CopySubresourceRegion: Copy GPU texture to staging texture

CopySubresourceRegion is asynchronous anyway. It just puts the copy 
request into the DX11 processing queue. Using SetMultithreadProtected
avoids any race conditions, but this call always returns immediately.

2. Context_Map: Make the staging texture accessible for the CPU

When called without MapFlags, this call blocks until the texture is
mapped (and we can be sure that CopySubresourceRegion is executed
by then).

=> This is the 'wait' you've been talking about

3. av_image_copy : Copy the image from the staging texture 

Takes its time for copying obviously.

4. Context_Unmap: Release the texture mapping

Returns immediately

-----------------

We've seen that there is no locking required with regards to DX11,
but there's still one thing left: The staging texture. To resolve this
I'm using multiple staging textures (it's system memory, not GPU
memory).

When we look at the sequence 1 - 2 - 3 - 4, it's obvious that
It can run much faster when just the individual steps are synchronized
(by DX11) as when we would put one big lock around 1234 from
our side.

I've been struggling a long time with this, because DXVA2 was 
often much faster than D3D11VA and this kind of parallelism
was finally the way to get it working equally fast.

-----------------

It is not really obvious from the documentation that it is legal
to use CopySubresourceRegion, Map and UnMap in (pseudo-)
parallel even on multiple indexes of the same ArrayTexture.
IIRC I got one hint at this from an internal (yet public) Nvidia
presentation about DX11 and another one from the source code
of a game engine, but I haven't saved those links.

-----------------

@Steven - I don't know anything about your specific way of
using the ffmpeg code.  Perhaps, the above information is
useful for you in some way, but maybe those locks are unavoidable
in your case.

My only concern is that your changes do not slow down normal
ffmpeg operation - like the locks you had added earlier.
Maybe those could be put into some condition?

Kind regards,
softworkz


More information about the ffmpeg-devel mailing list