[FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer copies are done before submitting them

Steve Lhomme robux4 at ycbcr.xyz
Mon Aug 10 15:15:07 EEST 2020


On 2020-08-10 12:04, Soft Works wrote:
>>>> the very least the locks are needed inside libavcodec to avoid
>>>> setting DXVA buffers concurrently from different threads. It will
>>>> most likely result in very bad distortions if not crashes. Maybe
>>>> you're only using 1 decoding thread with DXVA (which a lot of people
>>>> do) so you don't have this issue, but this is not my case.
>>>
>>> I see no point in employing multiple threads for hw accelerated decoding.
>>> To be honest I never looked into or tried whether ffmpeg even supports
>>> multiple threads with dva2 or d3d11va hw acceleration.
>>
>> Maybe you're in an ideal situation where all the files you play through
>> libavcodec are hardware accelerated (so also with matching hardware). In
>> this case you don't need to care about the case where it will fallback to
>> software decoding. Using a single thread in that case would have terrible
>> performance.
> 
> I think we need to clarify the use cases we're talking about.
> 
> There is no "my case". All I'm talking about is using D3D11VA hardware
> acceleration using the ffmpeg.exe CLI.
> 
> You seem to have a rather special case where you are using parts of
> the ffmpeg DXVA/D3D11VA code from another application (VLC?)

I don't think there is anything special about using libavcodec in 
another application. That's where the code we're discussing is, not the 
ffmpeg CLI or ffplay. The API has to be designed to work in all host 
apps, not just these simpler use cases.

> Did I understand that correctly?
> 
>>>> The documentation for ID3D11DeviceContext is very clear about that [2]:
>>>> "Because each ID3D11DeviceContext is single threaded, only one thread
>>>> can call a ID3D11DeviceContext at a time. If multiple threads must
>>>> access a single ID3D11DeviceContext, they must use some
>>>> synchronization mechanism, such as critical sections, to synchronize
>> access to that ID3D11DeviceContext."
>>>
>>> Yes, but this doesn't apply to accessing staging textures IIRC.
>>
>> It does. To copy to a staging texture you need to use
>> ID3D11DeviceContext::CopySubresourceRegion().
> 
> Correct. And with DX11 and using SetMultithreadProtected it is legal
> to call this from multiple threads without synchronization.

No. I already explained it and pointed to the Microsoft documentation 
[1]. SetMultithreadProtected relates to ID3D11Device. 
ID3D11DeviceContext needs to be managed as non-thread safe resource.
If you want, you can even create one ID3D11DeviceContext per thread [2]. 
I'd be curious to see the effect on multithreaded decoding.

Also it seems SetMultithreadProtected() is not even needed by default. 
It enables the "thread-safe layer" [3]. But in d3d11 that's the default 
behavior. See [4] "Use this flag if your application will only call 
methods of Direct3D 11 interfaces from a single thread. By default, the 
ID3D11Device object is thread-safe."
SetMultithreadProtected() only made sense for D3D10:
"Direct3D 11 has been designed from the ground up to support 
multithreading. Direct3D 10 implements limited support for 
multithreading using the thread-safe layer."

>> You probably don't have any synchronization issues in your pipeline because
>> it seems you copy from GPU to CPU. In that case it forces the
>> ID3D11DeviceContext::GetData() internally to make sure all the commands
>> to produce your source texture on that video context are finished
>> processing. You may not see it, but there's a wait happening there.
> 
> I've looked back into my work history and gladly most memory
> came back.
> 
> Yes, it's correct, there's a "wait happening". From your wording I
> would assume that you've already realized that I was right in stating
> that there's no need for an external locking:
> 
> - Not for uploading
> - Not for downloading (at least not for the regular ffmpeg use case)
> 
> There is still some locking applied: Internally inside the DX11 runtime
> (because we are using SetMultithreadProtected). And there's also
> the "wait happening".

As the doc says, you have to use some synchronization. It may work in 
your case (FFmpeg CLI I suppose). As you mentioned you only use one 
thread. There's less chance that it can fail. But copying memory to/from 
CPU/GPU is probably the slowest part of the whole decoding (hence we 
don't do any in VLC in normal playback). So if you have one decoding 
thread doing that copy and another thread reading on the same 
ID3D11DeviceContext you're likely going to race-condition issues. I 
don't know what FFmpeg CLI does, so I can't tell.

> Let's go through an example: Downloading of a texture
> 
> 1. Context_CopySubresourceRegion: Copy GPU texture to staging texture
> 
> CopySubresourceRegion is asynchronous anyway. It just puts the copy
> request into the DX11 processing queue. Using SetMultithreadProtected
> avoids any race conditions, but this call always returns immediately.
> 
> 2. Context_Map: Make the staging texture accessible for the CPU
> 
> When called without MapFlags, this call blocks until the texture is
> mapped (and we can be sure that CopySubresourceRegion is executed
> by then).
> 
> => This is the 'wait' you've been talking about

Yes.

> 3. av_image_copy : Copy the image from the staging texture
> 
> Takes its time for copying obviously.
> 
> 4. Context_Unmap: Release the texture mapping
> 
> Returns immediately
> 
> -----------------
> 
> We've seen that there is no locking required with regards to DX11,
> but there's still one thing left: The staging texture. To resolve this
> I'm using multiple staging textures (it's system memory, not GPU
> memory).
> 
> When we look at the sequence 1 - 2 - 3 - 4, it's obvious that
> It can run much faster when just the individual steps are synchronized
> (by DX11) as when we would put one big lock around 1234 from
> our side.
> 
> I've been struggling a long time with this, because DXVA2 was
> often much faster than D3D11VA and this kind of parallelism
> was finally the way to get it working equally fast.

That's one very particular case where you do a copy to CPU. There is 
some synchronization happening the memory mapping. But that only covers 
a small part of the possibilities of D3D11VA in libavcodec. And that's 
certainly not what I use.
You can't deduce from that usage that synchronization (access to 
ID3D11DeviceContext) is not needed. In fact the Microsoft documentation 
and my experience show the exact opposite.

>
> -----------------
> 
> It is not really obvious from the documentation that it is legal
> to use CopySubresourceRegion, Map and UnMap in (pseudo-)
> parallel even on multiple indexes of the same ArrayTexture.
> IIRC I got one hint at this from an internal (yet public) Nvidia
> presentation about DX11 and another one from the source code
> of a game engine, but I haven't saved those links.
> 
> -----------------
> 
> @Steven

My name is Steve.

> I don't know anything about your specific way of
> using the ffmpeg code.  Perhaps, the above information is
> useful for you in some way, but maybe those locks are unavoidable
> in your case.
> 
> My only concern is that your changes do not slow down normal
> ffmpeg operation - like the locks you had added earlier.
> Maybe those could be put into some condition?

My change in e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86 is optional. So 
much that it even requires to use the creator helper function to make 
sure the mutex is properly initialized to an empty value and retain 
backward compatibility. If you don't want to use many threads you can 
safely ignore this field.

That being said, that's for the old API. I suppose the one you're 
talking about is the new API for which I have done nothing. If the mutex 
is always set, that's not my fault.

If you want the lock to have no effect, you can set the lock/unlock 
callbacks of AVD3D11VADeviceContext to functions that do nothing. If you 
don't set them, the documentation says:
"If unset on init, the hwcontext implementation will set them to use an 
internal mutex."

It's certainly better than commenting out a whole bunch of code [5].

> Kind regards,
> softworkz

[1] 
https://docs.microsoft.com/en-us/windows/win32/direct3d11/overviews-direct3d-11-render-multi-thread-intro
[2] 
https://docs.microsoft.com/en-us/windows/win32/direct3d11/overviews-direct3d-11-render-multi-thread-render
[3] 
https://docs.microsoft.com/en-us/windows/win32/api/d3d10/nn-d3d10-id3d10multithread
[4] 
https://docs.microsoft.com/en-us/windows/win32/api/d3d11/ne-d3d11-d3d11_create_device_flag
[5] 
https://github.com/softworkz/ffmpeg_dx11/commit/c09cc37ce7f513717493e060df740aa0e7374257


More information about the ffmpeg-devel mailing list