[FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer copies are done before submitting them

Mon Aug 10 09:18:35 EEST 2020

On 2020-08-08 8:24, Soft Works wrote:
> 
> 
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
>> Steve Lhomme
>> Sent: Saturday, August 8, 2020 7:10 AM
>> To: ffmpeg-devel at ffmpeg.org
>> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
>> copies are done before submitting them
> 
> [...]
> 
>>>
>>> Hi Steven,
>>
>> Hi,
>>
>>> A while ago I had extended D3D11VA implementation to support single
>>> (non-array textures) for interoperability with Intel QSV+DX11.
>>
>> Looking at your code, it seems you are copying from an array texture to a
>> single slice texture to achieve this. With double the amount of RAM.
>> It may be a design issue with the new D3D11 API, which forces you to do
> 
> With D3D11, it's mandatory to use a staging texture, which is not only done
> in my code but also in the original implementation (hwcontext_d3d11va.c)
> https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/hwcontext_d3d11va.c
> 
>> that, but I'm not using that API. I'm using the old API.
> 
> I'm not sure whether I understand what you mean by this. To my knowledge
> there are two DX hw context implementations in ffmpeg:
> 
> - DXVA2
> - D3D11VA
> 
> I'm not aware of a variant like "D3D11 with old API". Could you please elaborate?

There is AV_PIX_FMT_D3D11VA_VLD (old) and AV_PIX_FMT_D3D11 (new).

>>
>>> Hence, I don't think that your patch is the best possible way .
>>
>> Removing locks and saying "it works for me" is neither a correct solution.
> 
> How did you come to the conclusion that I might be working like this?

The commented out "hwctx->lock" lines in your code.

>> the very least the locks are needed inside libavcodec to avoid setting DXVA
>> buffers concurrently from different threads. It will most likely result in very
>> bad distortions if not crashes. Maybe you're only using 1 decoding thread
>> with DXVA (which a lot of people do) so you don't have this issue, but this is
>> not my case.
> 
> I see no point in employing multiple threads for hw accelerated decoding.
> To be honest I never looked into or tried whether ffmpeg even supports
> multiple threads with dva2 or d3d11va hw acceleration.

Maybe you're in an ideal situation where all the files you play through 
libavcodec are hardware accelerated (so also with matching hardware). In 
this case you don't need to care about the case where it will fallback 
to software decoding. Using a single thread in that case would have 
terrible performance.

Even then, there's still a chance using multiple threads might improve 
performance. All the code that is run to prepare the buffers that are 
fed into the hardware decoder can be run in parallel for multiple 
frames. If you have an insanely fast hardware decoder that would be the 
bottleneck. In a transcoding scenario that could have an impact.

>> Also ID3D10Multithread::SetMultithreadProtected means that the resources
>> can be accessed from multiple threads. It doesn't mean that calls to
>> ID3D11DeviceContext are safe from multithreading. And my experience
>> shows that it is not. In fact if you have the Windows SDK installed and you
>> have concurrent accesses, you'll get a big warning in your debug logs that you
>> are doing something fishy. On WindowsPhone it would even crash. This is
>> how I ended up adding the mutex to the old API
>> (e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86).
>>
>> The documentation for ID3D11DeviceContext is very clear about that [1]:
>> "Because each ID3D11DeviceContext is single threaded, only one thread can
>> call a ID3D11DeviceContext at a time. If multiple threads must access a single
>> ID3D11DeviceContext, they must use some synchronization mechanism,
>> such as critical sections, to synchronize access to that ID3D11DeviceContext."
> 
> Yes, but this doesn't apply to accessing staging textures IIRC.

It does. To copy to a staging texture you need to use 
ID3D11DeviceContext::CopySubresourceRegion().

You probably don't have any synchronization issues in your pipeline 
because it seems you copy from GPU to CPU. In that case it forces the 
ID3D11DeviceContext::GetData() internally to make sure all the commands 
to produce your source texture on that video context are finished 
processing. You may not see it, but there's a wait happening there. In 
my case there's nothing happening between the decoder and the rendering 
of the texture.

> In fact, I had researched this in-depth, but I can't tell much more without
> looking into it again.
> 
> The patch I referenced is working in production on thousands of installations
> and tested with many different hardware and driver versions from Nvidia,
> Intel and AMD.

And I added the lock before, as the specs say, it's necessary. That 
solved some issues on the hundred of millions of VLC running on Windows 
on all the hardware you can think of.

Decoding 8K 60 fps HEVC was also a good stress test of the code. The 
ID3D11DeviceContext::GetData() in the rendering side ensured that we had 
the frames displayed in the right time and not whenever the pipeline is 
done processing the device context commands.

Now I realize the same thing should be done on the decoder side. With 
the improvements given earlier in this thread.