[FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer copies are done before submitting them

Sat Aug 8 08:09:57 EEST 2020

On 2020-08-07 23:59, Soft Works wrote:
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
>> Steve Lhomme
>> Sent: Friday, August 7, 2020 3:05 PM
>> To: ffmpeg-devel at ffmpeg.org
>> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
>> copies are done before submitting them
>>
>> I experimented a bit more with this. Here are the 3 scenarii in other of least
>> frame late:
>>
>> - GetData waiting for 1/2s and releasing the lock
>> - No use of GetData (current code)
>> - GetData waiting for 1/2s and keeping the lock
>>
>> The last option has horrible perfomance issues and should not be used.
>>
>> The first option gives about 50% less late frames compared to the current
>> code. *But* it requires to unlock the Video Context. There are 2 problems
>> with this:
>>
>> - the same ID3D11Asynchronous is used to wait on multiple concurrent
>> thread. This can confuse D3D11 which emits a warning in the logs.
>> - another thread might Get/Release some buffers and submit them before
>> this thread is finished processing. That can result in distortions, for example if
>> the second thread/frame depends on the first thread/frame which is not
>> submitted yet.
>>
>> The former issue can be solved by using a ID3D11Asynchronous per thread.
>> That requires some TLS storage which FFmpeg doesn't seem to support yet.
>> With this I get virtually no frame late.
>>
>> The latter issue only occur if the wait is too long. For example waiting by
>> increments of 10ms is too long in my test. Using increments of 1ms or 2ms
>> works fine in the most stressing sample I have (Sony Camping HDR HEVC high
>> bitrate). But this seems hackish. There's still potentially a quick frame (alt
>> frame in VPx/AV1 for example) that might get through to the decoder too
>> early. (I suppose that's the source of the distortions I
>> see)
>>
>> It's also possible to change the order of the buffer sending, by starting with
>> the bigger one (D3D11_VIDEO_DECODER_BUFFER_BITSTREAM). But it seems
>> to have little influence, regardless if we wait for buffer submission or not.
>>
>> The results are consistent between integrated GPU and dedicated GPU.
> 
> Hi Steven,

Hi,

> A while ago I had extended D3D11VA implementation to support single
> (non-array textures) for interoperability with Intel QSV+DX11.

Looking at your code, it seems you are copying from an array texture to 
a single slice texture to achieve this. With double the amount of RAM. 
It may be a design issue with the new D3D11 API, which forces you to do 
that, but I'm not using that API. I'm using the old API.

In my case I directly render the texture slices coming out of the 
decoder with no copying (and no extra memory allocation). It is 
happening in a different thread than the decoder thread(s).

Also in VLC we also support direct D3D11 to QSV encoding. It does 
require a copy to "shadow" textures to feed QSV. I never managed to make 
it work without a copy.

> I noticed a few bottlenecks making D3D11VA significantly slower than DXVA2.
> 
> The solution was to use ID3D10Multithread_SetMultithreadProtected and
> remove all the locks which are currently applied.

I am also using that.

> Hence, I don't think that your patch is the best possible way .

Removing locks and saying "it works for me" is neither a correct 
solution. At the very least the locks are needed inside libavcodec to 
avoid setting DXVA buffers concurrently from different threads. It will 
most likely result in very bad distortions if not crashes. Maybe you're 
only using 1 decoding thread with DXVA (which a lot of people do) so you 
don't have this issue, but this is not my case.

Also ID3D10Multithread::SetMultithreadProtected means that the resources 
can be accessed from multiple threads. It doesn't mean that calls to 
ID3D11DeviceContext are safe from multithreading. And my experience 
shows that it is not. In fact if you have the Windows SDK installed and 
you have concurrent accesses, you'll get a big warning in your debug 
logs that you are doing something fishy. On WindowsPhone it would even 
crash. This is how I ended up adding the mutex to the old API 
(e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86).

The documentation for ID3D11DeviceContext is very clear about that [1]:
"Because each ID3D11DeviceContext is single threaded, only one thread 
can call a ID3D11DeviceContext at a time. If multiple threads must 
access a single ID3D11DeviceContext, they must use some synchronization 
mechanism, such as critical sections, to synchronize access to that 
ID3D11DeviceContext."

The DXVA documentation is a lot less clearer on the subject. But given 
the ID3D11VideoContext derives from a ID3D11DeviceContext (but is not a 
ID3D11DeviceContext) it's seem correct to assume it has the same 
restrictions.

[1] 
https://docs.microsoft.com/en-us/windows/win32/direct3d11/overviews-direct3d-11-render-multi-thread-intro