[FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer copies are done before submitting them

Steve Lhomme robux4 at ycbcr.xyz
Fri Aug 14 10:12:18 EEST 2020


On 2020-08-13 1:01, Soft Works wrote:
> 
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
>> Steve Lhomme
>> Sent: Wednesday, August 12, 2020 2:05 PM
>> To: ffmpeg-devel at ffmpeg.org
>> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
>> copies are done before submitting them
>>
>> On 2020-08-11 12:43, Steve Lhomme wrote:
>>>>> Sorry if you seem to know all the answers already, but I don't and
>>>>> so I have to investigate.
>>>>
>>>> Last year, I had literally worked this down to death. I followed
>>>> every slightest hint from countless searches, read through hundreds
>>>> of discussions, driven because I was unwilling to believe that
>>>> up-/downloading of video textures with
>>>> D3D11 can't be done equally fast as with D3D9.
>>>> (the big picture was the implementation of D3D11 support for
>>>> QuickSync where the slowdown played a much bigger role than with
>>>> D3D11VA decoders only).
>>>> Eventually I landed at some internal Nvidia presentation, some talks
>>>> with MS guys and some source code discussion deep inside a 3D game
>>>> engine (not a no-name). It really bugs me that I didn't properly note
>>>> the references, but from somewhere in between I was able to gather
>>>> solid evidence about what is legal to do and what Is not. Based on
>>>> that, followed several iterations to find the optimal way for doing
>>>> the texture transfer. As I had implemented
>>>> D3D11 support for QuickSync, this got pretty complicated because with
>>>> a full transcoding pipeline, all parts (decoder, encoder and filters)
>>>> can (and usually will) request textures. Only the latest Intel
>>>> Drivers can work with array textures everywhere (e.g. VPP), so I also
>>>> needed to add support for non-array texture allocation. The patch
>>>> you've seen is the result of weeks of intensive work (a small but
>>>> crucial part of it) - even when it may not look like that.
>>>>
>>>>
>>>>> Sorry if you seem to know all the answers already
>>>>
>>>> Obviously, I don't know all the answers, but all the answers I have
>>>> given were correct. And when I didn't have an answer I always
>>>> respectfully said that your situation might be different.
>>>> And I didn't reply by implying that you would have done your work by
>>>> trial-and-error or most likely invalid assumptions or deductions.
>>>>
>>>>
>>>> I still don't know how you are actually operating this and thus I
>>>> also cannot tell what might or might not work in your case.
>>>> All I can tell is that the procedure that I have described (1-2-3-4)
>>>> can work rock-solid for multi-threaded DX11 texture transfer when
>>>> it's done in the same way as I've shown.
>>>> And believe it or not - I would still be happy when it would be of
>>>> any use for you...
>>>
>>> Even though the discussion is heated (fitting with the weather here) I
>>> don't mind. I learned some stuff and it pushed me to dig deeper. I
>>> can't just accept your word for it. I need something solid if I'm
>>> going to remove a lock that helped me so far.
>>>
>>> So I'm currently tooling VLC to be able to bring the decoder to its
>>> knees and find out what it can and cannot do safely. So far I can
>>> still see decoding artifacts when I don't a lock, which would mean I
>>> still need the mutex, for the reasons given in the previous mail.
>>
>> A follow-up on this. Using ID3D10Multithread seems to be enough to have
>> mostly thread safe ID3D11Device/ID3D11DeviceContext/etc. Even the
>> decoding with its odd API seem to know what to do when submitted
>> different buffers.
>>
>> I did not manage to saturate the GPU but I much bigger decoding
>> speed/throughput to validate the errors I got before. Many of them were
>> due to VLC dropping data because of odd timing.
>>
>> Now I still have some threading issues. For example for deinterlacing we
>> create a ID3D11VideoProcessor to handle the deinterlacing. And we create it
>> after the decoding started (as the deinterlacing can be enabled/disabled
>> dynamically). Without the mutex in the decoder it crashes on
>> ID3D11VideoDevice::CreateVideoProcessor() and
>> ID3D11VideoContext::SubmitDecoderBuffers() as they are being called
>> simultaneously. If I add the mutex between the decoder and just this filter
>> (not the rendering side) it works fine.
>>
>> So I guess I'm stuck with the mutex for the time being.
> 
> At an earlier stage I had considered the idea of adding those video
> processors as ffmpeg hardware filters, but due to the vast amount of
> different use cases, platforms and hw accelerations we support,
> I had made the decision that we do all filtering either by CPU or in the
> hw context of the en-coder, but never in the hw context of the de-coder,
> so I don't have any experience with DX11 video processors.
> 
> Maybe a too obvious idea: How about activating the mutex use only for
> a short time during the process of adding the video processor?

This doesn't seem feasable, even with a callback system. You don't know 
when it's safe to enable/disable it.

By the way the origin of the mutex was on Windows Phones. It's probably 
related to the fact that some phones only decode to 
DXGI_FORMAT_I420_OPAQUE which cannot be used for rendering. The only way 
to use the decoded surface is to convert it (to NV12) via a 
VideoProcessor. So in this case it was always used, even for basic decoding.


More information about the ffmpeg-devel mailing list