[FFmpeg-devel] [PATCH 6/6] ffv1enc_vulkan: switch to receive_packet

Sun Nov 24 18:41:37 EET 2024

On 11/24/24 16:51, Jerome Martinez wrote:

> Le 24/11/2024 à 04:41, Lynne via ffmpeg-devel a écrit :
>> On 11/23/24 23:10, Jerome Martinez wrote:
>>
>>> Le 23/11/2024 à 20:58, Lynne via ffmpeg-devel a écrit :
>>>> This allows the encoder to fully saturate all queues the GPU
>>>> has, giving a good 10% in certain cases and resolutions.
>>>
>>>
>>> Using a RTX 4070:
>>> +50% (!!!) with 2K 10-bit content.
>>> +17% with 4K 16-bit content.
>>> Also the speed with 2K content is now 4x the speed of 4K content 
>>> which is similar to the SW encoder (with similar count of slices) 
>>> and which is the expected result, it seems that a bottleneck with 
>>> smaller resolutions is removed.
>>>
>>>
>>> Unfortunatly, it has a drawback, a 6K5K content which was well 
>>> handled without this patch is now having an immediate error:
>>> [vost#0:0/ffv1_vulkan @ 0x10467840] [enc:ffv1_vulkan @ 0x12c011c0] 
>>> Error submitting video frame to the encoder
>>> [vost#0:0/ffv1_vulkan @ 0x10467840] [enc:ffv1_vulkan @ 0x12c011c0] 
>>> Error encoding a frame: Cannot allocate memory
>>> [vost#0:0/ffv1_vulkan @ 0x10467840] Task finished with error code: 
>>> -12 (Cannot allocate memory)
>>> [vost#0:0/ffv1_vulkan @ 0x10467840] Terminating thread with return 
>>> code -12 (Cannot allocate memory)
>>>
>>> Which is a problem, the handling of 6K5K being good on the RTX 4070 
>>> (3x faster than a CPU at the same price) before this patch.
>>> Is it possible to keep the handling of bigger resolutions on such 
>>> card while keeping the performance boost of this patch?
>>
>>
>> To an extent. At high resolutions, -async_depth 0 (maximum) harms 
>> performance for higher
>> resolution. I get the best results with it set to 2 or 3 for 6k 
>> content, on my odd setup.
>> Increasing async_depth increases the amount of VRAM used, so that's 
>> the tradeoff.
>> Automatically detecting it is difficult, as Vulkan doesn't give you 
>> metrics on how much free
>> VRAM there is, so there's nothing we can do
>
>
> I am torn between a default having as much performance as possible and 
> a default working for sure (a default value of 1 is OK for the 6K5K 
> content on the RTX 4070, not 2).
> Surprisingly, default async_depth works on 4K (51 MiB) but async_depth 
> 2 does not work on 6K5K (183 MiB), but I don't know what is the value 
> of nb_queues.
> Maybe real use case is a user managing 6K5K with the biggest GPU 
> available so it does not hurt much to have a default crashing with 
> such big content.
>
> The encoder catches the allocation error and sends a nice message, 
> wouldn't it possible to reduce automatically async_depth and retry 
> instead of sending immediately the error, in the case async_depth is 
> not provided, and error only if -async_depth 1 does not work?
>
>> than to document it and hope users follow the instructions in case 
>> they run out of memory.
>
>
> If not possible to try automatically smaller values, is it possible to 
> add "use -async_depth with a value smaller than (here the current 
> value)" to the error message?
>
>
>
>> The good news is that -async_depth 1 uses less VRAM than before this 
>> patch.
>> Must of the VRAM used is from somewhere within Nvidia's black-box 
>> driver, as RADV
>> uses 1/3rd of the VRAM at the same content and async_depth settings. 
>> Nothing we
>> can do about this too.
>>
>>
>>>> This also improves error resilience if an allocation fails,
>>>> and properly cleans up after itself if it does.
>>>
>>> Looks like that this part does not work, still a freeze if an 
>>> allocation fails.
>>
>>
>> This is due to Nvidia's drivers. If you switch to using their GSP
>> firmware, recovery is instant, pretty much.
>
> Beyond my knowledge, and it does not make things worse so not blocking.

I've added VRAM checking to the patch. It should work in most cases. It 
autodetects the async_depth value

based on both the VRAM and the size needed for one frame.

Except for when VRAM is already full. We cannot detect currently used VRAM,

but its a good thing no one will want to run anything else anyway.