[FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Hendrik Leppkes h.leppkes at gmail.com
Mon May 18 22:48:41 CEST 2015

On Mon, May 18, 2015 at 9:41 PM, Reimar Döffinger
<Reimar.Doeffinger at gmx.de> wrote:
> On 18.05.2015, at 12:37, Stefano Sabatini <stefasab at gmail.com> wrote:
>> On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini <stefasab at gmail.com>
>> wrote:
>>> On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded:
>>>> On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
>>> [...]
>>>>> One limitation is as the manual said, it needs to be copied from the
>>>>> GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
>>>>> copy function for this, it uses plain old memcpy.
>>>>> Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
>>>>> is optimized for copying from USWC memory (Uncacheable Speculative
>>>>> Write Combining) to system memory. Using this may help speed up the
>>>>> process significantly, and VLC probably uses it.
>>>> Now the question is, how would be possible to optimize GPU to CPU copy
>>>> to get an overall performance gain? At least VLC seems able to get
>>>> better performances when using HW decoding, but I'm not sure it is
>>>> copying decoded data back to the CPU (indeed it may perform direct
>>>> rendering).
>>> Self-reply:
>>> commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7
>>> Author: Laurent Aimar <fenrir at videolan.org>
>>> Date:   Tue Nov 17 01:09:43 2009 +0100
>>>    Improved performance when copying video surface in dxva2.
>>> That is, VLC is using optimized GPU->CPU copy when the relevant SSE2
>>> instructions are available.
>> I have a first hackish patch, performed some tests and I got some
>> significant performance gains, on my iCore5 with Intel Graphics HD4000 I
>> have now the same performance as the software decoder using DXVA2 for
>> decoding a H.264 1920x1080 video, but using only a single thread. The patch
>> as is is a hack, since I had to modify the compilation flags to enable
>> assembly compilation in the ffmpeg_dxva2.c file. I should probably create
>> an optimized copy function in libavutil, comments are welcome.
> What exactly is SSE4 needed for?

MOVNTDQA, its specifically designed for just this task.

> Both non-temporal movs and prefetches existed before it, so if that is critical for performance the fallback implementation is bad.

A SSE2 implementation may or may not be faster than plain memcpy, that
depends on memcpy. In my tests on Windows, a SSE2 implementation was
usually not worth it.

> However possibly more important: why is a memcpy needed at all?

For any further processing, you need the frame data. And trying to use
the frame data directly from the locked surfaces for eg. an encoder is
very inefficient (possibly random access pattern), so it needs to be
copied into normal memory first.

- Hendrik

More information about the ffmpeg-devel mailing list