[FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Thu May 14 18:15:13 CEST 2015

On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini <stefasab at gmail.com> wrote:
> On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded:
>> On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
> [...]
>> > One limitation is as the manual said, it needs to be copied from the
>> > GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
>> > copy function for this, it uses plain old memcpy.
>> > Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
>> > is optimized for copying from USWC memory (Uncacheable Speculative
>> > Write Combining) to system memory. Using this may help speed up the
>> > process significantly, and VLC probably uses it.
>>
>> Now the question is, how would be possible to optimize GPU to CPU copy
>> to get an overall performance gain? At least VLC seems able to get
>> better performances when using HW decoding, but I'm not sure it is
>> copying decoded data back to the CPU (indeed it may perform direct
>> rendering).
>
> Self-reply:
> commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7
> Author: Laurent Aimar <fenrir at videolan.org>
> Date:   Tue Nov 17 01:09:43 2009 +0100
>
>     Improved performance when copying video surface in dxva2.
>
> That is, VLC is using optimized GPU->CPU copy when the relevant SSE2
> instructions are available.

Actually the real proper instructions are SSE4.1, using SSE2 would
only be a small advantage over memcpy.

- Hendrik