[FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
h.leppkes at gmail.com
Mon May 18 13:17:17 CEST 2015
On Mon, May 18, 2015 at 12:37 PM, Stefano Sabatini <stefasab at gmail.com> wrote:
> On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini <stefasab at gmail.com>
>> On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded:
>> > On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
>> > > One limitation is as the manual said, it needs to be copied from the
>> > > GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
>> > > copy function for this, it uses plain old memcpy.
>> > > Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
>> > > is optimized for copying from USWC memory (Uncacheable Speculative
>> > > Write Combining) to system memory. Using this may help speed up the
>> > > process significantly, and VLC probably uses it.
>> > Now the question is, how would be possible to optimize GPU to CPU copy
>> > to get an overall performance gain? At least VLC seems able to get
>> > better performances when using HW decoding, but I'm not sure it is
>> > copying decoded data back to the CPU (indeed it may perform direct
>> > rendering).
>> commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7
>> Author: Laurent Aimar <fenrir at videolan.org>
>> Date: Tue Nov 17 01:09:43 2009 +0100
>> Improved performance when copying video surface in dxva2.
>> That is, VLC is using optimized GPU->CPU copy when the relevant SSE2
>> instructions are available.
> I have a first hackish patch, performed some tests and I got some
> significant performance gains, on my iCore5 with Intel Graphics HD4000 I
> have now the same performance as the software decoder using DXVA2 for
> decoding a H.264 1920x1080 video, but using only a single thread. The patch
> as is is a hack, since I had to modify the compilation flags to enable
> assembly compilation in the ffmpeg_dxva2.c file. I should probably create
> an optimized copy function in libavutil, comments are welcome.
FWIW, I never saw any benefits from using a small cache over simply
copying directly to the destination memory, that could potentially
simplify this a bit.
And yeah, its a huge hack, we don't want new inline assembly.
More information about the ffmpeg-devel