[FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Hendrik Leppkes h.leppkes at gmail.com
Tue May 12 15:54:17 CEST 2015

On Tue, May 12, 2015 at 3:33 PM, Stefano Sabatini <stefasab at gmail.com> wrote:
> Hi guys,
> I'm playing with DXVA2 hardware decoding on Windows, and these are my
> findings.
> DVXA2 decoding was enabled in avconv/ffmpeg through the commit:
> commit 35177ba77ff60a8b8839783f57e44bcc4214507a
> Author: Hendrik Leppkes <h.leppkes at gmail.com>
> Date:   Tue Apr 22 15:22:53 2014 +0200
>     avconv: add support for DXVA2 decoding
>     Signed-off-by: Anton Khirnov <anton at khirnov.net>
> DXVA2 decoding is enabled when a dxva2api.h header is found in the
> path. From my understanding the header is provided by VLC:
> http://download.videolan.org/pub/contrib/dxva2api.h
> (I suppose the header was created in order to make compilation work
> with MinGW). When compiling with MinGW from mingw.org I had to change
> the GetShellWindow call in the line:
>     hr = IDirect3D9_CreateDevice(ctx->d3d9, adapter, D3DDEVTYPE_HAL, GetShellWindow(),
>                                  &d3dpp, &ctx->d3d9device);
> to GetDesktopWindow in the ffmpeg_dxva2.c file. I applied the fix
> suggested here:
> http://ffmpeg.org/pipermail/libav-user/2014-December/007673.html

You should use mingw-w64, it provides both a dxva2api.h and can
compile the code without any modifications.
Using the "original" mingw32 is not recommended, and barely supported.

> Then I performed some tests with the command:
> ffmpeg -hwaccel dxva2 INPUT -threads 1 -f null -
> The -threads 1 option seems required or ffmpeg will fail with decoding
> errors.

Indeed, multi-threading with hwaccel is not something that should be
used, as it will break, although the API allows it for BS reasons.
There wouldn't be a performance improvement either way.

> In the ffmpeg(1) manual I can read this big warning:
>  Note that most acceleration methods are intended for playback and
>  will not be faster than software decoding on modern
>  CPUs. Additionally, ffmpeg will usually need to copy the decoded
>  frames from the GPU memory into the system memory, resulting in
>  further performance loss. This option is thus mainly useful for
>  testing.
> I tested with several HW combinations, and I always find that pure
> software decoding is always several time faster than DXVA2
> decoding. In some cases I got invalid output (same with VLC) which may
> be related to a problem in the graphics card or driver (a VIA VX900).

I don't think I've ever tested on such a chip. I didn't even know VIA
still made PC hardware.
Therefor,I have no idea how fast/slow or compatible it is.

> On the other hand when testing with VLC I noticed better performances
> (in general, a significantly reduced usage of the CPU, usually of an
> order of 3), so I have to conclude that at least VLC is able to make
> good use of DXVA2 hardware acceleration.
> I'm aware that the need to copy GPU data back to the CPU memory as
> required by ffmpeg defeats the advantage (if any) of hardware
> decoding, especially given that multithreading decoding cannot be
> adopted with DXVA2.
> My questions are:
> There are some cases when DXVA2 (or in general HW decoding) can be
> used effectively in ffmpeg? Can you tell if there is something which
> could be improved in the current ffmpeg_dxva2.c implementation? (My
> guess is that this code is somehow based on the VLC code).

Its not based on the VLC code, its roughly based on code from my own
project that uses ffmpeg for DXVA2, but really, the workflow is going
to be pretty similar in any implementation either way, since the MS
API dictates that, more or less.

DXVA2 decoding can be faster then software decoding, depending on your hardware.

If you used a low-end Intel CPU, say a Pentium or i3 (Ivy or Haswell),
or use a recent NVIDIA GPU (Kepler or Maxwell), then DXVA2 decoding on
the GPU can potentially give you ~400 fps for 1080p, while the CPU
will likely not manage that.
On a high-end CPU, the software decoder can potentially exceed that, however.

One limitation is as the manual said, it needs to be copied from the
GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
copy function for this, it uses plain old memcpy.
Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
is optimized for copying from USWC memory (Uncacheable Speculative
Write Combining) to system memory. Using this may help speed up the
process significantly, and VLC probably uses it.

The original primary goal of this code was however to be able to test
and debug the hwaccels much easier, and not directly to provide a
playback/transcoding feature, so such optimizations were not performed
for brevity.

- Hendrik

More information about the ffmpeg-devel mailing list