[FFmpeg-devel] [PATCH v5] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

Mark Thompson sw at jkqxz.net
Wed Feb 20 23:43:10 EET 2019


On 20/02/2019 10:17, Wang, Shaofei wrote:
>> -----Original Message-----
>> From: ffmpeg-devel [mailto:ffmpeg-devel-bounces at ffmpeg.org] On Behalf Of
>> Mark Thompson
>> Sent: Saturday, February 16, 2019 8:12 PM
>> To: ffmpeg-devel at ffmpeg.org
>> Subject: Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1
>> decode + N filter graphs and adaptive bitrate.
>> On 15/02/2019 21:54, Shaofei Wang wrote:
>>> It enabled multiple filter graph concurrency, which bring above about
>>> 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration
>>>
>>> Below are some test cases and comparison as reference.
>>> (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
>>> (Software: Intel iHD driver - 16.9.00100, CentOS 7)
>>>
>>> For 1:N transcode by GPU acceleration with vaapi:
>>> ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \
>>>     -hwaccel_output_format vaapi \
>>>     -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
>>>     -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \
>>>     -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null
>>>
>>>     test results:
>>>                 2 encoders 5 encoders 10 encoders
>>>     Improved       6.1%    6.9%       5.5%
>>>
>>> For 1:N transcode by GPU acceleration with QSV:
>>> ./ffmpeg -hwaccel qsv -c:v h264_qsv \
>>>     -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
>>>     -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null
>> \
>>>     -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null
>>> /dev/null
>>>
>>>     test results:
>>>                 2 encoders  5 encoders 10 encoders
>>>     Improved       6%       4%         15%
>>>
>>> For Intel GPU acceleration case, 1 decode to N scaling, by QSV:
>>> ./ffmpeg -hwaccel qsv -c:v h264_qsv \
>>>     -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
>>>     -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f
>> null /dev/null \
>>>     -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f
>>> null /dev/null
>>>
>>>     test results:
>>>                 2 scale  5 scale   10 scale
>>>     Improved       12%     21%        21%
>>>
>>> For CPU only 1 decode to N scaling:
>>> ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
>>>     -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \
>>>     -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null
>>>
>>>     test results:
>>>                 2 scale  5 scale   10 scale
>>>     Improved       25%    107%       148%
>>>
>>> Signed-off-by: Wang, Shaofei <shaofei.wang at intel.com>
>>> Reviewed-by: Zhao, Jun <jun.zhao at intel.com>
>>> ---
>>>  fftools/ffmpeg.c        | 121
>> ++++++++++++++++++++++++++++++++++++++++++++----
>>>  fftools/ffmpeg.h        |  14 ++++++
>>>  fftools/ffmpeg_filter.c |   1 +
>>>  3 files changed, 128 insertions(+), 8 deletions(-)
>>
>> On a bit more review, I don't think this patch works at all.
>>
> It has been tested and verified by a lot of cases. More fate cases need to be covered now.
> 
>> The existing code is all written to be run serially.  This simplistic approach to
>> parallelising it falls down because many of those functions use variables
>> written in what were previously other functions called at different times but
>> have now become other threads, introducing undefined behaviour due to
>> data races.
>>
> Actually, this is not a patch to parallel every thing in the ffmpeg. It just thread the input filter
> of the filter graph(tend for simple filter graph), which is a simple way to improve N filter graph
> performance and also without introduce huge modification. So that there is still a lot of serial function call, differences
> are that each filter graph need to init its output stream instead of init all together and each
> filter graph will reap filters for its filter chain.

Indeed the existing encapsulation tries to keep things mostly separate, but in various places it accesses shared state which works fine in the serial case but fails when those parts are run in parallel.

Data races are undefined behaviour in C; introducing them is not acceptable.

>> To consider a single example (not the only one), the function
>> check_init_output_file() does not work at all after this change.  The test for
>> OutputStream initialisation (so that you run exactly once after all of the
>> output streams are ready) races with other threads setting those variables.
>> Since that's undefined behaviour you may get lucky sometimes and have the
>> output file initialisation run exactly once, but in general it will fail in unknown
>> ways.
>>
> 
> The check_init_output_file() should be responsible for the output file related with
> specified output stream which managed by each thread chain, that means even it
> called by different thread, the data setting are different. Let me double check.

Each output file can contain multiple streams - try a transcode with both audio and video filters.

(Incidentally, the patch as-is also crashes in this case if the transcode completes without any data from one of the streams.)

- Mark


More information about the ffmpeg-devel mailing list