[FFmpeg-devel] [PATCH] 8-bit hevc decoding optimization on aarch64 with neon
Rafal Dabrowa
fatwildcat at gmail.com
Sun Nov 19 16:43:14 EET 2017
On 11/18/2017 07:41 PM, James Almer wrote:
> On 11/18/2017 3:31 PM, Rostislav Pehlivanov wrote:
>>>
>>>
>>> On 18 November 2017 at 17:35, Rafal Dabrowa <fatwildcat at gmail.com> wrote:
>>>
>>> This is a proposal of performance optimizations for 8-bit
>>> hevc video decoding on aarch64 platform with neon (simd) extension.
>>>
>>> I'm testing my optimizations on NanoPi M3 device. I'm using
>>> mainly "Big Buck Bunny" video file in format 1280x720 for testing.
>>> The video file was pulled from libde265.org page, see
>>> http://www.libde265.org/hevc-bitstreams/bbb-1280x720-cfg06.mkv
>>> The movie duration is 00:10:34.53.
>>>
>>> Overall performance gain is about 2x. Without optimizations the movie
>>> playback stops in practice after a few seconds. With
>>> optimizations the file is played smoothly 99% of the time.
>>>
>>> For performance testing the following command was used:
>>>
>>> time ./ffmpeg -hide_banner -i ~/bbb-1280x720-cfg06.mkv -f yuv4mpegpipe
>>> - >/dev/null
>>>
>>> The video file was pre-read before test to minimize disk reads during
>>> testing.
>>> Program execution time without optimization was as follows:
>>>
>>> real 11m48.576s
>>> user 43m8.111s
>>> sys 0m12.469s
>>>
>>> Execution time with optimizations:
>>>
>>> real 6m17.046s
>>> user 21m19.792s
>>> sys 0m14.724s
>>>
>>>
>>> The patch contains optimizations for most heavily used qpel, epel, sao and
>>> idct
>>> functions. Among the functions provided for optimization there are two
>>> intensively used, but not optimized in this patch:
>>> hevc_v_loop_filter_luma_8
>>> and hevc_h_loop_filter_luma_8. I have no idea how they could be optimized
>>> hence I leaved them without optimizations.
>>>
>>>
>>>
>>> Signed-off-by: Rafal Dabrowa <fatwildcat at gmail.com>
>>> ---
>>> libavcodec/aarch64/Makefile | 5 +
>>> libavcodec/aarch64/hevcdsp_epel_8.S | 3949 ++++++++++++++++++++
>>> libavcodec/aarch64/hevcdsp_idct_8.S | 1980 ++++++++++
>>> libavcodec/aarch64/hevcdsp_init_aarch64.c | 170 +
>>> libavcodec/aarch64/hevcdsp_qpel_8.S | 5666
>>> +++++++++++++++++++++++++++++
>>> libavcodec/aarch64/hevcdsp_sao_8.S | 166 +
>>> libavcodec/hevcdsp.c | 2 +
>>> libavcodec/hevcdsp.h | 1 +
>>> 8 files changed, 11939 insertions(+)
>>> create mode 100644 libavcodec/aarch64/hevcdsp_epel_8.S
>>> create mode 100644 libavcodec/aarch64/hevcdsp_idct_8.S
>>> create mode 100644 libavcodec/aarch64/hevcdsp_init_aarch64.c
>>> create mode 100644 libavcodec/aarch64/hevcdsp_qpel_8.S
>>> create mode 100644 libavcodec/aarch64/hevcdsp_sao_8.S
>>
>>
>> Very nice.
>> The way we test SIMD is to put START_TIMER("function_name"); and
>> STOP_TIMER; (they're located in libavutil/timer.h) around where the
>> function gets called in the C code, then we do a run with the C code (no
>> SIMD) and a separate run with whatever SIMD optimizations we're
>> implementing. We take the last printed value of both runs and that's what's
>> used to measure speedup.
>>
>> I don't think there's a need to split the patch into multiple patches for
>> each idividual version though yet, that's usually only done if some
>> function's C implementation is faster than the SIMD code.
> It would be nice however to at least split it into two patches, one for
> MC and one for SAO.
Could you explain whose functions are MC?
I can split patch into a few, but dependency between patches
is unavoidable because the non-optimized function pointers are
replaced with optimized all together, in one function body.
One of the patches must add the function and must add the function call.
>
> Also, no way to use macros in aarch64 asm files? ~11k lines of code is a
> lot to add, and I'm sure a sizable portion is duplicated with only some
> small differences between functions.
I used macros sparingly because code without macros is
easier to understand and to improve. Sometimes even order
of assembly instructions is important. But, of course, I can reduce
the code size using macros if the patch will be accepted. I didn't know
whether you are interested with the patch at all.
Regarding performance testing. I wrapped every function with another
one, which calls START_TIMER and STOP_TIMER. It looks these macros
aren't reentrant, I needed to force the program to run in single thread.
Without this I had strange results, very differing between runs, for
example:
22190 UNITS in put_hevc_qpel_uni_h12_8, 16232 runs, 152 skips
1126 UNITS in put_hevc_qpel_uni_h12_8, 12001 runs, 4383 skips
Force to run in single-threaded mode was not easy, the -filter_threads
option didn't help.
Below is the outcome. Meaning of the columns:
FUNCTION - the function to optimize
UNITS_NOOPT - last UNITS result in run without optimization
OPT - last UNITS result in run with optimization
CALLS - sum of runs and skips
NSKIPS - number of skips in non-optimized version
OSKIPS - number of skips in optimized version
FUNCTION UNITS_NOOPT OPT CALLS NSKIPS OSKIPS
-------------------------------------------------------------------------
idct_16x16_8 113074 24079 2097152 0 0
idct_32x32_8 587447 100434 524288 0 0
put_hevc_epel_bi_h4_8 7651 3654 524288 177 1857
put_hevc_epel_bi_h6_8 18377 6668 32768 0 0
put_hevc_epel_bi_h8_8 20644 6698 1048576 34 1298
put_hevc_epel_bi_h12_8 62927 18968 16384 0 0
put_hevc_epel_bi_h16_8 78601 21254 524288 0 4
put_hevc_epel_bi_h24_8 231004 53800 4096 0 0
put_hevc_epel_bi_h32_8 294058 63302 524288 0 0
put_hevc_epel_bi_hv4_8 13183 6264 2097152 67 3057
put_hevc_epel_bi_hv6_8 27672 12706 131072 0 0
put_hevc_epel_bi_hv8_8 31908 11184 2097152 4 1688
put_hevc_epel_bi_hv12_8 86370 29497 65536 0 0
put_hevc_epel_bi_hv16_8 104623 30717 1048576 0 3
put_hevc_epel_bi_hv24_8 302361 80610 8192 0 0
put_hevc_epel_bi_hv32_8 376614 92475 1048576 0 0
put_hevc_epel_bi_v4_8 7290 3368 2097152 338 4444
put_hevc_epel_bi_v6_8 19306 8423 65536 0 0
put_hevc_epel_bi_v8_8 20431 5795 2097152 12 2252
put_hevc_epel_bi_v12_8 61368 21050 16384 0 0
put_hevc_epel_bi_v16_8 74351 17655 1048576 0 9
put_hevc_epel_bi_v24_8 226914 51601 4096 0 0
put_hevc_epel_bi_v32_8 285476 55184 1048576 0 0
put_hevc_epel_h4_8 5826 3362 524288 667 2619
put_hevc_epel_h6_8 12852 5912 32768 0 0
put_hevc_epel_h8_8 13847 6009 1048576 237 1504
put_hevc_epel_h12_8 44210 17185 16384 0 0
put_hevc_epel_h16_8 53502 18642 524288 0 5
put_hevc_epel_h24_8 157030 48086 4096 0 0
put_hevc_epel_h32_8 193877 54837 524288 0 0
put_hevc_epel_hv4_8 11031 6379 2097152 316 1886
put_hevc_epel_hv6_8 23233 12730 131072 0 0
put_hevc_epel_hv8_8 25406 10989 2097152 21 1471
put_hevc_epel_hv12_8 70139 28821 65536 0 0
put_hevc_epel_hv16_8 81318 30190 1048576 0 4
put_hevc_epel_hv24_8 230829 75079 16384 0 0
put_hevc_epel_hv32_8 285945 92143 1048576 0 0
put_hevc_epel_uni_hv4_8 13255 7571 2097152 142 582
put_hevc_epel_uni_hv6_8 29279 14637 131072 0 0
put_hevc_epel_uni_hv8_8 31783 14114 1048576 0 26
put_hevc_epel_uni_hv12_8 85576 31757 32768 0 0
put_hevc_epel_uni_hv16_8 90346 29886 524288 0 0
put_hevc_epel_uni_hv24_8 281864 76862 1024 0 0
put_hevc_epel_uni_hv32_8 322135 91541 65536 0 0
put_hevc_epel_uni_v4_8 6826 3785 2097152 494 3496
put_hevc_epel_uni_v6_8 20113 10093 32768 0 0
put_hevc_epel_uni_v8_8 18883 6444 1048576 7 448
put_hevc_epel_uni_v12_8 59989 23523 8192 0 0
put_hevc_epel_uni_v16_8 63740 18096 262144 0 0
put_hevc_epel_uni_v24_8 208109 48880 512 0 0
put_hevc_epel_uni_v32_8 249717 50660 262144 0 0
put_hevc_epel_v4_8 5834 3056 2097152 970 5422
put_hevc_epel_v6_8 15541 8900 65536 0 0
put_hevc_epel_v8_8 14549 5476 2097152 296 3129
put_hevc_epel_v12_8 48518 22362 32768 0 0
put_hevc_epel_v16_8 53909 16483 1048576 0 23
put_hevc_epel_v24_8 166783 43662 4096 0 0
put_hevc_epel_v32_8 210650 47112 1048576 0 0
put_hevc_pel_bi_pixels4_8 4751 2923 2097152 7381 9232
put_hevc_pel_bi_pixels6_8 11774 5689 65536 0 0
put_hevc_pel_bi_pixels8_8 12269 4165 4194304 2298 12731
put_hevc_pel_bi_pixels12_8 36260 14031 65536 0 0
put_hevc_pel_bi_pixels16_8 42718 10421 4194304 21 3881
put_hevc_pel_bi_pixels24_8 137480 38423 32768 0 0
put_hevc_pel_bi_pixels32_8 172166 43996 8388608 0 3
put_hevc_pel_bi_pixels48_8 520118 133238 4096 0 0
put_hevc_pel_bi_pixels64_8 671892 173615 4194304 0 0
put_hevc_pel_pixels4_8 3859 3139 1048576 8926 9478
put_hevc_pel_pixels6_8 8453 6566 32768 0 0
put_hevc_pel_pixels8_8 7144 3093 4194304 4802 30239
put_hevc_pel_pixels12_8 25096 16648 65536 0 0
put_hevc_pel_pixels16_8 25472 9538 2097152 790 3094
put_hevc_pel_pixels24_8 93108 42948 32768 0 0
put_hevc_pel_pixels32_8 100331 37550 8388608 0 2
put_hevc_pel_pixels48_8 321258 137835 4096 0 0
put_hevc_pel_pixels64_8 387236 152538 4194304 0 0
put_hevc_qpel_bi_h4_8 34054 20498 16384 0 0
put_hevc_qpel_bi_h8_8 34264 10873 524288 0 801
put_hevc_qpel_bi_h12_8 85199 22938 16384 0 0
put_hevc_qpel_bi_h16_8 107035 20526 524288 0 488
put_hevc_qpel_bi_h24_8 323233 66440 16384 0 0
put_hevc_qpel_bi_h32_8 415699 76073 262144 0 0
put_hevc_qpel_bi_h48_8 1282990 246145 2048 0 0
put_hevc_qpel_bi_h64_8 1664853 260382 262144 0 0
put_hevc_qpel_bi_hv4_8 56239 31221 32768 0 0
put_hevc_qpel_bi_hv8_8 63859 21595 1048576 0 63
put_hevc_qpel_bi_hv12_8 143173 58139 65536 0 0
put_hevc_qpel_bi_hv16_8 184410 40468 1048576 0 15
put_hevc_qpel_bi_hv24_8 509364 134833 32768 0 0
put_hevc_qpel_bi_hv32_8 647015 125581 524288 0 0
put_hevc_qpel_bi_hv48_8 1929283 385204 4096 0 0
put_hevc_qpel_bi_hv64_8 2416442 430161 524288 0 0
put_hevc_qpel_bi_v4_8 37454 22461 32768 0 0
put_hevc_qpel_bi_v8_8 34500 9218 1048576 0 1291
put_hevc_qpel_bi_v12_8 87403 31659 32768 0 0
put_hevc_qpel_bi_v16_8 106589 19326 1048576 0 971
put_hevc_qpel_bi_v24_8 332644 78044 16384 0 0
put_hevc_qpel_bi_v32_8 405835 73886 524288 0 0
put_hevc_qpel_bi_v48_8 1266494 217496 2048 0 0
put_hevc_qpel_bi_v64_8 1677771 259481 524288 0 0
put_hevc_qpel_h4_8 29542 16982 16384 0 0
put_hevc_qpel_h8_8 26710 10452 524288 5 558
put_hevc_qpel_h12_8 67708 22021 16384 0 0
put_hevc_qpel_h16_8 81849 18637 524288 0 560
put_hevc_qpel_h24_8 258384 62392 16384 0 0
put_hevc_qpel_h32_8 321281 68451 262144 0 0
put_hevc_qpel_h48_8 984759 219657 2048 0 0
put_hevc_qpel_h64_8 1224717 227914 262144 0 0
put_hevc_qpel_hv4_8 51764 32150 32768 0 0
put_hevc_qpel_hv8_8 56369 21627 1048576 0 73
put_hevc_qpel_hv12_8 125191 48671 65536 0 0
put_hevc_qpel_hv16_8 159288 40749 1048576 0 10
put_hevc_qpel_hv24_8 438656 131331 32768 0 0
put_hevc_qpel_hv32_8 551607 121954 524288 0 0
put_hevc_qpel_hv48_8 1627266 397656 4096 0 0
put_hevc_qpel_hv64_8 2016176 414765 524288 0 0
put_hevc_qpel_uni_h4_8 21301 13384 131072 0 0
put_hevc_qpel_uni_h8_8 30057 11010 524288 7 486
put_hevc_qpel_uni_h12_8 84804 25790 16384 0 0
put_hevc_qpel_uni_h16_8 95333 24267 262144 0 17
put_hevc_qpel_uni_h24_8 318029 76951 4096 0 0
put_hevc_qpel_uni_h32_8 356799 72279 65536 0 0
put_hevc_qpel_uni_h48_8 1181308 237731 128 0 0
put_hevc_qpel_uni_h64_8 1401262 231221 16384 0 0
put_hevc_qpel_uni_hv4_8 39439 22837 262144 0 1
put_hevc_qpel_uni_hv8_8 60380 23283 1048576 0 77
put_hevc_qpel_uni_hv12_8 146759 56280 32768 0 0
put_hevc_qpel_uni_hv16_8 173329 45131 524288 0 2
put_hevc_qpel_uni_hv24_8 505434 139999 16384 0 0
put_hevc_qpel_uni_hv32_8 561402 120361 131072 0 0
put_hevc_qpel_uni_hv48_8 1854753 361780 256 0 0
put_hevc_qpel_uni_hv64_8 2142627 404073 32768 0 0
put_hevc_qpel_uni_v4_8 23081 12550 262144 0 0
put_hevc_qpel_uni_v8_8 30075 9971 1048576 5 511
put_hevc_qpel_uni_v12_8 89427 38025 16384 0 0
put_hevc_qpel_uni_v16_8 96131 21727 524288 0 23
put_hevc_qpel_uni_v24_8 328019 90689 8192 0 0
put_hevc_qpel_uni_v32_8 358340 71396 131072 0 0
put_hevc_qpel_uni_v48_8 1164812 176367 256 0 0
put_hevc_qpel_uni_v64_8 1464856 232866 32768 0 0
put_hevc_qpel_v4_8 31732 19999 32768 0 0
put_hevc_qpel_v8_8 25311 8967 1048576 10 1142
put_hevc_qpel_v12_8 67764 29917 32768 0 0
put_hevc_qpel_v16_8 78023 18260 1048576 0 819
put_hevc_qpel_v24_8 254724 75185 16384 0 0
put_hevc_qpel_v32_8 305639 69130 524288 0 0
put_hevc_qpel_v48_8 892900 240703 2048 0 0
put_hevc_qpel_v64_8 1149597 221632 524288 0 0
sao_edge_filter_8 600074 91811 524288 0 0
More information about the ffmpeg-devel
mailing list