[Libav-user] gcc auto-vectorisation
"René J.V. Bertin"
rjvbertin at gmail.com
Wed Feb 27 17:31:37 CET 2013
I've added a benchmarking option to ffmpeg, allowing to break down the time spent in various stages without dumping everything immediately as -benchmark_all does. The results are quite educative. It turns out that there is actually a slight penalty to running auto-vectorised code (probably not large enough to be due to non-optimal vectorisation due to alignment assumptions not being met). There's also a huge chunk of work that's not being benchmarked at all - and it is multithreaded. I haven't gone so far as to delve into ffmpeg.c to figure out what it corresponds to, though. The format conversion(s), maybe?
For me this settles the question: better stick to not using auto-vectorisation esp. since it causes a few tests to fail.
I have yet to test my modifications on MS Windows but I'd be willing to post a patch for this option (but also to admit it'd annoy me to have to adapt my cross-platform HR timing routines to ffmpeg naming conventions :( )
Benchmark results (intended to focus on decoding for playback; I'm surprised encoding to rawvideo is so expensive):
> time /usr/local/FFmpeg/trunk/bin/ffmpeg-rjvb -benchmark_most -y -v 0 -i ~/Desktop/Downloads/SOA4ep11.flv -pix_fmt argb -vcodec rawvideo -acodec pcm_f32le -f mov /dev/null; \
time /usr/local/FFmpeg/trunk.vect/bin/ffmpeg-rjvb -benchmark_most -y -v 0 -i ~/Desktop/Downloads/SOA4ep11.flv -pix_fmt argb -vcodec rawvideo -acodec pcm_f32le -f mov /dev/null ; \
time /usr/local/FFmpeg/trunk.O0/bin/ffmpeg-rjvb -benchmark_most -y -v 0 -i ~/Desktop/Downloads/SOA4ep11.flv -pix_fmt argb -vcodec rawvideo -acodec pcm_f32le -f mov /dev/null ; \
time /usr/local/FFmpeg/trunk.O0vect/bin/ffmpeg-rjvb -benchmark_most -y -v 0 -i ~/Desktop/Downloads/SOA4ep11.flv -pix_fmt argb -vcodec rawvideo -acodec pcm_f32le -f mov /dev/null
Detailed benchmark results: (32 bit, MMX/SSE code, -fno-tree-vectorize)
samples user t kernel t real t CPU %
Video decode : 85166 27.0846s 2.48361s 13.5333s 218.484%
Audio decode : 152971 10.5851s 0.189161s 4.71418s 228.55%
Video encode : 85164 38.3081s 0.304017s 19.4358s 198.665%
Audio encode : 152969 1.12343s 0.141641s 0.581738s 217.465%
Failed loops : 1 0s 1e-06s 8.64995e-07s 115.608%
Weighed totals: 476271/5 15.4539s 0.604725s 7.59638s 211.398%
Overall execution timing:
: 1 233.666s 6.46592s 108.363s 221.6%
233.673 user_cpu 6.472 kernel_cpu 1:48.37 total_time 221.5%CPU {0W 0X 0D 0K 21553152M 37F 12625R 0I 0O 0r 0s 0k 0w 213203c}
Detailed benchmark results: (32 bit, MMX/SSE code, -ftree-vectorize)
samples user t kernel t real t CPU %
Video decode : 85166 27.9066s 2.62058s 13.9246s 219.232%
Audio decode : 152971 11.0481s 0.201142s 4.9342s 227.985%
Video encode : 85164 40.3674s 0.33645s 20.4187s 199.346%
Audio encode : 152969 1.23643s 0.150545s 0.602971s 230.023%
Failed loops : 1 0s 0s 1.012e-06s 0%
Weighed totals: 476271/5 16.1541s 0.641726s 7.91958s 212.079%
Overall execution timing:
: 1 246.41s 6.8878s 114.681s 220.872%
246.418 user_cpu 6.894 kernel_cpu 1:54.69 total_time 220.8%CPU {0W 0X 0D 0K 21592064M 0F 12679R 0I 0O 0r 0s 0k 0w 216849c}
Detailed benchmark results: (64 bit, no MMX/SSE code, -fno-tree-vectorize)
samples user t kernel t real t CPU %
Video decode : 85166 199.297s 3.36215s 49.5899s 408.67%
Audio decode : 152971 29.3016s 0.32553s 8.54242s 346.823%
Video encode : 85164 73.5307s 0.530001s 23.1734s 319.594%
Audio encode : 152969 2.49674s 0.203737s 0.718678s 375.756%
Failed loops : 1 1e-06s 1e-06s 1.07e-06s 186.915%
Weighed totals: 476271/5 58.9994s 0.865978s 15.9858s 374.49%
Overall execution timing:
: 1 535.404s 9.23317s 155.785s 349.607%
535.408 user_cpu 9.239 kernel_cpu 2:35.79 total_time 349.5%CPU {0W 0X 0D 0K 22816768M 220F 13492R 0I 0O 0r 0s 0k 0w 470931c}
Detailed benchmark results: (64 bit, no MMX/SSE code, -ftree-vectorize)
samples user t kernel t real t CPU %
Video decode : 85166 213.686s 3.43406s 53.0596s 409.201%
Audio decode : 152971 30.5476s 0.328917s 8.79987s 350.874%
Video encode : 85164 74.08s 0.51521s 23.2998s 320.153%
Audio encode : 152969 2.47226s 0.202298s 0.745479s 358.771%
Failed loops : 1 0s 1e-06s 1.24599e-06s 80.2573%
Weighed totals: 476271/5 62.0631s 0.876818s 16.7202s 376.43%
Overall execution timing:
: 1 558.639s 9.3095s 160.509s 353.842%
558.643 user_cpu 9.318 kernel_cpu 2:40.51 total_time 353.8%CPU {0W 0X 0D 0K 22867968M 195F 13538R 0I 0O 0r 0s 0k 0w 480915c}
The test video:
> /usr/local/FFmpeg/trunk/bin/ffprobe ~/Desktop/Downloads/SOA4ep11.flv
ffprobe version N-50309-gaf0e814 Copyright (c) 2007-2013 the FFmpeg developers
built on Feb 25 2013 19:48:25 with gcc 4.7.2 (MacPorts gcc47 4.7.2_2+universal)
configuration: --prefix=/usr/local/FFmpeg/trunk --target-os=darwin --enable-shared --enable-static --enable-gpl --enable-nonfree --enable-libfreetype --enable-pthreads --enable-yasm --disable-doc --cpu=core2 --enable-debug=1 --disable-stripping --enable-ffmpeg --enable-ffprobe --disable-ffplay --enable-hwaccels --enable-libx264 --cc=gcc-mp-4.7 --disable-outdev=sdl
libavutil 52. 17.103 / 52. 17.103
libavcodec 54. 92.100 / 54. 92.100
libavformat 54. 63.100 / 54. 63.100
libavdevice 54. 3.103 / 54. 3.103
libavfilter 3. 41.100 / 3. 41.100
libswscale 2. 2.100 / 2. 2.100
libswresample 0. 17.102 / 0. 17.102
libpostproc 52. 2.100 / 52. 2.100
Input #0, flv, from '/Users/bertin/Desktop/Downloads/SOA4ep11.flv':
Metadata:
canSeekToEnd : false
hasCuePoints : false
hasVideo : true
videosize : 101806465
lasttimestamp : 3552
hasMetadata : true
hasKeyframes : true
metadatacreator : inlet media FLVTool2 v1.0.6 - http://www.inlet-media.de/flvtool2
hasAudio : true
audiodelay : 0
lastkeyframetimestamp: 3539
datasize : 139300867
audiosize : 37480392
Duration: 00:59:11.99, start: 0.042000, bitrate: 315 kb/s
Stream #0:0: Video: h264 (High), yuv420p, 624x352 [SAR 1:1 DAR 39:22], 232 kb/s, 23.98 tbr, 1k tbn, 47.95 tbc
Stream #0:1: Audio: aac, 44100 Hz, stereo, fltp, 82 kb/s
More information about the Libav-user
mailing list