[FFmpeg-devel] [PATCH] 'vorbis_residue_decode' optimizations
Siarhei Siamashka
siarhei.siamashka
Sat Aug 30 22:42:31 CEST 2008
On Saturday 30 August 2008, Loren Merritt wrote:
> On Sat, 30 Aug 2008, Siarhei Siamashka wrote:
> > This trivial patch improves overall vorbis decoding performance by ~3% on
> > Pentium-M with gcc 4.2.3
>
> vorbis_residue_decode_type# are superfluous. Just inline
> vorbis_residue_decode_internal into vorbis_residue_decode.
Theoretically they are superfluous (inlining vorbis_residue_decode_internal
into vorbis_residue_decode was the first thing that I tried). But in practice
code is consistently faster this way. Probably it is easier for gcc to
optimize 3 independent functions than everything bundled into a huge one. Let
me know if you get different results.
original unpatched ffmpeg:
178520 dezicycles in vorbis_residue_decode, 1 runs, 0 skips
145815 dezicycles in vorbis_residue_decode, 2 runs, 0 skips
117025 dezicycles in vorbis_residue_decode, 4 runs, 0 skips
96461 dezicycles in vorbis_residue_decode, 8 runs, 0 skips
169166 dezicycles in vorbis_residue_decode, 16 runs, 0 skips
287517 dezicycles in vorbis_residue_decode, 32 runs, 0 skips
348828 dezicycles in vorbis_residue_decode, 64 runs, 0 skips
377407 dezicycles in vorbis_residue_decode, 128 runs, 0 skips
396503 dezicycles in vorbis_residue_decode, 255 runs, 1 skips
409856 dezicycles in vorbis_residue_decode, 510 runs, 2 skips
413123 dezicycles in vorbis_residue_decode, 1021 runs, 3 skips
421823 dezicycles in vorbis_residue_decode, 2043 runs, 5 skips
431205 dezicycles in vorbis_residue_decode, 4090 runs, 6 skips
438082 dezicycles in vorbis_residue_decode, 8172 runs, 20 skips
461191 dezicycles in vorbis_residue_decode, 16320 runs, 64 skips
473954 dezicycles in vorbis_residue_decode, 32635 runs, 133 skips
vorbis_residue_decode_internal inlined into vorbis_residue_decode:
155960 dezicycles in vorbis_residue_decode, 1 runs, 0 skips
127240 dezicycles in vorbis_residue_decode, 2 runs, 0 skips
101385 dezicycles in vorbis_residue_decode, 4 runs, 0 skips
85620 dezicycles in vorbis_residue_decode, 8 runs, 0 skips
155000 dezicycles in vorbis_residue_decode, 16 runs, 0 skips
262903 dezicycles in vorbis_residue_decode, 32 runs, 0 skips
317825 dezicycles in vorbis_residue_decode, 64 runs, 0 skips
353917 dezicycles in vorbis_residue_decode, 128 runs, 0 skips
370371 dezicycles in vorbis_residue_decode, 255 runs, 1 skips
377430 dezicycles in vorbis_residue_decode, 509 runs, 3 skips
382719 dezicycles in vorbis_residue_decode, 1020 runs, 4 skips
396982 dezicycles in vorbis_residue_decode, 2040 runs, 8 skips
401483 dezicycles in vorbis_residue_decode, 4084 runs, 12 skips
406397 dezicycles in vorbis_residue_decode, 8174 runs, 18 skips
426972 dezicycles in vorbis_residue_decode, 16341 runs, 43 skips
438214 dezicycles in vorbis_residue_decode, 32681 runs, 87 skips
patch from my previous e-mail:
167730 dezicycles in vorbis_residue_decode, 1 runs, 0 skips
138015 dezicycles in vorbis_residue_decode, 2 runs, 0 skips
106990 dezicycles in vorbis_residue_decode, 4 runs, 0 skips
86990 dezicycles in vorbis_residue_decode, 8 runs, 0 skips
140043 dezicycles in vorbis_residue_decode, 15 runs, 1 skips
255483 dezicycles in vorbis_residue_decode, 31 runs, 1 skips
315177 dezicycles in vorbis_residue_decode, 63 runs, 1 skips
340941 dezicycles in vorbis_residue_decode, 127 runs, 1 skips
354520 dezicycles in vorbis_residue_decode, 255 runs, 1 skips
363585 dezicycles in vorbis_residue_decode, 511 runs, 1 skips
368976 dezicycles in vorbis_residue_decode, 1017 runs, 7 skips
382299 dezicycles in vorbis_residue_decode, 2037 runs, 11 skips
390577 dezicycles in vorbis_residue_decode, 4076 runs, 20 skips
395680 dezicycles in vorbis_residue_decode, 8157 runs, 35 skips
415305 dezicycles in vorbis_residue_decode, 16334 runs, 50 skips
425434 dezicycles in vorbis_residue_decode, 32691 runs, 77 skips
This minor inlining strategy difference results in 0.5%-1% better
overall decoding time.
Also here is the result of valgrind simulation (it is quite precise and
doesn't suffer from random deviation):
--------------------
callgrind simulation for './ffmpeg_g.1huge' (L1 data cache is 32K):
I refs: 85,817,091
D refs: 43,457,905 (28,888,575 rd + 14,569,330 wr)
D1 misses: 785,564 ( 583,645 rd + 201,919 wr)
D1 miss rate: 1.8% ( 2.0% + 1.3% )
callgrind simulation for './ffmpeg_g.3func' (L1 data cache is 32K):
I refs: 85,085,997
D refs: 42,653,212 (28,454,961 rd + 14,198,251 wr)
D1 misses: 782,978 ( 581,685 rd + 201,293 wr)
D1 miss rate: 1.8% ( 2.0% + 1.4% )
The difference is visible both for the total number of instructions and for
the number of memory accesses.
I understand that this all is very shady and may heavily depend on the
compiler version and optimization options. I think that you can decide
better what to do with 'vorbis_residue_decode' function and commit
something that is good for vorbis decoding performance. Most likely my
job ends here.
--
Best regards,
Siarhei Siamashka
More information about the ffmpeg-devel
mailing list