[FFmpeg-devel] [PATCH] 'vorbis_residue_decode' optimizations

Siarhei Siamashka siarhei.siamashka
Sat Aug 30 22:42:31 CEST 2008


On Saturday 30 August 2008, Loren Merritt wrote:
> On Sat, 30 Aug 2008, Siarhei Siamashka wrote:
> > This trivial patch improves overall vorbis decoding performance by ~3% on
> > Pentium-M with gcc 4.2.3
>
> vorbis_residue_decode_type# are superfluous. Just inline
> vorbis_residue_decode_internal into vorbis_residue_decode.

Theoretically they are superfluous (inlining vorbis_residue_decode_internal
into vorbis_residue_decode was the first thing that I tried). But in practice
code is consistently faster this way. Probably it is easier for gcc to
optimize 3 independent functions than everything bundled into a huge one. Let
me know if you get different results.

original unpatched ffmpeg:

178520 dezicycles in vorbis_residue_decode, 1 runs, 0 skips
145815 dezicycles in vorbis_residue_decode, 2 runs, 0 skips
117025 dezicycles in vorbis_residue_decode, 4 runs, 0 skips
96461 dezicycles in vorbis_residue_decode, 8 runs, 0 skips
169166 dezicycles in vorbis_residue_decode, 16 runs, 0 skips
287517 dezicycles in vorbis_residue_decode, 32 runs, 0 skips
348828 dezicycles in vorbis_residue_decode, 64 runs, 0 skips
377407 dezicycles in vorbis_residue_decode, 128 runs, 0 skips
396503 dezicycles in vorbis_residue_decode, 255 runs, 1 skips
409856 dezicycles in vorbis_residue_decode, 510 runs, 2 skips
413123 dezicycles in vorbis_residue_decode, 1021 runs, 3 skips
421823 dezicycles in vorbis_residue_decode, 2043 runs, 5 skips
431205 dezicycles in vorbis_residue_decode, 4090 runs, 6 skips
438082 dezicycles in vorbis_residue_decode, 8172 runs, 20 skips
461191 dezicycles in vorbis_residue_decode, 16320 runs, 64 skips
473954 dezicycles in vorbis_residue_decode, 32635 runs, 133 skips

vorbis_residue_decode_internal inlined into vorbis_residue_decode:

155960 dezicycles in vorbis_residue_decode, 1 runs, 0 skips
127240 dezicycles in vorbis_residue_decode, 2 runs, 0 skips
101385 dezicycles in vorbis_residue_decode, 4 runs, 0 skips
85620 dezicycles in vorbis_residue_decode, 8 runs, 0 skips
155000 dezicycles in vorbis_residue_decode, 16 runs, 0 skips
262903 dezicycles in vorbis_residue_decode, 32 runs, 0 skips
317825 dezicycles in vorbis_residue_decode, 64 runs, 0 skips
353917 dezicycles in vorbis_residue_decode, 128 runs, 0 skips
370371 dezicycles in vorbis_residue_decode, 255 runs, 1 skips
377430 dezicycles in vorbis_residue_decode, 509 runs, 3 skips
382719 dezicycles in vorbis_residue_decode, 1020 runs, 4 skips
396982 dezicycles in vorbis_residue_decode, 2040 runs, 8 skips
401483 dezicycles in vorbis_residue_decode, 4084 runs, 12 skips
406397 dezicycles in vorbis_residue_decode, 8174 runs, 18 skips
426972 dezicycles in vorbis_residue_decode, 16341 runs, 43 skips
438214 dezicycles in vorbis_residue_decode, 32681 runs, 87 skips

patch from my previous e-mail:

167730 dezicycles in vorbis_residue_decode, 1 runs, 0 skips
138015 dezicycles in vorbis_residue_decode, 2 runs, 0 skips
106990 dezicycles in vorbis_residue_decode, 4 runs, 0 skips
86990 dezicycles in vorbis_residue_decode, 8 runs, 0 skips
140043 dezicycles in vorbis_residue_decode, 15 runs, 1 skips
255483 dezicycles in vorbis_residue_decode, 31 runs, 1 skips
315177 dezicycles in vorbis_residue_decode, 63 runs, 1 skips
340941 dezicycles in vorbis_residue_decode, 127 runs, 1 skips
354520 dezicycles in vorbis_residue_decode, 255 runs, 1 skips
363585 dezicycles in vorbis_residue_decode, 511 runs, 1 skips
368976 dezicycles in vorbis_residue_decode, 1017 runs, 7 skips
382299 dezicycles in vorbis_residue_decode, 2037 runs, 11 skips
390577 dezicycles in vorbis_residue_decode, 4076 runs, 20 skips
395680 dezicycles in vorbis_residue_decode, 8157 runs, 35 skips
415305 dezicycles in vorbis_residue_decode, 16334 runs, 50 skips
425434 dezicycles in vorbis_residue_decode, 32691 runs, 77 skips

This minor inlining strategy difference results in 0.5%-1% better 
overall decoding time.

Also here is the result of valgrind simulation (it is quite precise and
doesn't suffer from random deviation):

--------------------
callgrind simulation for './ffmpeg_g.1huge' (L1 data cache is 32K):
I   refs:      85,817,091
D   refs:      43,457,905  (28,888,575 rd + 14,569,330 wr)
D1  misses:       785,564  (   583,645 rd +    201,919 wr)
D1  miss rate:        1.8% (       2.0%   +        1.3%  )
callgrind simulation for './ffmpeg_g.3func' (L1 data cache is 32K):
I   refs:      85,085,997
D   refs:      42,653,212  (28,454,961 rd + 14,198,251 wr)
D1  misses:       782,978  (   581,685 rd +    201,293 wr)
D1  miss rate:        1.8% (       2.0%   +        1.4%  )

The difference is visible both for the total number of instructions and for 
the number of memory accesses.

I understand that this all is very shady and may heavily depend on the
compiler version and optimization options. I think that you can decide 
better what to do with 'vorbis_residue_decode' function and commit 
something that is good for vorbis decoding performance. Most likely my 
job ends here.

-- 
Best regards,
Siarhei Siamashka




More information about the ffmpeg-devel mailing list