[FFmpeg-devel] [PATCH] move h264 loopfilter strength code to yasm
Ronald S. Bultje
Fri Sep 24 00:13:30 CEST 2010
$subj. This could likely be done in inline asm as well but I still
can't write that. The advantage of the approach to write it fully in
asm is to get rid of gcc doing a pretty bad job at optimizing, e.g.
b_idx and edge are essentially the same thing, and d_idx, mask_mv,
mask_dir (and the following pand statement) are all inlinable (or in
the case of pand -1: removable) if you unroll the toplevel loop (which
gcc does, but it still keeps separate registries for each of the
above). This basically leads to code that needs no stack space and is
way faster, although it does exactly what the original code (I think?)
intended to tell gcc to do, or something like that.
Code goes from 116 to 86 cycles (26% speedup), leading to a 0.075%
speedup overall (measured on OSX 10.6.4, x86-64 Corei7, cathedral
sample). I had real numbers somewhere but lost them when I closed the
text editor where I saved them. :-(. Also, it's not much, but every
cycle is one...
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 14197 bytes
Desc: not available
More information about the ffmpeg-devel