[FFmpeg-devel] [PATCH] unscaled float 2 int conversion

Thu May 15 23:17:40 CEST 2008

On Thu, May 15, 2008 at 09:14:15PM +0200, Benjamin Larsson wrote:
> Michael Niedermayer wrote:
> >> Well when I tried the last time I did't get it to work, there was some
> >> overlap issue that wasn't trivial to sort out. 
> > 
> > You just add 384 or what it was after the windowing/overlap.
> > 
> 
> Just to be clear, this bias scale thing is about not having to use the
> fstp fpu call or whatever it is called on other cpus. To perform it you
> first scale down your samples to -1 and 1. This scaling operation is
> most often performed for free by scaling a suitable table somewhere.
> Then you add 384 so you can cast the float value directly to an integer.
> So you trade a float add against fstp which must have been faster on
> some cpu (or else they wouldn't have used it).
> 
> In FFmpeg we also have 3dnow, sse and altivec code that can do float to
> int16 conversion. I think we can agree that the simd code is faster then
> the bias trick on all processors that supports the simd code. Then we
> are left with Intel cpus before P3, the Motorola G3 and various other
> cpus with only fpus and no simd unit. I'm pretty sure that this trick is
> the best when we are dealing with P2 cpus and lower but I'm not sure it
> is for the G3.
> 
> So then we come to the matter of performance, you want benchmarks to
> justify changing or adding a new scaling method. As I don't have access
> to any machines that doesn't have a simd unit I can't do any usable
> benchmarks. But I'm quite sure that if I had access it would show that
> doing the bias trick would be faster. So one could argue that well ok
> then we keep the code as it is. But my opinion is that we should scrap
> this anyway, it makes the code complex, it slows down the simd code
> (very little though) for no good reason, it complicates the development
> of a proper audio api and filter system. Cpus with slow fpus should use
> fixed point code instead.
> 
> So I propose that we start cleaning out this.

Ohh well, why do i always have to do the work? You could have safed me
some time by just saying that you wont do the benchmarks.

PS: yes i dont give a damn what you or anyone else thinks, either
i see benchmarks or people can go talking to their next wall.
It would have taken you less time to disable MMX*/SSE* and write
a benchmark than explaining why its better not to.

P3
gcc-4.4 -O3 -fno-math-errno
145658 dezicycles in conv_cast, 16368 runs, 16 skips
52531 dezicycles in conv_lrint, 16377 runs, 7 skips
57978 dezicycles in conv_bias, 16380 runs, 4 skips

gcc-4.4 -O2 -fno-math-errno
137440 dezicycles in conv_cast, 16362 runs, 22 skips
44664 dezicycles in conv_lrint, 16377 runs, 7 skips
57940 dezicycles in conv_bias, 16379 runs, 5 skips

gcc-4.3 -O2 -fno-math-errno
137546 dezicycles in conv_cast, 16372 runs, 12 skips
44358 dezicycles in conv_lrint, 16379 runs, 5 skips
58084 dezicycles in conv_bias, 16378 runs, 6 skips

gcc-4.2 -O2 -fno-math-errno -lm (yes 4.2 and earlier call lrintf() litterally)
135311 dezicycles in conv_cast, 16367 runs, 17 skips
78760 dezicycles in conv_lrint, 16377 runs, 7 skips
44998 dezicycles in conv_bias, 16379 runs, 5 skips

gcc-4.1 -O2 -fno-math-errno -lm
135832 dezicycles in conv_cast, 16365 runs, 19 skips
78740 dezicycles in conv_lrint, 16375 runs, 9 skips
48120 dezicycles in conv_bias, 16383 runs, 1 skips

gcc-4.0 -O2 -fno-math-errno -lm
135604 dezicycles in conv_cast, 16363 runs, 21 skips
78595 dezicycles in conv_lrint, 16370 runs, 14 skips
45599 dezicycles in conv_bias, 16383 runs, 1 skips

gcc-3.4 -O2 -fno-math-errno -lm
135826 dezicycles in conv_cast, 16372 runs, 12 skips
68563 dezicycles in conv_lrint, 16379 runs, 5 skips
42373 dezicycles in conv_bias, 16381 runs, 3 skips

gcc-3.3 -O2 -fno-math-errno -lm
miscompiled, fails

If we pick the halfway well compiled ones we come up with
44358 dezicycles in conv_lrint, 16379 runs, 5 skips
vs.
42373 dezicycles in conv_bias, 16381 runs, 3 skips
and
31979 dezicycles in conv_lrint, 16380 runs, 4 skips
with instructions slightly reordered by hand

So i would say, lrintf() is faster than the bias trick as long as
the compiler doesnt mess up (which it does ... but ohh well)

Anyone who cares about his favorite architecture, benchmark it, if not
ill ok ben removing the bias code.

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Republics decline into democracies and democracies degenerate into
despotisms. -- Aristotle
-------------- next part --------------
A non-text attachment was scrubbed...
Name: float2int_test.c
Type: text/x-csrc
Size: 2170 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080515/7c87baf7/attachment.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080515/7c87baf7/attachment.pgp>