[FFmpeg-devel] gcc 2.95.3 support plan

Wed Feb 11 21:43:35 CET 2009

On Wed, 11 Feb 2009, Ivan Kalvachev wrote:

> So on AMD the plain asm cmov function is faster than mmxext.
> Can you show benchmarks?

benchmarking add_hfyu_median_prediction with width=1280.
2**20 runs. stddev is about 4 cycles.

Intel Core2 e6600, 64bit, gcc-4.2.3
24929 cycles in plain c
19086 cycles in c with HAVE_CMOV (i.e. asm mid_pred())
16489 cycles in cmov asm
  8869 cycles in mmx

AMD K8 3400+, 64bit, gcc-4.2.3
21165 cycles in plain c
14361 cycles in c with HAVE_CMOV
  9398 cycles in cmov asm
14048 cycles in mmx

The numbers are easily explained by:
My mmx doesn't use any simd, it just applies mmx ops with one value per 
reg. (simd decoding is impossible in huffyuv. I'm writing a new format to 
remedy that, among other improvements. Not ready yet.)
On P3, PM, and Core2, cmov has latency 2 and pmaxub has latency 1.
On K7, K8, and K10, cmov has latency 1 and pmaxub has latency 2.
The critical path is made almost entirely of those instructions.

--Loren Merritt