[FFmpeg-devel] [PATCH 3/3] Use DSPContext.vector_fmul() and DSPContext.vector_fmul_reverse() in floating-point version of apply_window(). 46% faster in function apply_window().
Loren Merritt
lorenm
Wed Jan 5 22:06:07 CET 2011
On Tue, 4 Jan 2011, Justin Ruggles wrote:
> Currently we have vector_fmul() for: C, neon, vfp, altivec, 3dnow, sse
>
> I implemented vector_fmul_copy() for C, altivec, 3dnow, and sse to use 2
> src and 1 dst. The Altivec version of vector_fmul_copy() has not been
> tested, but I implemented it in the hope that someone else will test and
> review it. Here are some benchmarks on my Athlon64. benchmark numbers
> are in dezicycles.
>
> I also tried to rewrite the current C version in SSE. It was faster
> than the fmul_copy+fmul_reverse since it basically merges the 2 loops,
> but it was slower than vector_fmul_copy(512). I left that out of the
> patch. If anyone is interested I can send it...
I predict that all of the vector_fmul_* mentioned here are memory-bound on
intel and arithmetic-bound on amd.
Is there any reason to keep both the 2-arg and 3-arg version of
vector_fmul?
--Loren Merritt
More information about the ffmpeg-devel
mailing list