[FFmpeg-devel] [RFC] SSE3/4 implementation of flac_encode_residual_lpc
Mon May 4 05:39:05 CEST 2009
On Sat, 25 Apr 2009 03:03:30 +0000 (UTC)
Loren Merritt <lorenm at u.washington.edu> wrote:
> On Fri, 24 Apr 2009, Bobby Bingham wrote:
> > Attached are patches to move flac_encode_residual_lpc to dsputils,
> > and to add SSE3 and SSE4 implementations. I wrote the SSE3 first,
> > but since it doesn't have signed 32x32 multiplication AFAICT, I
> > ended up using double precision floats for it, and the result is
> > code that's slower than the C version. Unless somebody has a
> > suggestion of how to fix this, I'll drop the SSE3 version.
> > I tried an SSE4 version because it does have signed 32x32->32
> > multiplication, like the C version uses. Unfortunately, I don't
> > have an SSE4-capable processor to test it with, so I can't check
> > its speed or even its correctness. Benchmarks welcome.
> fails regression test on my Penryn.
> > +// TODO: look into palignr?
> Yea, do that. It should be possible to load each sample just once
> (aligned), and do all other manipulation in registers.
> There are no cpus with both lddqu and sse4, so you're paying the full
> cost of unaligned loads.
I've changed the code to use palignr, and hopefully fixed it to work
correctly now. I've also removed the SSE3 code from this patch as I
haven't managed to get it any faster by using integer arithmetic yet.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 7993 bytes
Desc: not available
More information about the ffmpeg-devel