[RFC] SSE3/4 implementation of flac_encode_residual_lpc

Jason Garrett-Glaser
Mon May 4 06:21:19 CEST 2009

On Sun, May 3, 2009 at 8:39 PM, Bobby Bingham <uhmmmm at gmail.com> wrote:
> On Sat, 25 Apr 2009 03:03:30 +0000 (UTC)
> Loren Merritt <lorenm at u.washington.edu> wrote:
>> On Fri, 24 Apr 2009, Bobby Bingham wrote:
>> > Attached are patches to move flac_encode_residual_lpc to dsputils,
>> > and to add SSE3 and SSE4 implementations. ?I wrote the SSE3 first,
>> > but since it doesn't have signed 32x32 multiplication AFAICT, I
>> > ended up using double precision floats for it, and the result is
>> > code that's slower than the C version. ?Unless somebody has a
>> > suggestion of how to fix this, I'll drop the SSE3 version.
>> >
>> > I tried an SSE4 version because it does have signed 32x32->32
>> > multiplication, like the C version uses. ?Unfortunately, I don't
>> > have an SSE4-capable processor to test it with, so I can't check
>> > its speed or even its correctness. ?Benchmarks welcome.
>> fails regression test on my Penryn.
>> > +// TODO: look into palignr?
>> Yea, do that. It should be possible to load each sample just once
>> (aligned), and do all other manipulation in registers.
>> There are no cpus with both lddqu and sse4, so you're paying the full
>> cost of unaligned loads.
> I've changed the code to use palignr, and hopefully fixed it to work
> correctly now. ?I've also removed the SSE3 code from this patch as I
> haven't managed to get it any faster by using integer arithmetic yet.

>"movdqu  -16(%3,%0), %%xmm4         \n\t"   // xmm4 = smp  [i-4 .. i-1]
>"movdqu  -12(%3,%0), %%xmm6         \n\t"   // xmm6 = smp  [i-3 .. i  ]

Any reason you didn't use palignr here?

>"movdqu     %%xmm5, %2              \n\t"

Is there a good reason why this store has to be unaligned?

> "phaddd     %%xmm1, %%xmm0          \n\t"
> "phaddd     %%xmm3, %%xmm2          \n\t"
> "phaddd     %%xmm2, %%xmm0          \n\t"   // xmm0 = [p0, p1, p2, p3]

Did you not find a better way of doing this without PHADD, given how slow it is?


pmulld is really really slow (6 clocks on Nehalem!).  If you make
certain assumptions about the nature of the input data (say, restrict
your code to only 16-bit samples), you might be able to use a faster

>"movdqa     %%xmm5, %%xmm9          \n\t"

Does this asm really need to be x86_64-only?  If so, how about an
x86_32 version?

Dark Shikari

