[FFmpeg-devel] [PATCH] NEON code for basic scalar ops
Kostya
kostya.shishkov
Thu Aug 13 06:44:56 CEST 2009
On Thu, Aug 13, 2009 at 12:33:07AM +0100, M?ns Rullg?rd wrote:
> Kostya <kostya.shishkov at gmail.com> writes:
>
> > On Tue, Jul 21, 2009 at 03:23:58PM +0100, M?ns Rullg?rd wrote:
> >> Kostya <kostya.shishkov at gmail.com> writes:
> >>
> >> > While waiting for RTMP patch review, here's a bit of NEON code to speed
> >> > up int16 array addition/subtraction and scalar product calculation.
> >> >
> >> > This about halves decoding time for APE compressed at insane level
> >> > (so it's only 7 times slower than realtime on my BeagleBoard).
> >>
> >> These functions are far from optimal.
> >
> > Since I won't be able to work at it for some time I post here version
> > that is few cycles closer to optimal (but still far away).
> >
> > +function ff_scalarproduct_int16_neon, export=1
> > + vmov.i16 q0, #0
> > + vmov.i16 q1, #0
> > + vmov.i16 q2, #0
> > + vmov.i16 q3, #0
> > +1: vld1.16 {d16-d17}, [r0]!
> > + vld1.16 {d20-d21}, [r1,:128]!
> > + vmlal.s16 q0, d16, d20
> > + vld1.16 {d18-d19}, [r0]!
> > + vmlal.s16 q1, d17, d21
> > + vld1.16 {d22-d23}, [r1,:128]!
> > + vmlal.s16 q2, d18, d22
> > + vmlal.s16 q3, d19, d23
> > + subs r2, r2, #16
> > + bne 1b
> > + vpadd.s32 d8, d0, d1
> > + vpadd.s32 d9, d2, d3
> > + vpadd.s32 d10, d4, d5
> > + vpadd.s32 d11, d6, d7
> > + vpadd.s32 d0, d8, d9
> > + vpadd.s32 d1, d10, d11
> > + vpadd.s32 d2, d0, d1
> > + vpaddl.s32 d3, d2
> > + vmov.32 r0, d3[0]
> > + asr r0, r3
> > + bx lr
> > + .endfunc
>
> This doesn't do exactly the same thing as the C version, which shifts
> immediately after the multiplication, before accumulating. However,
> all calls to DSPContext.scalarproduct_int16 have a zero shift.
>
> Since shifting at the end is both more accurate and faster, maybe we
> should change it. Someone would have to update the sse and altivec
> versions of course.
The intent was to have sped-up scalar product calculating for Monkey
Audio but with CELP filters in mind too. Since those use fixed point
values, shift right after multiplication is logical there (and will
prevent overflows).
As for this version - I seem unable to find instruction for vector
right shift by register value, only by immediate ones (which looks like
discrimination of the rightshifts).
> --
> M?ns Rullg?rd
More information about the ffmpeg-devel
mailing list