[FFmpeg-devel] [RFC][PATCH] DSPUtilize some functions from APE decoder
Kostya
kostya.shishkov
Thu Jul 3 05:57:46 CEST 2008
On Thu, Jul 03, 2008 at 03:29:07AM +0300, Ivan Kalvachev wrote:
> On 7/3/08, Loren Merritt <lorenm at u.washington.edu> wrote:
> > On Wed, 2 Jul 2008, Kostya wrote:
> >
> >> I'm not satisfied with the decoding speed of APE decoder,
> >> so I've decided to finally dsputilize functions marked as such.
> >
> >> +static void vector_int16_add_sse(int16_t * v1, int16_t * v2, int order)
> >
> > sse2
oops
Michael, can you say something about moving C functions to dsputil,
I'll polish SSE2 and Altivec versions later.
> >> + "movdqa (%0), %%xmm0 \n\t"
> >> + "movdqu (%1), %%xmm1 \n\t"
> >> + "paddw %%xmm1, %%xmm0 \n\t"
> >
> > movdqu (%1), %%xmm0
> > paddw (%0), %%xmm0
> >
> >> +static int32_t vector_int16_scalarproduct_sse(int16_t * v1, int16_t * v2,
> >> int order)
> >> +{
> >> + int i;
> >> + int res = 0, *resp=&res;
> >> +
> >> + asm volatile("pxor %xmm7, %xmm7 \n\t");
> >> +
> >> + for(i = 0; i < order; i += 8){
> >> + asm volatile(
> >> + "movdqu (%0), %%xmm0 \n\t"
> >> + "movdqa (%1), %%xmm1 \n\t"
> >> + "pmaddwd %%xmm1, %%xmm0 \n\t"
> >> + "movhlps %%xmm0, %%xmm2 \n\t"
> >> +
> >> + "paddd %%xmm2, %%xmm0 \n\t"
> >> + "pshufd $0x01, %%xmm0,%%xmm2 \n\t"
> >> + "paddd %%xmm2, %%xmm0 \n\t"
> >> + "paddd %%xmm0, %%xmm7 \n\t"
> >> + : "+r"(v1), "+r"(v2)
> >> + );
> >> + v1 += 8;
> >> + v2 += 8;
> >> + }
> >> + asm volatile("movd %%xmm7, (%0)\n\t" : "+r"(resp));
> >> + return res;
> >> +}
> >
> > horizontal sum should be outside the loop
> > pshuflw is faster than pshufd
>
>
> Few more things.
>
> What guarantees that these functions are called at 8 bytes aligned
> addresses and that they always process the data in bunch of 8 (aka
> order%8 ==0);
> (I actually have no idea if the exact instructions you used require 8B
> alignment, I just assume they do. If they don't, they are slow ;)
In APE decoder we have orders=16, 32, 64, 256 and 1280.
Also all vector operations are invoked on av_malloc()ed array with some
offset, so one of the arguments have perfect align and another has
increments by 2.
> I think somewhere in the docs there is requirement to don't break
> asm blocks just to do loop in C, this definitely would make you
> use one variable/register for loop instead of 2.
>
> I'm not sure why you use pointer to local variable,
> there must be way to give the return variable directly
> to the asm block, so if compiler pleases and that variable
> is assigned to eax register then "movd" would put the value
> in eax directly and return it this way.
I don't speak assembler well, I can only read it.
More information about the ffmpeg-devel
mailing list