[MPlayer-dev-eng] Improved remove-logo filter

Fri Sep 15 05:38:34 CEST 2006

On Thu, 14 Sep 2006, Trent Piepho wrote:
>
> Of course, wouldn't it be even faster to put accumulator into an MMX
> register, and the just use paddd %[sum], %[accumulator]?  That avoids the
> movd/movl entirely.  paddd isn't much slower than addl, is it?
>
> I tried that, telling gcc to input/output accumulator in an mmx register,
> and doing the add myself with paddd.  For some reason, gcc thinks it needs
> to save the mmx register to memory and then load it again.  So, it ends up
> being slower.
>
> So I tried writing the inner loop (over one line) in asm so accumulator
> would be kept in mm1 for the whole loop.  Gcc still spills and loads
> accumulator for no reason on each outer loop (for each line).  This ended
> up being about the same speed.

Not that there's anything wrong with writing the loops in asm, but you 
don't have to do that just to keep the accumulator in an mmreg. "y" 
constraints are not needed, unless you _want_ gcc to load/spill values.

asm("movd %0, %%mm1" ::"g"(accumulator));
for(i=0; i<n; i++)
     asm(/* your computation ->mm0 here */
         "paddd %%mm0, %%mm1"
         :/*...*/);
asm("movd %%mm1, %0" :"=g"(accumulator));

Or with both loops as one asm block, you can bring back "+y"(accumulator) 
instead of the explicit movd.

> : "=m" (accumulator), "=r" (i), "=g" (j), "=r" (mask), "=r" (image)
> : "m" (accumulator), "1" (i), "2" (j), "3" (mask), "4" (image),
>   "g" (logo_mask->width), "g" (stride)

   : "+m" (accumulator), "+r" (i), "+g" (j), "+r" (mask), "+r" (image)
   : "g" (logo_mask->width), "g" (stride)

--Loren Merritt