[MPlayer-dev-eng] Improved remove-logo filter

Thu Sep 14 10:35:39 CEST 2006

Hi,

2006/9/13, Trent Piepho <xyzzy at speakeasy.org>:
> The reason I had _-prefixed variable names no longer applies, so I will
> change those.
>
> The constraints work fine.  The 'y' constraint is for an MMX register, so
> gcc will place the 64 bit data into an MMX register it chooses.  This way
> gcc has more options of where to put the movq instruction wrt the rest of
> the loop code.  The actual code generated:
>
> .L44:
>        addl    $8, %edx        #, i                    i+=8
>        movq    (%ebx), %mm0    #* mask, tmp217         mm0 = *mask
>        addl    $8, %ebx        #, mask                 mask+=8
> #APP
>        pandn (%ecx), %mm0      #* image.164, tmp217
>        psadbw (%ecx), %mm0     #* image.164, tmp217
>        movd %mm0, %eax # tmp217, _sum
> #NO_APP
>        addl    $8, %ecx        #, image.164            image+=8
>        addl    %eax, %edi      # _sum, accumulator     accumulator += _sum
>        cmpl    %edx, %esi      # i, D.4595             i < logo_mask->width
>        jg      .L44    #,
>
> You can see in the second line, gcc has generated a movq to put *mask into
> a mm0 for the asm block.  gcc has decided to leave *image in memory, since
> I allowed that with the constraint "ym", so the asm block uses (%ecx).  I
> could change the *image constraint to just "y", in which case gcc would
> put a 'movq (%ecx), %mm1' somewhere and the asm block would use mm1.  I
> tried this, but it benchmarked slower.
>
> If I had used a "m" constraint and moved the data into mmx registers
> myself, then it could not be mixed in with the loop counter instructions
> for better scheduling.
>

MOVD mmreg, r32 is slow on AMD CPUs, maybe use "m" instead of "r" for
sum will be faster?

-- 
Zuxy
Beauty is truth,
While truth is beauty.
PGP KeyID: E8555ED6