[MPlayer-dev-eng] possible bugs in vf_decimate filter

Michael Niedermayer michaelni at gmx.at
Mon Oct 16 14:08:21 CEST 2006


Hi

On Sun, Oct 15, 2006 at 02:51:00PM -0600, Loren Merritt wrote:
> On Sun, 15 Oct 2006, Rich Felker wrote:
> 
> >I'm the author so I'll comment.
> >
> >On Sat, Oct 14, 2006 at 07:09:32AM -0700, Trent Piepho wrote:
> >>The decimate filter calculates 8x8 SADs over the image.  The loop that
> >>calls the SAD function increments x and y by 4 each time, rather than 8.
> >>This means all the pixels, except for the outer four, are included in four
> >>SAD calculations instead of one.
> >
> >This is intentional. Ideally it would increment by 1 each time, but
> >that would be much slower and not much more accurate. The idea is to
> >look for maximal change over _any_ 8x8 block, not just
> >aligned-to-8-pixels 8x8 blocks.
> 
> OK, but it would be faster to calculate non-overlapping 4x4 blocks, and 
> then add 4 adjacent block sums.

yes, and for 8x8 blocks at every shift a simple vf_boxblur.c like algorithm
could be used

and the current asm can be improved somewhat:
                "1: \n\t"
                
                "movq (%%"REG_S"), %%mm0 \n\t"
                "movq (%%"REG_S"), %%mm2 \n\t"
                "add %%"REG_a", %%"REG_S" \n\t"
                "movq (%%"REG_D"), %%mm1 \n\t"
                "add %%"REG_b", %%"REG_D" \n\t"
                "psubusb %%mm1, %%mm2 \n\t"
                "psubusb %%mm0, %%mm1 \n\t"

                "movq %%mm2, %%mm0 \n\t"
                "movq %%mm1, %%mm3 \n\t"
                "punpcklbw %%mm7, %%mm0 \n\t"
                "punpcklbw %%mm7, %%mm1 \n\t"
                "punpckhbw %%mm7, %%mm2 \n\t"
                "punpckhbw %%mm7, %%mm3 \n\t"
                "paddw %%mm0, %%mm4 \n\t"
                "paddw %%mm1, %%mm4 \n\t"
                "paddw %%mm2, %%mm4 \n\t"
                "paddw %%mm3, %%mm4 \n\t"

this can be done faster by:
"por    %%mm2, %%mm1\n\t"
"movq %%mm1, %%mm3 \n\t"
"punpcklbw %%mm7, %%mm1 \n\t"
"punpckhbw %%mm7, %%mm3 \n\t"
"paddw %%mm1, %%mm4 \n\t"
"paddw %%mm3, %%mm5 \n\t"

the last also adds the left 4 and right 4 into 2 different registers so that
4x4 blocks are calclated


                
                "decl %%ecx \n\t"

id use a "cmp %%"REG_S", ...  here, some cpus have a dissike for inc/dec as
inc/dec just change part of the flags which creates a dependancy to the
previous flag value
its also posible to count toward zero and use (base, index) style to read
stuff, that would be 1 instruction less



                "jnz 1b \n\t"
                "movq %%mm4, (%%"REG_d") \n\t"
                "emms \n\t"

emms should be farther outside


[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

In the past you could go to a library and read, borrow or copy any book
Today you'd get arrested for mere telling someone where the library is



More information about the MPlayer-dev-eng mailing list