[FFmpeg-devel] [PATCH] VP8 MMX optimizations (MC and IDCT dc_add)

Wed Jun 23 09:16:40 CEST 2010

Hi,

except for the patch being full of hunks I would have described as
unrelated changes,
I wanted to comment on this:

2010/6/23 Jason Garrett-Glaser <darkshikari at gmail.com>:
> +INIT_MMX
> +cglobal put_vp8_epel4_h4_mmxext, 5,5
> +    shl       r4, 4
> +    sub       r0, r1
> +    mova      m4, [fourtap_filter_hw+r4-16] ; set up 4tap filter in words
> +    mova      m5, [fourtap_filter_hw+r4]
> +    mova      m7, [pw_64]
> +    pxor      m6, m6
> +.nextrow
> +    movu      m1, [r1-1]    ; (ABCDEFGH) load 8 horizontal pixels
> +
> +    ; first set of 2 pixels
> +    mova      m2, m1        ; byte ABCD..
> +    punpcklbw m1, m6        ; byte->word ABCD
> +    pshufw    m0, m2, 9     ; byte CDEF..
> +    punpcklbw m0, m6        ; byte->word CDEF
> +    pshufw    m3, m1, 0x94  ; word ABBC
> +    pshufw    m1, m0, 0x94  ; word CDDE
> +    pmaddwd   m3, m4        ; multiply 2px with F0/F1
> +    mova      m0, m1        ; backup for second set of pixels
> +    pmaddwd   m1, m5        ; multiply 2px with F2/F3
> +    paddd     m3, m1        ; finish 1st 2px

The vc1 mc code uses unsaturating arith, and thus avoid intermediate
results in dwords.
I may try to bench what this alternate implementation would bring to
that part of the vp8 mc patch.

Also, this avoids code size increase, but when considering this:
> +sixtap_filter:  dw  2, -11, 108,  36,  -8, 1, \
> +                    3, -16,  77,  77, -16, 3, \

There seems to be twice as many pmullw/... done as necessary.

Christophe