[FFmpeg-devel] [PATCH] faster vp6 decoding
Aurelien Jacobs
aurel
Fri Feb 13 00:59:24 CET 2009
Zuxy Meng wrote:
> 2009/2/12 Aurelien Jacobs <aurel at gnuage.org>:
> > Zuxy Meng wrote:
> >
> >> Hi,
> >>
> >> 2009/2/9 Jason Garrett-Glaser <darkshikari at gmail.com>:
> >> > + "punpcklbw %%mm7, %%mm0\n\t" \
> >> > + "punpcklbw %%mm7, %%mm1\n\t" \
> >> > + "punpckhbw %%mm7, %%mm3\n\t" \
> >> > + "punpckhbw %%mm7, %%mm4\n\t" \
> >> > + "pmullw 0(%2), %%mm0\n\t" /* src[x-8 ] * biweight [0] */ \
> >> > + "pmullw 8(%2), %%mm1\n\t" /* src[x ] * biweight [1] */ \
> >> > + "pmullw 0(%2), %%mm3\n\t" /* src[x-8 ] * biweight [0] */ \
> >> > + "pmullw 8(%2), %%mm4\n\t" /* src[x ] * biweight [1] */ \
> >> > + "paddw %%mm1, %%mm0\n\t" \
> >> > + "paddw %%mm4, %%mm3\n\t" \
> >> >
> >> > This can be done faster with pmaddubsw (SSSE3-only, but worth making
> >> > another version surely).
> >>
> >> Sure but that would require weights to be stored as arrays of int8_t
> >> instead of int16_t?
> >>
> >> > Worthwhile if you make an SSE version.
> >>
> >> SSE2?
> >>
> >> > Works by interleaving the weights, allowing you to avoid the unpacks,
> >> > use only two multiplies, and avoid the adds, too, I think. If I'm
> >> > right, that makes the entire thing quite a bit less than half the
> >> > instructions.
> >>
> >> I tried something like below and it's about 15% faster on my Pentium
> >> M. The speed up should be more prominent on modern CPUs with 128 bit
> >> FADD unit:
> >>
> >> [...some asm code...]
> >
> > Nice. I fixed it so that it works on x86_64 and I cleaned it up.
> > It works but has some small visible artifacts.
> > It would be great if you could fix attached patch so that it gives
> > bitexact result with:
> > ffmpeg -i sample.flv -f framecrc out.crc
>
> Can be fixed by expand ff_pw_64 from uint64_t to xmm_reg.
Done. Thanks for the hint (and for the code :-).
Aurel
More information about the ffmpeg-devel
mailing list