[FFmpeg-devel] [PATCH 1/4] avcodec/x86: add put_pixels16_x2_sse2

Sun Feb 3 21:54:39 CET 2013

On Sun, Feb 03, 2013 at 11:30:56AM -0800, Ronald S. Bultje wrote:
> Hi,
> 
> On Sun, Feb 3, 2013 at 7:31 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
> > about 1% faster P frame motion compensation for matrixbench on i7
> >
> > Signed-off-by: Michael Niedermayer <michaelni at gmx.at>
> > ---
> >  libavcodec/x86/dsputil_mmx.c |    4 ++++
> >  libavcodec/x86/hpeldsp.asm   |   31 ++++++++++++++++++++++++++++++-
> >  2 files changed, 34 insertions(+), 1 deletion(-)
> >
> > diff --git a/libavcodec/x86/dsputil_mmx.c b/libavcodec/x86/dsputil_mmx.c
> > index 2e8300a..29d87a1 100644
> > --- a/libavcodec/x86/dsputil_mmx.c
> > +++ b/libavcodec/x86/dsputil_mmx.c
> > @@ -1523,6 +1523,8 @@ static void gmc_mmx(uint8_t *dst, uint8_t *src,
> >
> >  void ff_put_pixels16_sse2(uint8_t *block, const uint8_t *pixels,
> >                            int line_size, int h);
> > +void ff_put_pixels16_x2_sse2(uint8_t *block, const uint8_t *pixels,
> > +                              int line_size, int h);
> >  void ff_avg_pixels16_sse2(uint8_t *block, const uint8_t *pixels,
> >                            int line_size, int h);
> >
> > @@ -2034,6 +2036,8 @@ static void dsputil_init_sse2(DSPContext *c, AVCodecContext *avctx,
> >          // these functions are slower than mmx on AMD, but faster on Intel
> >          if (!high_bit_depth) {
> >              c->put_pixels_tab[0][0]        = ff_put_pixels16_sse2;
> > +            c->put_pixels_tab[0][1]        = ff_put_pixels16_x2_sse2;
> > +
> >              c->put_no_rnd_pixels_tab[0][0] = ff_put_pixels16_sse2;
> >              c->avg_pixels_tab[0][0]        = ff_avg_pixels16_sse2;
> >          }
> > diff --git a/libavcodec/x86/hpeldsp.asm b/libavcodec/x86/hpeldsp.asm
> > index 7f0c285..81b6901 100644
> > --- a/libavcodec/x86/hpeldsp.asm
> > +++ b/libavcodec/x86/hpeldsp.asm
> > @@ -2,7 +2,7 @@
> >  ;*
> >  ;* Copyright (c) 2000-2001 Fabrice Bellard <fabrice at bellard.org>
> >  ;* Copyright (c)      Nick Kurshev <nickols_k at mail.ru>
> > -;* Copyright (c) 2002 Michael Niedermayer <michaelni at gmx.at>
> > +;* Copyright (c) 2002-2013 Michael Niedermayer <michaelni at gmx.at>
> >  ;* Copyright (c) 2002 Zdenek Kabelac <kabi at informatics.muni.cz>
> >  ;* Copyright (c) 2013 Daniel Kang
> >  ;*
> > @@ -513,3 +513,32 @@ cglobal avg_pixels16, 4,5,4
> >      lea          r0, [r0+r2*4]
> >      jnz       .loop
> >      REP_RET
> > +
> > +; put_pixels16_x2(uint8_t *block, const uint8_t *pixels, int line_size, int h)
> > +cglobal put_pixels16_x2, 4, 5, 4
> > +    movsxdifnidn r2, r2d
> > +    lea          r4, [r2*2]
> > +.loop:
> > +    movu         m0, [r1]
> > +    movu         m1, [r1+r2]
> > +    movu         m2, [r1+1]
> > +    movu         m3, [r1+r2+1]
> > +    pavgb        m0, m2
> > +    pavgb        m1, m3
> > +    mova       [r0], m0
> > +    mova    [r0+r2], m1
> > +    add          r1, r4
> > +    add          r0, r4
> > +    movu         m0, [r1]
> > +    movu         m1, [r1+r2]
> > +    movu         m2, [r1+1]
> > +    movu         m3, [r1+r2+1]
> > +    pavgb        m0, m2
> > +    pavgb        m1, m3
> > +    add          r1, r4
> > +    mova       [r0], m0
> > +    mova    [r0+r2], m1
> > +    add          r0, r4
> > +    sub         r3d, 4
> > +    jne .loop
> > +    REP_RET
> 
> I bet that this code is identical to the 8-pixel mmx/mmx2 version. How
> about you extend that version to use %+ mmsize in the cglobal line,
> and then INIT_MMX mmx callmacro INIT_XMM sse2 callmacro so you use the
> same macro for both versions = smaller and more maintainable source
> code size?

it differs in pavgb, the mmx2 version uses pavgb so that it does
unaligned reads. mmx2/3dnow supports that, sse2 requires aligned
addresses in pavgb so it requires different code.

To combine the 2 functions, would either require a bunch of if/else
or a macro for a unaligned pavgb, later would restrict the instruction
order though

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Awnsering whenever a program halts or runs forever is
On a turing machine, in general impossible (turings halting problem).
On any real computer, always possible as a real computer has a finite number
of states N, and will either halt in less than N cycles or never halt.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20130203/408a641f/attachment.asc>