[FFmpeg-devel] [PATCH] Split H.264 luma dc idct, implement MMX/SSE2 versions

Fri Jan 14 18:22:50 CET 2011

On Thu, Jan 13, 2011 at 06:29:35PM -0800, Jason Garrett-Glaser wrote:
> On Wed, Jan 12, 2011 at 8:06 PM, Jason Garrett-Glaser <jason at x264.com> wrote:
> > This patch splits the H.264 i16x16 luma-dc idct and implements it in
> > asm on x86. ?It does this by storing the DC coefficients in a separate
> > location initially, then scattering them at the end of the asm
> > function. ?This lets us use SIMD on the inverse transform and dequant.
> >
> > The result is 1043 -> 413 dezicycles spend in the inverse transform.
> >
> > TODOs:
> >
> > 1. ?Don't do the idct_dc/dequant if there are no coefficients. ?In the
> > current architecture we don't know this; we'd need to add an entry to
> > scan8 (x264 does this) or move the idct-dc call into cabac/cavlc (I'm
> > fine with this too). ?You'd still have to modify them in the latter
> > case to, for example, return the number of coefficients.
> >
> > 2. ?THIS PROBABLY BREAKS ARM/PPC/SIMILAR because of an extremely
> > stupid architectural problem in ffh264. ?That is, the scantables are
> > transposed in the case of asm, but not in the case of C. ?So this
> > means that if my idct_dc function isn't implemented in asm for all
> > architectures that have idct implemented in asm, they'll probably
> > break. ?The best solution would be to just throw out the
> > non-transposed scan table: there is zero benefit to having it at all
> > and it just adds complexity and binary size.
> >
> > Jason
> >
> 
> Here's an updated version with three patches.  It passes FATE with and
> without asm.
> 
> First patch: eliminate the non-transposed scantable mode.  There's no
> reason for this to even exist and other devs on IRC agree.  It just
> adds complexity.  NB: This change allows some simplification, like
> removing some entries in H264Context, but to keep this patch simple I
> didn't do all possible simplifications.
> 
> Second patch: Mostly rewritten with new asm and stuff to handle an
> obnoxious corner case.
> 
> Third patch: Use x264-style DC NNZ tracking to avoid calling
> dequant_idct when there's no coefficients to deal with.  This could
> also be used in the future to add dc_dequant_idct, which would handle
> the very common case of there being only a DC luma/chroma DC
> coefficient.  In short: in addition to what this does, there's extra
> optimization opportunities added by this patch.
> 
> Jason

[...]
> From a76ba804fe82157a4b5349877e826be206837ece Mon Sep 17 00:00:00 2001
> From: Jason Garrett-Glaser <jason at x264.com>
> Date: Thu, 13 Jan 2011 12:38:01 -0800
> Subject: [PATCH 1/3] H.264: eliminate non-transposed scantable support
>  It was an ugly hack to begin with despite not actually helping performance.
> 
> ---
>  libavcodec/h264.c     |   34 +++++++----------------
>  libavcodec/h264idct.c |   72 ++++++++++++++++++++++++------------------------
>  2 files changed, 46 insertions(+), 60 deletions(-)

lgtm if fate passes


>
> From 6f52b11d6365fe30dfead51d2d086d436259ee94 Mon Sep 17 00:00:00 2001
> From: Jason Garrett-Glaser <jason at x264.com>
> Date: Wed, 12 Jan 2011 19:16:23 -0800
> Subject: [PATCH 2/3] Split H.264 luma dc idct, implement MMX/SSE2 versions
> 
> ---
>  libavcodec/dsputil.h         |    4 +
>  libavcodec/h264.c            |   49 ++------------
>  libavcodec/h264.h            |    5 +-
>  libavcodec/h264_cabac.c      |    8 +-
>  libavcodec/h264_cavlc.c      |    8 +-
>  libavcodec/h264dsp.c         |    1 +
>  libavcodec/h264dsp.h         |    2 +
>  libavcodec/h264idct.c        |   35 ++++++++++
>  libavcodec/svq3.c            |   20 +++---
>  libavcodec/x86/dsputil_mmx.c |    1 +
>  libavcodec/x86/h264_idct.asm |  145 ++++++++++++++++++++++++++++++++++++++++++
>  libavcodec/x86/h264dsp_mmx.c |    4 +
>  12 files changed, 217 insertions(+), 65 deletions(-)
> 
[...]
> @@ -1245,9 +1205,14 @@ static av_always_inline void hl_decode_mb_internal(H264Context *h, int simple){
>                  h->hpc.pred16x16[ h->intra16x16_pred_mode ](dest_y , linesize);
>                  if(is_h264){
>                      if(!transform_bypass)
> -                        h264_luma_dc_dequant_idct_c(h->mb, s->qscale, h->dequant4_coeff[0][s->qscale][0]);
> +                        h->h264dsp.h264_luma_dc_dequant_idct(h->mb, h->mb_luma_dc, h->dequant4_coeff[0][s->qscale][0]);
> +                    else{
> +                        static const uint8_t dc_mapping[16] = {0,1,4,5,2,3,6,7,8,9,12,13,10,11,14,15};
> +                        for(i = 0; i < 16; i++)
> +                            h->mb[dc_mapping[i]*16] = h->mb_luma_dc[i];

the *16 can be merged into the table, i dont know if it makes a difference or
if this conflicts with planed optims
either way patch lgtm if it passes fate


[...]

> From 9b38a22e9e2e17ee5dc4d95d9cdeeac5f2fee186 Mon Sep 17 00:00:00 2001
> From: Jason Garrett-Glaser <jason at x264.com>
> Date: Thu, 13 Jan 2011 13:36:04 -0800
> Subject: [PATCH 3/3] H.264: switch to x264-style tracking of luma/chroma DC NNZ
>  Useful so that we don't have to run the hierarchical DC iDCT if there aren't any coefficients.
> 
> ---
>  libavcodec/h264.c       |   28 ++++++++++++++++------------
>  libavcodec/h264.h       |   19 ++++++++++++++++---
>  libavcodec/h264_cabac.c |   11 ++++++-----
>  libavcodec/h264_cavlc.c |   10 +++++-----
>  4 files changed, 43 insertions(+), 25 deletions(-)
> 
> diff --git a/libavcodec/h264.c b/libavcodec/h264.c
> index 56e0c4e..ec012de 100644
> --- a/libavcodec/h264.c
> +++ b/libavcodec/h264.c
> @@ -1203,16 +1203,18 @@ static av_always_inline void hl_decode_mb_internal(H264Context *h, int simple){
>                  }
>              }else{
>                  h->hpc.pred16x16[ h->intra16x16_pred_mode ](dest_y , linesize);
> +                if(h->non_zero_count_cache[ scan8[LUMA_DC_BLOCK_INDEX] ]){
> -                if(is_h264){
> +                    if(is_h264){
> -                    if(!transform_bypass)
> +                        if(!transform_bypass)
> -                        h->h264dsp.h264_luma_dc_dequant_idct(h->mb, h->mb_luma_dc, h->dequant4_coeff[0][s->qscale][0]);
> +                            h->h264dsp.h264_luma_dc_dequant_idct(h->mb, h->mb_luma_dc, h->dequant4_coeff[0][s->qscale][0]);
> -                    else{
> +                        else{
> -                        static const uint8_t dc_mapping[16] = {0,1,4,5,2,3,6,7,8,9,12,13,10,11,14,15};
> +                            static const uint8_t dc_mapping[16] = {0,1,4,5,2,3,6,7,8,9,12,13,10,11,14,15};
> -                        for(i = 0; i < 16; i++)
> +                            for(i = 0; i < 16; i++)
> -                            h->mb[dc_mapping[i]*16] = h->mb_luma_dc[i];
> +                                h->mb[dc_mapping[i]*16] = h->mb_luma_dc[i];
> -                    }
> +                        }
> -                }else
> +                    }else
> -                    ff_svq3_luma_dc_dequant_idct_c(h->mb, h->mb_luma_dc, s->qscale);
> +                        ff_svq3_luma_dc_dequant_idct_c(h->mb, h->mb_luma_dc, s->qscale);
> +                }
>              }
>              if(h->deblocking_filter)
>                  xchg_mb_border(h, dest_y, dest_cb, dest_cr, linesize, uvlinesize, 0, simple);

you loose one karma point for mixing addition of if() and reindent

except this patch lgtm if it passes fate and is not slower

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Democracy is the form of government in which you can choose your dictator
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20110114/c0db406b/attachment.pgp>