[FFmpeg-devel] [PATCH] Split H.264 luma dc idct, implement MMX/SSE2 versions

Jason Garrett-Glaser jason
Fri Jan 14 22:37:48 CET 2011


On Fri, Jan 14, 2011 at 9:22 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Thu, Jan 13, 2011 at 06:29:35PM -0800, Jason Garrett-Glaser wrote:
>> On Wed, Jan 12, 2011 at 8:06 PM, Jason Garrett-Glaser <jason at x264.com> wrote:
>> > This patch splits the H.264 i16x16 luma-dc idct and implements it in
>> > asm on x86. ?It does this by storing the DC coefficients in a separate
>> > location initially, then scattering them at the end of the asm
>> > function. ?This lets us use SIMD on the inverse transform and dequant.
>> >
>> > The result is 1043 -> 413 dezicycles spend in the inverse transform.
>> >
>> > TODOs:
>> >
>> > 1. ?Don't do the idct_dc/dequant if there are no coefficients. ?In the
>> > current architecture we don't know this; we'd need to add an entry to
>> > scan8 (x264 does this) or move the idct-dc call into cabac/cavlc (I'm
>> > fine with this too). ?You'd still have to modify them in the latter
>> > case to, for example, return the number of coefficients.
>> >
>> > 2. ?THIS PROBABLY BREAKS ARM/PPC/SIMILAR because of an extremely
>> > stupid architectural problem in ffh264. ?That is, the scantables are
>> > transposed in the case of asm, but not in the case of C. ?So this
>> > means that if my idct_dc function isn't implemented in asm for all
>> > architectures that have idct implemented in asm, they'll probably
>> > break. ?The best solution would be to just throw out the
>> > non-transposed scan table: there is zero benefit to having it at all
>> > and it just adds complexity and binary size.
>> >
>> > Jason
>> >
>>
>> Here's an updated version with three patches. ?It passes FATE with and
>> without asm.
>>
>> First patch: eliminate the non-transposed scantable mode. ?There's no
>> reason for this to even exist and other devs on IRC agree. ?It just
>> adds complexity. ?NB: This change allows some simplification, like
>> removing some entries in H264Context, but to keep this patch simple I
>> didn't do all possible simplifications.
>>
>> Second patch: Mostly rewritten with new asm and stuff to handle an
>> obnoxious corner case.
>>
>> Third patch: Use x264-style DC NNZ tracking to avoid calling
>> dequant_idct when there's no coefficients to deal with. ?This could
>> also be used in the future to add dc_dequant_idct, which would handle
>> the very common case of there being only a DC luma/chroma DC
>> coefficient. ?In short: in addition to what this does, there's extra
>> optimization opportunities added by this patch.
>>
>> Jason
>
> [...]
>> From a76ba804fe82157a4b5349877e826be206837ece Mon Sep 17 00:00:00 2001
>> From: Jason Garrett-Glaser <jason at x264.com>
>> Date: Thu, 13 Jan 2011 12:38:01 -0800
>> Subject: [PATCH 1/3] H.264: eliminate non-transposed scantable support
>> ?It was an ugly hack to begin with despite not actually helping performance.
>>
>> ---
>> ?libavcodec/h264.c ? ? | ? 34 +++++++----------------
>> ?libavcodec/h264idct.c | ? 72 ++++++++++++++++++++++++------------------------
>> ?2 files changed, 46 insertions(+), 60 deletions(-)
>
> lgtm if fate passes
>
>
>>
>> From 6f52b11d6365fe30dfead51d2d086d436259ee94 Mon Sep 17 00:00:00 2001
>> From: Jason Garrett-Glaser <jason at x264.com>
>> Date: Wed, 12 Jan 2011 19:16:23 -0800
>> Subject: [PATCH 2/3] Split H.264 luma dc idct, implement MMX/SSE2 versions
>>
>> ---
>> ?libavcodec/dsputil.h ? ? ? ? | ? ?4 +
>> ?libavcodec/h264.c ? ? ? ? ? ?| ? 49 ++------------
>> ?libavcodec/h264.h ? ? ? ? ? ?| ? ?5 +-
>> ?libavcodec/h264_cabac.c ? ? ?| ? ?8 +-
>> ?libavcodec/h264_cavlc.c ? ? ?| ? ?8 +-
>> ?libavcodec/h264dsp.c ? ? ? ? | ? ?1 +
>> ?libavcodec/h264dsp.h ? ? ? ? | ? ?2 +
>> ?libavcodec/h264idct.c ? ? ? ?| ? 35 ++++++++++
>> ?libavcodec/svq3.c ? ? ? ? ? ?| ? 20 +++---
>> ?libavcodec/x86/dsputil_mmx.c | ? ?1 +
>> ?libavcodec/x86/h264_idct.asm | ?145 ++++++++++++++++++++++++++++++++++++++++++
>> ?libavcodec/x86/h264dsp_mmx.c | ? ?4 +
>> ?12 files changed, 217 insertions(+), 65 deletions(-)
>>
> [...]
>> @@ -1245,9 +1205,14 @@ static av_always_inline void hl_decode_mb_internal(H264Context *h, int simple){
>> ? ? ? ? ? ? ? ? ?h->hpc.pred16x16[ h->intra16x16_pred_mode ](dest_y , linesize);
>> ? ? ? ? ? ? ? ? ?if(is_h264){
>> ? ? ? ? ? ? ? ? ? ? ?if(!transform_bypass)
>> - ? ? ? ? ? ? ? ? ? ? ? ?h264_luma_dc_dequant_idct_c(h->mb, s->qscale, h->dequant4_coeff[0][s->qscale][0]);
>> + ? ? ? ? ? ? ? ? ? ? ? ?h->h264dsp.h264_luma_dc_dequant_idct(h->mb, h->mb_luma_dc, h->dequant4_coeff[0][s->qscale][0]);
>> + ? ? ? ? ? ? ? ? ? ?else{
>> + ? ? ? ? ? ? ? ? ? ? ? ?static const uint8_t dc_mapping[16] = {0,1,4,5,2,3,6,7,8,9,12,13,10,11,14,15};
>> + ? ? ? ? ? ? ? ? ? ? ? ?for(i = 0; i < 16; i++)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ?h->mb[dc_mapping[i]*16] = h->mb_luma_dc[i];
>
> the *16 can be merged into the table, i dont know if it makes a difference or
> if this conflicts with planed optims
> either way patch lgtm if it passes fate
>
>
> [...]
>
>> From 9b38a22e9e2e17ee5dc4d95d9cdeeac5f2fee186 Mon Sep 17 00:00:00 2001
>> From: Jason Garrett-Glaser <jason at x264.com>
>> Date: Thu, 13 Jan 2011 13:36:04 -0800
>> Subject: [PATCH 3/3] H.264: switch to x264-style tracking of luma/chroma DC NNZ
>> ?Useful so that we don't have to run the hierarchical DC iDCT if there aren't any coefficients.
>>
>> ---
>> ?libavcodec/h264.c ? ? ? | ? 28 ++++++++++++++++------------
>> ?libavcodec/h264.h ? ? ? | ? 19 ++++++++++++++++---
>> ?libavcodec/h264_cabac.c | ? 11 ++++++-----
>> ?libavcodec/h264_cavlc.c | ? 10 +++++-----
>> ?4 files changed, 43 insertions(+), 25 deletions(-)
>>
>> diff --git a/libavcodec/h264.c b/libavcodec/h264.c
>> index 56e0c4e..ec012de 100644
>> --- a/libavcodec/h264.c
>> +++ b/libavcodec/h264.c
>> @@ -1203,16 +1203,18 @@ static av_always_inline void hl_decode_mb_internal(H264Context *h, int simple){
>> ? ? ? ? ? ? ? ? ?}
>> ? ? ? ? ? ? ?}else{
>> ? ? ? ? ? ? ? ? ?h->hpc.pred16x16[ h->intra16x16_pred_mode ](dest_y , linesize);
>> + ? ? ? ? ? ? ? ?if(h->non_zero_count_cache[ scan8[LUMA_DC_BLOCK_INDEX] ]){
>> - ? ? ? ? ? ? ? ?if(is_h264){
>> + ? ? ? ? ? ? ? ? ? ?if(is_h264){
>> - ? ? ? ? ? ? ? ? ? ?if(!transform_bypass)
>> + ? ? ? ? ? ? ? ? ? ? ? ?if(!transform_bypass)
>> - ? ? ? ? ? ? ? ? ? ? ? ?h->h264dsp.h264_luma_dc_dequant_idct(h->mb, h->mb_luma_dc, h->dequant4_coeff[0][s->qscale][0]);
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ?h->h264dsp.h264_luma_dc_dequant_idct(h->mb, h->mb_luma_dc, h->dequant4_coeff[0][s->qscale][0]);
>> - ? ? ? ? ? ? ? ? ? ?else{
>> + ? ? ? ? ? ? ? ? ? ? ? ?else{
>> - ? ? ? ? ? ? ? ? ? ? ? ?static const uint8_t dc_mapping[16] = {0,1,4,5,2,3,6,7,8,9,12,13,10,11,14,15};
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ?static const uint8_t dc_mapping[16] = {0,1,4,5,2,3,6,7,8,9,12,13,10,11,14,15};
>> - ? ? ? ? ? ? ? ? ? ? ? ?for(i = 0; i < 16; i++)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ?for(i = 0; i < 16; i++)
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ?h->mb[dc_mapping[i]*16] = h->mb_luma_dc[i];
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?h->mb[dc_mapping[i]*16] = h->mb_luma_dc[i];
>> - ? ? ? ? ? ? ? ? ? ?}
>> + ? ? ? ? ? ? ? ? ? ? ? ?}
>> - ? ? ? ? ? ? ? ?}else
>> + ? ? ? ? ? ? ? ? ? ?}else
>> - ? ? ? ? ? ? ? ? ? ?ff_svq3_luma_dc_dequant_idct_c(h->mb, h->mb_luma_dc, s->qscale);
>> + ? ? ? ? ? ? ? ? ? ? ? ?ff_svq3_luma_dc_dequant_idct_c(h->mb, h->mb_luma_dc, s->qscale);
>> + ? ? ? ? ? ? ? ?}
>> ? ? ? ? ? ? ?}
>> ? ? ? ? ? ? ?if(h->deblocking_filter)
>> ? ? ? ? ? ? ? ? ?xchg_mb_border(h, dest_y, dest_cb, dest_cr, linesize, uvlinesize, 0, simple);
>
> you loose one karma point for mixing addition of if() and reindent
>
> except this patch lgtm if it passes fate and is not slower
>
> [...]
> --
> Michael ? ? GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> Democracy is the form of government in which you can choose your dictator
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
>
> iEYEARECAAYFAk0whmoACgkQYR7HhwQLD6vBMQCfRtNANzrJSG9GY4bQmmygBYzL
> ypUAnjYOnB5PHnOffR5BoKLdjryU/2yH
> =A0Ul
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at mplayerhq.hu
> https://lists.mplayerhq.hu/mailman/listinfo/ffmpeg-devel
>

Applied with cosmetics separated and massive bugs in patch 2) (I
forgot to implement win64/linux64 support) fixed.  All three asm
targets tested with FATE and passed.

Jason



More information about the ffmpeg-devel mailing list