[FFmpeg-devel] [RFC] An improved implementation of ARMv5TE?IDCT (simple_idct_armv5te.S)

Fri Sep 14 18:37:01 CEST 2007

On 14 September 2007, Michael Niedermayer wrote:
> Hi
>
> On Thu, Sep 13, 2007 at 07:34:06PM +0300, Siarhei Siamashka wrote:

[...]

> > Patch attached. In order not to break badly on possible MAX_NEG_CROP
> > changes in the future, header file 'dsputil.h' was split into two
> > parts: 'dsputil.h' and 'dsputil_asm.h'. The latter is supposed to contain
>
> just add a commet to the #define saying that the ARM code needs to be
> updated if the value is changed

Thanks, that's a nice and simple solution, I wonder how I missed it :)

>
> [...]
>
> > Index: libavcodec/armv4l/simple_idct_armv5te.S
> > ===================================================================
> > --- libavcodec/armv4l/simple_idct_armv5te.S	(revision 10483)
> > +++ libavcodec/armv4l/simple_idct_armv5te.S	(working copy)
>
> if the files have nothing in common besides the license headers then please
> dont diff them against each other

OK

> > +/*
> > + * a local pool with 64-bit constants for 'idct_rows_armv5te' function,
> > + * we align it at 16 byte boundary in order to ensure that it does not
> > cross + * cache line boundary (occupies only a single cache line)
> > + */
> > +        .balign 16
>
> ASMALIGN()

Does it mean I should define it as a macro at top of simple_idct_armv5te.S
and use it in the rest of the source file?

>
> > +w2266idct_rows_armv5te:
> > +        .long W22
> > +        .long W66
> > +w1357idct_rows_armv5te:
> > +        .long W13
> > +        .long W57
> > +
> > +/*
> > + * A rows processing function. Benchmarks on a few video files show that
> > + * about 80-90% of uses of this function have all rows empty except for
> > + * the row[0].
> > + *
> > + * On entry:
> > + * a1              - row address
> > + * lr              - return address
> > + *
> > + * On exit:
> > + * a1              - row address
> > + *
> > + * Registers usage within this function:
> > + *  a1             - row address
> > + *  a2             - temporary register
> > + *  v5, v6, v7, v8 - row data
> > + *  v1, v2, v3, v4 - A0, A1, A2 and A3 variables
> > + *  a3, a4         - used for loading constants
> > + *  ip             - temporary register
> > + *  lr             - temporary register, also holds initial row address
> > value + *                   to check end of loop condition
> > + */
> > +        .balign 32
>
> why 32, didnt you just say cache lines are 16 ?

Cache line size is 32 bytes. A chunk of 16 bytes (four 32-bit variables) was
specified to be aligned at 16 bytes boundary above because such alignment is
enough to ensure that it would still be stored in a single cache line without
crossing cache line boundary.

Such .balign directives are used at the boundaries between code and data
chunks in order to ensure the best cache utilization. Before you ask why there
are three chunks of data inserted between various functions and not stored in
the same place, the answer is that immediate offset is only 8-bit for LDRD
instructions (as opposed to 12-bit for normal LDR) and such chunks of data
need to be stored really close to the code using them for instruction pointer
relative addressing.

> [...]
>
> > +        smlabb v4, a3, v6, v1          /* v4 = v1 - W2*row[2] */
> > +        smlabb v3, a4, v6, v1          /* v3 = v1 - W6*row[2] */
> > +        smlatb v2, a4, v6, v1          /* v2 = v1 + W6*row[2] */
> > +        smlatb v1, a3, v6, v1          /* v1 = v1 + W2*row[2] */
>
> [---]
>
> > +        smlabb v4, a4, v8, v4          /* v4 -= W6*row[6] */
> > +        smlatb v3, a3, v8, v3          /* v3 += W2*row[6] */
> > +        smlabb v2, a3, v8, v2          /* v2 -= W2*row[6] */
> > +        smlatb v1, a4, v8, v1          /* v1 += W6*row[6] */
> > +        ldrd   a3, w1357idct_rows_armv5te /* a3 = W1 | (W3 << 16) */
> > +                                       /* a4 = W5 | (W7 << 16) */
>
> [---]
>
> > +        smlatb v4, a2, v7, v4          /* v4 += W4*row[4] */
> > +        smlabb v3, a2, v7, v3          /* v3 -= W4*row[4] */
> > +        smlabb v2, a2, v7, v2          /* v2 -= W4*row[4] */
> > +        smlatb v1, a2, v7, v1          /* v1 += W4*row[4] */
>
> i think this can be implemented in fewer instructions, someting based on:
>
> v2 = v1 - W4*row[4]
> v1 = v1 + W4*row[4]
>
> v3 = v2 - W6*row[2]
> v4 = v1 - W2*row[2]
>
> v3 += W2*row[6]
> v4 -= W6*row[6]
>
> v2 = 2*v2 - v3
> v1 = 2*v1 - v4

Thanks for a good tip, I'll try it.

> or you could check that the rows are non zero before multiplying and adding
> them
> same applies to the column code

Do you suggest something like I tried in one of the older revisions
('idct_row_full' and 'idct_row_partial' labels)?:
https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libavcodec/armv4l/simple_idct_armv5te.S?root=mplayer&rev=82&view=markup

That's a somewhat gray area and the decision is not clear. As most 
videos contain a lot of empty rows (only row[0] is nonzero), code
from 'idct_row_partial' would be executed in only about 10% of cases.
That branch saves us 16 cycles total. On the other hand, the conditional
branch instruction to 'idct_row_partial' will be always executed and take 
at least 1 cycle. So while 16 cycles saving is still more than 10 times per 
1 cycle, one more concern is code size. As you could see in my previous post
with -O2 vs. -O3 benchmarks, mpeg4 decoder actually might be short on
instructions cache size and increasing code size without a really good 
reason might be not a very good idea.

By the way, simple_idct_* can be split into three parts: rows processing,
columns processing, storing results to memory. Rows processing is the 
least interesting for us from the performance optimization point of view
as most of the rows are empty anyway. Empty rows case is handled in a tight 
and efficient loop now. Also I already tried to submit IDCT optimization 
patch (only rows processing affected) almost a year ago:
http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2006-September/016376.html
While it showed some improvement on synthetic benchmarks, the improvement 
(if any) on real vides could not be confirmed (it was probably below
instrumentation error) :)

I'm not saying that rows processing should not be optimized. It just does 
not deserve any minor optimizations that would add a noticeable code size
increase IMHO.

For the purpose of collecting statistics about idct coefficients distribution
on real videos, I hooked some code into simple_idct.c and added an 'atexit'
registered function to print some final summary results. I would be really
interested to know if my calculations are wrong and we can't rely on them
when optimizing code.

> btw do you have a refereence which lists the number of cpu cycles per
> instruction?

ffmpeg doc/optimization.txt, 'ARM-specific' section

>
>
> [...]
>
> > +        smulbt a2, a3, v5              /* b0 = W1*row[1] */
> > +        smultt ip, a3, v5              /* tmp = W3*row[1] */
> > +        smultt lr, a4, v6              /* -b1 = W7*row[3] */
> > +        smlatt a2, a3, v6, a2          /* b0 += W3*row[3] */
> > +        smlabt lr, a3, v7, lr          /* -b1 += W1*row[5] */
> > +        smlabt a2, a4, v7, a2          /* b0 += W5*row[5] */
> > +        smlabt lr, a4, v8, lr          /* -b1 += W5*row[7] */
> > +        smlatt a2, a4, v8, a2          /* b0 += W7*row[7] */
> > +        sub    lr, ip, lr              /* b1 = -b1 - tmp */
> >
> > -        ldr    pc, [sp], #4
> > +        /* B0 is now calculated (a2), B1 is now calculated (lr) */
>
> [---]
>
> > +        add    ip, v1, a2              /* ip = (A0 + B0) */
> > +        sub    a2, v1, a2              /* a2 = (A0 - B0) */
> > +        mov    ip, ip, asr #ROW_SHIFT
> > +        mov    a2, a2, asr #ROW_SHIFT
> > +        strh   ip, [a1, #0]            /* row[0] = (A0 + B0) >>
> > ROW_SHIFT */ +        strh   a2, [a1, #14]           /* row[7] = (A0 -
> > B0) >> ROW_SHIFT */
> >
> > -        ldr    pc, [sp], #4
> > +        ldr    v1, m51                 /* v1 = ((-W5 & 0xFFFF) | ((-W1 &
> > 0xFFFF) << 16)) */ +
> > +        add    ip, v2, lr              /* ip = (A1 + B1) */
> > +        sub    a2, v2, lr              /* ip = (A1 - B1) */
> > +        mov    ip, ip, asr #ROW_SHIFT
> > +        mov    a2, a2, asr #ROW_SHIFT
> > +        strh   ip, [a1, #2]            /* row[1] = (A1 + B1) >>
> > ROW_SHIFT */ +        strh   a2, [a1, #12]           /* row[6] = (A1 -
> > B1) >> ROW_SHIFT */ +
> > +        smulbt a2, a4, v5              /* b2 = W5*row[1] */
> > +        smultt v2, a4, v5              /* b3 = W7*row[1] */
> > +        smlatt a2, v1, v6, a2          /* b2 -= W1*row[3] */
> > +        smlatt v2, a3, v7, v2          /* b3 += W3*row[5] */
> > +        smlatt a2, a4, v7, a2          /* b2 += W7*row[5] */
> > +        smlatt v2, v1, v8, v2          /* b3 -= W1*row[7] */
> > +        smlatt a2, a3, v8, a2          /* b2 += W3*row[7] */
> > +        smlabt v2, v1, v6, v2          /* b3 -= W5*row[3] */
>
> somehow i suspect that some speed could be gained by checking  row 3/5/7
> for being zero?

Checking values for zero and branching seems to be quite expensive when
inserted in the middle of code, multiplying is very fast. A zero check and a
conditional branch (50%/50% probability) would be only useful on ARM9E if it
lets to skip over 7 or more multiply instructions.

By the way, before you point to another optimization opportunity
in 'idct_two_col_armv5te' macro - more use of 64-bit loads via LDRD 
instead of some LDR instructions, I would like to say that I tried 
really hard to do it but could not come with a better solution. The problem 
is that even combining two LDR instructions in the beginning of this macro 
into one LDRD, it will introduce pipeline stall on ARM11 (because of a
load-use delay). On XScale everything is even worse, in order to execute LDRD
in a single cycle, the next instruction must be not a load/store one. ARM9E
always executes LDRD in 2 cycles so they are only useful for improving code
density. I also tried to reverse A and B coefficients calculation or merge
this macro with the rest of code in the loop. Technically, the best I could
get was to increase performance on ARM11 a bit (though actually XScale and
ARM9E are the targets to run this code), at the cost of some slowdown on
ARM9E. I don't say it's impossible, but unless somebody suggests something
bright I did haven't tried yet, I would leave it to somebody else trying to
further improve this code later :)

Also I have given up and have plans for buying some XScale hardware soon
(most likely Motorola RORK E2/E6 or A1200 phone) to do real benchmarks and
final tweaks on it. Maybe I will even try to implement some IWMMXT SIMD
optimizations :)

> [...]
>
> > +/*
> > + * Enforce 8 byte stack alignment if it is not provided by ABI. Used at
> > the beginning + * of global functions. If stack is not properly aligned,
> > real return address is + * pushed to stack (thus fixing stack alignment)
> > and lr register is set to a thunk + * function
> > 'unaligned_return_thunk_armv5te' which is responsible for providing + *
> > correct return from the function in this case.
> > + */
> > +        .macro idct_stackalign_armv5te
> > +#ifndef DWORD_ALIGNED_STACK
> > +        tst    sp, #4
> > +        strne  lr, [sp, #-4]!
> > +        adrne  lr, unaligned_return_thunk_armv5te
> >  #endif
> > +        .endm
>
> the compiler has to maintain an properly aligned stack and if needed has
> to align it on entry to libavcodec (avcodec_decode_video() and such)
> its not acceptable to realign the stack in the inner loop calling the
> idct

The compiler has to maintain ABI specified stack alignment (either OABI or
EABI for ARM). If ABI specifies 4-byte alignment (OABI case), we have to
either insert some code to manually align stack, or not use this function at
all. Stack alignment at 8-byte boundary can be performed really fast with 3
instructions (3 cycles) on entry and one instruction at exit (5 cycles) in the
case when stack alignment was needed. The overhead of manual stack alignment
for OABI is 3+5/2 on average and is equal to 5.5 cycles (ARM9E). That's quite
cheap.

More information about OABI vs. EABI can be found here:
http://wiki.debian.org/ArmEabiPort

EABI was designed to fix design flaws in OABI and better support modern ARM
hardware (btw, 64-bit stack alignment for use with LDRD/STRD instructions
is nothing compared to floating point woes, when modern ARM cores, having
floating point support in hardware would still have to emulate floating point
math in kernel via unhandled instruction trap). Many ARM systems are still
using OABI at the moment, debian linux does not have an official port to 
EABI yet.

While I have EABI system myself, I verified that this stack realignment code
actually works by doing some extra tests.

By the way, there is one more ABI specific problem with both old and new
simple_idct_armv5te code. Actually register r9 can be either reserved by
the system or used as additional general purpose register (aliased as v6)
depending on the target system. Linux EABI specifies that r9 can be used
as an additional general purpose register, but I'm not sure about Windows 
Mobile or Palm.

> > Index: libavcodec/armv4l/dsputil_arm.c
> > ===================================================================
> > --- libavcodec/armv4l/dsputil_arm.c	(revision 10483)
> > +++ libavcodec/armv4l/dsputil_arm.c	(working copy)
> > @@ -29,11 +29,11 @@
> >  extern void j_rev_dct_ARM(DCTELEM *data);
> >  extern void simple_idct_ARM(DCTELEM *data);
> >
> > -extern void simple_idct_armv5te(DCTELEM *data);
> > -extern void simple_idct_put_armv5te(uint8_t *dest, int line_size,
> > -                                    DCTELEM *data);
> > -extern void simple_idct_add_armv5te(uint8_t *dest, int line_size,
> > -                                    DCTELEM *data);
> > +extern void ff_simple_idct_armv5te(DCTELEM *data);
> > +extern void ff_simple_idct_put_armv5te(uint8_t *dest, int line_size,
> > +                                       DCTELEM *data);
> > +extern void ff_simple_idct_add_armv5te(uint8_t *dest, int line_size,
> > +                                       DCTELEM *data);
>
> this renaming belongs to a seperate patch
>
> [...]

OK