[FFmpeg-devel] [RFC] An improved implementation of?ARMv5TE?IDCT (simple_idct_armv5te.S)

Sat Sep 15 00:47:56 CEST 2007

Hi

On Fri, Sep 14, 2007 at 07:37:01PM +0300, Siarhei Siamashka wrote:
[...]
> > > +/*
> > > + * a local pool with 64-bit constants for 'idct_rows_armv5te' function,
> > > + * we align it at 16 byte boundary in order to ensure that it does not
> > > cross + * cache line boundary (occupies only a single cache line)
> > > + */
> > > +        .balign 16
> >
> > ASMALIGN()
> 
> Does it mean I should define it as a macro at top of simple_idct_armv5te.S
> and use it in the rest of the source file?

no i mean you should use the ASMALIGN macro as *align is not portable
different assemblers interpret it differently or dont support some forms
at all
i dont know if that is true for assemblers supporting ARM too but ...

[...]
> 
> > or you could check that the rows are non zero before multiplying and adding
> > them
> > same applies to the column code
> 
> Do you suggest something like I tried in one of the older revisions
> ('idct_row_full' and 'idct_row_partial' labels)?:
> https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libavcodec/armv4l/simple_idct_armv5te.S?root=mplayer&rev=82&view=markup
> 
> That's a somewhat gray area and the decision is not clear. As most 
> videos contain a lot of empty rows (only row[0] is nonzero), code
> >from 'idct_row_partial' would be executed in only about 10% of cases.
> That branch saves us 16 cycles total. On the other hand, the conditional
> branch instruction to 'idct_row_partial' will be always executed and take 
> at least 1 cycle. So while 16 cycles saving is still more than 10 times per 
> 1 cycle, one more concern is code size. As you could see in my previous post
> with -O2 vs. -O3 benchmarks, mpeg4 decoder actually might be short on
> instructions cache size and increasing code size without a really good 
> reason might be not a very good idea.

i dont think a single branch instruction to branch over the code will
significantly increase the code size

[...]
> >
> >
> > [...]
> >
> > > +        smulbt a2, a3, v5              /* b0 = W1*row[1] */
> > > +        smultt ip, a3, v5              /* tmp = W3*row[1] */
> > > +        smultt lr, a4, v6              /* -b1 = W7*row[3] */
> > > +        smlatt a2, a3, v6, a2          /* b0 += W3*row[3] */
> > > +        smlabt lr, a3, v7, lr          /* -b1 += W1*row[5] */
> > > +        smlabt a2, a4, v7, a2          /* b0 += W5*row[5] */
> > > +        smlabt lr, a4, v8, lr          /* -b1 += W5*row[7] */
> > > +        smlatt a2, a4, v8, a2          /* b0 += W7*row[7] */
> > > +        sub    lr, ip, lr              /* b1 = -b1 - tmp */
> > >
> > > -        ldr    pc, [sp], #4
> > > +        /* B0 is now calculated (a2), B1 is now calculated (lr) */
> >
> > [---]
> >
> > > +        add    ip, v1, a2              /* ip = (A0 + B0) */
> > > +        sub    a2, v1, a2              /* a2 = (A0 - B0) */
> > > +        mov    ip, ip, asr #ROW_SHIFT
> > > +        mov    a2, a2, asr #ROW_SHIFT
> > > +        strh   ip, [a1, #0]            /* row[0] = (A0 + B0) >>
> > > ROW_SHIFT */ +        strh   a2, [a1, #14]           /* row[7] = (A0 -
> > > B0) >> ROW_SHIFT */
> > >
> > > -        ldr    pc, [sp], #4
> > > +        ldr    v1, m51                 /* v1 = ((-W5 & 0xFFFF) | ((-W1 &
> > > 0xFFFF) << 16)) */ +
> > > +        add    ip, v2, lr              /* ip = (A1 + B1) */
> > > +        sub    a2, v2, lr              /* ip = (A1 - B1) */
> > > +        mov    ip, ip, asr #ROW_SHIFT
> > > +        mov    a2, a2, asr #ROW_SHIFT
> > > +        strh   ip, [a1, #2]            /* row[1] = (A1 + B1) >>
> > > ROW_SHIFT */ +        strh   a2, [a1, #12]           /* row[6] = (A1 -
> > > B1) >> ROW_SHIFT */ +
> > > +        smulbt a2, a4, v5              /* b2 = W5*row[1] */
> > > +        smultt v2, a4, v5              /* b3 = W7*row[1] */
> > > +        smlatt a2, v1, v6, a2          /* b2 -= W1*row[3] */
> > > +        smlatt v2, a3, v7, v2          /* b3 += W3*row[5] */
> > > +        smlatt a2, a4, v7, a2          /* b2 += W7*row[5] */
> > > +        smlatt v2, v1, v8, v2          /* b3 -= W1*row[7] */
> > > +        smlatt a2, a3, v8, a2          /* b2 += W3*row[7] */
> > > +        smlabt v2, v1, v6, v2          /* b3 -= W5*row[3] */
> >
> > somehow i suspect that some speed could be gained by checking  row 3/5/7
> > for being zero?
> 
> Checking values for zero and branching seems to be quite expensive when
> inserted in the middle of code, multiplying is very fast. A zero check and a
> conditional branch (50%/50% probability) would be only useful on ARM9E if it
> lets to skip over 7 or more multiply instructions.

columns have all coefficients 0 except the first in ~90% of the cases

[...]
> > [...]
> >
> > > +/*
> > > + * Enforce 8 byte stack alignment if it is not provided by ABI. Used at
> > > the beginning + * of global functions. If stack is not properly aligned,
> > > real return address is + * pushed to stack (thus fixing stack alignment)
> > > and lr register is set to a thunk + * function
> > > 'unaligned_return_thunk_armv5te' which is responsible for providing + *
> > > correct return from the function in this case.
> > > + */
> > > +        .macro idct_stackalign_armv5te
> > > +#ifndef DWORD_ALIGNED_STACK
> > > +        tst    sp, #4
> > > +        strne  lr, [sp, #-4]!
> > > +        adrne  lr, unaligned_return_thunk_armv5te
> > >  #endif
> > > +        .endm
> >
> > the compiler has to maintain an properly aligned stack and if needed has
> > to align it on entry to libavcodec (avcodec_decode_video() and such)
> > its not acceptable to realign the stack in the inner loop calling the
> > idct
> 
> The compiler has to maintain ABI specified stack alignment (either OABI or
> EABI for ARM). If ABI specifies 4-byte alignment (OABI case), we have to
> either insert some code to manually align stack, or not use this function at
> all. Stack alignment at 8-byte boundary can be performed really fast with 3
> instructions (3 cycles) on entry and one instruction at exit (5 cycles) in the
> case when stack alignment was needed. The overhead of manual stack alignment
> for OABI is 3+5/2 on average and is equal to 5.5 cycles (ARM9E). That's quite
> cheap.

as ive said, this code does NOT belong in the inner loop, its the compilers
job to maintain the user specified alignment (see -mpreferred-stack-boundary
in the case of gcc)
stack realignment due to the ABI providing less is only needed on functions
called from the outside (main(), avcodec_encode_video() ...)

if gcc doesn maintain the alignment requested by the user its buggy

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Its not that you shouldnt use gotos but rather that you should write
readable code and code with gotos often but not always is less readable
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070915/1f5a8378/attachment.pgp>