[FFmpeg-devel] [RFC] An improved implementation of?ARMv5TE?IDCT (simple_idct_armv5te.S)
Siarhei Siamashka
siarhei.siamashka
Sat Sep 15 10:35:43 CEST 2007
On 15 September 2007, Michael Niedermayer wrote:
> Hi
>
> On Fri, Sep 14, 2007 at 07:37:01PM +0300, Siarhei Siamashka wrote:
> [...]
>
> > > > +/*
> > > > + * a local pool with 64-bit constants for 'idct_rows_armv5te'
> > > > function, + * we align it at 16 byte boundary in order to ensure that
> > > > it does not cross + * cache line boundary (occupies only a single
> > > > cache line) + */
> > > > + .balign 16
> > >
> > > ASMALIGN()
> >
> > Does it mean I should define it as a macro at top of
> > simple_idct_armv5te.S and use it in the rest of the source file?
>
> no i mean you should use the ASMALIGN macro as *align is not portable
> different assemblers interpret it differently or dont support some forms
> at all
> i dont know if that is true for assemblers supporting ARM too but ...
I'm sorry, but please answer my original question. This ASMALIGN() macro
should be defined somewhere first, otherwise such statement will just result
in a syntax error. You already rejected a common header file for assembly
sources, so the only option to define this macro is at the top of
simple_idct_armv5te.S source. Is that correct? I just wanted to confirm it.
I'm fine with any solution as long as it compiles and works right.
By the way, you can check 'info as' for the information about '.align'
directive. You will notice that '.balign' is a portable way to specify
alignment across all platforms, as long as gnu assembler is used.
Do you want to be able to compile the same *.S source with a different
assembler? In this case this source still has lots of other portability
issues, namely gnu-specific macro definitions and the use of C-style
preprocessor.
> > > > + smulbt a2, a4, v5 /* b2 = W5*row[1] */
> > > > + smultt v2, a4, v5 /* b3 = W7*row[1] */
> > > > + smlatt a2, v1, v6, a2 /* b2 -= W1*row[3] */
> > > > + smlatt v2, a3, v7, v2 /* b3 += W3*row[5] */
> > > > + smlatt a2, a4, v7, a2 /* b2 += W7*row[5] */
> > > > + smlatt v2, v1, v8, v2 /* b3 -= W1*row[7] */
> > > > + smlatt a2, a3, v8, a2 /* b2 += W3*row[7] */
> > > > + smlabt v2, v1, v6, v2 /* b3 -= W5*row[3] */
> > >
> > > somehow i suspect that some speed could be gained by checking row
> > > 3/5/7 for being zero?
> >
> > Checking values for zero and branching seems to be quite expensive when
> > inserted in the middle of code, multiplying is very fast. A zero check
> > and a conditional branch (50%/50% probability) would be only useful on
> > ARM9E if it lets to skip over 7 or more multiply instructions.
>
> columns have all coefficients 0 except the first in ~90% of the cases
This special case is already handled. Just check the main loop in
'idct_rows_armv5te' function.
Anyway, that 50%/50% estimation was used just as an example, it does not
change the fact that branches are very slow and multiplications are fast
(just checking a value and branching over the some piece of code takes 4
cycles, the same number of cycles could be used to do 4 multiplications
instead). Multiple branches just over small chunks of code here and there
are not an option. And as I already said before, I'll check if splitting
code into 'idct_row_full' (row[0] to row[7] are all nonzero) and
'idct_row_partial' (rows[4] to row[7] are zero) will prove to be
useful. Though alternative paths for these two cases really do
increase code size somewhat.
> > > > +/*
> > > > + * Enforce 8 byte stack alignment if it is not provided by ABI. Used
> > > > at the beginning + * of global functions. If stack is not properly
> > > > aligned, real return address is + * pushed to stack (thus fixing
> > > > stack alignment) and lr register is set to a thunk + * function
> > > > 'unaligned_return_thunk_armv5te' which is responsible for providing +
> > > > * correct return from the function in this case.
> > > > + */
> > > > + .macro idct_stackalign_armv5te
> > > > +#ifndef DWORD_ALIGNED_STACK
> > > > + tst sp, #4
> > > > + strne lr, [sp, #-4]!
> > > > + adrne lr, unaligned_return_thunk_armv5te
> > > > #endif
> > > > + .endm
> > >
> > > the compiler has to maintain an properly aligned stack and if needed
> > > has to align it on entry to libavcodec (avcodec_decode_video() and
> > > such) its not acceptable to realign the stack in the inner loop calling
> > > the idct
> >
> > The compiler has to maintain ABI specified stack alignment (either OABI
> > or EABI for ARM). If ABI specifies 4-byte alignment (OABI case), we have
> > to either insert some code to manually align stack, or not use this
> > function at all. Stack alignment at 8-byte boundary can be performed
> > really fast with 3 instructions (3 cycles) on entry and one instruction
> > at exit (5 cycles) in the case when stack alignment was needed. The
> > overhead of manual stack alignment for OABI is 3+5/2 on average and is
> > equal to 5.5 cycles (ARM9E). That's quite cheap.
>
> as ive said, this code does NOT belong in the inner loop,
It depends on what you consider an inner loop. A 600+ cycles function does
not look like an inner loop to me. And 3 cycles is a minor overhead (if the
stack is already 64-bit aligned), which is only compiled in for legacy ABI,
with no impact on EABI targeted builds at all.
> its the compilers
> job to maintain the user specified alignment (see
> -mpreferred-stack-boundary in the case of gcc)
I know about the options supported by gcc, including this one in particular.
But could you please point me to an exact place where such compilation option
is specified for building ffmpeg? And for other projects using ffmpeg as part
of their source tree (mplayer for example)?
I have tracked this mailing list for a long time already and have seen quite a
number of posts about windows related improper stack alignment breakages. I
must say, I would not like to see something like this happening to ARM as
well.
In addition, you seem to care about portability. A random compiler X does not
have to provide an option to support arbitrary stack alignment specified by
user. There is a reason why such a thing as ABI was invented.
> stack realignment due to the ABI providing less is only needed on functions
> called from the outside (main(), avcodec_encode_video() ...)
>
> if gcc doesn maintain the alignment requested by the user its buggy
Anyway, I'm fine with removing this check as long as a disclaimer is added
that I'm not taking any responsibility for the stack alignment related
breakages, and the removal of the stack alignment fixup was one of the
requirements for this code to be accepted to ffmpeg.
More information about the ffmpeg-devel
mailing list