[FFmpeg-devel] [PATCH] ARM: NEON optimised simple_idct

Mon Aug 25 19:52:21 CEST 2008

On Mon, Aug 25, 2008 at 03:53:29PM +0100, M?ns Rullg?rd wrote:
> Michael Niedermayer <michaelni at gmx.at> writes:
> 
> > On Mon, Aug 25, 2008 at 04:06:33AM +0100, Mans Rullgard wrote:
> >> ---
> >>  libavcodec/Makefile                  |    2 +
> >>  libavcodec/armv4l/dsputil_arm.c      |   15 ++
> >>  libavcodec/armv4l/simple_idct_neon.S |  383 ++++++++++++++++++++++++++++++++++
> >>  libavcodec/avcodec.h                 |    1 +
> >>  libavcodec/utils.c                   |    1 +
> >>  5 files changed, 402 insertions(+), 0 deletions(-)
> >>  create mode 100644 libavcodec/armv4l/simple_idct_neon.S
> >> 
> >
> > is this idct binary identical in output to the C/MMX simple idct?
> 
> Yes.
> 
> >> +#ifdef HAVE_NEON
> >> +        } else if (idct_algo==FF_IDCT_SIMPLENEON){
> >> +            c->idct_put= ff_simple_idct_put_neon;
> >> +            c->idct_add= ff_simple_idct_add_neon;
> >> +            c->idct    = ff_simple_idct_neon;
> >> +            c->idct_permutation_type = FF_NO_IDCT_PERM;
> >> +#endif
> >
> > I do not know neon at all but, ive never seen a SIMD instruction set for
> > which the identity permutation would have been optimal.
> >
> > Also i suspect that the MMX simple idct is a better basis from which to
> > write other SIMD variants of the simple idct than the C one.
> 
> I can't read mmx code.  Could you explain briefly what optimisations
> are possible with permuted input?  NEON has more and wider registers
> than mmx, so it is reasonable to expect the optimal code to be quite
> different.

sure, but still i think our mmx code (not only the simple idct) contains
a few tricks that should be applicable to many SIMD instruction sets.

Lets see what i remember about the simple idct
1. it doesnt need any transposes due to using a tricky way of interleaving
   elements. This trick depends on the pmaddw instruction
   pmaddw(int32_t out[], int16_t in0[], int16_t in1[]){
        out[i]= in0[2*i+0]*in1[2*i+0]
               +in0[2*i+1]*in1[2*i+1]
   }
   If such a instruction isnt available then that trick isnt useable as is.

   Still its likely better to use a transposed permutation instead of the
   identity one as this means 1 transpose less in a SIMD IDCT.
2. depending on the pattern of non zero / all zero rows one of 8 optimized
   column transforms is used.
   This may be a bad idea though for a CPU with a small code cache ...

also maybe it would make sense to look at i386/idct_sse2_xvid.c
which uses SSE2 (128bit registers), this one uses only 16bit operations
for the column transform so it may be faster when the tricks of the simple
idct arent applicable

also

    Intel 64 and IA-32 Architectures
    Software Developers Manual
                              Volume 2A (and B)
           Instruction Set Reference

contains very readable and unambigious explanations of what all the
MMX, SSE* instruction do, if you ever want to decypher mmx or sse code

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Complexity theory is the science of finding the exact solution to an
approximation. Benchmarking OTOH is finding an approximation of the exact
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080825/ba451486/attachment.pgp>