[FFmpeg-devel] [PATCH] ARM: NEON optimised simple_idct
Måns Rullgård
mans
Mon Aug 25 20:47:16 CEST 2008
Michael Niedermayer <michaelni at gmx.at> writes:
> On Mon, Aug 25, 2008 at 03:53:29PM +0100, M?ns Rullg?rd wrote:
>> Michael Niedermayer <michaelni at gmx.at> writes:
>>
>> > On Mon, Aug 25, 2008 at 04:06:33AM +0100, Mans Rullgard wrote:
>> >> ---
>> >> libavcodec/Makefile | 2 +
>> >> libavcodec/armv4l/dsputil_arm.c | 15 ++
>> >> libavcodec/armv4l/simple_idct_neon.S | 383 ++++++++++++++++++++++++++++++++++
>> >> libavcodec/avcodec.h | 1 +
>> >> libavcodec/utils.c | 1 +
>> >> 5 files changed, 402 insertions(+), 0 deletions(-)
>> >> create mode 100644 libavcodec/armv4l/simple_idct_neon.S
>> >>
>> >
>> > is this idct binary identical in output to the C/MMX simple idct?
>>
>> Yes.
>>
>> >> +#ifdef HAVE_NEON
>> >> + } else if (idct_algo==FF_IDCT_SIMPLENEON){
>> >> + c->idct_put= ff_simple_idct_put_neon;
>> >> + c->idct_add= ff_simple_idct_add_neon;
>> >> + c->idct = ff_simple_idct_neon;
>> >> + c->idct_permutation_type = FF_NO_IDCT_PERM;
>> >> +#endif
>> >
>> > I do not know neon at all but, ive never seen a SIMD instruction set for
>> > which the identity permutation would have been optimal.
>> >
>> > Also i suspect that the MMX simple idct is a better basis from which to
>> > write other SIMD variants of the simple idct than the C one.
>>
>> I can't read mmx code. Could you explain briefly what optimisations
>> are possible with permuted input? NEON has more and wider registers
>> than mmx, so it is reasonable to expect the optimal code to be quite
>> different.
>
> sure, but still i think our mmx code (not only the simple idct) contains
> a few tricks that should be applicable to many SIMD instruction sets.
>
> Lets see what i remember about the simple idct
> 1. it doesnt need any transposes due to using a tricky way of interleaving
> elements. This trick depends on the pmaddw instruction
> pmaddw(int32_t out[], int16_t in0[], int16_t in1[]){
> out[i]= in0[2*i+0]*in1[2*i+0]
> +in0[2*i+1]*in1[2*i+1]
> }
> If such a instruction isnt available then that trick isnt useable as is.
There is no such instruction. There's normal multiply-accumulate and
pairwise add (with optional accumulate).
> Still its likely better to use a transposed permutation instead of
> the identity one as this means 1 transpose less in a SIMD IDCT.
That idea struck me as well. I'll try it out.
>2. depending on the pattern of non zero / all zero rows one of 8
> optimized column transforms is used. This may be a bad idea though
> for a CPU with a small code cache ...
>
> also maybe it would make sense to look at i386/idct_sse2_xvid.c
> which uses SSE2 (128bit registers), this one uses only 16bit operations
> for the column transform so it may be faster when the tricks of the simple
> idct arent applicable
Do you expect any sane person to be able to read that? That's also
not bitexact, right?
> also
>
> Intel 64 and IA-32 Architectures
> Software Developers Manual
> Volume 2A (and B)
> Instruction Set Reference
>
> contains very readable and unambigious explanations of what all the
> MMX, SSE* instruction do, if you ever want to decypher mmx or sse code
I have those documents, and reading Chinese is easier.
--
M?ns Rullg?rd
mans at mansr.com
More information about the ffmpeg-devel
mailing list