[FFmpeg-devel] ARM NEON optimisations

Siarhei Siamashka siarhei.siamashka
Fri Aug 29 02:00:12 CEST 2008

On Monday 25 August 2008, Mans Rullgard wrote:
> NEON is the name of the SIMD unit present in ARMv7 processors such as
> the Cortex-A8.  Its register file is viewed as 16 128-bit registers or
> 32 64-bit registers, each a vector of 8/16/32-bit integers or 32-bit
> floats.
> This patch series adds NEON optimisations for a the most heavily used
> FFmpeg dsputil functions.
> I will commit these soon unless there are objections.

Thanks, you did an impressive job and it is a good start. Though your NEON
code is definitely not perfect. The main issue is that you don't take any
advantage of Cortex-A8 dual issue capability and this is a waste of precious
cpu cycles.

Please try to benchmark attached code vs. your 'float_to_int16'
implementation. It is a lot faster than your code because of executing 
load/store operations simultaneously with other instructions and 
pipelining. If you can improve it more, that would be even better of 
course, I did not spend too much time finetuning it.

Your 'vector_fmul' implementation is also suboptimal:

+extern ff_vector_fmul_neon
+ ? ? ? ?mov ? ? ? ? ? r3, r0
+ ? ? ? ?vld1.64 ? ? ? {d0-d3}, [r0,:128]!
+ ? ? ? ?vld1.64 ? ? ? {d4-d7}, [r1,:128]!
+ ? ? ? ?dmb
+1: ? ? ?subs ? ? ? ? ?r2, r2, #8
+ ? ? ? ?vmul.f32 ? ? ?q8, q0, q2
+ ? ? ? ?vmul.f32 ? ? ?q9, q1, q3
+ ? ? ? ?beq ? ? ? ? ? 2f
+ ? ? ? ?vld1.64 ? ? ? {d0-d3}, ? [r0,:128]!
+ ? ? ? ?vld1.64 ? ? ? {d4-d7}, ? [r1,:128]!
+ ? ? ? ?vst1.64 ? ? ? {d16-d19}, [r3,:128]!
+ ? ? ? ?b ? ? ? ? ? ? 1b
+2: ? ? ?vst1.64 ? ? ? {d16-d19}, [r3,:128]!
+ ? ? ? ?bx ? ? ? ? ? ?lr
+ ? ? ? ?.endfunc

Try to convert it to something more efficient as an exercise (split
multicycle instructions into single cycle 128-bit loads/stores and 64-bit
multiplications, schedule instructions to run load/store and multiplication
operations simultaneously, make this function use only a single conditional
jump, unroll it if needed to compensate multiplication latency).

Your IDCT contains long strides of load and store instructions too, they could
be interleaved with the arithmetic instructions to make code faster. Also you 
still use function calls quite a lot just like in ARMv6 IDCT. I don't know 
how big is the call/return overhead on Cortex-A8 yet, but ordinary loops 
should be probably faster anyway. Also having a bit of imagination, many extra 
tricks can be tried. For example, an interesting experiment would be to
try mixing ordinary ARM instructions into the flow of NEON instructions.
Surely, NEON unit is faster at doing multiplications than ARM unit (2x faster
if I don't miss something), but if they work simultaneously (by offloading
some part of work to ARMv6 SIMD code), there could be some performance
improvement ... maybe :)

NEON looks to be very similar to MMX/SSE and has equivalents for almost
all operations (maybe even for all of them, I only started looking into it).
Straightforward conversion MMX/SSE code to NEON looks to be rather simple.
Getting optimal scheduling for Cortex-A8 pipeline may be a bit more
difficult though.

Generally, looks like the best strategy for Cortex-A8 NEON optimizations would
be to make better use of parallel instructions execution. In order to achieve
this, you are better to split slow 128-bit arithmetic operations into 64-bit
ones and try to run them in parallel with 128-bit load/stores. 128-bit
arithmetic operations still can be used to reduce code size where splitting
them into 64-bit operations does not provide any extra benefits in
instructions scheduling. This may be not true for Cortex-A9, but we don't have
much information about it. Maybe ARM representatives here can share some
information? ;)

Overall, it would be nice to have something committed in order to keep moving
forward. Even if the code is not completely perfect, 80/20 rule is still also
applicable here. Further finetuning may require a lot of efforts, but the
effect would be much less visible than the switch from older code to even
naive straightforward NEON optimizations. That is as long as you don't abandon
this code, keep improving it and don't refuse to accept patches with
improvements for ARM ;) And the first priority is to have bug free code in
SVN. By the way, while we are at it, please check 'float_to_int16' for the
rounding issues.

I will not be doing any NEON optimizations in my free time for the hobby
projects until I get some Cortex-A8 hardware at home (and I still want to get
some). Judging from the discussion here, looks like it is better to wait for
at least revision C of beagleboard with better silicon. Revision B seems like
a waste of money if I would still need to buy revision C later.

Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: float_to_int16_neon.c
Type: text/x-csrc
Size: 1945 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080829/cd97564b/attachment.c>

More information about the ffmpeg-devel mailing list