[FFmpeg-devel] [PATCH 0/9] DCA (DTS) decoder optimisations for ARMv6

Mon Jul 15 19:28:08 CEST 2013

I present here a patch series aimed at making DCA decode practical
on the Raspberry Pi. This uses an ARM1176JZF-S core, which is
ARMv6Z + VFPv2. Since DCA is a floating point codec, the
optimisations mostly rely upon hand-scheduled VFP code and the use
of short vectors. Note that short vectors are deprecated on Cortex-A8
and unsupported on Cortex-A9 and later, so the existing NEON
implementations remain the preferred code for ARMv7 or later.

Note that some of these patches result in floating point operations
being performed in a different order, with corresponding effects upon
rounding, so you might not always see a binary-identical result.
Additional subtle changes may be caused by the fact that I'm
configuring the VFP to RunFast mode, which amongst other things will
flush denormalised numbers to 0.

I'm afraid I haven't been able to prove this using "make fate" since
I have been unable to find a base revision in git that passes the
tests even without any of my patches applied. This even goes for
supposedly known good revisions from fate.ffmpeg.org, such as
786b096 (illustrated at http://tinyurl.com/p5hqrue). I haven't
identified whether this is due to toolchain or hardware differences:
I'm using gcc (Debian 4.6.3-14+rpi1) 4.6.3 on ARM1176JZF-S, the one
on fate.ffmpeg.org is gcc 4.4.7 (Ubuntu/Linaro 4.4.7-1ubuntu2) and
presumably a Cortex-A9.

Two of the optimisations rely upon new function pointers. The changes
to the C code to utilise these pointers are platform-independent, and
are given in separate patches from the optimisations themselves.

Benchmarks presented here were gathered using gperftools, a
statistical sampler. The numbers are the number of samples when
decoding a 30 minute test stream, averaged over 4 runs; lower numbers
represent faster operation.

The combined effect of this patch series is a speedup of
approximately 67%.

Ben Avison (9):
  [ARMv6] Add VFP-accelerated version of synth_filter_float
  [ARMv6] Add VFP-accelerated version of int32_to_float_fmul_scalar
  New fmtconvert method, int32_to_float_fmul_scalar_array
  [ARMv6] Add VFP-accelerated version of
    int32_to_float_fmul_scalar_array
  [ARMv6] Add VFP-accelerated version of imdct_half
  [ARMv6] Add VFP-accelerated version of dca_lfe_fir
  [ARMv6] Add VFP-accelerated version of fft16
  New dcadsp method, qmf_32_subbands
  [ARMv6] Add VFP-accelerated version of qmf_32_subbands

 libavcodec/arm/Makefile              |    6 +-
 libavcodec/arm/dcadsp_init_arm.c     |   14 +
 libavcodec/arm/dcadsp_vfp.S          |  491 ++++++++++++++++++++++++++++++++++
 libavcodec/arm/fft_init_arm.c        |   16 ++
 libavcodec/arm/fft_vfp.S             |  299 +++++++++++++++++++++
 libavcodec/arm/fmtconvert_init_arm.c |   16 +-
 libavcodec/arm/fmtconvert_vfp.S      |  200 ++++++++++++++
 libavcodec/arm/mdct_vfp.S            |  205 ++++++++++++++
 libavcodec/arm/synth_filter_vfp.S    |  244 +++++++++++++++++
 libavcodec/dcadec.c                  |   55 ++---
 libavcodec/dcadsp.c                  |   32 +++
 libavcodec/dcadsp.h                  |    6 +
 libavcodec/fmtconvert.c              |    7 +
 libavcodec/fmtconvert.h              |   14 +
 14 files changed, 1570 insertions(+), 35 deletions(-)
 create mode 100644 libavcodec/arm/dcadsp_vfp.S
 create mode 100644 libavcodec/arm/fft_vfp.S
 create mode 100644 libavcodec/arm/mdct_vfp.S
 create mode 100644 libavcodec/arm/synth_filter_vfp.S

-- 
1.7.5.4