[FFmpeg-devel] [PATCH] VP3 DC-only IDCT
Måns Rullgård
mans
Sat Mar 13 14:50:51 CET 2010
David Conrad <lessen42 at gmail.com> writes:
> Hi,
>
> This gives 2-4% faster overall decode for normal files.
>
> Some thoughts:
> I can't think of any shortcuts that could make the IDCT faster with 128-byte simd that don't rely on knowing the last non-zero coefficient.
>
> Knowing that before calling the idct, you could do a slightly faster IDCT that assumes the right and bottom of the block are all 0. This seems to be significantly faster only for mmx; for sse2 it's nearly a wash between the added check vs. the time saved.
>
> For an average video, around a third of all idcts are DC-only, a third more could be done with that shortcut (i.e. last_nnz is under 10), and the rest require a full IDCT.
>
> libtheora only does the 10 element shortcut, not DC-only. It also only has a mmx IDCT.
>
> I also haven't really looked at whether a DC-only IDCT is beneficial for mpeg codecs, thus the vp3-specific dsputil function.
>
>
> commit 0c4da1f09d90f7aec230b190195e063d51a2f3d8
> Author: David Conrad <lessen42 at gmail.com>
> Date: Sat Mar 13 01:13:57 2010 -0500
>
> vp3: DC-only IDCT
>
> 2-4% faster overall decode
>
> diff --git a/libavcodec/arm/dsputil_init_neon.c b/libavcodec/arm/dsputil_init_neon.c
> index 4a8de5e..9644748 100644
> --- a/libavcodec/arm/dsputil_init_neon.c
> +++ b/libavcodec/arm/dsputil_init_neon.c
> @@ -32,6 +32,7 @@ void ff_simple_idct_add_neon(uint8_t *dest, int line_size, DCTELEM *data);
> void ff_vp3_idct_neon(DCTELEM *data);
> void ff_vp3_idct_put_neon(uint8_t *dest, int line_size, DCTELEM *data);
> void ff_vp3_idct_add_neon(uint8_t *dest, int line_size, DCTELEM *data);
> +void ff_vp3_idct_dc_add_neon(uint8_t *dest, int line_size, DCTELEM *data);
>
> void ff_put_pixels16_neon(uint8_t *, const uint8_t *, int, int);
> void ff_put_pixels16_x2_neon(uint8_t *, const uint8_t *, int, int);
> @@ -386,6 +387,7 @@ void ff_dsputil_init_neon(DSPContext *c, AVCodecContext *avctx)
> if (CONFIG_VP3_DECODER) {
> c->vp3_v_loop_filter = ff_vp3_v_loop_filter_neon;
> c->vp3_h_loop_filter = ff_vp3_h_loop_filter_neon;
> + c->vp3_idct_dc_add = ff_vp3_idct_dc_add_neon;
> }
>
> c->vector_fmul = ff_vector_fmul_neon;
> diff --git a/libavcodec/arm/vp3dsp_neon.S b/libavcodec/arm/vp3dsp_neon.S
> index 6deae47..ade1998 100644
> --- a/libavcodec/arm/vp3dsp_neon.S
> +++ b/libavcodec/arm/vp3dsp_neon.S
> @@ -374,3 +374,47 @@ function ff_vp3_idct_add_neon, export=1
> vst1.64 {d7}, [r2,:64], r1
> bx lr
> endfunc
> +
> +function ff_vp3_idct_dc_add_neon, export=1
> + ldrsh r2, [r2]
> + movw r3, #46341
> + mul r2, r3, r2
> + smulwt r2, r3, r2
> + mov r3, r0
> + vdup.16 q15, r2
> + vrshr.s16 q15, q15, #4
> +
> + vld1.8 {d0}, [r0,:64], r1
> + vld1.8 {d1}, [r0,:64], r1
> + vld1.8 {d2}, [r0,:64], r1
> + vaddw.u8 q8, q15, d0
> + vld1.8 {d3}, [r0,:64], r1
> + vaddw.u8 q9, q15, d1
> + vld1.8 {d4}, [r0,:64], r1
> + vaddw.u8 q10, q15, d2
> + vld1.8 {d5}, [r0,:64], r1
> + vaddw.u8 q11, q15, d3
> + vld1.8 {d6}, [r0,:64], r1
> + vaddw.u8 q12, q15, d4
> + vld1.8 {d7}, [r0,:64], r1
> + vaddw.u8 q13, q15, d5
> + vqmovun.s16 d0, q8
> + vaddw.u8 q14, q15, d6
> + vqmovun.s16 d1, q9
> + vaddw.u8 q15, q15, d7
> + vqmovun.s16 d2, q10
> + vst1.8 {d0}, [r3,:64], r1
> + vqmovun.s16 d3, q11
> + vst1.8 {d1}, [r3,:64], r1
> + vqmovun.s16 d4, q12
> + vst1.8 {d2}, [r3,:64], r1
> + vqmovun.s16 d5, q13
> + vst1.8 {d3}, [r3,:64], r1
> + vqmovun.s16 d6, q14
> + vst1.8 {d4}, [r3,:64], r1
> + vqmovun.s16 d7, q15
> + vst1.8 {d5}, [r3,:64], r1
> + vst1.8 {d6}, [r3,:64], r1
> + vst1.8 {d7}, [r3,:64], r1
> + bx lr
> +endfunc
Looks good, assuming it works.
--
M?ns Rullg?rd
mans at mansr.com
More information about the ffmpeg-devel
mailing list