[FFmpeg-devel] [PATCH 2/7] avcodec/aarch64/mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1
Martin Storsjö
martin at martin.st
Sun Aug 18 23:43:23 EEST 2024
On Sun, 18 Aug 2024, Ramiro Polla wrote:
> A53 A76
> pix_norm1_c: 519.2 231.5
> pix_norm1_neon: 195.0 ( 2.66x) 44.2 ( 5.24x)
> pix_sum_c: 344.5 242.2
> pix_sum_neon: 119.0 ( 2.89x) 41.7 ( 5.81x)
> ---
Hmm, those speedups on the A53 look quite small. I guess that's because
this isn't unrolled at all, as you mention. Especially for A53, I would
expect unrolling to have a very large effect here. But it sounds weird if
you say perf indicates that it is slower in real world use. Yes, unrolling
does make the code use more space and makes the I-cache less efficient,
but in this case it would only be a difference of like 2 instructions?
> diff --git a/libavcodec/aarch64/mpegvideoencdsp_neon.S b/libavcodec/aarch64/mpegvideoencdsp_neon.S
> new file mode 100644
> index 0000000000..89e50e29b3
> --- /dev/null
> +++ b/libavcodec/aarch64/mpegvideoencdsp_neon.S
> @@ -0,0 +1,67 @@
> +/*
> + * Copyright (c) 2024 Ramiro Polla
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include "libavutil/aarch64/asm.S"
> +
> +function ff_pix_sum16_neon, export=1
> +// x0 const uint8_t *pix
> +// x1 int line_size
> +
> + sxtw x1, w1
> + movi v0.16b, #0
> + mov w2, #16
> +
> +1:
> + ld1 { v1.16b }, [x0], x1
Nit; we usually don't have these {} written with spaces inside of the
braces, same below.
> + subs w2, w2, #1
> + uadalp v0.8h, v1.16b
> + b.ne 1b
> +
> + uaddlp v0.4s, v0.8h
> + uaddlv d0, v0.4s
Couldn't this be aggregated with just one instruction, "uaddlv s0, v0.8h"?
There's no need to widen it to 64 bit as we're truncating the returned
value to 32 bit anyway.
// Martin
More information about the ffmpeg-devel
mailing list