[FFmpeg-devel] [PATCH] swresample/arm: add ff_resample_common_apply_filter_{x4, x8}_{float, s16}_neon
Matthieu Bouron
matthieu.bouron at gmail.com
Thu May 12 15:22:04 CEST 2016
On Thu, May 12, 2016 at 10:01 AM, Benoit Fouet <benoit.fouet at free.fr> wrote:
> Hi,
>
> I mostly have nits remarks.
>
> On 11/05/2016 18:39, Matthieu Bouron wrote:
>
>> From: Matthieu Bouron <matthieu.bouron at stupeflix.com>
>>
>>
> [...]
>
> diff --git a/libswresample/arm/resample.S b/libswresample/arm/resample.S
>> new file mode 100644
>> index 0000000..13462e3
>> --- /dev/null
>> +++ b/libswresample/arm/resample.S
>> @@ -0,0 +1,77 @@
>>
>> [...]
>>
>> +function ff_resample_common_apply_filter_x4_float_neon, export=1
>> + vmov.f32 q0, #0.0 @
>> accumulator
>> +1: vld1.32 {q1}, [r1]! @
>> src
>> + vld1.32 {q2}, [r2]! @
>> filter
>> + vmla.f32 q0, q1, q2 @
>> src + {0..3} * filter + {0..3}
>>
>
> nit: the comment could be "accu += src[0..3] . filter[0..3]"
> same for the other ones below
>
> [...]
>
> + subs r3, #4 @
>> filter_length -= 4
>> + bgt 1b @
>> loop until filter_length
>> + vpadd.f32 d0, d0, d1 @
>> pair adding of the 4x32-bit accumulated values
>> + vpadd.f32 d0, d0, d0 @
>> pair adding of the 4x32-bit accumulator values
>> + vst1.32 {d0[0]}, [r0] @
>> write accumulator
>> + mov pc, lr
>> +endfunc
>> +
>> +function ff_resample_common_apply_filter_x8_float_neon, export=1
>> + vmov.f32 q0, #0.0 @
>> accumulator
>> +1: vld1.32 {q1}, [r1]! @
>> src1
>> + vld1.32 {q2}, [r2]! @
>> filter1
>> + vld1.32 {q8}, [r1]! @
>> src2
>> + vld1.32 {q9}, [r2]! @
>> filter2
>> + vmla.f32 q0, q1, q2 @
>> src1 + {0..3} * filter1 + {0..3}
>> + vmla.f32 q0, q8, q9 @
>> src2 + {0..3} * filter2 + {0..3}
>>
>
> instead of using src1 and src2, you may want to use src[0..3] and src[4..7]
> so, if I reuse the formulation I proposed above:
> accu += src[0..3] . filter[0..3]
> accu += src[4..7] . filter[4..7]
>
Fixed locally (as well as the other case you mentionned) with:
- vmla.f32 q0, q1, q2 @
src1 + {0..3} * filter1 + {0..3}
- vmla.f32 q0, q8, q9 @
src2 + {0..3} * filter2 + {0..3}
+ vmla.f32 q0, q1, q2 @
accumulator += src1 + {0..3} * filter1 + {0..3}
+ vmla.f32 q0, q8, q9 @
accumulator += src2 + {4..7} * filter2 + {4..7}
I prefer to use + {0..3} instead of [0..3] to make the comments consistent
with what has been done in swscale/arm.
>
> + subs r3, #8 @
>> filter_length -= 4
>>
>
> -= 8
>
Fixed locally.
>
> [...]
>
> diff --git a/libswresample/arm/resample_init.c
>> b/libswresample/arm/resample_init.c
>> new file mode 100644
>> index 0000000..c817d03
>> --- /dev/null
>> +++ b/libswresample/arm/resample_init.c
>>
>> [...]
>>
>> +static int ff_resample_common_##TYPE##_neon(ResampleContext *c, void
>> *dest, const void *source, \
>> + int n, int update_ctx)
>> \
>> +{
>> \
>> + DELEM *dst = dest;
>> \
>> + const DELEM *src = source;
>> \
>> + int dst_index;
>> \
>> + int index= c->index;
>> \
>> + int frac= c->frac;
>> \
>> + int sample_index = index >> c->phase_shift;
>> \
>> + int x4_aligned_filter_length = c->filter_length & ~3;
>> \
>> + int x8_aligned_filter_length = c->filter_length & ~7;
>> \
>> +
>> \
>> + index &= c->phase_mask;
>> \
>> + for (dst_index = 0; dst_index < n; dst_index++) {
>> \
>> + FELEM *filter = ((FELEM *) c->filter_bank) + c->filter_alloc *
>> index; \
>> +
>> \
>> + FELEM2 val=0;
>> \
>> + int i = 0;
>> \
>> + if (x8_aligned_filter_length >= 8) {
>> \
>> + ff_resample_common_apply_filter_x8_##TYPE##_neon(&val,
>> &src[sample_index], \
>> + filter,
>> x8_aligned_filter_length); \
>> + i += x8_aligned_filter_length;
>> \
>> +
>> \
>> + } else if (x4_aligned_filter_length >= 4) {
>> \
>>
>
> do you think there could be a gain processing the remainder of the
> 8-aligned part through the 4-aligned part of the code? e.g. for a filter
> length of 15, that would make:
> - one run of the 8-aligned
> - one run of the 4-aligned
> - 3 C loops
> As you stated filter length seems to commonly be 32, I guess that wouldn't
> be easy to benchmark, though.
>
I'll see if I could trigger a case where the filter_length is 15 and come
up with a benchmark. If there is a performance gain, would you be ok if it
is implemented as a separate patch ?
If there is no objection, i'll push the updated patch in one day.
Thanks for the review,
Matthieu
[...]
More information about the ffmpeg-devel
mailing list