[FFmpeg-devel] [PATCH] iirfilter: Rework FILTER_O2() to improve performance.
Justin Ruggles
justin.ruggles
Mon Jan 31 23:16:16 CET 2011
On 01/31/2011 05:03 PM, M?ns Rullg?rd wrote:
> Justin Ruggles <justin.ruggles at gmail.com> writes:
>
>> On 01/31/2011 04:19 PM, M?ns Rullg?rd wrote:
>>
>>> Justin Ruggles <justin.ruggles at gmail.com> writes:
>>>
>>>> Store coefficient pointers locally.
>>>> Store state locally, then save at end.
>>>> Split up arithmetic.
>>>> ---
>>>> libavcodec/iirfilter.c | 27 +++++++++++++++++++--------
>>>> 1 files changed, 19 insertions(+), 8 deletions(-)
>>>>
>>>> I was able to gcc to do what I wanted, but it's not pretty.
>>>>
>>>> Also, if a simplified version of ff_iir_filter_flt() is added that
>>>> doesn't read/save the state and always has sstep/dstep == 1, it can
>>>> gain about another 3% speedup. I'll save that for later though
>>>> since it would be unnecessarily complicating something that's not
>>>> even used yet.
>>>>
>>>> Athlon64
>>>> current: 217731
>>>> patched: 195939
>>>> simple: 189951
>>>>
>>>> Atom
>>>> current: 569708
>>>> patched: 542916
>>>> simple: 524855
>>>>
>>>> Cheers,
>>>> Justin
>>>>
>>>> diff --git a/libavcodec/iirfilter.c b/libavcodec/iirfilter.c
>>>> index bc63c39..8398e22 100644
>>>> --- a/libavcodec/iirfilter.c
>>>> +++ b/libavcodec/iirfilter.c
>>>> @@ -257,19 +257,30 @@ av_cold struct FFIIRFilterState* ff_iir_filter_init_state(int order)
>>>> }
>>>>
>>>> #define FILTER_O2(type, fmt) { \
>>>> - int i; \
>>>> const type *src0 = src; \
>>>> type *dst0 = dst; \
>>>> - for (i = 0; i < size; i++) { \
>>>> - float in = *src0 * c->gain + \
>>>> - s->x[0] * c->cy[0] + \
>>>> - s->x[1] * c->cy[1]; \
>>>> - CONV_##fmt(*dst0, s->x[0] + in + s->x[1] * c->cx[1]) \
>>>> - s->x[0] = s->x[1]; \
>>>> - s->x[1] = in; \
>>>> + const float *g = &c->gain; \
>>>
>>> Why is this a pointer?
>>>
>>>> + const int *x = c->cx; \
>>>> + const float *y = c->cy; \
>>>> + float s0 = s->x[0]; \
>>>> + float s1 = s->x[1]; \
>>>> + while (size--) { \
>>>> + float in = *src0 * *g; \
>>>> + float tmp = s0 * y[0]; \
>>>> + in += tmp; \
>>>> + tmp = s1 * y[1]; \
>>>> + in += tmp; \
>>>> + tmp = s1 * x[1]; \
>>>> + s0 += tmp; \
>>>> + s0 += in; \
>>>
>>> Does using the 'tmp' variable really make it faster? I have seen such
>>> things with gcc before, but I have to ask.
>>>
>>>> + CONV_##fmt(*dst0, s0) \
>>>> + s0 = s1; \
>>>> + s1 = in; \
>>>> src0 += sstep; \
>>>> dst0 += dstep; \
>>>> } \
>>>> + s->x[0] = s0; \
>>>> + s->x[1] = s1; \
>>>> }
>>>>
>>>> void ff_iir_filter(const struct FFIIRFilterCoeffs *c,
>>>
>>
>> Everything that seems weird about the function is only because the
>> obvious alternative is slower on my Athlon64.
>>
>> I basically started by disassembling the gcc output and converting it to
>> yasm. Then I tweaked and tweaked and tried every possible combination I
>> could think of. Some really random stuff was significantly slower.
>> Once I got it to the fastest I could get it, I tried to get gcc to match
>> the asm version as closely as possible. Merging any of those tmp values
>> makes it slower. Changing any of the local pointers to values makes it
>> slower.
>
> I have a bad feeling about this. Crazy effects like you describe here
> tend to be very volatile, varying wildly between architectures and
> compiler versions. If there's a big improvement to be had, it is
> better to write a proper asm function and leave the C code with
> something sensible-looking. Loading some values into local variables
> can be expected to help due to potential aliasing being avoided.
> Anything beyond that is entirely random, and might end up hurting one
> CPU as much as it helps another.
Ok. I'll just hold on to asm for safekeeping and bring it up again when
the function is actually used.
-Justin
More information about the ffmpeg-devel
mailing list