[FFmpeg-devel] [PATCH] iirfilter: Rework FILTER_O2() to improve performance.

Justin Ruggles justin.ruggles
Mon Jan 31 23:16:16 CET 2011


On 01/31/2011 05:03 PM, M?ns Rullg?rd wrote:

> Justin Ruggles <justin.ruggles at gmail.com> writes:
> 
>> On 01/31/2011 04:19 PM, M?ns Rullg?rd wrote:
>>
>>> Justin Ruggles <justin.ruggles at gmail.com> writes:
>>>
>>>> Store coefficient pointers locally.
>>>> Store state locally, then save at end.
>>>> Split up arithmetic.
>>>> ---
>>>>  libavcodec/iirfilter.c |   27 +++++++++++++++++++--------
>>>>  1 files changed, 19 insertions(+), 8 deletions(-)
>>>>
>>>> I was able to gcc to do what I wanted, but it's not pretty.
>>>>
>>>> Also, if a simplified version of ff_iir_filter_flt() is added that
>>>> doesn't read/save the state and always has sstep/dstep == 1, it can
>>>> gain about another 3% speedup.  I'll save that for later though
>>>> since it would be unnecessarily complicating something that's not
>>>> even used yet.
>>>>
>>>> Athlon64
>>>> current: 217731
>>>> patched: 195939
>>>> simple:  189951
>>>>
>>>> Atom
>>>> current: 569708
>>>> patched: 542916
>>>> simple:  524855
>>>>
>>>> Cheers,
>>>> Justin
>>>>
>>>> diff --git a/libavcodec/iirfilter.c b/libavcodec/iirfilter.c
>>>> index bc63c39..8398e22 100644
>>>> --- a/libavcodec/iirfilter.c
>>>> +++ b/libavcodec/iirfilter.c
>>>> @@ -257,19 +257,30 @@ av_cold struct FFIIRFilterState* ff_iir_filter_init_state(int order)
>>>>  }
>>>>  
>>>>  #define FILTER_O2(type, fmt) {                                          \
>>>> -    int i;                                                              \
>>>>      const type *src0 = src;                                             \
>>>>      type       *dst0 = dst;                                             \
>>>> -    for (i = 0; i < size; i++) {                                        \
>>>> -        float in = *src0   * c->gain  +                                 \
>>>> -                   s->x[0] * c->cy[0] +                                 \
>>>> -                   s->x[1] * c->cy[1];                                  \
>>>> -        CONV_##fmt(*dst0, s->x[0] + in + s->x[1] * c->cx[1])            \
>>>> -        s->x[0] = s->x[1];                                              \
>>>> -        s->x[1] = in;                                                   \
>>>> +    const float *g = &c->gain;                                          \
>>>
>>> Why is this a pointer?
>>>
>>>> +    const int   *x = c->cx;                                             \
>>>> +    const float *y = c->cy;                                             \
>>>> +    float s0 = s->x[0];                                                 \
>>>> +    float s1 = s->x[1];                                                 \
>>>> +    while (size--) {                                                    \
>>>> +        float in  = *src0 * *g;                                         \
>>>> +        float tmp = s0 * y[0];                                          \
>>>> +        in += tmp;                                                      \
>>>> +        tmp = s1 * y[1];                                                \
>>>> +        in += tmp;                                                      \
>>>> +        tmp = s1 * x[1];                                                \
>>>> +        s0 += tmp;                                                      \
>>>> +        s0 += in;                                                       \
>>>
>>> Does using the 'tmp' variable really make it faster?  I have seen such
>>> things with gcc before, but I have to ask.
>>>
>>>> +        CONV_##fmt(*dst0, s0)                                           \
>>>> +        s0 = s1;                                                        \
>>>> +        s1 = in;                                                        \
>>>>          src0 += sstep;                                                  \
>>>>          dst0 += dstep;                                                  \
>>>>      }                                                                   \
>>>> +    s->x[0] = s0;                                                       \
>>>> +    s->x[1] = s1;                                                       \
>>>>  }
>>>>  
>>>>  void ff_iir_filter(const struct FFIIRFilterCoeffs *c,
>>>
>>
>> Everything that seems weird about the function is only because the
>> obvious alternative is slower on my Athlon64.
>>
>> I basically started by disassembling the gcc output and converting it to
>> yasm.  Then I tweaked and tweaked and tried every possible combination I
>> could think of.  Some really random stuff was significantly slower.
>> Once I got it to the fastest I could get it, I tried to get gcc to match
>> the asm version as closely as possible.  Merging any of those tmp values
>> makes it slower.  Changing any of the local pointers to values makes it
>> slower.
> 
> I have a bad feeling about this.  Crazy effects like you describe here
> tend to be very volatile, varying wildly between architectures and
> compiler versions.  If there's a big improvement to be had, it is
> better to write a proper asm function and leave the C code with
> something sensible-looking.  Loading some values into local variables
> can be expected to help due to potential aliasing being avoided.
> Anything beyond that is entirely random, and might end up hurting one
> CPU as much as it helps another.


Ok. I'll just hold on to asm for safekeeping and bring it up again when
the function is actually used.

-Justin



More information about the ffmpeg-devel mailing list