[FFmpeg-devel] [PATCH] Heavy optimization of IFF decoder

Tue Apr 27 20:57:23 CEST 2010

Sebastian Vater <cdgs.basty at googlemail.com> writes:

> Ronald S. Bultje a ?crit :
>> Hi,
>>
>> 2010/4/27 M?ns Rullg?rd <mans at mansr.com>:
>>   
>>> "Ronald S. Bultje" <rsbultje at gmail.com> writes:
>>>     
>>>> On Mon, Apr 26, 2010 at 7:39 PM, Sebastian Vater
>>>> <cdgs.basty at googlemail.com> wrote:
>>>>       
>>>>> +    const uint32_t lut[] = {0x0000000,
>>>>> +                            0x1000000 << plane,
>>>>> +                            0x0010000 << plane,
>>>>> +                            0x1010000 << plane,
>>>>> +                            0x0000100 << plane,
>>>>> +                            0x1000100 << plane,
>>>>> +                            0x0010100 << plane,
>>>>> +                            0x1010100 << plane,
>>>>> +                            0x0000001 << plane,
>>>>> +                            0x1000001 << plane,
>>>>> +                            0x0010001 << plane,
>>>>> +                            0x1010001 << plane,
>>>>> +                            0x0000101 << plane,
>>>>> +                            0x1000101 << plane,
>>>>> +                            0x0010101 << plane,
>>>>> +                            0x1010101 << plane};
>>>>>         
>>>> I really can't imagine that a static const lut[][] isn't faster. which
>>>> file did you use to test this? (Is it on mphq/samples?)
>>>>       
>>> A static table whose values are shifted in the loop is 7% faster on ARM.
>>>     
>>
>> It's a little slower on x86 (~12%). However, a (static) 2D array is
>> faster (3%) over the original patch. Mans just said that's fine on ARM
>> as well, so you should probably implement that (don't forget that
>> plane is const, so do a const *lut = table[plane] before entering the
>> loop, else gcc messes up).
>>   
>
> Please clarify on this, what kind of static 2D array should I use now?
> Just apply this one? Well, this one is 20% slower (increases from 6.2k
> to 8.9k) for me than my original.
>
> -- 
>
> Best regards,
>                    :-) Basty/CDGS (-:
>
> diff --git a/libavcodec/iff.c b/libavcodec/iff.c
> index be13f4e..ca8a790 100644
> --- a/libavcodec/iff.c
> +++ b/libavcodec/iff.c
> @@ -36,6 +36,33 @@ typedef struct {
>      uint8_t * planebuf;
>  } IffContext;
>  
> +#define DECODEPLANE8(plane) {0x0000000, \
> +                             0x1000000 << plane, \
> +                             0x0010000 << plane, \
> +                             0x1010000 << plane, \
> +                             0x0000100 << plane, \
> +                             0x1000100 << plane, \
> +                             0x0010100 << plane, \
> +                             0x1010100 << plane, \
> +                             0x0000001 << plane, \
> +                             0x1000001 << plane, \
> +                             0x0010001 << plane, \
> +                             0x1010001 << plane, \
> +                             0x0000101 << plane, \
> +                             0x1000101 << plane, \
> +                             0x0010101 << plane, \
> +                             0x1010101 << plane} \
> +
> +// 8 planes * 4-bit mask
> +static const uint32_t decodeplane8_tab[8][16] = {DECODEPLANE8(0),
> +                                                 DECODEPLANE8(1),
> +                                                 DECODEPLANE8(2),
> +                                                 DECODEPLANE8(3),
> +                                                 DECODEPLANE8(4),
> +                                                 DECODEPLANE8(5),
> +                                                 DECODEPLANE8(6),
> +                                                 DECODEPLANE8(7)};
> +

Please use a bit more sane formatting, e.g. something like this:

+#define LUT8(plane) {       \
+        0x0000000 << plane, \
+        0x1000000 << plane, \
+        0x0010000 << plane, \
+        0x1010000 << plane, \
+        0x0000100 << plane, \
+        0x1000100 << plane, \
+        0x0010100 << plane, \
+        0x1010100 << plane, \
+        0x0000001 << plane, \
+        0x1000001 << plane, \
+        0x0010001 << plane, \
+        0x1010001 << plane, \
+        0x0000101 << plane, \
+        0x1000101 << plane, \
+        0x0010101 << plane, \
+        0x1010101 << plane, \
+    }
+
+static const uint32_t plane8_lut[8][16] = {
+    LUT8(0), LUT8(1), LUT8(2), LUT8(3),
+    LUT8(4), LUT8(5), LUT8(6), LUT8(7),
+};

>  /**
>   * Convert CMAP buffer (stored in extradata) to lavc palette format
>   */
> @@ -95,13 +122,22 @@ static av_cold int decode_init(AVCodecContext *avctx)
>   * @param bps bits_per_coded_sample (must be <= 8)
>   * @param plane plane number to decode as
>   */
> -static void decodeplane8(uint8_t *dst, const uint8_t *const buf, int buf_size, int bps, int plane)
> +static inline void decodeplane8(uint8_t *dst,
> +                                const uint8_t *const buf,
> +                                const unsigned buf_size,
> +                                const unsigned bps,
> +                                const unsigned plane)

Const on simple variables is mostly pointless.

>  {
>      GetBitContext gb;
>      unsigned int i;
>      const unsigned b = (buf_size * 8) + bps - 1;
> +    const unsigned b32 = b & ~3;
>      init_get_bits(&gb, buf, buf_size * 8);
> -    for(i = 0; i < b; i++) {
> +    for(i = 0; i < b32; i += 4) {
> +        const uint32_t v = decodeplane8_tab[plane][get_bits(&gb, 4)];
> +        AV_WN32A(dst+i, AV_RN32A(dst+i) | v);
> +    }

I suggest using a local variable here, like this:

    const uint32_t *lut = decodeplane8_tab[plane];
    [...]
    uint32_t v = lut[get_bits(...)];

I don't trust gcc to do that on its own.

-- 
M?ns Rullg?rd
mans at mansr.com