[FFmpeg-devel] [PATCH] Fix non-rounding up to next 16-bit aligned bug in IFF decoder
Sebastian Vater
cdgs.basty
Thu Apr 29 15:50:54 CEST 2010
M?ns Rullg?rd a ?crit :
> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>
>
>> M?ns Rullg?rd a ?crit :
>>
>>> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>>>
>>>
>>>
>>>> M?ns Rullg?rd a ?crit :
>>>>
>>>>
>>>>> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Just got the idea, we can get rid of the GetBitContext
>>>>>> completely...Instead of reading 4 bits, we simply read a byte:
>>>>>> const uint8_t lut_offsets = *buf++; // instead of get_bits(gb,4);
>>>>>>
>>>>>>
>>>>> That's a separate thing.
>>>>>
>>>>>
>>>> Separate in what way? What did you mean exactly?
>>>>
>>>>
>>> Separate from the LUT byte order.
>>>
>>>
>>>
>>>>>> Then we do loop unrolling by 8 and do two accesses to lut one with >> 4
>>>>>> and one with & 0x0F, or we get even rid of this and create a lut table
>>>>>> with 256 entries using AV_WN64A / AV_RN64A ;-)
>>>>>>
>>>>>> The advance here is that on a 64 bit CPU we get another nice speed
>>>>>> improvement ;-)
>>>>>> If we avoid calculations with AV_RN64A etc.
>>>>>>
>>>>>>
>>>>>>
>>>>> Those macros don't do any calculations. All they do is some magic to
>>>>> avoid type aliasing errors.
>>>>>
>>>>>
>>>> Yes, I know, but I meant stuff like (lut0[...] << 32ULL) | lut1[...];
>>>>
>>>>
>>> Why on earth would you do that?
>>>
>>>
>>>
>>>> But this isn't necessary if we use an 8-bit table storing uint64_t's...
>>>>
>>>>
>>> That would fall apart completely on 32-bit machines. I doubt any
>>> speedup you might see on 64-bit is worth the added complexity of
>>> doing it conditionally. Just leave it as 32-bit.
>>>
>>>
>>>
>>>>>> gcc just should use 2 registers on 32-bit CPU and that's it.
>>>>>>
>>>>>>
>>>>> Should, but doesn't.
>>>>>
>>>>>
>>>> With the way I meant above, it should...I'll test that now, but without
>>>> a completed table and tell you what it does.
>>>>
>>>>
>>> Believe me, it doesn't. GCC is terrible with 64-bit data on 32-bit
>>> machines. Do not tempt it.
>>>
>>>
>> I just tried, but unfortunately, I messed my last mail with the
>> benchmarks completely up, so here again in a clean manner:
>>
>> 32-bit 4 bits per loop:
>>
>
> Still using get_bits?
>
>
>> 64-bit with got rid of GetBitContext, containing the following code:
>> static void decodeplane8(uint8_t *dst,
>> const uint8_t *buf,
>> const unsigned buf_size,
>> const unsigned bps,
>> const unsigned plane)
>> {
>> START_TIMER;
>> const uint8_t *end = dst + (buf_size * 8);
>> const uint64_t *lut = plane8_lut[plane];
>> for(; dst < end; dst += 8) {
>> const uint64_t v = AV_RN64A(dst) | lut[*buf++];
>> AV_WN64A(dst, v);
>> }
>> STOP_TIMER("decodeplane8");
>> }
>>
>> Which outputs the following diassembly (from START/STOP_TIMER):
>> 544: 0f 31 rdtsc
>> 546: 89 54 24 60 mov %edx,0x60(%esp)
>> 54a: 8b 5c 24 60 mov 0x60(%esp),%ebx
>> 54e: 31 d2 xor %edx,%edx
>> 550: c7 44 24 64 00 00 00 movl $0x0,0x64(%esp)
>> 557: 00
>> 558: 8b 74 24 64 mov 0x64(%esp),%esi
>> 55c: 89 de mov %ebx,%esi
>> 55e: bb 00 00 00 00 mov $0x0,%ebx
>> 563: 89 5c 24 60 mov %ebx,0x60(%esp)
>> 567: 01 44 24 60 add %eax,0x60(%esp)
>> 56b: 8b 44 24 44 mov 0x44(%esp),%eax
>> 56f: 89 74 24 64 mov %esi,0x64(%esp)
>> 573: 11 54 24 64 adc %edx,0x64(%esp)
>> 577: 2b 84 24 9c 00 00 00 sub 0x9c(%esp),%eax
>> 57e: 39 c8 cmp %ecx,%eax
>> 580: 76 02 jbe 584 <decode_frame_ilbm+0x2d4>
>> 582: 89 c8 mov %ecx,%eax
>> 584: 8b 74 24 50 mov 0x50(%esp),%esi
>> 588: 8d 04 c6 lea (%esi,%eax,8),%eax
>> 58b: 89 44 24 5c mov %eax,0x5c(%esp)
>> 58f: 8b 44 24 4c mov 0x4c(%esp),%eax
>> 593: c1 e0 07 shl $0x7,%eax
>> 596: 05 00 00 00 00 add $0x0,%eax
>> 59b: 89 44 24 58 mov %eax,0x58(%esp)
>> 59f: 8b 44 24 5c mov 0x5c(%esp),%eax
>> 5a3: 39 c6 cmp %eax,%esi
>> 5a5: 73 30 jae 5d7 <decode_frame_ilbm+0x327>
>> 5a7: 8b ac 24 9c 00 00 00 mov 0x9c(%esp),%ebp
>> 5ae: 89 f7 mov %esi,%edi
>> 5b0: 0f b6 75 00 movzbl 0x0(%ebp),%esi
>> 5b4: 83 c5 01 add $0x1,%ebp
>> 5b7: 8b 4c 24 58 mov 0x58(%esp),%ecx
>> 5bb: 8b 5f 04 mov 0x4(%edi),%ebx
>> 5be: 8b 07 mov (%edi),%eax
>> 5c0: 8b 54 f1 04 mov 0x4(%ecx,%esi,8),%edx
>> 5c4: 0b 04 f1 or (%ecx,%esi,8),%eax
>> 5c7: 09 da or %ebx,%edx
>> 5c9: 89 07 mov %eax,(%edi)
>> 5cb: 89 57 04 mov %edx,0x4(%edi)
>> 5ce: 83 c7 08 add $0x8,%edi
>> 5d1: 39 7c 24 5c cmp %edi,0x5c(%esp)
>> 5d5: 77 d9 ja 5b0 <decode_frame_ilbm+0x300>
>> 5d7: 0f 31 rdtsc
>>
>
> That looks _VERY_ inefficient.
>
No, there's some part of START_TIMER:
The for (; dst < end; dst++) is just:
5b0: 0f b6 75 00 movzbl 0x0(%ebp),%esi
5b4: 83 c5 01 add $0x1,%ebp
5b7: 8b 4c 24 58 mov 0x58(%esp),%ecx
5bb: 8b 5f 04 mov 0x4(%edi),%ebx
5be: 8b 07 mov (%edi),%eax
5c0: 8b 54 f1 04 mov 0x4(%ecx,%esi,8),%edx
5c4: 0b 04 f1 or (%ecx,%esi,8),%eax
5c7: 09 da or %ebx,%edx
5c9: 89 07 mov %eax,(%edi)
5cb: 89 57 04 mov %edx,0x4(%edi)
5ce: 83 c7 08 add $0x8,%edi
5d1: 39 7c 24 5c cmp %edi,0x5c(%esp)
5d5: 77 d9 ja 5b0 <decode_frame_ilbm+0x300>
That looks very well.
--
Best regards,
:-) Basty/CDGS (-:
More information about the ffmpeg-devel
mailing list