[FFmpeg-devel] [PATCH] h264.c/decode_cabac_residual optimization
Måns Rullgård
mans
Wed Jul 2 12:28:35 CEST 2008
Siarhei Siamashka wrote:
> On Wed, Jul 2, 2008 at 1:00 AM, M?ns Rullg?rd <mans at mansr.com> wrote:
>> "Siarhei Siamashka" <siarhei.siamashka at gmail.com> writes:
> [...]
>>> Typically pre-decrement is always preferred in code optimized for
>>> performance as it is generally faster. Something like this would be
>>> better (also it is closer to the old code):
>>> while( --coeff_count >= 0 ) {
>>> ...
>>> }
>>>
>>> You can try to compile this sample with the best possible
>>> optimizations, look at the assembly output and check where the
>>> generated code is better and why:
>>>
>>> /**********************/
>>> int q();
>>>
>>> void f1(int n)
>>> {
>>> while (--n >= 0) {
>>> q();
>>> }
>>> }
>>>
>>> void f2(int n)
>>> {
>>> while (n--) {
>>> q();
>>> }
>>> }
>>> /**********************/
>>
>> Any half-decent compiler should generate the same code for those two
>> functions.
>
> That's not true, just because these two functions are not identical.
> Hint: what happens if you pass -1 or any other negative value to these
> functions?
Right... I somehow read the second one as while (n-- > 0). If you
want to compare post- vs. pre-decrement, that is also what you should
be compiling, as otherwise you'll be comparing the speed of doing
different things.
>> GCC for ARM generates a slightly different, but equivalent, setup sequence,
>> and the loops are exactly the same.
>
> In my case, gcc 3.4.4 (using '-march=armv6 -O3 -c' options) generated
> the following assembly output, which is definitely better for 'f1' (3
> instructions in the inner loop instead of 4):
>
> 00000000 <f1>:
> 0: e92d4010 stmdb sp!, {r4, lr}
> 4: e2504001 subs r4, r0, #1 ; 0x1
> 8: 48bd8010 ldmmiia sp!, {r4, pc}
> c: ebfffffe bl 0 <q>
> 10: e2544001 subs r4, r4, #1 ; 0x1
> 14: 5afffffc bpl c <f1+0xc>
> 18: e8bd8010 ldmia sp!, {r4, pc}
>
> 0000001c <f2>:
> 1c: e92d4010 stmdb sp!, {r4, lr}
> 20: e2504001 subs r4, r0, #1 ; 0x1
> 24: 38bd8010 ldmccia sp!, {r4, pc}
> 28: e2444001 sub r4, r4, #1 ; 0x1
> 2c: ebfffffe bl 0 <q>
> 30: e3740001 cmn r4, #1 ; 0x1
> 34: 1afffffb bne 28 <q+0x28>
> 38: e8bd8010 ldmia sp!, {r4, pc}
>
> I'm curious, what is the output of your compiler?
I was using CodeSourcery GCC 4.1.2 (the only compiler that works with
NEON) and -O3 -mcpu=cortex-a8. I'm at work now, so I can't post
the exact output, but the loop bodies were identical in both cases;
only the prologue was different, since (as you pointed out) negative
initial values have different effects.
Since I'm at work, I can try it with the commercial ARM compiler
(only an old version, unfortunately):
00000000 <f1>:
0: e92d4010 stmdb sp!, {r4, lr}
4: e1a04000 mov r4, r0
8: ea000000 b 10 <f1+0x10>
c: ebfffffe bl 0 <q>
10: e2544001 subs r4, r4, #1 ; 0x1
14: 5afffffc bpl c <f1+0xc>
18: e8bd8010 ldmia sp!, {r4, pc}
0000001c <f2>:
1c: e92d4010 stmdb sp!, {r4, lr}
20: e1a04000 mov r4, r0
24: ea000000 b 2c <f2+0x10>
28: ebfffffe bl 0 <q>
2c: e2544001 subs r4, r4, #1 ; 0x1
30: 2afffffc bcs 28 <q+0x28>
34: e8bd8010 ldmia sp!, {r4, pc}
This is different from what gcc does, and the two loops are different.
The speed should, however, be exactly the same.
>> I can't be bothered to check x86.
>
> But I can. For this particular case, the difference between the
> following variants in 'decode_cabac_residual' is the following:
> "while( --coeff_count >= 0 ) { ... }"
>
> ...
> 3022: 66 89 04 4a mov %ax,(%edx,%ecx,2)
> 3026: 83 6c 24 1c 04 subl $0x4,0x1c(%esp)
> 302b: 83 6c 24 0c 01 subl $0x1,0xc(%esp)
> 3030: 0f 89 06 fe ff ff jns 2e3c <decode_cabac_residual+0x42d>
> 3036: e9 d3 01 00 00 jmp 320e <decode_cabac_residual+0x7ff>
> 303b: 8b 54 24 08 mov 0x8(%esp),%edx
> 303f: 81 c2 bc 1d 02 00 add $0x21dbc,%edx
> ...
>
> "while( coeff_count-- ) { ... }"
>
> ...
> 3022: 66 89 04 4a mov %ax,(%edx,%ecx,2)
> 3026: 83 6c 24 1c 04 subl $0x4,0x1c(%esp)
> 302b: 83 6c 24 0c 01 subl $0x1,0xc(%esp)
>> 3030: 83 7c 24 0c ff cmpl $0xffffffff,0xc(%esp)
> 3035: 0f 85 01 fe ff ff jne 2e3c <decode_cabac_residual+0x42d>
> 303b: e9 d3 01 00 00 jmp 3213 <decode_cabac_residual+0x804>
> 3040: 8b 54 24 08 mov 0x8(%esp),%edx
> 3044: 81 c2 bc 1d 02 00 add $0x21dbc,%edx
> ...
>
> The expression 'while( coeff_count-- )' has one extra instruction
> inside of the loop in 'decode_cabac_residual', also increasing the
> size of the function by 5 bytes. The compiler seems to internally
> convert it into 'while( --coeff_count != -1 )', which is less
> efficient.
Stupid compiler.
> Compiled FFmpeg on Pentium-M with gcc 4.2.3 using just './configure &&
> make', let me know if you get different results with other versions of
> gcc or other optimization options.
Try adding a suitable --cpu flag to configure. In your case, that would
be --cpu=pentium-m.
> Of course, benchmarking with 'decizycles' can hardly reliable detect
> the difference in just 1 instruction, also gcc may generate different
> code for the other part of the source as a side effect, but they are
> unrelated to "while( coeff_count-- ) { ... }" vs. "while(
> --coeff_count >= 0 ) { ... }" case.
The difference comes probably not from post- vs. pre-decrement being used,
but rather from the fact that the logic is different. Your point about
benchmarking is of course valid.
--
M?ns Rullg?rd
mans at mansr.com
More information about the ffmpeg-devel
mailing list