[FFmpeg-devel] [PATCH] h264.c/decode_cabac_residual optimization
Siarhei Siamashka
siarhei.siamashka
Wed Jul 2 11:45:37 CEST 2008
On Wed, Jul 2, 2008 at 1:00 AM, M?ns Rullg?rd <mans at mansr.com> wrote:
> "Siarhei Siamashka" <siarhei.siamashka at gmail.com> writes:
[...]
>> Typically pre-decrement is always preferred in code optimized for
>> performance as it is generally faster. Something like this would be
>> better (also it is closer to the old code):
>> while( --coeff_count >= 0 ) {
>> ...
>> }
>>
>> You can try to compile this sample with the best possible
>> optimizations, look at the assembly output and check where the
>> generated code is better and why:
>>
>> /**********************/
>> int q();
>>
>> void f1(int n)
>> {
>> while (--n >= 0) {
>> q();
>> }
>> }
>>
>> void f2(int n)
>> {
>> while (n--) {
>> q();
>> }
>> }
>> /**********************/
>
> Any half-decent compiler should generate the same code for those two
> functions.
That's not true, just because these two functions are not identical.
Hint: what happens if you pass -1 or any other negative value to these
functions?
> GCC for ARM generates a slightly different, but equivalent, setup sequence, and the loops are exactly the same.
In my case, gcc 3.4.4 (using '-march=armv6 -O3 -c' options) generated
the following assembly output, which is definitely better for 'f1' (3
instructions in the inner loop instead of 4):
00000000 <f1>:
0: e92d4010 stmdb sp!, {r4, lr}
4: e2504001 subs r4, r0, #1 ; 0x1
8: 48bd8010 ldmmiia sp!, {r4, pc}
c: ebfffffe bl 0 <q>
10: e2544001 subs r4, r4, #1 ; 0x1
14: 5afffffc bpl c <f1+0xc>
18: e8bd8010 ldmia sp!, {r4, pc}
0000001c <f2>:
1c: e92d4010 stmdb sp!, {r4, lr}
20: e2504001 subs r4, r0, #1 ; 0x1
24: 38bd8010 ldmccia sp!, {r4, pc}
28: e2444001 sub r4, r4, #1 ; 0x1
2c: ebfffffe bl 0 <q>
30: e3740001 cmn r4, #1 ; 0x1
34: 1afffffb bne 28 <q+0x28>
38: e8bd8010 ldmia sp!, {r4, pc}
I'm curious, what is the output of your compiler?
> I can't be bothered to check x86.
But I can. For this particular case, the difference between the
following variants in 'decode_cabac_residual' is the following:
"while( --coeff_count >= 0 ) { ... }"
...
3022: 66 89 04 4a mov %ax,(%edx,%ecx,2)
3026: 83 6c 24 1c 04 subl $0x4,0x1c(%esp)
302b: 83 6c 24 0c 01 subl $0x1,0xc(%esp)
3030: 0f 89 06 fe ff ff jns 2e3c <decode_cabac_residual+0x42d>
3036: e9 d3 01 00 00 jmp 320e <decode_cabac_residual+0x7ff>
303b: 8b 54 24 08 mov 0x8(%esp),%edx
303f: 81 c2 bc 1d 02 00 add $0x21dbc,%edx
...
"while( coeff_count-- ) { ... }"
...
3022: 66 89 04 4a mov %ax,(%edx,%ecx,2)
3026: 83 6c 24 1c 04 subl $0x4,0x1c(%esp)
302b: 83 6c 24 0c 01 subl $0x1,0xc(%esp)
> 3030: 83 7c 24 0c ff cmpl $0xffffffff,0xc(%esp)
3035: 0f 85 01 fe ff ff jne 2e3c <decode_cabac_residual+0x42d>
303b: e9 d3 01 00 00 jmp 3213 <decode_cabac_residual+0x804>
3040: 8b 54 24 08 mov 0x8(%esp),%edx
3044: 81 c2 bc 1d 02 00 add $0x21dbc,%edx
...
The expression 'while( coeff_count-- )' has one extra instruction
inside of the loop in 'decode_cabac_residual', also increasing the
size of the function by 5 bytes. The compiler seems to internally
convert it into 'while( --coeff_count != -1 )', which is less
efficient.
Compiled FFmpeg on Pentium-M with gcc 4.2.3 using just './configure &&
make', let me know if you get different results with other versions of
gcc or other optimization options.
Of course, benchmarking with 'decizycles' can hardly reliable detect
the difference in just 1 instruction, also gcc may generate different
code for the other part of the source as a side effect, but they are
unrelated to "while( coeff_count-- ) { ... }" vs. "while(
--coeff_count >= 0 ) { ... }" case.
More information about the ffmpeg-devel
mailing list