[FFmpeg-devel] [PATCH] h264.c/decode_cabac_residual optimization

Wed Jul 2 12:28:35 CEST 2008

Siarhei Siamashka wrote:
> On Wed, Jul 2, 2008 at 1:00 AM, M?ns Rullg?rd <mans at mansr.com> wrote:
>> "Siarhei Siamashka" <siarhei.siamashka at gmail.com> writes:
> [...]
>>> Typically pre-decrement is always preferred in code optimized for
>>> performance as it is generally faster. Something like this would be
>>> better (also it is closer to the old code):
>>> while( --coeff_count >= 0 ) {
>>> ...
>>> }
>>>
>>> You can try to compile this sample with the best possible
>>> optimizations, look at the assembly output and check where the
>>> generated code is better and why:
>>>
>>> /**********************/
>>> int q();
>>>
>>> void f1(int n)
>>> {
>>>     while (--n >= 0) {
>>>         q();
>>>     }
>>> }
>>>
>>> void f2(int n)
>>> {
>>>     while (n--) {
>>>         q();
>>>     }
>>> }
>>> /**********************/
>>
>> Any half-decent compiler should generate the same code for those two
>> functions.
>
> That's not true, just because these two functions are not identical.
> Hint: what happens if you pass -1 or any other negative value to these
> functions?

Right... I somehow read the second one as while (n-- > 0).  If you
want to compare post- vs. pre-decrement, that is also what you should
be compiling, as otherwise you'll be comparing the speed of doing
different things.

>> GCC for ARM generates a slightly different, but equivalent, setup sequence,
>> and the loops are exactly the same.
>
> In my case, gcc 3.4.4 (using '-march=armv6 -O3 -c' options) generated
> the following assembly output, which is definitely better for 'f1' (3
> instructions in the inner loop instead of 4):
>
> 00000000 <f1>:
>    0:   e92d4010        stmdb   sp!, {r4, lr}
>    4:   e2504001        subs    r4, r0, #1      ; 0x1
>    8:   48bd8010        ldmmiia sp!, {r4, pc}
>    c:   ebfffffe        bl      0 <q>
>   10:   e2544001        subs    r4, r4, #1      ; 0x1
>   14:   5afffffc        bpl     c <f1+0xc>
>   18:   e8bd8010        ldmia   sp!, {r4, pc}
>
> 0000001c <f2>:
>   1c:   e92d4010        stmdb   sp!, {r4, lr}
>   20:   e2504001        subs    r4, r0, #1      ; 0x1
>   24:   38bd8010        ldmccia sp!, {r4, pc}
>   28:   e2444001        sub     r4, r4, #1      ; 0x1
>   2c:   ebfffffe        bl      0 <q>
>   30:   e3740001        cmn     r4, #1  ; 0x1
>   34:   1afffffb        bne     28 <q+0x28>
>   38:   e8bd8010        ldmia   sp!, {r4, pc}
>
> I'm curious, what is the output of your compiler?

I was using CodeSourcery GCC 4.1.2 (the only compiler that works with
NEON) and -O3 -mcpu=cortex-a8.  I'm at work now, so I can't post
the exact output, but the loop bodies were identical in both cases;
only the prologue was different, since (as you pointed out) negative
initial values have different effects.

Since I'm at work, I can try it with the commercial ARM compiler
(only an old version, unfortunately):

00000000 <f1>:
   0:   e92d4010        stmdb   sp!, {r4, lr}
   4:   e1a04000        mov     r4, r0
   8:   ea000000        b       10 <f1+0x10>
   c:   ebfffffe        bl      0 <q>
  10:   e2544001        subs    r4, r4, #1      ; 0x1
  14:   5afffffc        bpl     c <f1+0xc>
  18:   e8bd8010        ldmia   sp!, {r4, pc}

0000001c <f2>:
  1c:   e92d4010        stmdb   sp!, {r4, lr}
  20:   e1a04000        mov     r4, r0
  24:   ea000000        b       2c <f2+0x10>
  28:   ebfffffe        bl      0 <q>
  2c:   e2544001        subs    r4, r4, #1      ; 0x1
  30:   2afffffc        bcs     28 <q+0x28>
  34:   e8bd8010        ldmia   sp!, {r4, pc}

This is different from what gcc does, and the two loops are different.
The speed should, however, be exactly the same.

>> I can't be bothered to check x86.
>
> But I can. For this particular case, the difference between the
> following variants in 'decode_cabac_residual' is the following:
> "while( --coeff_count >= 0 ) { ... }"
>
> ...
>     3022:   66 89 04 4a             mov    %ax,(%edx,%ecx,2)
>     3026:   83 6c 24 1c 04          subl   $0x4,0x1c(%esp)
>     302b:   83 6c 24 0c 01          subl   $0x1,0xc(%esp)
>     3030:   0f 89 06 fe ff ff       jns    2e3c <decode_cabac_residual+0x42d>
>     3036:   e9 d3 01 00 00          jmp    320e <decode_cabac_residual+0x7ff>
>     303b:   8b 54 24 08             mov    0x8(%esp),%edx
>     303f:   81 c2 bc 1d 02 00       add    $0x21dbc,%edx
> ...
>
> "while( coeff_count-- ) { ... }"
>
> ...
>     3022:   66 89 04 4a             mov    %ax,(%edx,%ecx,2)
>     3026:   83 6c 24 1c 04          subl   $0x4,0x1c(%esp)
>     302b:   83 6c 24 0c 01          subl   $0x1,0xc(%esp)
>>    3030:   83 7c 24 0c ff          cmpl   $0xffffffff,0xc(%esp)
>     3035:   0f 85 01 fe ff ff       jne    2e3c <decode_cabac_residual+0x42d>
>     303b:   e9 d3 01 00 00          jmp    3213 <decode_cabac_residual+0x804>
>     3040:   8b 54 24 08             mov    0x8(%esp),%edx
>     3044:   81 c2 bc 1d 02 00       add    $0x21dbc,%edx
> ...
>
> The expression 'while( coeff_count-- )' has one extra instruction
> inside of the loop in 'decode_cabac_residual', also increasing the
> size of the function by 5 bytes. The compiler seems to internally
> convert it into 'while( --coeff_count != -1 )', which is less
> efficient.

Stupid compiler.

> Compiled FFmpeg on Pentium-M with gcc 4.2.3 using just './configure &&
> make', let me know if you get different results with other versions of
> gcc or other optimization options.

Try adding a suitable --cpu flag to configure.  In your case, that would
be --cpu=pentium-m.

> Of course, benchmarking with 'decizycles' can hardly reliable detect
> the difference in just 1 instruction, also gcc may generate different
> code for the other part of the source as a side effect, but they are
> unrelated to "while( coeff_count-- ) { ... }" vs. "while(
> --coeff_count >= 0 ) { ... }" case.

The difference comes probably not from post- vs. pre-decrement being used,
but rather from the fact that the logic is different.  Your point about
benchmarking is of course valid.

-- 
M?ns Rullg?rd
mans at mansr.com