[FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (v2)

Tue Feb 2 13:28:59 CET 2016

Hi

On Tue, 2 Feb 2016 12:52:15 +0100, you wrote:

>Hi,
>
>as a motus operandi for this review, I have no time for a proper one,
>or at least not fitting with John's timeframe. I'll try to close as
>many pending discussions, and would prefer if someone else completed
>the review/validation/commit.
Thanks

>2016-01-22 19:33 GMT+01:00 John Cox <jc at kynesim.co.uk>:
>> Fair enough - though given that your slowdowns are almost certainly
>> cache-related the whole may be quite different from the sum of the
>> parts.
>
>True, they don't always translate to anything noticeable, but that's
>the best tool we have to objectively decide.
Yes, but it isn't always a good one. I have spent substantial time in
the past optimising TI DSP based codecs and it was not uncommon that
some patches would make life slightly slower until enough of them were
applied and then the whole thing suddenly gained a jump in speed.

Either way I'm not averse to splitting stuff up and, at least on ARM,
none of the patches caused a slowdown.

>>>cosmetics?
>>
>> I renamed the variable, because c_idx can have values 0..2 and c_idx_nz
>> is a boolean with 0..1 and in the rewrite of the inc var it is important
>> that we are using the _nz variant so having the var named appropriately
>> seemed sensible.
>
>I don't really mind mixing some form of cosmetics (=supposedly without
>code generation consequences) although other people prefer splitting
>for easier review and regression testing.
>
>This is not a blocking item for me, just that finding the most
>appropriate commit would be nice.

My point was that I changed the inputs to that fn and so I changed the
vars name to make the point clearer - it should be part of the c_idx_nz
patch.

>>>I suppose branch prediction could help here, but less likely than for
>>>get_cabac_sig_coeff_flag_idxs.
>>>
>>>Also for this and some others: why inline over av_always_inline?
>> No particularly good reason for this one - though for any fn that might
>> be called from multiple places there is a strong argument for just
>> "inline" as it allows the compiler to make a judgment call based on how
>> big L1 cache is and how bad the call penalty.
>
>Anyway, those kinds of micro-optimizations I'm suggesting need more
>testing (sequences, platforms), so let's roll with this.
>
>>>AV_WN64 of 0x0101010101010101ULL, or even a memset, as it would be
>>>inlined by the compiler, given its size, and done the best possible
>>>way.
>>
>> levels is int *, not char *
>
>Ok, sorry, then 0x0000000100000001ULL. But you can ignore this, it'll
>probably make no difference outside of a micro-benchmark.

My experience with compilers is that this is the sort of thing that they
can and will do off their own bat. (Certainly MS C has been unrolling
this sort of memset loop for the past two decades and I'd be stunned if
gcc doesn't too),

>>>Saturation, could be a separate patch, with which I would be ok.
>
>btw and iirc, a comment indicated assumptions on what is a "legit"
>(instead of conforming ) bitstream/coeffs, making a conscious
>decision.
>
>I know Ronald, ffvp9's author, specifically decided to handle
>equivalent cases in bitstreams (hint) from Argon Designs. I have no
>opinion, but others might.
>
>>>Related to but not strictly bypass ?
>>
>> Not bypass per se, more the general optimisation of abs_level_remaining
>> - it is pulled out because I had a slightly better arm asm version of
>> the fn.  So it could go in that patch, but this allows other asm to
>> override it if they so desire.
>
>What I meant: would better be there than in another commit.
>
>>>Doing:
>>>    if (get_cabac(c, state0 + ctx_map[n]))
>>>        *p++ = n;
>>>    while (--n != 0) {
>>>        if (get_cabac(c, state0 + ctx_map[n]))
>>>            *p++ = n;
>>>    }
>>>is most likely faster, probably also on arm, if the branch prediction is good.
>>
>> Not convinced.  That will increase code size (as get_cabac will inline)
>> which can be pure poison as you have found out with USE_N_END_1.
>
>That's 300B, not 1.5KB. And I *know* it can help, just not on all
>platforms and sequences. The same decision was made for ffh264's
>equivalent, iirc.

I'll have to take your word for it but it seems very strange to me that

fn(x);
while(cond)
  fn(x);

is faster than

do {
  fn(x);
} while (cond);

I guess that it might be a branch prediction thing, but the second form
uses no more conditions and the first and is shorter. (And the compiler
always has the option of unrolling the 2nd form into the 1st.)

>But this would need more validated results, so let's ignore that discussion.

Yup - despite what I said above if you can show that first form is
objectively faster than the first on all platforms then you should use
it...

>>>boolean? (not sure)
>> I saw no booleans anywhere in the rest of this code so assumed they were
>> (at best) deprecated.  But if there is an official ffmpeg boolean then
>> that is what it should be.
>
>Sorry for the imprecision, ffmpeg doesn't have a boolean type. I meant
>this to be possibly split as its own kind of modification, not fitting
>with anything else. Or not, this shouldn't be a blocking item.
>
>>>So we decided to have #define USE_N_END_1 ARCH_ARM. As said in another
>>>mail, there's a 10-20% loss for the codeblock.
>>>
>>>USE_BY22_DIV also strangely provides no benefit.
>> Slightly surprised by that, but DIV is going to produce smaller code so
>> should be preferred on general grounds as it puts off cache disasters.
>
>There maybe pessimization (spelling?) there, if the compiler doesn't
>manage to generate the idiv + picking edx for the shifted result.
>I haven't checked the disassembled code.

Using div avoids a conditional, a load and a possible shift so it should
be faster and it should certainly be a bit shorter (and it expresses
clearly what the multiply form is trying to achieve).

Many thanks for your comments

Regards

JC