[FFmpeg-devel] [RFC] support encrypted asf

Måns Rullgård mans
Mon Oct 8 22:40:34 CEST 2007


Michael Niedermayer <michaelni at gmx.at> writes:

> On Mon, Oct 08, 2007 at 08:42:56AM +0200, Reimar D?ffinger wrote:
>> Hello,
>> since it was such an interesting problem I could not stop myself from
>> optimizing the inverse() function.
>> It probably is easier to understand without algebra knowledge, but a
>> bit more obfuscated for those with.
>> On x86 the speed difference is only about 10%, multiplication is just
>> too fast on those ;-).
>> Here it is (tell me if you'd prefer an updated complete patch instead):
>> 
>> static uint32_t inverse(uint32_t v) {
>>     uint32_t factor = 1;
>>     uint32_t product = v;
>>     uint32_t mask = 2;
>>     do {
>>       v <<= 1;
>>       factor |= product & mask;
>>       product += v & -(product & mask);
>>       // should be mask <<= 1; but then gcc misses the
>>       // optimization opportunity to use the Z-flag for the while test
>>       mask += mask;
>>     } while (mask);
>>     return factor;
>> }
>
> interresting, btw a multiply based variant with fewer multiplies is:
>
> static uint32_t inverse(uint32_t v) {
>     uint32_t v3, v6, v12;
> #define POW3(v)  v*=v; v*=v; v*=v
> #define POW6(v)  POW3(v); POW3(v)
> #define POW12(v) POW6(v); POW6(v)
>     v3= v*v*v;
>     v6=v3= v3*v3*v;
>     POW3(v6);
>     v12=v6= v6*v3;
>     POW6(v12);
>     v3=v12= v12*v6;
>     POW12(v3);
>     v3*=v12;
>     POW6(v3);
>     v3 *=v6;
>     return v3*v3*v;
> }

That's a curious one.  GCC compiles this for ARM as 37 multiplies and
push/pop of two registers.  For x86-64 it generates 37 multiplies, 6
moves and a ret.  GCC 4.3 manages with only 4 moves (though I didn't
check it for correctness).

While this is faster than Reimar's code above, it suffers badly from
pipeline stalls on ARM since 32-bit multiplies have a 5-cycle latency
before the result can be used as input to another multiply, and all
they all depend on the previous one.  Does it have to be calculated
serially?  (I'm too lazy to work out why it works in the first place.)

> or a LUT based variant
>
> static uint32_t inverse2(uint32_t v) {
>     uint32_t a,b;
>     a= inv[v&255];
>
>     b= (a*v)>>8;
>     b*= a;
>     a-= b<<8;
>     b= (a*v)>>16;
>     b*= a;
>
>     return a - (b<<16);
> }

This looks like it should beat all the other variants presented so far
if the lookup table can be afforded and has a chance of being cached.

-- 
M?ns Rullg?rd
mans at mansr.com




More information about the ffmpeg-devel mailing list