[FFmpeg-devel] [PATCH] Indeo5 decoder

Maxim max_pole
Fri Apr 17 13:44:30 CEST 2009


Michael Niedermayer schrieb:
> On Tue, Apr 07, 2009 at 05:08:34PM +0200, Maxim wrote:
>   
>> Michael Niedermayer schrieb:
>>     
>>> On Tue, Apr 07, 2009 at 10:52:34AM +0200, Maxim wrote:
>>>   
>>>       
>>>> Michael Niedermayer schrieb:
>>>>     
>>>>         
>>>>> On Mon, Apr 06, 2009 at 08:41:57PM +0200, Maxim wrote:
>>>>>       
>>>>>           
>>> [...]
>>>   
>>>       
>>>>>> +
>>>>>> +
>>>>>> +/**
>>>>>> + *  Build static indeo5 dequantization tables.
>>>>>> + */
>>>>>> +static av_cold void build_dequant_tables(void)
>>>>>> +{
>>>>>> +    int         mat, i, lev;
>>>>>> +    uint32_t    q1, q2, sf1, sf2;
>>>>>> +
>>>>>> +    for (mat = 0; mat < 5; mat++) {
>>>>>> +        /* build 8x8 intra/inter tables for all 24 quant levels */
>>>>>> +        for (lev = 0; lev < 24; lev++) {
>>>>>> +            sf1 = ivi5_scale_quant_8x8_intra[mat][lev];
>>>>>> +            sf2 = ivi5_scale_quant_8x8_inter[mat][lev];
>>>>>> +
>>>>>> +            for (i = 0; i < 64; i++) {
>>>>>> +                q1 = (ivi5_base_quant_8x8_intra[mat][i] * sf1) >> 8;
>>>>>> +                q2 = (ivi5_base_quant_8x8_inter[mat][i] * sf2) >> 8;
>>>>>> +                deq8x8_intra[mat][lev][i] = av_clip(q1, 1, 255);
>>>>>> +                deq8x8_inter[mat][lev][i] = av_clip(q2, 1, 255);
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> 1..255 but they arent uint8_t 
>>>>> av_clip() seems useless  and the whole table precalc maybe as well
>>>>>   
>>>>>       
>>>>>           
>>>> They were made uint16_t in order to achieve a compatibility with Indeo4
>>>> that uses 9bits dequant tables...
>>>> The table precalculation should help avoiding huge static tables...
>>>>     
>>>>         
>>> let me clarify my question, what is gained by merging a multiply and shift
>>> into the table?
>>> is it faster? if so then by how much?
>>>       

I did some research on that! Here are answers on your questions:

Question: Is it faster? if so then how much?

Yes, it's faster. I measured calc "time" using START/STOP_TIMER macs. I
did two tests on two different videos: one containing mostly light
colors (DPS190indeo.avi) and another containing mostly dark colors
(haegemonia.avi). The reason for this choice was that the light colors
require higher scalefactors to be used and therefore a multiply by a
higher number.
First test measured dezicycles consumed by the inverse quantization
using TABLE lookup/MUL. It was done in my x86 Laptop equipped with the
Indel Core Duo processor at 2 GHz. Here are the raw numbers:

DPS190indeo.avi
using MUL
---------------------------------------------------------
10050 dezicycles in inverse_quant, 1 runs, 0 skips
6225 dezicycles in inverse_quant, 2 runs, 0 skips
4012 dezicycles in inverse_quant, 4 runs, 0 skips
2831 dezicycles in inverse_quant, 8 runs, 0 skips
2503 dezicycles in inverse_quant, 16 runs, 0 skips
2104 dezicycles in inverse_quant, 32 runs, 0 skips
1912 dezicycles in inverse_quant, 64 runs, 0 skips
1814 dezicycles in inverse_quant, 128 runs, 0 skips
1766 dezicycles in inverse_quant, 256 runs, 0 skips
1730 dezicycles in inverse_quant, 512 runs, 0 skips
1715 dezicycles in inverse_quant, 1024 runs, 0 skips
1718 dezicycles in inverse_quant, 2048 runs, 0 skips
1735 dezicycles in inverse_quant, 4096 runs, 0 skips
1703 dezicycles in inverse_quant, 8192 runs, 0 skips
1679 dezicycles in inverse_quant, 16384 runs, 0 skips
1642 dezicycles in inverse_quant, 32761 runs, 7 skips
1906 dezicycles in inverse_quant, 65524 runs, 12 skips
2165 dezicycles in inverse_quant, 131055 runs, 17 skips
2289 dezicycles in inverse_quant, 262124 runs, 20 skips
2361 dezicycles in inverse_quant, 524263 runs, 25 skips
2461 dezicycles in inverse_quant, 1048530 runs, 46 skips


DPS190indeo.avi
using table lookup
---------------------------------------------------------
5850 dezicycles in inverse_quant, 1 runs, 0 skips
3825 dezicycles in inverse_quant, 2 runs, 0 skips
2737 dezicycles in inverse_quant, 4 runs, 0 skips
2156 dezicycles in inverse_quant, 8 runs, 0 skips
1837 dezicycles in inverse_quant, 16 runs, 0 skips
1678 dezicycles in inverse_quant, 32 runs, 0 skips
1605 dezicycles in inverse_quant, 64 runs, 0 skips
1572 dezicycles in inverse_quant, 128 runs, 0 skips
1562 dezicycles in inverse_quant, 256 runs, 0 skips
1546 dezicycles in inverse_quant, 512 runs, 0 skips
1542 dezicycles in inverse_quant, 1024 runs, 0 skips
1538 dezicycles in inverse_quant, 2048 runs, 0 skips
1533 dezicycles in inverse_quant, 4095 runs, 1 skips
1529 dezicycles in inverse_quant, 8191 runs, 1 skips
2481 dezicycles in inverse_quant, 16377 runs, 7 skips
2276 dezicycles in inverse_quant, 32754 runs, 14 skips
2217 dezicycles in inverse_quant, 65521 runs, 15 skips
2303 dezicycles in inverse_quant, 131056 runs, 16 skips
2380 dezicycles in inverse_quant, 262126 runs, 18 skips
2390 dezicycles in inverse_quant, 524256 runs, 32 skips
2303 dezicycles in inverse_quant, 1048526 runs, 50 skips


haegemonia.avi
using MUL
-------------------------------------------------------------
9150 dezicycles in inverse_quant, 1 runs, 0 skips
8400 dezicycles in inverse_quant, 2 runs, 0 skips
5175 dezicycles in inverse_quant, 4 runs, 0 skips
3375 dezicycles in inverse_quant, 8 runs, 0 skips
2943 dezicycles in inverse_quant, 16 runs, 0 skips
2189 dezicycles in inverse_quant, 32 runs, 0 skips
1804 dezicycles in inverse_quant, 64 runs, 0 skips
1722 dezicycles in inverse_quant, 128 runs, 0 skips
1565 dezicycles in inverse_quant, 256 runs, 0 skips
1502 dezicycles in inverse_quant, 512 runs, 0 skips
1460 dezicycles in inverse_quant, 1024 runs, 0 skips
1455 dezicycles in inverse_quant, 2048 runs, 0 skips
1445 dezicycles in inverse_quant, 4096 runs, 0 skips
1450 dezicycles in inverse_quant, 8189 runs, 3 skips
1456 dezicycles in inverse_quant, 16377 runs, 7 skips
1462 dezicycles in inverse_quant, 32761 runs, 7 skips
1467 dezicycles in inverse_quant, 65529 runs, 7 skips
1471 dezicycles in inverse_quant, 131063 runs, 9 skips
1471 dezicycles in inverse_quant, 262133 runs, 11 skips
1469 dezicycles in inverse_quant, 524273 runs, 15 skips
1482 dezicycles in inverse_quant, 1048539 runs, 37 skips
1539 dezicycles in inverse_quant, 2097087 runs, 65 skips
1569 dezicycles in inverse_quant, 4194192 runs, 112 skips
1587 dezicycles in inverse_quant, 8388415 runs, 193 skips
1576 dezicycles in inverse_quant, 16776867 runs, 349 skips
1589 dezicycles in inverse_quant, 33553772 runs, 660 skips


haegemonia.avi
using table lookup
----------------------------------------------------------------
7650 dezicycles in inverse_quant, 1 runs, 0 skips
9225 dezicycles in inverse_quant, 2 runs, 0 skips
7162 dezicycles in inverse_quant, 4 runs, 0 skips
4312 dezicycles in inverse_quant, 8 runs, 0 skips
3150 dezicycles in inverse_quant, 16 runs, 0 skips
2292 dezicycles in inverse_quant, 32 runs, 0 skips
1851 dezicycles in inverse_quant, 64 runs, 0 skips
1692 dezicycles in inverse_quant, 128 runs, 0 skips
1549 dezicycles in inverse_quant, 256 runs, 0 skips
1479 dezicycles in inverse_quant, 512 runs, 0 skips
1447 dezicycles in inverse_quant, 1024 runs, 0 skips
1435 dezicycles in inverse_quant, 2048 runs, 0 skips
1425 dezicycles in inverse_quant, 4096 runs, 0 skips
1417 dezicycles in inverse_quant, 8191 runs, 1 skips
1414 dezicycles in inverse_quant, 16383 runs, 1 skips
1415 dezicycles in inverse_quant, 32763 runs, 5 skips
1415 dezicycles in inverse_quant, 65526 runs, 10 skips
1412 dezicycles in inverse_quant, 131054 runs, 18 skips
1411 dezicycles in inverse_quant, 262111 runs, 33 skips
1411 dezicycles in inverse_quant, 524235 runs, 53 skips
1427 dezicycles in inverse_quant, 1048472 runs, 104 skips
1462 dezicycles in inverse_quant, 2097001 runs, 151 skips
1478 dezicycles in inverse_quant, 4194124 runs, 180 skips
1487 dezicycles in inverse_quant, 8388350 runs, 258 skips
1480 dezicycles in inverse_quant, 16776811 runs, 405 skips
1483 dezicycles in inverse_quant, 33553731 runs, 701 skips

The 2nd test measures the time consumed by the whole "decode_block"
function. It was tested on the same videos in order to get an idea how
much overall slowdown the mul can cause comparing to the table lookup.
Note that this test isn't highly precise due to different block sizes,
types, motion compensation routines etc. Here are the results:

DPS190indeo.avi
using table lookup
--------------------------------------------------------------
247572150 dezicycles in decode_block, 1 runs, 0 skips
131260875 dezicycles in decode_block, 2 runs, 0 skips
83439337 dezicycles in decode_block, 4 runs, 0 skips
50838712 dezicycles in decode_block, 8 runs, 0 skips
48420253 dezicycles in decode_block, 16 runs, 0 skips
35089167 dezicycles in decode_block, 32 runs, 0 skips
31400655 dezicycles in decode_block, 64 runs, 0 skips
34119541 dezicycles in decode_block, 128 runs, 0 skips
37256929 dezicycles in decode_block, 256 runs, 0 skips
34629149 dezicycles in decode_block, 512 runs, 0 skips
34388553 dezicycles in decode_block, 1024 runs, 0 skips


DPS190indeo.avi
using MUL
--------------------------------------------------------------
98318100 dezicycles in decode_block, 1 runs, 0 skips
52273125 dezicycles in decode_block, 2 runs, 0 skips
43584300 dezicycles in decode_block, 4 runs, 0 skips
30443175 dezicycles in decode_block, 8 runs, 0 skips
24910940 dezicycles in decode_block, 15 runs, 1 skips
30644970 dezicycles in decode_block, 31 runs, 1 skips
28728507 dezicycles in decode_block, 63 runs, 1 skips
30057405 dezicycles in decode_block, 127 runs, 1 skips
33836048 dezicycles in decode_block, 255 runs, 1 skips
34394564 dezicycles in decode_block, 511 runs, 1 skips
34969909 dezicycles in decode_block, 1023 runs, 1 skips


haegemonia.avi
using table lookup
---------------------------------------------------------------
148390350 dezicycles in decode_block, 1 runs, 0 skips
79289700 dezicycles in decode_block, 2 runs, 0 skips
56912250 dezicycles in decode_block, 4 runs, 0 skips
46417781 dezicycles in decode_block, 8 runs, 0 skips
58794965 dezicycles in decode_block, 16 runs, 0 skips
64963382 dezicycles in decode_block, 32 runs, 0 skips
77440038 dezicycles in decode_block, 64 runs, 0 skips
83901941 dezicycles in decode_block, 128 runs, 0 skips
86332104 dezicycles in decode_block, 256 runs, 0 skips
87132562 dezicycles in decode_block, 512 runs, 0 skips
87044181 dezicycles in decode_block, 1024 runs, 0 skips
85457518 dezicycles in decode_block, 2048 runs, 0 skips
82111635 dezicycles in decode_block, 4096 runs, 0 skips


haegemonia.avi
using MUL
---------------------------------------------------------------
149617350 dezicycles in decode_block, 1 runs, 0 skips
80227575 dezicycles in decode_block, 2 runs, 0 skips
64138537 dezicycles in decode_block, 4 runs, 0 skips
43374487 dezicycles in decode_block, 8 runs, 0 skips
44831160 dezicycles in decode_block, 15 runs, 1 skips
57537459 dezicycles in decode_block, 31 runs, 1 skips
75061088 dezicycles in decode_block, 63 runs, 1 skips
82704550 dezicycles in decode_block, 126 runs, 2 skips
86989034 dezicycles in decode_block, 254 runs, 2 skips
88685182 dezicycles in decode_block, 510 runs, 2 skips
88840202 dezicycles in decode_block, 1022 runs, 2 skips
87397207 dezicycles in decode_block, 2046 runs, 2 skips
83958492 dezicycles in decode_block, 4094 runs, 2 skips


I was not able to make these tests on PPC because I couldn't get the
latest svn compiled in my PPC Mac! But the final result can vary depends
on the processor/arch used, i.e. older proc/slower mul...

Finally, the main reason why I created this table was the intention to
share the "decode_block" func between both Indeo4 and Indeo5 decoders.
The problem is that the dequant tables are quite different. You don't
know about it because you didn't see the indeo4 code but doing the
scalefactor calc in "decode_block" for both decoders will finally look
like that (correctness not ensured):

//base_tab and scale_tab point to the appropriated tables
// depends on dequant matrix, block type and block size

if (dec_type == indeo4) {
     q = (base_tab[pos] * quant * scale_tab[pos]) / 48;
     q = (q) ? q : 1;
} else { // dec_type == indeo5
     q = (base_tab[pos] * scale_tab[quant]) >> 8;
     q = (q) ? q : 1;
}
// inverse quant using q here...

That looks more messy as the table precalc IMHO... Moreover the DIV in
the indeo4 can cause an significant slowdown (I didn't test it though)...
On the other side I agree with you at the point that the precalculated
tables require lots of memory/cache especially when only a small amount
of operations is used...
It's surely important to find a good trade-off...

Regards
Maxim



More information about the ffmpeg-devel mailing list