[FFmpeg-devel] pre discussion around Blackfin dct_quantize_bfin routine

Wed Jun 13 02:14:26 CEST 2007

On Jun 12, 2007, at 7:39 PM, Michael Niedermayer wrote:

> Hi
>
> On Tue, Jun 12, 2007 at 09:24:22AM -0400, Marc Hoffman wrote:
>> On 6/12/07, Siarhei Siamashka <siarhei.siamashka at gmail.com> wrote:
>>> On 6/12/07, Michael Niedermayer <michaelni at gmx.at> wrote:
>>>> Hi
>>>>
>>>> On Tue, Jun 12, 2007 at 05:49:19AM -0400, Marc Hoffman wrote:
>>>>> Does these allow me to ignore the DCT permutation?
>>>>
>>>> no it would still break if the user selected the other integer idct
>>>
>>> Is it possible to add a configure option to be able to compile  
>>> ffmpeg
>>> only with IDCT that do not need permutation (and do not allow the  
>>> user
>>> to select other idct)? At least it would eliminate table lookups in
>>> many places (replace table lookups with a macro which expands either
>>> to table lookup or the value itself). The point is that ARM devices
>>> are heavily CPU limited and ARMv5TE optimized IDCT does not use
>>> permutation. Blackfin powered devices may be CPU limited too  
>>> (Marc can
>>> probably privide more information about blackfin performance). I'll
>>> try to do some benchmarks on ARM and post some results later.
>>>
>>
>> On Blackfin you want to elliminate those permutations they are  
>> costly.
>>  Basically, something like:
>>
>>     j=scantable[i];
>>     x=data[j];
>>
>> expands into:
>>
>>     p0=[p1++];
>>     3 cycle delay waiting for p0 to validate.  Thank god its  
>> interlocked.
>>     r0=[p0];
>>
>> you don't really want to do this very often.  The execution pipeline
>> looks something like this
>>
>>     IF0 IF1 IF2 ID AC M0 M1 M2 EX WB
>>
>> AC is where addresses are computed before they are feed into the  
>> memory pipe.
>> Mx are memory access stages they overlap with other things not needed
>> for this discussion.
>> IFx instruction fetch
>> ID instruction decode
>> WB write back
>> EX execute, actually Blackfin has two stages of execution the other
>> one overlaps with M2.
>>
>> There are 3 stages of execution in the pipeline for accessing the
>> memory on the parts and the feed back of the load into the  
>> register p0
>> needs to wait until the end of the pipeline before its used.
>
> why not read 3 into 3 registers and then write them, doesnt this avoid
> the delays?
>
Agreed this is what is typically done and what I plan on doing in the  
search for last non zero code.  Just haven't done so yet.

>
>>
>> This is what I/we have to work with on these lighter weight embedded
>> processors.  We are talking about fairly simple micro  
>> architectures in
>> comparison to things like PPC and X86.  Actually, this pipeline  
>> layout
>> works very well for numerical calculations that don't require
>> permutations :).
>> #include <stdio.h>
>> main ()
>> {
>>   int clk;
>>   int mem[10];
>>   while (1) {
>>   asm (
>>        "%0=cycles;\n\t"
>>        "p0=[%1];\n\t"
>>        "r0=[p0];\n\t"
>>        "r0=cycles;\n\t"
>>        "%0=r0-%0 (ns);\n\t"
>>        : "=d" (clk) : "a" (mem) : "R0","P0");
>>   printf ("%d\n", clk);
>>   }
>> }
>>
>> results in 6.... subtract 1 for the last read of cycles we get 5, and
>> the two instructions which execute gives you 3 dead cycles.  What is
>
> benchmarking 2 instructions with a single iteration is meaningless  
> even on
> a simple pipelined arch IMHO, you should at least do 10 read+write
>

yep, I usually run 4 at a time when I need to do the zigzag permute.  
Come to think of it where is the best place to initialize my own scan  
pattern which would be offset from each element instead of offsets  
from the base?  something like 2,14,-14 blah blah blah.

> also decoding involes a mandatory permutation
> so no matter what idct_permutation is set to it will be the same  
> speed and
> wisely setting the idct permutation can simplify the idct and thus  
> speed
> it up, this is a high level optimization and wont make code slower  
> no matter
> how expensive the permutation is as there arent more permutations done
>
> the extra cost is just on the encoder side, where its just a single  
> if()
> if its the no permutation case ...
>
>

This is not that expensive I completely agree with you michael, its  
no cost to do this at this level of of the code. I guess if the  
permutation check was done for every pixel we would have a problem.

Marc