[FFmpeg-devel] a64 encoder 7th round

Wed Feb 4 08:02:02 CET 2009

> is this a fully unrolled loop?
> because if it is i would expect the following to be faster

No, not fully, but partly to save enough overhead, else memory usage 
would explode (though using this method i'd also save a few bytes in 
addition) :-) However as i write in $100 blocks i just need to increment 
one counter per round, so overhead is not too big. However smaller 
chunks are not possible that way, else i spoil linear writing even more 
and am forced to do the whole copy off the display. This way i can at 
least use time when the lower and upper part are displayed, what i need 
to. When i write completely linear i have more loop overhead unless i do 
a complete unroll of course, as you suggested, but then, the memory 
footprint is again kind of excessive. This will be around 5kb just 
unrolled loop. When thinking of that i want also to include a 
filebrowser to select the video to watch and have all decoders in memory 
that i need that might get tight. But i keep that in mind for further 
improvment, a colorram setting routine might also be usefull for other 
modes, like ecmh, so i could share that and unroll if there's enough 
space left. So long i'll do it that way, fast enough to cope with 
jitter, saving 30% memory on the encoded frames.

>> when finished, restore stack pointer
>>
>> this allows me to save 2 more cycle per 4 byte lookup as i can just pull 
>> data from stack within 3 cycles and even get the stackpointer 
>> incremented for free by that. To bad that stack area is fixed, else i'd 
>> do it the other way round by pushing bytes on the stack.
>> This is on the one hand dirty, but works fine, the real stackpointer and 
>> data is far away from my LUT i placed in the stack, so no collisions 
>> expected. My lookup consists of 16 entries each 4 bytes. The size does 
>> not hurt, 6502 code grew anyway by 0x500 bytes by the latest 
>> optimizations (twice the size now) ;-)
>>
>> LUT looks like:
>>
>> 8,8,8,8
>> 8,8,8,9
>> 8,8,9,8
>> 8,8,9,9
>> ...
> 
> 8, 8, 8, 8, 9, 8, 8, 9, 9, 8, 9, 8, 9, 9, 9, 9, 8, 8, 8
> 
> 19 elements
> why is this better? 64 bytes dont matter ...

Sure, i can use other offsets in the LUT besides 4^n, this will add 
another lookup in the codec i guess, or is there any simply math trick 
to get the fitting position within the LUT?

> why not do it with 5 instead of 4 (36 vs. 160)
> 8, 8, 8, 8, 8, 9, 8, 8, 9, 8, 9, 9, 8, 8, 9, 9, 9, 9, 9, 8, 8, 8, 9, 9, 8, 9, 9, 9, 8, 9, 8, 9, 8, 8, 8, 8

eeks, that is not 2^n :-) But i need to end at a 0x100 border (or at 
least then waste the padding at the end if i need less)