[FFmpeg-devel] a64 encoder 7th round
Bitbreaker/METALVOTZE
bitbreaker
Wed Feb 4 08:02:02 CET 2009
> is this a fully unrolled loop?
> because if it is i would expect the following to be faster
No, not fully, but partly to save enough overhead, else memory usage
would explode (though using this method i'd also save a few bytes in
addition) :-) However as i write in $100 blocks i just need to increment
one counter per round, so overhead is not too big. However smaller
chunks are not possible that way, else i spoil linear writing even more
and am forced to do the whole copy off the display. This way i can at
least use time when the lower and upper part are displayed, what i need
to. When i write completely linear i have more loop overhead unless i do
a complete unroll of course, as you suggested, but then, the memory
footprint is again kind of excessive. This will be around 5kb just
unrolled loop. When thinking of that i want also to include a
filebrowser to select the video to watch and have all decoders in memory
that i need that might get tight. But i keep that in mind for further
improvment, a colorram setting routine might also be usefull for other
modes, like ecmh, so i could share that and unroll if there's enough
space left. So long i'll do it that way, fast enough to cope with
jitter, saving 30% memory on the encoded frames.
>> when finished, restore stack pointer
>>
>> this allows me to save 2 more cycle per 4 byte lookup as i can just pull
>> data from stack within 3 cycles and even get the stackpointer
>> incremented for free by that. To bad that stack area is fixed, else i'd
>> do it the other way round by pushing bytes on the stack.
>> This is on the one hand dirty, but works fine, the real stackpointer and
>> data is far away from my LUT i placed in the stack, so no collisions
>> expected. My lookup consists of 16 entries each 4 bytes. The size does
>> not hurt, 6502 code grew anyway by 0x500 bytes by the latest
>> optimizations (twice the size now) ;-)
>>
>> LUT looks like:
>>
>> 8,8,8,8
>> 8,8,8,9
>> 8,8,9,8
>> 8,8,9,9
>> ...
>
> 8, 8, 8, 8, 9, 8, 8, 9, 9, 8, 9, 8, 9, 9, 9, 9, 8, 8, 8
>
> 19 elements
> why is this better? 64 bytes dont matter ...
Sure, i can use other offsets in the LUT besides 4^n, this will add
another lookup in the codec i guess, or is there any simply math trick
to get the fitting position within the LUT?
> why not do it with 5 instead of 4 (36 vs. 160)
> 8, 8, 8, 8, 8, 9, 8, 8, 9, 8, 9, 9, 8, 8, 9, 9, 9, 9, 9, 8, 8, 8, 9, 9, 8, 9, 9, 9, 8, 9, 8, 9, 8, 8, 8, 8
eeks, that is not 2^n :-) But i need to end at a 0x100 border (or at
least then waste the padding at the end if i need less)
More information about the ffmpeg-devel
mailing list