[FFmpeg-devel] a64 encoder 7th round
Wed Feb 4 19:05:26 CET 2009
On Wed, Feb 04, 2009 at 08:02:02AM +0100, Bitbreaker/METALVOTZE wrote:
> > is this a fully unrolled loop?
> > because if it is i would expect the following to be faster
> No, not fully, but partly to save enough overhead, else memory usage
> would explode (though using this method i'd also save a few bytes in
> addition) :-) However as i write in $100 blocks i just need to increment
> one counter per round, so overhead is not too big. However smaller
> chunks are not possible that way, else i spoil linear writing even more
> and am forced to do the whole copy off the display. This way i can at
> least use time when the lower and upper part are displayed, what i need
> to. When i write completely linear i have more loop overhead unless i do
> a complete unroll of course, as you suggested, but then, the memory
> footprint is again kind of excessive. This will be around 5kb just
> unrolled loop. When thinking of that i want also to include a
> filebrowser to select the video to watch and have all decoders in memory
> that i need that might get tight. But i keep that in mind for further
> improvment, a colorram setting routine might also be usefull for other
> modes, like ecmh, so i could share that and unroll if there's enough
> space left. So long i'll do it that way, fast enough to cope with
> jitter, saving 30% memory on the encoded frames.
> >> when finished, restore stack pointer
> >> this allows me to save 2 more cycle per 4 byte lookup as i can just pull
> >> data from stack within 3 cycles and even get the stackpointer
> >> incremented for free by that. To bad that stack area is fixed, else i'd
> >> do it the other way round by pushing bytes on the stack.
> >> This is on the one hand dirty, but works fine, the real stackpointer and
> >> data is far away from my LUT i placed in the stack, so no collisions
> >> expected. My lookup consists of 16 entries each 4 bytes. The size does
> >> not hurt, 6502 code grew anyway by 0x500 bytes by the latest
> >> optimizations (twice the size now) ;-)
> >> LUT looks like:
> >> 8,8,8,8
> >> 8,8,8,9
> >> 8,8,9,8
> >> 8,8,9,9
> >> ...
> > 8, 8, 8, 8, 9, 8, 8, 9, 9, 8, 9, 8, 9, 9, 9, 9, 8, 8, 8
> > 19 elements
> > why is this better? 64 bytes dont matter ...
> Sure, i can use other offsets in the LUT besides 4^n, this will add
> another lookup in the codec i guess, or is there any simply math trick
> to get the fitting position within the LUT?
hmm, i think a table is the simplest way ...
> > why not do it with 5 instead of 4 (36 vs. 160)
> > 8, 8, 8, 8, 8, 9, 8, 8, 9, 8, 9, 9, 8, 8, 9, 9, 9, 9, 9, 8, 8, 8, 9, 9, 8, 9, 9, 9, 8, 9, 8, 9, 8, 8, 8, 8
> eeks, that is not 2^n :-) But i need to end at a 0x100 border (or at
> least then waste the padding at the end if i need less)
you could write to
just one byte that needs special handling
or with 6 or 7 you end at 251 having 4 elements left to deal with differntly
or you could even do
0, 36, 73, 109, 146, 182, 219
1, 37, 74, 110, 147, 183, 220
36, 72,109, 145, 182, 218, 255
very simple on the decoder side, no special cases but 3 bytes of 256 are
written to twice ...
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Everything should be made as simple as possible, but not simpler.
-- Albert Einstein
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: Digital signature
More information about the ffmpeg-devel