[FFmpeg-devel] a64 encoder 7th round

Sat Jan 31 20:26:20 CET 2009

Michael Niedermayer schrieb:
> On Sat, Jan 31, 2009 at 01:59:48PM +0100, Bitbreaker/METALVOTZE wrote:
>   
>>>>> now a few questions, i hope iam not too annoying
>>>>> the low nibble is either 15 or 8 if i did RTFS correctly
>>>>> do you have 64 byte left for a LUT?
>>>>> if so you can do some code equivalent to
>>>>>
>>>>> x= read_net();
>>>>> dst[0]=lut[x  ];
>>>>> dst[1]=lut[x+1];
>>>>> dst[2]=lut[x+2];
>>>>> dst[3]=lut[x+3];
>>>>> x= read_net();
>>>>> dst[4]=lut[x  ];
>>>>> dst[5]=lut[x+1];
>>>>> dst[6]=lut[x+2];
>>>>> dst[7]=lut[x+3];
>>>>> ...
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>> Gotta see tomorrow if that works...
>>>>     
>>>>         
>> Hmm, i am afraid, a lookup from the table is as expensive as reading a 
>> byte from the network :-) So over all you end up at the same speed again.
>>     
>
> but its lower bitrate, so if its not worse in any other way you at least
> have smaller files, and not by a insignificant amount smaller
> was there a disadvantage in thr 5col mode over 4col except filesize ?
>   
Files are rather small already compared to a normal video. But for the 
sake of size i might stuff 2 nibbles together (2, just to still have the 
chance to use the full range of colors somewhen, you never know, so we 
better keep that option). That would save 0x200 bytes and add a LUT and 
extra code to the displayer. Also, i'd have the last packet ending at an 
0x100 boundary what would avoid even more extra code on c64 side. But i 
might implement that and therefor also interleave the charset so i get a 
constant packet size.
Disadvantage of 5col mode is the size of a frame itself, as i can't load 
it within 2 vsyncs. 4col mode works well between 2 vsyncs. Ecmh mode 
even needs 4 vysncs as it loads 0xc00 bytes per frame and the forcing of 
additional badlines consumes even more time.
As for 5col mode i am anyway not sure if it is the nicest thing to 
either lift the darkest area or lower the brightes area if both occur in 
a single block. But that is nothing that helps regarding the framesize 
and loading times :-)

>>    [setup as always]
>>    ...
>>   
>>    ldx $de00
>>    lda lut+0,x
>>    sta dest,y
>>    iny
>>    lda lut+0,x
>>    sta dest,y
>>    iny
>>    lda lut+2,x
>>    sta dest,y
>>    iny
>>    lda lut+3,x
>>    sta dest,y
>>    iny
>>
>> that is 12 cycles per reconstructed byte in the inner loop.
>>     
>
> i see 4+5+2=11 per byte output + some overhead per each 4 byte group
>   
yes, i counted the ldx in, as it can't be avoided per block.
> but this can be improved, you dont need to write 4 consecutive bytes
> you can write (0,64,128,192), (1,65,129,193), ...
> code should be:
>
> ldx $de00
> lda lut+0,x
> sta dest,y
> lda lut+1,x
> sta dest+64,y
> lda lut+2,x
> sta dest+128,y
> lda lut+3,x
> sta dest+192,y
> iny
>
> this safes 3 iny per 4 bytes written, thus 2*3/4=1.5 cycles faster
> the same trick might be useable for the generic copy from network as well
> maybe?
>   
sure, i can also completely unroll the loop and then save even more, but 
as long as i don't save my 18700 cycles, there is no need to, as no 
improvement happens. So no need to make things more complex :-) See, for 
storing the bytes i'll always need my 5 cycles (or 4 on a complete 
unroll without index), there is nothing that can be done to avoid that. 
Depending on how much i unroll things there are 4-6 cycles needed for 
getting the byte or getting some bytes and do some kind of decoding. If 
i'd save 6 cycles in best case per byte and need to load 10*256 bytes 
(one average frame in 5col) i would over all save 15360 cycles, that is, 
if bytes would be loaded for no cost automagically :-) That is still not 
my desired goal of 18700 cycles to save :-) So how to solve that? :-)

Kindest regards,

Toby