[FFmpeg-devel] a64 encoder 7th round

Bitbreaker/METALVOTZE bitbreaker
Tue Jan 27 22:04:47 CET 2009


> so you claim that copying 256 chars is faster than copying none ?
> if not a system that choose per frame if it used the last or used new
> ones would be better, a fixed 4frame pattern hardly is optimal.
>   
Okay, seems like i have to explain more in detail :-)
The multicolor displayer throws an interrupt each 2 frames and switches 
nothing more than a bit in a single register (kind of charmap pointer of 
the video chip) and thus advance to a different charmap (0x400 in size). 
After 4 charmaps were displayed this way, the intrerupt routine unlocks 
the loader to load the next charset + 4 screens into the current buffer 
(in 0x400 big chunks/packets via network) but switches to the previous 
loaded charset and screens beforehand. So it more or less is double 
buffering with 4 preloaded frames.
In your suggested scenario i have to take care of the worst case 
scenario/cross point, have to know beforehand how many bytes i need to 
load, take care of framesizes. This sounds all trivial, but:
Odd framesizes bloat up the loader, as do varying framesizes, as i need 
additional checks, loading can get easily 50% slower in that case (5 
additional cycles to 11 cycles or just even 8 cycles if using generated 
speedcode), so the cross over point goes rather low, as handling a 
charset delta consumes even more cycles. As the worst case is as slow as 
if loading plain frames, there is no gain, as framerate and quality do 
not improve, but adds lot of complexity to the displayer (that is 
already rather big for that, as it needs to drive the network chip and 
handle packets). So all i'd do, is saving diskspace, but a 500MB mpeg 
file shrinks to a ~ 50MB .a64 file at the moment, so not too much of a 
waste :-)
The 6502 is just a very scarce platform, offering only 52 different 
instructions not all of them being even orthogonal. I have three 8 bit 
registers only as well as a 8 bit data bus only + a 16 bit address bus. 
There is no multiply or divide instruction. So concepts that work out 
fine on nowadays machines often have to be done in a completely 
different way on such machines, often, by making it just plain and easy, 
or by doing some fake, that appears to do the same ;-) I invested quite 
some time in finding the appropriate display methods, i have done first 
prototypes to convert already years ago, and discussed a lot with other 
c64 scene members to work out the modes i have so far implemented. As 
for doing things on a c64, i can look back to the year 1988 where i did 
my first trys on that machine. So things on c64 side should already be 
rather optimal, but of course the codecs themselves may have still lots 
of potential for (speed/quality) improvements. Saving size so far does 
not bring any improvment, except when i can reduce framesize in every case.

> didnt i read somewhere that there was some kind or interrupt per row/line
> from which various things could be changed?
>   
Sure, you can, and i do so in ecmh mode, or better to say, i do in every 
4th line. That is why i need two times the charmap, force the video chip 
to alternate and reload the charmap each 4th line. This is however time 
consuming, as i have to throw 25 interrupts per frame, get the timing 
cycle exact by some coding tricks + a hardware timer. Also, when i force 
the video chip to reload the current line of the charmap, it takes over 
the bus and the cpu has to be idle while the 40 bytes are loaded by the 
video chip. That is, what is named bad line in that link i mentioned.  
However, this trick does not work with the colorram (what sets the fore 
groundcolor of each char), as this is at a fixed address and no register 
available to change that. (In hardware, this is even an addtional 1k RAM 
chip besdies the normal 64k, so the videochip can access that area 
without disturbing the CPU)
So in case of the multicol charset mode, i would only be able to set a 
new charmap each 4th line for example, but that would not increase the 
ammount of colors. It does however, when i am using the extended 
background color mode, but then the charsetsize shrinks to 64 chars 
only, the result is not satisfying, i gave it a try already, and that is 
also how teh ecmh mode started to exist, but trying it with a selfmade 
handoptimized charset.

> so why do you force it always to multicolor?
> if you just copy the stuff anyway, the encoder could choose per attribute
> cell which is better ...
>   
It is, because i can't change the mode per cell, but rather on a per 
line basis or even per frame basis. Also, i don't intend to mix modes 
within one video, but rather have a video encoded into a mode of your 
choice.

> with this limitation a pure multicolor encoder should do the following
>    for each frame try all 3 fixed color triplets out of 16 that are
>    560 full frame encodes, isnt going to be terribly fast but it should
>    be easy to skip some of these triplets.
>    for each block try all of the 8 colors and then from the 4 choosen
>    colors select per pixel colors with error diffusion dither choosing
>    the best block with sum of abs diff in dct domain.
I have that special table color_mixes, that tells my code (not 
multicolor) what colors are a good idea to mix (no matter if by 
interlacing or dithering) and what colors are definetedly a no go. There 
are quiet a lot of ugly combinations and some colors really clash 
terribly in PAL.  Also having one color being changed each block (while 
all others stay the same) leads to a blocky result, i mean it, as in, i 
know it, as in i tried it, not only in that case, but also with several 
converters for plain graphics for the c64. There are quite some tricks 
to avoid that, either by doing certain dithering tricks, and by counting 
more on the luninance of a color than its chrominance.
By the way i am doing a kind of similar thing in the ecmh mode as you 
described above, more or less a bruteforce attempt with some exclusions 
to speed up things to a reasonable time. I find out the best 
backgroundcolors by adding them incrementally, then find the best 2 
backgroundcolors + colorram for each  8x8 block, as well as the best two 
chars for that.
Oh, and as for dithering: Pixels look really big, the 320x200 are 
displayed on a 14" monitor with an fbas/s-video input. Error diffusion 
is not the choice, except in some very rare cases, like when you display 
320x200 with interlaced colors. Just see here for an example: What you 
use is ordered dither with certain patterns, and some antialiasing 
techniques to improve quality. See here:
http://noname.c64.org/csdb/release/viewpic.php?id=11585&zoom=1
Even doing a kind of dithering by using certain forms is common, like 
the clouds in this pic show:
http://noname.c64.org/csdb/release/viewpic.php?id=25333&zoom=1

>> Making things 
>> colorful gets really hard then and usually the result looks very blocky.
>>     
>
> did you try above? :)
>   
I know that even with less restrictions it looks already ugly, that is, 
how you easily can differ between handrawn/retouched pics and plain 
converted pics :-) And having even less color choice won't be very 
helpful either. Also, the colors from 0..7 are not the colors you need 
most. For e.g. brown, orange, pink, gray tones, they all are in the 
upper range from 8..15. So it is hard to get skin tones done, or do a 
proper gray color fade without them. If you want to hurt your eyes i can 
calculate some pics with using the lower range only, the limitation is 
easily done :-)
So see again at http://www.metalvotze.de/content/videomodes2.php and 
look closer at the result of the ecmh mode, you see already some of 
those blocky artefacts at the shoulder, that result of a lack of colors 
per block.
> uint8_t color[3], index;
> for(){
>     count= read_byte;
>     color= read_3bytes;
>     for(count--){
>         *dst++= read_byte;
>         *dst++= color[0];
>         *dst++= color[1];
>         *dst++= color[2];
>     }
> }
>
> this would be too slow?
>   
To read a byte from the network chip packet buffer and store it to the 
correct position where it is directly displayable by the videochip, i 
need 3 instructions if being lazy. (there is of course some overhead for 
loop handling and fetching a new packet).
The above code might be as following in 6502 (just hacked fast, not tested):

ldx $de00 ;count
lda $de01 ;byte1
sta buf
lda $de00 ;byte2
sta buf+1
lda $de01 ;byte3
sta buf+2
ldy #$00
;27 cycles used till here
more
lda $de00 ;data is offered 16 bit wide from network chip
sta dest,y
iny
lda buf
sta dest,y
iny
lda buf+1
sta dest,y
iny
lda buf+2
sta dest,y
iny
;41 more cycles
dex
beq out ; need to check for odd value of x
;4 more cycles if no branch
lda $de01 ;fetch next byte (network chip offers next byte automatically 
when both bytes were read, we are happy to have that feature)
sta dest,y
iny
lda buf
sta dest,y
iny
lda buf+1
sta dest,y
iny
lda buf+2
sta dest,y
iny
;+41 cycles
dex
bne more ; need to check for even value of x
;5 more cycles, as we branch hopefully a few times.

= 118 cycles to load 6 bytes (will get a bit less of course if the loop 
loops a few times, and i assumed already the buffer being in zeropage, 
where we can save one cycle when doing lda/sta).

i can do:
lda $de00
sta dest
lda $de01
sta dest
...

that is 48 cycles for 6 bytes.

But more likely i'll do (as it is easier, and still fast enough):

   ldx #<dest
   stx a1+1 ;set highbyte of dest in code
   stx a2+1
   stx a3+1
   stx a4+1
   ldx #$00 ;index is lowbyte of dest
loop
   lda $de00
a1 sta $0000,x
   inx
   lda $de01
a2 sta $0000,x
   inx
   lda $de00
a3 sta $0000,x
   inx
   lda $de01
a4 sta $0000,x
   inx
   bne loop

47 cycles per loop + 22 cycles for setup

So over all, i tried many things, even tried RLE and such, it did not 
bring any improvment, not with the speed i can achieve with loading that 
simple.
Convinced now? :-)

Kindest regards,

Toby





More information about the ffmpeg-devel mailing list