[FFmpeg-devel] a64 encoder 7th round
Michael Niedermayer
michaelni
Tue Feb 3 22:51:31 CET 2009
On Tue, Feb 03, 2009 at 08:28:10PM +0100, Bitbreaker/METALVOTZE wrote:
> Michael Niedermayer schrieb:
> > On Tue, Feb 03, 2009 at 03:22:21PM +0100, Bitbreaker/METALVOTZE wrote:
> >
> >>> ldx $de00
> >>> lda lut+0,x
> >>> sta dest,y
> >>> lda lut+1,x
> >>> sta dest+64,y
> >>> lda lut+2,x
> >>> sta dest+128,y
> >>> lda lut+3,x
> >>> sta dest+192,y
> >>> iny
> >>>
> >> Just tested something similar to reconstruct the "compressed" colorram.
> >> However it spoils of course my option of linear writing and thus things
> >> need to happen even faster as i write at the lower and upper end of the
> >> colorram area at the same time. It works out tightly however when i
> >> start 44 lines before screen ends. Writing endures until i enter the
> >> upper area again, but ends luckily fast enough (4 lines) before the last
> >> line of the first 0x100 block of colorram is displayed. So i have to
> >> take care that i cross no 0x100 border codewise and indexwise, as that
> >> would add extra cycles and thus trash display. I could however place the
> >> LUT into zeropage where no extar cycles apply on those conditions. Would
> >> make things more stable,
> >>
> >
> >
> >> but wastes 64 nice favourite places to store
> >> values when running out of registers ;-)
> >>
> >
> > 19 not 64 (see my previous reply for the actual table
> > you need just 2^n + n - 1 not 2^n * n with overlapping entries
> >
> What i do is stuffing 4 bits of each $0100 block together codec wise
>
> on c64 i copy 64 byte lut to $0100 (this is the stack) coz i was fed up
> by wasting so many cycles for just reading a table. Then i can suddenly do:
>
> ldy #$00
> tsx
> stx $40 ;save stack pointer
> ldx $de00
> txs
> pla
> sta $d800,y
> pla
> sta $d900,y
> pla
> sta $da00,y
> pla
> sta $db00,y
> iny
>
> ldx $de01
> ...
is this a fully unrolled loop?
because if it is i would expect the following to be faster
tsx
stx $40 ;save stack pointer
ldx $de00
txs
pla
sta $d800
pla
sta $d900
pla
sta $da00
pla
sta $db00
ldx $de01
txs
pla
sta $d801
pla
sta $d901
pla
sta $da01
pla
sta $db01
...
or the more obvious:
tsx
stx $40 ;save stack pointer
ldx $de00
txs
pla
sta $d800
pla
sta $d801
pla
sta $d802
pla
sta $d803
ldx $de01
txs
pla
sta $d804
pla
sta $d805
pla
sta $d806
pla
sta $d807
...
>
> when finished, restore stack pointer
>
> this allows me to save 2 more cycle per 4 byte lookup as i can just pull
> data from stack within 3 cycles and even get the stackpointer
> incremented for free by that. To bad that stack area is fixed, else i'd
> do it the other way round by pushing bytes on the stack.
> This is on the one hand dirty, but works fine, the real stackpointer and
> data is far away from my LUT i placed in the stack, so no collisions
> expected. My lookup consists of 16 entries each 4 bytes. The size does
> not hurt, 6502 code grew anyway by 0x500 bytes by the latest
> optimizations (twice the size now) ;-)
>
> LUT looks like:
>
> 8,8,8,8
> 8,8,8,9
> 8,8,9,8
> 8,8,9,9
> ...
8, 8, 8, 8, 9, 8, 8, 9, 9, 8, 9, 8, 9, 9, 9, 9, 8, 8, 8
19 elements
why is this better? 64 bytes dont matter ...
why not do it with 5 instead of 4 (36 vs. 160)
8, 8, 8, 8, 8, 9, 8, 8, 9, 8, 9, 9, 8, 8, 9, 9, 9, 9, 9, 8, 8, 8, 9, 9, 8, 9, 9, 9, 8, 9, 8, 9, 8, 8, 8, 8
or 7 (134 vs. 896) you dont have 896 if i understood you correctly, but the
bigger table should be faster ...
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Those who are too smart to engage in politics are punished by being
governed by those who are dumber. -- Plato
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20090203/719d3387/attachment.pgp>
More information about the ffmpeg-devel
mailing list