Author Topic: ASM has hit just about the right level of pain - what next? (Read 7237 times)

brucehoult · « **Reply #25 on:** June 30, 2025, 12:02:36 am »

Quote from: tggzzz on June 29, 2025, 10:10:55 pm

Your index register linked list chaining looks about right. Five instructions including surplus moves through the stack. Given everyone was.raving a about the improvements in the Z80, I was horrified when I first worked that one out.

The earlier 6800 did it in one instruction: LDX loaded both bytes from successive memory locations, and it could use indexed mode with an offset. The mnemonic would have been something like
LDX 2,X
Bit nicer in all respects than the Z80
LD B, (IX+2)
LD C, (IX+3)
PUSH BC
POP IX

And the 6809 was even nicer

Yes, the 6800 is much better with pointer-chasing code. But it falls down badly with anything that needs multiple pointers, especially pointers that need updating, such as a memcpy(). You end up with code like...

Code: [Select]

LDX src
LDA 0,X
INX
STX src
LDX dst
STA 0,X
INX
STX dst
DEC sz
BNE loop

19 bytes, 46 cycles per loop

Sticking to 8080 instructions, the straightforward way on Z80 (there are faster ways) is...

Code: [Select]

LD A,(HL)
LD (DE),A
INC HL
INC DE
DEC BC
LD A,B
OR C
JP NZ,loop

9 bytes, 50 cycles per loop

The z80 code is less than half the size 9 bytes vs 19. 6800 can save 2 bytes for copies of less than 256 bytes by using "DEC B" for the count ... but Z80 also saves 2 bytes (and 2 instructions) by using DEC B.

6502 more or less splits the difference.

Load link pointer at offset 2:

Code: [Select]

LDY #2
LDA (ptr),Y
TAX
INY
LDA (ptr),Y
STX ptr
STA ptr+1

Quite a few more bytes of code than Z80 but omg the Z80 takes 63 cycles for 4 instructions while the 6502 is 22 cycles for 7 instructions, fitting the usual pattern of Z80 using 3x more cycles than 6502. 6800 of course wins here with 6 cycles.

For 6502 memcpy:

Code: [Select]

LDA (src),Y
STA (dst),Y
DEY
BNE loop

7 bytes, 16 cycles. That's three times faster than either 6800 (46 cycles) or Z80 (50 cycles).

It does however only handle maximum 256 bytes copied. For larger copies you need another loop around this one, which brings the code size to larger than 6800, but doesn't affect the speed (much). In fact usually two loops are used, one for blocks of 256 bytes then another for the final bytes.

The 6809 of course solves all problems, with four 16 bit index registers (two of which also support push/pop) and a 16 bit D accumulator.

Code: [Select]

LDA ,X+
STA ,Y+
LEAU -1,U
BNE loop

8 bytes and I think 20 cycles per loop. Actually slightly bigger and slower than the 6502 inner loop, but handles bigger than 256 bytes.

You can also do 2 bytes at a time with...

Code: [Select]

LDD ,X++
STD ,Y++
LEAU -2,U
BNE loop

... if you deal with a possible odd byte first.

brucehoult · « **Reply #26 on:** June 30, 2025, 12:06:33 am »

Quote from: Analog Kid on June 29, 2025, 11:11:24 pm

So is there really no way to know which set of shadow registers is active? No tricky way of reading/writing those registers to determine which? That would certainly be a problem ...

Absolutely no way. I mean ... only wasting one of the six 8 bit registers to hold a magic value e.g. make B=0 and B'=1 and test B and swap if you have the wrong one. But no one ever does that. You juust have to keep track mentally, the same way you make sure malloc() and free() are properly paired in C.

Analog Kid · « **Reply #27 on:** June 30, 2025, 12:18:29 am »

Quote from: brucehoult on June 30, 2025, 12:06:33 am

Quote from: Analog Kid on June 29, 2025, 11:11:24 pm
So is there really no way to know which set of shadow registers is active? No tricky way of reading/writing those registers to determine which? That would certainly be a problem ...

Absolutely no way. I mean ... only wasting one of the six 8 bit registers to hold a magic value e.g. make B=0 and B'=1 and test B and swap if you have the wrong one. But no one ever does that. You juust have to keep track mentally, the same way you make sure malloc() and free() are properly paired in C.

So a simple 1-byte flag variable (or even just a bit in a byte) would take care of tracking that. Doesn't sound too bad.

brucehoult · « **Reply #28 on:** June 30, 2025, 12:37:09 am »

Quote from: Analog Kid on June 30, 2025, 12:18:29 am

Quote from: brucehoult on June 30, 2025, 12:06:33 am
Quote from: Analog Kid on June 29, 2025, 11:11:24 pm
So is there really no way to know which set of shadow registers is active? No tricky way of reading/writing those registers to determine which? That would certainly be a problem ...

Absolutely no way. I mean ... only wasting one of the six 8 bit registers to hold a magic value e.g. make B=0 and B'=1 and test B and swap if you have the wrong one. But no one ever does that. You juust have to keep track mentally, the same way you make sure malloc() and free() are properly paired in C.

So a simple 1-byte flag variable (or even just a bit in a byte) would take care of tracking that. Doesn't sound too bad.

Well if you're ok with going from 6 bytes of B,C,D,E,H,L registers to 10 instead of 12. Or, going from 3 register pairs to 4.

If you really wanted to go that route you'd be better off using a RAM byte as the flag.

Analog Kid · « **Reply #29 on:** June 30, 2025, 12:57:25 am »

Quote from: brucehoult on June 30, 2025, 12:37:09 am

If you really wanted to go that route you'd be better off using a RAM byte as the flag.

That's what I meant. Should have made that clear.

SiliconWizard · « **Reply #30 on:** June 30, 2025, 01:39:30 am »

The Z80 has LDIR for memcpy. A single instruction, but 21 cycles per copied byte. Still usually faster than with a manual loop, unless you optimize the heck out of it (there were ways to copy data faster than LDIR, but that wasn't immediately obvious).

brucehoult · « **Reply #31 on:** June 30, 2025, 03:02:18 am »

Quote from: SiliconWizard on June 30, 2025, 01:39:30 am

The Z80 has LDIR for memcpy.

It does, but it's ONLY for memcpy. Sure, memcpy is important, but there are any number of similar operations that have code very little different to that shown above when done on 8080, 6800, 6502, 6809, but have no LDIR equivalent e.g. the rest of the <string.h> functions. Also things like for( ... ) a = (b & m) | (c & n) which you can take quite complex on Z80 using IX and IY and the alternate BC, DE, HL and can take almost arbitrarily complex on 6502 with 128 ZP pointers available.

6502 can get under 10 cycles per byte on 256*N byte blocks using something like ...

Code: [Select]

outer:
  ldy #63
loop:
  lda src,y
  sta dst,y
  lda src+64,y
  sta dst+64,y
  lda src+128,y
  sta dst+128,y
  lda src+192,y
  sta dst+192,y
  dey
  bpl loop
  dex
  beq done
  inc loop+2 // 5 cycles each in ZP, 6 abs
  inc loop+5
  inc loop+8
  inc loop+11
  inc loop+14
  inc loop+17
  inc loop+20
  inc loop+23
  jmp outer
done:

... plus of course initial setup of the 4 src and 4 dst pointers in the self-modifying code.

In case of misalignment, note that the loads take an extra cycle if a page boundary is crossed by the indexing, but stores don't, so make sure src is at least 64 byte aligned using an initial byte by byte loop.

The outer loop adds less than 0.2 cycles/byte overhead to the 10.25 cycles/byte of the inner loop, total about 10.45. Copying 8 bytes per inner loop lowers that to about 9.98 cycles/byte and 16 bytes per inner loop is about the same.

tggzzz · « **Reply #32 on:** June 30, 2025, 06:59:23 am »

bruceholt brings up a number of useful points, which I'll coalesce here.

Back then different architectures are indeed better for different programming styles. IIRC the GEC4080 family was quite good for "Fortran style" coding based on indexing into arrays, but suboptimal fo "C style" pointer chasing. My coding style, even in the mid-late 70s before I knew C existed, seemed to have more pointer chasing and no array copying.

Nowadays architectures are much better at the instruction level, with the pain points moving to other parts of the system such as comms, scheduling and caches.

I've always taken the "number of cycles" metric with a very big pinch of salt, since different processors had a very different number of clock cycles in an instruction. From memory 6800 was a 1MHz clock and the Z80 was a 4MHz clock, so if the Z80 took four times as many clocks then it was an irrelevant difference. (IIRC the 1802 was even more extreme in that respect)

The 6802's "zero page is special" and "short addressing modes" always put me off it. I'm told that one attitude is to regard zero page as being registers, but I've never had the reason to test that.

The 6809 is indeed very clean; I've always mildly regretted not having a reason to use it.

The "LDIR for memcpy but only memcpy" fits in with my attitude to the Z80. It looks good on paper, but when you try to use it in your system the special case never seems to quite fit what you need to do. Consequently you are forced back on the core 8080 "RISC" instructions.

paulca · « **Reply #33 on:** June 30, 2025, 08:09:19 am »

Quote from: nctnico on June 29, 2025, 09:33:14 pm

PUSH BC
POP IX

Ah, ha! So when you are looking through the 16bit load group and you can't find a 16bit register swap instructions... just use the stack.

I never thought you can pop registers into other registers. Dumb.

brucehoult · « **Reply #34 on:** June 30, 2025, 08:44:40 am »

Quote from: paulca on June 30, 2025, 08:09:19 am

Quote from: nctnico on June 29, 2025, 09:33:14 pm
PUSH BC
POP IX

Ah, ha! So when you are looking through the 16bit load group and you can't find a 16bit register swap instructions... just use the stack.

It looks convenient, and only two bytes of code but 25 clock cycles.

For 4 bytes of code but only 16 clock cycles you can do:

Code: [Select]

LD IXH,B
LD IXL,C

Pick your poison.

Quote

I never thought you can pop registers into other registers. Dumb.

Bytes are just bytes.

Analog Kid · « **Reply #35 on:** June 30, 2025, 08:49:10 am »

So OP: will it be asm or C?

paulca · « **Reply #36 on:** June 30, 2025, 08:57:39 am »

Quote from: Analog Kid on June 29, 2025, 09:01:12 pm

Just a comment from the peanut gallery here:

I don't think you need to jump to C.
You can do anything you need to do in assembly language: you're just running into its common limitations, primarily the small set of registers.

Yea. I mean there is mileage. I just hit a steep slope when I started to run out of registers. The first "pass" you can track a couple in your head easily enough. When you run out of the ones you can track easily things start to get much more "bug prone". Long 15-20min debug cycles almost entirely down to "Ok,who modified what?"

An example:

Code: [Select]

LD A,D
CALL lcd_print_hex8
INC A ;; ERRRRR!!!   Nope!  A is corrupted.
LD D, A

That one took me far too long to spot. I was dilligently putting A into D which was safe, but then school boyed it. Even though the function comment header of that routine (I wrote), clearly says, "Mutates A!"

I ended up locking it down at both ends. I restore A in the sub-routine AND I don't rely on it in the callee. I had to point out to myself this is Z80 ASM, not Enterprise Java. Do one and remember, not both so you can forget.

The thing is, I have a whole new level to just step up to. Variables! Remember those? LOL

I just need to start putting labels down and
LD A,(VAR_ITS_NAME)

so stuff

LD (VAR_ITS_NAME), A

Definitely slower than a register swap, but I don't even think it's longer than a PUSH POP.

paulca · « **Reply #37 on:** June 30, 2025, 09:05:30 am »

Quote from: Analog Kid on June 30, 2025, 08:49:10 am

So OP: will it be asm or C?

Both. What I am going to do is go and look at what the C compiler does or rather look for sub-set of techniques.

Where I expect that to go and just like my prior post, is more verbose code, more diligent code, slower code. Just like what the C compiler would produce.

If I formulate (or borrow from a C compiler) techniques and even though they may not always be the most efficient, if I stick to them, then i know what to expect, at least from my own code.

I will be using sdcc and trying to tame it into my memory model.

Thinking now, you know, I'm probably lucky I didn't go "Harvard" now.

EDIT: ON sdcc. I had a late night, nearly asleep epiphany.

All I need to move is BASE and BSS (or whichever symbol is the floor of writable space).

It's the base that has been the issue. It starts at 0000 in ROM page 0. Forbidden.

If I just set that to 0x2000 it will land above my "System ROM" and has "user_land" all the way to the very top of RAM where it will run into the stack long before it ends up in the Interrupt Vector tables.

I just need to convince it to stay above 0x2000.

The next challenge then will be linking symbols between the sjasmplus system ROM and sdcc produced "user_land" code. I could possibly do it maunally, It's not like there will be more than 100 rom routines.

EDIT: I could port the sjasmplus to whatever asm sdcc prefers.... if I have to. It might make things easier, while it's small. Python can help.

tggzzz · « **Reply #38 on:** June 30, 2025, 09:16:25 am »

Quote from: paulca on June 30, 2025, 08:57:39 am

I just hit a steep slope when I started to run out of registers. The first "pass" you can track a couple in your head easily enough. When you run out of the ones you can track easily things start to get much more "bug prone". Long 15-20min debug cycles almost entirely down to "Ok,who modified what?"

Largely disappears if you use Forth: all variables are on the top of the Forth stack. You only use the processor registers to execute the Forth program words, which is much simpler.

Of course, every solution has its own downsides

brucehoult · « **Reply #39 on:** June 30, 2025, 10:51:24 am »

Quote from: tggzzz on June 30, 2025, 09:16:25 am

Quote from: paulca on June 30, 2025, 08:57:39 am
I just hit a steep slope when I started to run out of registers. The first "pass" you can track a couple in your head easily enough. When you run out of the ones you can track easily things start to get much more "bug prone". Long 15-20min debug cycles almost entirely down to "Ok,who modified what?"

Largely disappears if you use Forth: all variables are on the top of the Forth stack. You only use the processor registers to execute the Forth program words, which is much simpler.

Of course, every solution has its own downsides

You don't actually have to have a Forth on 8080/z80, you can just program in that style by yourself, using the native instructions.

For example, keeping ToS in HL and dealing only with 16 bit values:

<num> PUSH HL; LD HL,num

1+ INC HL

1- DEC HL

+ EX DE,HL; POP HL; ADD HL,DE

- EX DE,HL; POP HL; XOR A; SUB HL,DE

DROP POP HL

DUP PUSH HL

SWAP EX (SP),HL

@ LD A,(HL); INC HL; LD H,(HL); LD L,A

! EX DE,HL; POP HL; LD (HL),E; INC HL; LD (HL),D; POP HL

The only thing, really, is that your return stack is mixed in with your evaluation stack.

tggzzz · « **Reply #40 on:** June 30, 2025, 11:10:41 am »

Quote from: brucehoult on June 30, 2025, 10:51:24 am

Quote from: tggzzz on June 30, 2025, 09:16:25 am
Quote from: paulca on June 30, 2025, 08:57:39 am
I just hit a steep slope when I started to run out of registers. The first "pass" you can track a couple in your head easily enough. When you run out of the ones you can track easily things start to get much more "bug prone". Long 15-20min debug cycles almost entirely down to "Ok,who modified what?"

Largely disappears if you use Forth: all variables are on the top of the Forth stack. You only use the processor registers to execute the Forth program words, which is much simpler.

Of course, every solution has its own downsides

You don't actually have to have a Forth on 8080/z80, you can just program in that style by yourself, using the native instructions.

Very true. The only reason for not mentioning it was that the OP is interested in escaping from ASM into a higher level language.

Quote

For example, keeping ToS in HL and dealing only with 16 bit values:
...
The only thing, really, is that your return stack is mixed in with your evaluation stack.

Traditionally Forth has two independent stacks: the Data stack and the Return stack.

Without bothering to think about it, I would hope that those two stack pointers could be kept in HL and DE or IX and IY. That would leave the traditional 8080 SP stack for traditional uses.

(Aside: one reason I chose not to make an 1802 computer was that too many separate instructions were required for traditional subroutine JSR/RET calls. Algol60 spoiled me, and I didn't appreciate the thought of alternative execution strategies

)

guenthert · « **Reply #41 on:** July 01, 2025, 03:03:04 pm »

So how retro does it need to be? Using tools which were available at that time or more recent ones? Z80 is an awkward architecture for C. sdcc might be your best bet, if 'modern' tools are allowed -- I have no experience with it and fortunately no need to make such. SmallC was available in the 80s, but that one is best to be avoided. I always had a weak spot for FORTH, but arguably the best programming environment then for the Z80 was Turbo Pascal.

If I were to program 8bit computers again, I might be tempted to try Cowgol: https://cowlark.com/cowgol/

Benta · « **Reply #42 on:** July 01, 2025, 08:52:42 pm »

Quote from: guenthert on July 01, 2025, 03:03:04 pm

but arguably the best programming environment then for the Z80 was Turbo Pascal.

Wow! A walk down memory lane.
I'd forgotten about the Hejlsberg compilers, but used them all back then:
ComPas, PolyPascal and TurboPascal. Fastest compilers and executables I've ever worked with on CP/M and early MS-DOS machines.

But IIRC, they only run as native compilers, so you need a running CP/M-80 machine to generate code.

nfmax · « **Reply #43 on:** July 01, 2025, 09:19:24 pm »

You can run CP/M 80 software on a modern computer and operating system using tnylpo - https://gitlab.com/gbrein/tnylpo

gcewing · « **Reply #44 on:** July 05, 2025, 11:45:41 pm »

Quote from: paulca on June 29, 2025, 11:44:34 am

I still can't even decide if sub-routines should be responsible for restoring state or not. To me it seems as though sub-routines storing and restoring state "just in case" the parent was using it is wasteful ... is the parent wasn't.

You're overthinking this. Back when I was doing extensive Z80 programming, I adopted a very simple convention: callee preserves anything it uses. Anything else would have been a threat to my sanity.

Quote

Should a "blocking delay" function which uses the accumulator "PUSH/POP AF" for example. I mean if the parent is using A or F and is relying on it, then sure. But is it not better that the caller assumes the callee will corrupt the register file and take its own measures?

IMO, no. Either the caller assumes that the callee destroys everything, which could be just as wasteful, or implementation details of the callee become part of its contract with the caller, so that if the callee is changed in such a way that it uses more registers, you have to track down all the callers and update them. It could also result in larger code, because saving and restoring is being done at every call site instead of once in the callee.

Better, I think, to settle on the simplest, least error-prone convention for most of your code, and only do that kind of micro-optimisation for the small proportion of code that really needs it.
[/quote]

paulca · « **Reply #45 on:** July 10, 2025, 12:44:47 pm »

So for now I have migrated to a different style.

Just use memory. It might not be a beautiful example of optimisation, but I just want working code with less bugs, I can optimise later once I know the concept works.

The code isn't tested but it's the end of quite a long state machine for a "RAM Loader" via the PIO port (for now).

The concept was working in that I could read and display a counter sent by the MCU, but the full state machine is still to test. Part of me is not looking forward to it and part of me is considering revisiting the slow clock and bus monitor apparatus.

Doctorandus_P · « **Reply #46 on:** July 23, 2025, 07:08:47 am »

Quote from: tggzzz on June 29, 2025, 12:16:33 pm

You should look at Forth; it is a halfway house between machine code and C, and is efficient...

That may be so, but OP is trying to move away from ASM.
I do like my HP RPN calculator, but find something written in RPN extremely difficult to comprehend afterward.

C is pretty much the standard for microcontroller programming, and I've always liked the language, but it did take a few weeks to get the hang of the syntax. Up to now I only write software for which GCC is available (CLANG is apparently another good C / C++ compiler, but I just stuck with GCC). If those compilers are not available for your target, then SDCC would also be my next choice. From what I've read it's a well regarded compiler and has grown quite mature over the years. I once used it myself to compile the code for the ubiquitous EUR10 Logic analyzer (8051 based controller, code part of Sigrok / Pulseview) but I did not write code in it myself.

paulca · « **Reply #47 on:** July 23, 2025, 01:02:28 pm »

From further reflection. The problem with the C compiler is to do it's job it has to know a lot about your architecture.

By default it seems to lean towards a flat RAM only memory model by default. Stuff starts at 0x0000 and only goes elsewhere if you explicitly map it there.

The next level of config is not in mapping everything in or out of ROM/RAM areas, it's mapping it all into a "relocatable" or "fixed" offset using:

LOAD ""

(or a more raw variant loader).

ROM boots. User "loads application code to vector, jumps to vector".

This is the next most common architecture of the day.

For these two models sdcc might be far more "tameable" when I get back to it.

Originally I was jumping straight in with both feed to what happens in a modern MCU build chain with split memory segmentation models of SRAM, FlashROM etc. Ran head first into a wall instead of staying crawling and sneak under it.

paulca · « **Reply #48 on:** July 23, 2025, 01:08:15 pm »

On a different note, without leaving ASM, I just got more confident and more adventurous.

When it came to parsings command-line I didn't let myself think, "Oh shit, string utils!", I just started writing them. Turns out you can get quite "nifty" in ASM with strings... and even some high-level architecture sneaking in from my roots by making my str_util routines chainable. I use the flag register to signal sucess failure and so receive in A and return in A. The pointers passed in HL get returned in HL correct advanced and the length in BC gets returned in BC set to the remaining length.

The most surprising thing is they work.

The more annoying and harder to fix issue is UART overruns. Oh I hear you fancy modern day MCU devs nodding, but serious, you have DMA and a FIFO you have NO EXCUSE. I have one byte, If I dont read it before the next frame starts I loose one.

The irony ... if I don't look at the OVRN flag I only drop 1 byte out of 10,000s. If I check the OVRN flag I drop 1 in a 100.

Peabody · « **Reply #49 on:** July 23, 2025, 01:54:36 pm »

Quote from: paulca on July 23, 2025, 01:08:15 pm

The more annoying and harder to fix issue is UART overruns. Oh I hear you fancy modern day MCU devs nodding, but serious, you have DMA and a FIFO you have NO EXCUSE. I have one byte, If I dont read it before the next frame starts I loose one.

The irony ... if I don't look at the OVRN flag I only drop 1 byte out of 10,000s. If I check the OVRN flag I drop 1 in a 100.

Can you generate an interrupt when a byte has been received? Then the ISR could read in the byte and store it in a circular buffer.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: ASM has hit just about the right level of pain - what next? (Read 7237 times)

Share me