I've finally got my mind around the first principles of how the 6502 works, but am stuck on one part; the "addressing modes".
It's not surprising.
Today the 6502 is often used as an example of a very simple processor -- and it certainly uses very few transistors, and especially gets very good performance from a very small number of transistors -- but it does it by being quite complex to understand.
The 6502's spiritual ancestor (and actual ancestor if you look at the people, not the companies) the 6800 is far simpler to understand, and easier to write programs for. But it's a lot less efficient, and that's important when you have a 1 MHz clock.
ledtester did a pretty good job of explaining the 6502 addressing modes in terms of assembly language syntax -- syntax that is very similar to that still used in modern assembly language today.
I will give a couple of examples, specifically of how to use the "indirect indexed X" and "indexed indirect Y" addressing modes.
First, let's say we want to have a subroutine that copies a block of bytes of memory from one place to another. We'll call it "memcpy". It has a "from" argument and a "to" argument and a size of the memory block which, for today, we'll limit to between 1 and 128 bytes (don't bother to call it at all if you want to copy 0 bytes!)
Suppose you have two blocks of memory:
foo: .byte 0,0,0,0,0,0,0
bar: .byte 3,1,4,1,5,9,3
To copy the contents of bar to foo we want to be able to do:
lda >foo
pha
lda <foo
pha
lda >bar
pha
lda <bar
pha
lda #7
pha
jsr memcpy
Here, a ">" means the hi 9 bits of a 16 bit address and "<" means the lo 8 bits.
There are many other ways we might do this. Instead of pushing everything on to the stack we might store the various values into Zero Page locations that we know the memcpy() function expects to find them in. That would be faster, but it would add five extra bytes of code to every caller of memcpy() and you might have quite a lot of them in the program. Or the memcpy() function might expect to find the length in the X or Y register instead of on the stack. That would both be faster and save a byte of code in the caller (or two bytes compared to storing it in Zero Page)
Another way of doing this that makes the caller a lot smaller -- 8 bytes instead of 18 bytes (!) -- but less flexible is:
jsr memcpy
.byte >foo, <foo, >bar, <bar, 7
This makes memcpy work a lot harder to get the parameters, but it could be worth it if getting the most amount of program features into fixed size memory is important. A real program might have both this version and another that was faster but more code at the caller.
On modern computers where you are often interfacing to code from a C compiler or using a standard library, there is a standard convention for how subroutines and their callers interact, but on the 6502 and z80 it was pretty much wild west and someone who wrote a function did whatever they wanted, and the users of the function needed to consult the documentation for every function before calling it.
Let's stick with the first version for now.
Here's what the memcpy() subroutine might look like:
src equ 13 ; completely arbitrary location in first 256 bytes of RAM. We need 2 bytes free here
dst equ 42 ; again, arbitrary
memcpy:
pla
tay
pla
sta src
pla
sta src+1
pla
sta dst
pla
sta dst+1
dey ; convert 1..256 into 0..255, one less than how many bytes we will copy
loop:
lda (src),y
sta (dst),y
dey
bpl loop
rts
We'd really like to test the carry flag here, so that we could copy up to 256 bytes at a time, but unfortunately the DEY instruction only sets N and Z. We could add a CPY #0 and then use BCS loop, or CPY #255 and BNE loop, but these would slow down the loop quite a bit. Ok, 12.5% -- 18 cycles per loop instead of 16. Maybe it's worth it.
Programming weird things such as the 6502 is full of such trade-offs.
OK, so when would we use ZP,x and absolute,Y addressing modes?
One example would be if we have a program doing a lot of arithmetic on 16 bit or 32 bit integers. Maybe a subroutine has a number of variables like this, and stores them in Zero Page while working on them.
Let's say we have some 32 bit variables in Zero Page with their least significant 8 bits at addresses ... 13, 42, 69, and 100. And for some reason we want to add them all up and put the result at address 42 (the answer).
You could do it like this:
add32:
clc
lda $0000,y ;sadly, there is no ZP,y
adc $00,x
sta $00,x
lda $0001,y
adc $01,x
sta $01,x
lda $0002,y
adc $02,x
sta $02,x
lda $0003,y
adc $03,x
sta $03,x
rts
... and then using it ...
ldx #42
ldy #13
jsr add32
ldy #69
jsr add32
ldy #100
jsr add32
Finally, when would you use {ZP,x) addressing mode?
To be honest, I think I've almost never used it, except when X was 0 and Y was being used for something else. Note that (ZP,x) and (ZP),y are exactly the same thing if X and Y are 0. But (ZP),y is 1 clock cycle faster, except for STA.
I do have an example where you might use it, but it's kind of obscure so I'll leave it for now.
A ±128 byte PC-relative JSR wouldn't be very useful (although 6800 had one), and there is no support at all in any other instructions for the 16 bit add needed for an arbitrary PC-relative call, so it would have been tricky to provide.
There was very little support at all for PIC in any 8 bit ISA except the M6809, which came a bit too late after 16 bit machines were already introduced.
M6800 could BSR ±128 bytes, or JSR to an absolute address in the X register (±128 bytes), so you could calculate a target address into a pair of zero page locations and then load them into X. There's no way to access the current PC contents other than a JSR/BSR, so you'd need a stub function that copied its return address to A/B or X or into memory before returning. Ugh. At least, yeah, you can call a calculated address and easily get the return address pushed.
z80 has no calculated call at all. Only absolute. And you can only get the PC with a call. So need the stub routine. At least the stub can simply load the pushed PC into a 16 bit register before returning. But you need to calculate the return address and manually push it before using CALL (HL/IX/IY).
8086 also doesn't have PC-relative call.
M6809 has both BSR with 8 bit offset and LBSR with 16 bit offset. And JSR with absolute address.
Probably with everything else the best approach would be to load the desired offset into registers and then do an absolute call to a utility function that uses the return address in the normal way, as well as to calculate an address to jump to.
e.g. for 6502
// the call site
// NB offset is relative to the last byte of the JSR, *not* to the jsr or the next instruction
ldy #offsetHI
lda #offsetLO
jsr relativeCall
// the utility function
relativeCall:
tsx
clc
adc $100,x
sta zptmp
tya
adc $101,x
sta zptmp+1
jmp (zptmp)
So you're looking at 7 bytes instead of 3 at the call site, which is not awful, but 2+2+6+2+2+4+3+2+4+3+5 = 35 cycles instead of 6 for the call. Plus, it nukes ALL the registers, so you can't pass any actual arguments in registers.
z80 a little better, but still not fun:
// call site. Offset is from start of following instruction
ld bc,#offset
call relativeCall
// the utility function
relativeCall:
pop hl
push hl
add hl,bc
jp (hl)
So that's 6 bytes at the call site and 10+17+10+11+11+4 = 63 cycles instead of 17 for an absolute call.
At least the z80 utility function is only 4 bytes vs 15 bytes. But you only need one copy of it, so that hardly matters.
A ±128 byte PC-relative JSR wouldn't be very useful (although 6800 had one), and there is no support at all in any other instructions for the 16 bit add needed for an arbitrary PC-relative call, so it would have been tricky to provide.
There was very little support at all for PIC in any 8 bit ISA except the M6809, which came a bit too late after 16 bit machines were already introduced.
M6800 could BSR ±128 bytes, or JSR to an absolute address in the X register (±128 bytes), so you could calculate a target address into a pair of zero page locations and then load them into X. There's no way to access the current PC contents other than a JSR/BSR, so you'd need a stub function that copied its return address to A/B or X or into memory before returning. Ugh. At least, yeah, you can call a calculated address and easily get the return address pushed.
z80 has no calculated call at all. Only absolute. And you can only get the PC with a call. So need the stub routine. At least the stub can simply load the pushed PC into a 16 bit register before returning. But you need to calculate the return address and manually push it before using CALL (HL/IX/IY).
8086 also doesn't have PC-relative call.
M6809 has both BSR with 8 bit offset and LBSR with 16 bit offset. And JSR with absolute address.
Probably with everything else the best approach would be to load the desired offset into registers and then do an absolute call to a utility function that uses the return address in the normal way, as well as to calculate an address to jump to.
e.g. for 6502
// the call site
// NB offset is relative to the last byte of the JSR, *not* to the jsr or the next instruction
ldy #offsetHI
lda #offsetLO
jsr relativeCall
// the utility function
relativeCall:
tsx
clc
adc $100,x
sta zptmp
tya
adc $101,x
sta zptmp+1
jmp (zptmp)
So you're looking at 7 bytes instead of 3 at the call site, which is not awful, but 2+2+6+2+2+4+3+2+4+3+5 = 35 cycles instead of 6 for the call. Plus, it nukes ALL the registers, so you can't pass any actual arguments in registers.
z80 a little better, but still not fun:
// call site. Offset is from start of following instruction
ld bc,#offset
call relativeCall
// the utility function
relativeCall:
pop hl
push hl
add hl,bc
jp (hl)
So that's 6 bytes at the call site and 10+17+10+11+11+4 = 63 cycles instead of 17 for an absolute call.
At least the z80 utility function is only 4 bytes vs 15 bytes. But you only need one copy of it, so that hardly matters.
Sure, but who said it could only be relative? the 6502's JMP instruction has absolute and indirect forms for example, JSR could have leveraged these two modes too.
Sure, but who said it could only be relative? the 6502's JMP instruction has absolute and indirect forms for example, JSR could have leveraged these two modes too.
"there is no support at all in any other instructions for the 16 bit add needed for an arbitrary PC-relative call, so it would have been tricky to provide."
If you're happy with indirect JSR instead of relative JSR then instead of the ideal...
jsr ($nnnn)
... all you need is ...
jsr indirectJsrViaNN
:
:
indirectJsrViaNN:
jmp ($nnnn)
I expect an actual indirect JSR instruction would take 8 cycles -- certainly no fewer, since it reads and writes 8 bytes to/from RAM -- and the substitute takes 11 cycles, so not a huge deal.