I've finally got my mind around the first principles of how the 6502 works, but am stuck on one part; the "addressing modes".
It's not surprising.
Today the 6502 is often used as an example of a very simple processor -- and it certainly uses very few transistors, and especially gets very good performance from a very small number of transistors -- but it does it by being quite complex to understand.
The 6502's spiritual ancestor (and actual ancestor if you look at the people, not the companies) the 6800 is far simpler to understand, and easier to write programs for. But it's a lot less efficient, and that's important when you have a 1 MHz clock.
ledtester did a pretty good job of explaining the 6502 addressing modes in terms of assembly language syntax -- syntax that is very similar to that still used in modern assembly language today.
I will give a couple of examples, specifically of how to use the "indirect indexed X" and "indexed indirect Y" addressing modes.
First, let's say we want to have a subroutine that copies a block of bytes of memory from one place to another. We'll call it "memcpy". It has a "from" argument and a "to" argument and a size of the memory block which, for today, we'll limit to between 1 and 128 bytes (don't bother to call it at all if you want to copy 0 bytes!)
Suppose you have two blocks of memory:
foo: .byte 0,0,0,0,0,0,0
bar: .byte 3,1,4,1,5,9,3
To copy the contents of bar to foo we want to be able to do:
lda >foo
pha
lda <foo
pha
lda >bar
pha
lda <bar
pha
lda #7
pha
jsr memcpy
Here, a ">" means the hi 9 bits of a 16 bit address and "<" means the lo 8 bits.
There are many other ways we might do this. Instead of pushing everything on to the stack we might store the various values into Zero Page locations that we know the memcpy() function expects to find them in. That would be faster, but it would add five extra bytes of code to every caller of memcpy() and you might have quite a lot of them in the program. Or the memcpy() function might expect to find the length in the X or Y register instead of on the stack. That would both be faster and save a byte of code in the caller (or two bytes compared to storing it in Zero Page)
Another way of doing this that makes the caller a lot smaller -- 8 bytes instead of 18 bytes (!) -- but less flexible is:
jsr memcpy
.byte >foo, <foo, >bar, <bar, 7
This makes memcpy work a lot harder to get the parameters, but it could be worth it if getting the most amount of program features into fixed size memory is important. A real program might have both this version and another that was faster but more code at the caller.
On modern computers where you are often interfacing to code from a C compiler or using a standard library, there is a standard convention for how subroutines and their callers interact, but on the 6502 and z80 it was pretty much wild west and someone who wrote a function did whatever they wanted, and the users of the function needed to consult the documentation for every function before calling it.
Let's stick with the first version for now.
Here's what the memcpy() subroutine might look like:
src equ 13 ; completely arbitrary location in first 256 bytes of RAM. We need 2 bytes free here
dst equ 42 ; again, arbitrary
memcpy:
pla
tay
pla
sta src
pla
sta src+1
pla
sta dst
pla
sta dst+1
dey ; convert 1..256 into 0..255, one less than how many bytes we will copy
loop:
lda (src),y
sta (dst),y
dey
bpl loop
rts
We'd really like to test the carry flag here, so that we could copy up to 256 bytes at a time, but unfortunately the DEY instruction only sets N and Z. We could add a CPY #0 and then use BCS loop, or CPY #255 and BNE loop, but these would slow down the loop quite a bit. Ok, 12.5% -- 18 cycles per loop instead of 16. Maybe it's worth it.
Programming weird things such as the 6502 is full of such trade-offs.
OK, so when would we use ZP,x and absolute,Y addressing modes?
One example would be if we have a program doing a lot of arithmetic on 16 bit or 32 bit integers. Maybe a subroutine has a number of variables like this, and stores them in Zero Page while working on them.
Let's say we have some 32 bit variables in Zero Page with their least significant 8 bits at addresses ... 13, 42, 69, and 100. And for some reason we want to add them all up and put the result at address 42 (the answer).
You could do it like this:
add32:
clc
lda $0000,y ;sadly, there is no ZP,y
adc $00,x
sta $00,x
lda $0001,y
adc $01,x
sta $01,x
lda $0002,y
adc $02,x
sta $02,x
lda $0003,y
adc $03,x
sta $03,x
rts
... and then using it ...
ldx #42
ldy #13
jsr add32
ldy #69
jsr add32
ldy #100
jsr add32
Finally, when would you use {ZP,x) addressing mode?
To be honest, I think I've almost never used it, except when X was 0 and Y was being used for something else. Note that (ZP,x) and (ZP),y are exactly the same thing if X and Y are 0. But (ZP),y is 1 clock cycle faster, except for STA.
I do have an example where you might use it, but it's kind of obscure so I'll leave it for now.