Your index register linked list chaining looks about right. Five instructions including surplus moves through the stack. Given everyone was.raving a about the improvements in the Z80, I was horrified when I first worked that one out.
The earlier 6800 did it in one instruction: LDX loaded both bytes from successive memory locations, and it could use indexed mode with an offset. The mnemonic would have been something like
LDX 2,X
Bit nicer in all respects than the Z80
LD B, (IX+2)
LD C, (IX+3)
PUSH BC
POP IX
And the 6809 was even nicer 
Yes, the 6800 is much better with pointer-chasing code. But it falls down badly with anything that needs multiple pointers, especially pointers that need updating, such as a memcpy(). You end up with code like...
LDX src
LDA 0,X
INX
STX src
LDX dst
STA 0,X
INX
STX dst
DEC sz
BNE loop
19 bytes, 46 cycles per loop
Sticking to 8080 instructions, the straightforward way on Z80 (there are faster ways) is...
LD A,(HL)
LD (DE),A
INC HL
INC DE
DEC BC
LD A,B
OR C
JP NZ,loop
9 bytes, 50 cycles per loop
The z80 code is less than half the size 9 bytes vs 19. 6800 can save 2 bytes for copies of less than 256 bytes by using "DEC B" for the count ... but Z80 also saves 2 bytes (and 2 instructions) by using DEC B.
6502 more or less splits the difference.
Load link pointer at offset 2:
LDY #2
LDA (ptr),Y
TAX
INY
LDA (ptr),Y
STX ptr
STA ptr+1
Quite a few more bytes of code than Z80 but omg the Z80 takes 63 cycles for 4 instructions while the 6502 is 22 cycles for 7 instructions, fitting the usual pattern of Z80 using 3x more cycles than 6502. 6800 of course wins here with 6 cycles.
For 6502 memcpy:
LDA (src),Y
STA (dst),Y
DEY
BNE loop
7 bytes, 16 cycles. That's three times faster than either 6800 (46 cycles) or Z80 (50 cycles).
It does however only handle maximum 256 bytes copied. For larger copies you need another loop around this one, which brings the code size to larger than 6800, but doesn't affect the speed (much). In fact usually two loops are used, one for blocks of 256 bytes then another for the final bytes.
The 6809 of course solves all problems, with four 16 bit index registers (two of which also support push/pop) and a 16 bit D accumulator.
LDA ,X+
STA ,Y+
LEAU -1,U
BNE loop
8 bytes and I think 20 cycles per loop. Actually slightly bigger and slower than the 6502 inner loop, but handles bigger than 256 bytes.
You can also do 2 bytes at a time with...
LDD ,X++
STD ,Y++
LEAU -2,U
BNE loop
... if you deal with a possible odd byte first.