BlockCopy:
beqz a0, BlockCopyEnd
slli a0, a0, 2
add t1, a1, a0
BlockCopyLoop:
addi a1, a1, 4
lw t2, -4(a1)
addi a2, a2, 4
sw t2, -4(a2)
bgtu t1, a1, BlockCopyLoop
BlockCopyEnd:
memcpy a2, a1, a0
[..] because RISC is speed focused, and microcode is subject to the same restrictions, it's not like one can be faster (given the same depth of optimization, obviously). Your only cost seems to be program space. And the jump-return overhead, if using it as a library function rather than inlined (which is just a simple time-space tradeoff).
An example piece of RISC-V (RV32I) assembly code for a block copy of N 'words' looks like (do not hesitate to provide more optimized code if there is):
[Assume: a0 contains N, a1 the source start address and a2 the destination start address]Code: [Select]BlockCopy:
beqz a0, BlockCopyEnd
slli a0, a0, 2
add t1, a1, a0
BlockCopyLoop:
addi a1, a1, 4
lw t2, -4(a1)
addi a2, a2, 4
sw t2, -4(a2)
bgtu t1, a1, BlockCopyLoop
BlockCopyEnd:
That's 8 instructions, and 5 in the copy loop.
/* Nonzero if either X or Y is not aligned on a "long" boundary. */
#define UNALIGNED(X, Y) \
(((long)X & (sizeof (long) - 1)) | ((long)Y & (sizeof (long) - 1)))
/* How many bytes are copied each iteration of the 4X unrolled loop. */
#define BIGBLOCKSIZE (sizeof (long) << 2)
/* How many bytes are copied each iteration of the word copy loop. */
#define LITTLEBLOCKSIZE (sizeof (long))
/* Threshhold for punting to the byte copier. */
#define TOO_SMALL(LEN) ((LEN) < BIGBLOCKSIZE)
void *
__inhibit_loop_to_libcall
memcpy (void *__restrict dst0,
const void *__restrict src0,
size_t len0)
{
#if defined(PREFER_SIZE_OVER_SPEED) || defined(__OPTIMIZE_SIZE__)
char *dst = (char *) dst0;
char *src = (char *) src0;
void *save = dst0;
while (len0--)
{
*dst++ = *src++;
}
return save;
#else
char *dst = dst0;
const char *src = src0;
long *aligned_dst;
const long *aligned_src;
/* If the size is small, or either SRC or DST is unaligned,
then punt into the byte copy loop. This should be rare. */
if (!TOO_SMALL(len0) && !UNALIGNED (src, dst))
{
aligned_dst = (long*)dst;
aligned_src = (long*)src;
/* Copy 4X long words at a time if possible. */
while (len0 >= BIGBLOCKSIZE)
{
*aligned_dst++ = *aligned_src++;
*aligned_dst++ = *aligned_src++;
*aligned_dst++ = *aligned_src++;
*aligned_dst++ = *aligned_src++;
len0 -= BIGBLOCKSIZE;
}
/* Copy one long word at a time if possible. */
while (len0 >= LITTLEBLOCKSIZE)
{
*aligned_dst++ = *aligned_src++;
len0 -= LITTLEBLOCKSIZE;
}
/* Pick up any residual with a byte copier. */
dst = (char*)aligned_dst;
src = (char*)aligned_src;
}
while (len0--)
*dst++ = *src++;
return dst0;
#endif /* not PREFER_SIZE_OVER_SPEED */
}
lw t2, -4(a1)
addi a2, a2, 4
sw t2, -4(a2) <- will stall reading from register t2 for Read after write (RAW), if the implementation have more than 6 or 7 stages.
Hmm. Does any implemented block copy instruction copy the data out-of-order to match up with the way memory and/or cache is actually implemented?
Drawback for the "'N' in a register" versions: they would require 3 source registers.If you use the third register as a counter that is updated for each iteration, that would make the instruction interruptible without too much complication.
That kind of resembles the 68010 loop mode.
// Compare two blocks of memory for equality
lea #Source,A0 // A0 points to source block
lea #Destination,A1 // A1 points to destination block
move.w #Count-1,D0 // Compare Count words
rpt:
cmpm.w (A0)+,(A1)+ // Compare pair of words
dbne D0,rpt // Repeat until all done
move.w (A0)+,(A1)+ // copy mem(p_src) into mem(p_trg) and auto increment pointers
Just having this thought lately, about potential benefits (or lack thereof) of block copy instructions in a CPU.
Memory block copy is a *very* common operation in typical programs. Efficient block copy seems critical.
The best starting point is a language that takes care of it for you, i.e. not C/C++
Just having this thought lately, about potential benefits (or lack thereof) of block copy instructions in a CPU.I, too, have wondered and examined these. I came to the conclusion that auto-increment loads, stores, and comparisons (if using a separate comparison status register, unlike e.g. RISC-V where the comparison operation is inherent in the conditional branch instruction) would make the biggest difference. It does not even matter if the increments occur before or after the access, nor even if it differs between loads and stores. If you explore the assembly needed to implement the most commonly used operations – memset(); memcpy_inc() and memcpy_dec() (implements memcpy(), memmove(), memrepeat()); strncmp(), strnlen(), memcmp(), etc.; whichever are most common for your workloads – you'll see that the address increment/decrement is the only part common to all of these, and the one that saves the most cycles per iteration if folded to the load/store/comparison operations.
auto-increment loads, stores, and comparisons [..] would make the biggest difference
load_postinc.size D_reg, (A_reg) // load D_reg from mem[A_reg], and then increment A_reg+=size
store_postinc.size D_reg, (A_reg) // store D_reg into mem[A_reg], and then increment A_reg+=size
size={ sizeof(u8), sizeof(u16), sizeof(u32)} = { 1, 2, 4} bytes (...)
But I wouldn't add anything more than this, otherwise the "RISC" will become a "CRISC" :D
I'd trade 16-bit for autodecrement, and I don't care if the increment is preincrement or postincrement, but yes.auto-increment loads, stores, and comparisons [..] would make the biggest differenceSo basically you want these two, right?Code: [Select]size={1,2,4} as for {u8, u16, u32} I/O, D_reg, data register; A_reg, address register; in a RISC design Both D_reg and A_reg belong to the register file set.load_postinc.size D_reg, A_reg // load D_reg from mem[A_reg], and then increment A_reg+=size
store_postinc.size D_reg, A_reg // store D_reg into mem[A_reg], and then increment A_reg+=size
single instructions, part of some IS extensionThat's what I'd prefer, too. Essentially, there are ten instruction forms that I'd like to see:
static inline const char *skipspaces(const char *src) {
if (src)
while (isspace(*src))
src++;
return src;
}
whereas the correct version isstatic inline const char *skipspaces(const char *src) {
if (src)
while (isspace((unsigned char)(*src)))
src++;
return src;
}
The only actual case for sign extension from byte to full register is when you have a variable of type signed char or int8_t, and you load it for integer manipulation. So I do see a need for sign extension for a single byte load to a register; but not as part of memory block functions.
Finally, as to how frequent "block copy" operations are, I think many do not realize how much.
They are certainly not just coming from explicit memcpy()-like calls. Anytime data is moved around (variables, objects, ...), a "block copy" is actually issued. I have no "benchmarks" to give figures (and I'd be interested in seeing some), but I'm certain moving memory around is a very significant part of most programs. And even if your particular microarch *and* compiler allow copies as efficient as dedicated instructions/dedicated controllers, that would potentially save significant code memory. Anyway, I recommend reading the above UCB report.
Finally, as to how frequent "block copy" operations are, I think many do not realize how much.
They are certainly not just coming from explicit memcpy()-like calls. Anytime data is moved around (variables, objects, ...), a "block copy" is actually issued. I have no "benchmarks" to give figures (and I'd be interested in seeing some), but I'm certain moving memory around is a very significant part of most programs. And even if your particular microarch *and* compiler allow copies as efficient as dedicated instructions/dedicated controllers, that would potentially save significant code memory. Anyway, I recommend reading the above UCB report.
Language semantics have a significant effect. Some languages mandate copying, others don't. T
To some extent optimisation can reduce such overhead, if and only if there can be no aliasing (to the specific data) anywhere in the program.
Finally, as to how frequent "block copy" operations are, I think many do not realize how much.
They are certainly not just coming from explicit memcpy()-like calls. Anytime data is moved around (variables, objects, ...), a "block copy" is actually issued. I have no "benchmarks" to give figures (and I'd be interested in seeing some), but I'm certain moving memory around is a very significant part of most programs. And even if your particular microarch *and* compiler allow copies as efficient as dedicated instructions/dedicated controllers, that would potentially save significant code memory. Anyway, I recommend reading the above UCB report.
Language semantics have a significant effect. Some languages mandate copying, others don't. T
To some extent optimisation can reduce such overhead, if and only if there can be no aliasing (to the specific data) anywhere in the program.
True, although even so, data copy is IMO still significant. What language do you have in mind?
I'd like to find or write a tool that allows identifying data copy at runtime and yield statistics such as min, max, average block size and total amount of copied data, so we get real figures instead of just guessing.
C/C++ mandate copying of structs/objects as parameters to function calls. Many other languages copy a reference to the struct/object,but that can have it's own disadvantages.
m68k has movem, for block copies above a certain size. Could copy something like 48 bytes in 2 instructions.
C/C++ mandate copying of structs/objects as parameters to function calls. Many other languages copy a reference to the struct/object,but that can have it's own disadvantages.
Yes, but programmers who care about performance never do that for large objects/structs.
[..] zero copy APIs [..]
The general solution if copying is the performance bottleneck, is to completely eliminate the need to copy. AKA Zero-copy.
Most modern OS's have zero copy APIs for things like files and networking.
m68k has movem, for block copies above a certain size. Could copy something like 48 bytes in 2 instructions.
On the actual M68000 that is 18 cycles per 4 bytes copied, plus 20 cycles overhead. You can't use all 16 registers (64 bytes), obviously, because you need some for the src and dst pointers, minimum. And whatever registers you do use, you probably have to save them somewhere at the start, and reload them later.
OK, let's say 12 registers. So 12*18+20 = 236 cycles to save them at the start and restore them after. And then 236 cycles plus some loop overhead (dbnz, say, 10 cycles) for each 48 bytes copied.
A simple MOVE.L (An)+,(An)+ is 20 cycles and doesn't use any registers except the src and dst pointers so MOVEM saves very little if you don't mind unrolling the MOVE.L loop a little.
The general solution if copying is the performance bottleneck, is to completely eliminate the need to copy. AKA Zero-copy.
Most modern OS's have zero copy APIs for things like files and networking.
That solution can easily bring its own problems, e.g. aliasing (and not being able to prove the absence of aliasing), ownership, memory reclamation.
Language support helps, and it absence hinders optimisation and correctness.
You cannot avoid memory copies completely.
Consider I/O in a general purpose OS with paged virtual memory, where recently used files are cached in otherwise unused ("free") RAM. Standard I/O involves a minimum of one copy from the page cache to the userspace process buffer; typically two, with the second being within the userspace, copying from its own file buffer to the target. Memory-mapped I/O can have significant cost in (kernel) metadata structures and setup and teardown costs, and is not possible when the source is a pipe, socket, character device, or any other non-seekable non-file-like object.
In a multithreaded system, memory copies are essential for efficient operation (RCU (https://en.wikipedia.org/wiki/Read-copy-update)). In particular, if you have a state object protected by a mutex or a rwlock, you always want to copy the object to local storage for general examination and processing, so that the lock duration can be kept to minimum.
Dense cache-effective data structures – arrays and (binary) heaps in particular – require memory copy operations during insertion and deletion. Using pointers not only grows the per-item size (often significantly), but also tends to reduce data locality (because consecutive items are not necessarily near each other in memory), which tends to reduce the effectiveness of caching schemes. A particularly well known example is naïve matrix multiplication on x86/x86-64, where doing an (in-place) transpose to one of the matrices to make the inner loop access memory consecutively gives a significant efficiency boost.
Pointers aren't a panacea - far from it!
I should have phrased that better, now that you point it out! :-DDQuoteIn a multithreaded system, memory copies are essential for efficient operation (RCU (https://en.wikipedia.org/wiki/Read-copy-update)). In particular, if you have a state object protected by a mutex or a rwlock, you always want to copy the object to local storage for general examination and processing, so that the lock duration can be kept to minimum.Are you sure of that?
Pointers aren't a panacea - far from it!Exactly. Many cache-efficient algorithms do need efficient copy operations.
The VAX instruction set had at least two block copy instructions: MOVC3, MOVC5. (Depending on how you look at it, add MOVTC and MOVTUC.)
The instructions block move instructions were *all* interruptible, since the transfer link could be much larger than a page. There had to be a way of handling a page fault that had to go to disk. There were also real-time considerations, as a 64KB+ block could take seconds to complete on a 730, for instance.
MOVC3 and MOVC5 even dealt with *overlapping* source and destination. Microcode adjusted the order of transfer (front to back, or back to front) in order to get the "right" result.
MOVC5 would fill the destination buffer with a specified character if the source buffer was smaller than the destination buffer.
The interruptibility of the instructions gave rise to quite a few microcoding issues, as the intermediate state was saved in general purpose registers. Fiddling with the return stack before resuming a MOVCx could produce surprise results. However, the VAX architecture commitment mandated that it could not cause a system hang.
These instructions were very important in the paging system and, believe it or not, to the BACKUP program. And it was faster than the "equivalent" series of MOV (Rx)+,(Ry)+, and AOBLEQ.
Exactly. Many cache-efficient algorithms do need efficient copy operations.
As I said, I'd really like to find (which I have little hope about) or write a tool that could analyze code dynamically (at run time) and extract stats about copy operations.While
struct foo {
double d;
float f;
long o;
int i;
char c;
};
void copy(struct foo *dst, struct foo *src)
{
*dst = *src;
}
using -O2 will generate an open-coded copy, not a call to memcpy()/memmove().m68k has movem, for block copies above a certain size. Could copy something like 48 bytes in 2 instructions.
On the actual M68000 that is 18 cycles per 4 bytes copied, plus 20 cycles overhead. You can't use all 16 registers (64 bytes), obviously, because you need some for the src and dst pointers, minimum. And whatever registers you do use, you probably have to save them somewhere at the start, and reload them later.
OK, let's say 12 registers. So 12*18+20 = 236 cycles to save them at the start and restore them after. And then 236 cycles plus some loop overhead (dbnz, say, 10 cycles) for each 48 bytes copied.
A simple MOVE.L (An)+,(An)+ is 20 cycles and doesn't use any registers except the src and dst pointers so MOVEM saves very little if you don't mind unrolling the MOVE.L loop a little.
Yes, that's why I said above a certain size. You have to be doing a silly amount of small copies to even care about the cost.
Considerably more loop overhead can be eliminated at some point too with a duff device.
The VAX instruction set had at least two block copy instructions: MOVC3, MOVC5. (Depending on how you look at it, add MOVTC and MOVTUC.)
:
These instructions were very important in the paging system and, believe it or not, to the BACKUP program. And it was faster than the "equivalent" series of MOV (Rx)+,(Ry)+, and AOBLEQ.
Pointers aren't a panacea - far from it!
Yeah, for example garbage collected languages. If your struct/data is small enough it can be better to copy it around, rather than use pointers and increase pressure on the gc.
m68k has movem, for block copies above a certain size. Could copy something like 48 bytes in 2 instructions.
On the actual M68000 that is 18 cycles per 4 bytes copied, plus 20 cycles overhead. You can't use all 16 registers (64 bytes), obviously, because you need some for the src and dst pointers, minimum. And whatever registers you do use, you probably have to save them somewhere at the start, and reload them later.
OK, let's say 12 registers. So 12*18+20 = 236 cycles to save them at the start and restore them after. And then 236 cycles plus some loop overhead (dbnz, say, 10 cycles) for each 48 bytes copied.
A simple MOVE.L (An)+,(An)+ is 20 cycles and doesn't use any registers except the src and dst pointers so MOVEM saves very little if you don't mind unrolling the MOVE.L loop a little.
Yes, that's why I said above a certain size. You have to be doing a silly amount of small copies to even care about the cost.
Considerably more loop overhead can be eliminated at some point too with a duff device.
Duff is just a way to do unrolling without having a leftover part at the end by jumping into the middle of the loop the first time. It only works if the pointers are actually updated after every unit copied, not using offset addressing with one big pointer update at the end of the loop -- so it doesn't apply to typical RISC code.
m68k has movem, for block copies above a certain size. Could copy something like 48 bytes in 2 instructions.
On the actual M68000 that is 18 cycles per 4 bytes copied, plus 20 cycles overhead. You can't use all 16 registers (64 bytes), obviously, because you need some for the src and dst pointers, minimum. And whatever registers you do use, you probably have to save them somewhere at the start, and reload them later.
OK, let's say 12 registers. So 12*18+20 = 236 cycles to save them at the start and restore them after. And then 236 cycles plus some loop overhead (dbnz, say, 10 cycles) for each 48 bytes copied.
A simple MOVE.L (An)+,(An)+ is 20 cycles and doesn't use any registers except the src and dst pointers so MOVEM saves very little if you don't mind unrolling the MOVE.L loop a little.
Yes, that's why I said above a certain size. You have to be doing a silly amount of small copies to even care about the cost.
Considerably more loop overhead can be eliminated at some point too with a duff device.
Duff is just a way to do unrolling without having a leftover part at the end by jumping into the middle of the loop the first time. It only works if the pointers are actually updated after every unit copied, not using offset addressing with one big pointer update at the end of the loop -- so it doesn't apply to typical RISC code.
Yes, they use a computed goto/jump.
Duff devices work on MIPS/MIPS64 and ARM/ARM64. So not sure what "doesn't apply to typical RISC code means".
#include <inttypes.h>
typedef uint64_t word;
void foo(word *to, word *from, int count) {
int n = (count + 7) / 8;
switch (count % 8) {
case 0: do { *to++ = *from++;
case 7: *to++ = *from++;
case 6: *to++ = *from++;
case 5: *to++ = *from++;
case 4: *to++ = *from++;
case 3: *to++ = *from++;
case 2: *to++ = *from++;
case 1: *to++ = *from++;
} while (--n > 0);
}
}
void bar(word *to, word *from, int count) {
while (count >= 8) {
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
count -= 8;
}
while (count > 0) {
*to++ = *from++;
count -= 1;
}
}
0000000000000000 <foo>:
0: 31001c44 adds w4, w2, #0x7
4: 11003843 add w3, w2, #0xe
8: 1a844063 csel w3, w3, w4, mi // mi = first
c: 6b0203e4 negs w4, w2
10: 12000884 and w4, w4, #0x7
14: 12000842 and w2, w2, #0x7
18: 5a844442 csneg w2, w2, w4, mi // mi = first
1c: 13037c63 asr w3, w3, #3
20: 71000c5f cmp w2, #0x3
24: 54000740 b.eq 10c <foo+0x10c> // b.none
28: 5400060d b.le e8 <foo+0xe8>
2c: 7100145f cmp w2, #0x5
30: 540006a0 b.eq 104 <foo+0x104> // b.none
34: 5400022b b.lt 78 <foo+0x78> // b.tstop
38: 7100185f cmp w2, #0x6
3c: 540000e0 b.eq 58 <foo+0x58> // b.none
40: 71001c5f cmp w2, #0x7
44: 540005a1 b.ne f8 <foo+0xf8> // b.any
48: f9400022 ldr x2, [x1]
4c: 91002000 add x0, x0, #0x8
50: f81f8002 stur x2, [x0, #-8]
54: 91002021 add x1, x1, #0x8
58: f9400024 ldr x4, [x1]
5c: 91002002 add x2, x0, #0x8
60: 91002021 add x1, x1, #0x8
64: f9000004 str x4, [x0]
68: f9400024 ldr x4, [x1]
6c: 91002040 add x0, x2, #0x8
70: 91002021 add x1, x1, #0x8
74: f9000044 str x4, [x2]
78: f9400022 ldr x2, [x1]
7c: 91002004 add x4, x0, #0x8
80: 91002021 add x1, x1, #0x8
84: f9000002 str x2, [x0]
88: f9400020 ldr x0, [x1]
8c: 91002082 add x2, x4, #0x8
90: 91002021 add x1, x1, #0x8
94: f9000080 str x0, [x4]
98: f9400024 ldr x4, [x1]
9c: 91002040 add x0, x2, #0x8
a0: 91002021 add x1, x1, #0x8
a4: f9000044 str x4, [x2]
a8: f9400022 ldr x2, [x1]
ac: 51000463 sub w3, w3, #0x1
b0: f9000002 str x2, [x0]
b4: 7100007f cmp w3, #0x0
b8: 5400020d b.le f8 <foo+0xf8>
bc: 91002021 add x1, x1, #0x8
c0: 91002000 add x0, x0, #0x8
c4: f9400022 ldr x2, [x1]
c8: 91002021 add x1, x1, #0x8
cc: f9000002 str x2, [x0]
d0: 91002000 add x0, x0, #0x8
d4: f9400022 ldr x2, [x1]
d8: 91002021 add x1, x1, #0x8
dc: f9000002 str x2, [x0]
e0: 91002000 add x0, x0, #0x8
e4: 17ffffdd b 58 <foo+0x58>
e8: 7100045f cmp w2, #0x1
ec: 54fffde0 b.eq a8 <foo+0xa8> // b.none
f0: 5400006c b.gt fc <foo+0xfc>
f4: 34fffe82 cbz w2, c4 <foo+0xc4>
f8: d65f03c0 ret
fc: aa0003e2 mov x2, x0
100: 17ffffe6 b 98 <foo+0x98>
104: aa0003e2 mov x2, x0
108: 17ffffd8 b 68 <foo+0x68>
10c: aa0003e4 mov x4, x0
110: 17ffffde b 88 <foo+0x88>
114: d503201f nop
0000000000000118 <bar>:
118: 71001c5f cmp w2, #0x7
11c: 5400052d b.le 1c0 <bar+0xa8>
120: 51002042 sub w2, w2, #0x8
124: aa0103e3 mov x3, x1
128: 53037c47 lsr w7, w2, #3
12c: d37a70e6 ubfiz x6, x7, #6, #29
130: 910100c6 add x6, x6, #0x40
134: 8b060004 add x4, x0, x6
138: f9400065 ldr x5, [x3]
13c: 91010063 add x3, x3, #0x40
140: f9000005 str x5, [x0]
144: 91010000 add x0, x0, #0x40
148: eb00009f cmp x4, x0
14c: f85c8065 ldur x5, [x3, #-56]
150: f81c8005 stur x5, [x0, #-56]
154: f85d0065 ldur x5, [x3, #-48]
158: f81d0005 stur x5, [x0, #-48]
15c: f85d8065 ldur x5, [x3, #-40]
160: f81d8005 stur x5, [x0, #-40]
164: f85e0065 ldur x5, [x3, #-32]
168: f81e0005 stur x5, [x0, #-32]
16c: f85e8065 ldur x5, [x3, #-24]
170: f81e8005 stur x5, [x0, #-24]
174: f85f0065 ldur x5, [x3, #-16]
178: f81f0005 stur x5, [x0, #-16]
17c: f85f8065 ldur x5, [x3, #-8]
180: f81f8005 stur x5, [x0, #-8]
184: 54fffda1 b.ne 138 <bar+0x20> // b.any
188: 8b060021 add x1, x1, x6
18c: 4b070c42 sub w2, w2, w7, lsl #3
190: 7100005f cmp w2, #0x0
194: 5400014d b.le 1bc <bar+0xa4>
198: 51000442 sub w2, w2, #0x1
19c: d2800000 mov x0, #0x0 // #0
1a0: 91000442 add x2, x2, #0x1
1a4: d503201f nop
1a8: f8607823 ldr x3, [x1, x0, lsl #3]
1ac: f8207883 str x3, [x4, x0, lsl #3]
1b0: 91000400 add x0, x0, #0x1
1b4: eb02001f cmp x0, x2
1b8: 54ffff81 b.ne 1a8 <bar+0x90> // b.any
1bc: d65f03c0 ret
1c0: aa0003e4 mov x4, x0
1c4: 17fffff3 b 190 <bar+0x78>
00000000 <foo>:
0: 24c30007 addiu v1,a2,7
4: 04610002 bgez v1,10 <foo+0x10>
8: 3c028000 lui v0,0x8000
c: 24c3000e addiu v1,a2,14
10: 24420007 addiu v0,v0,7
14: 00c23024 and a2,a2,v0
18: 04c10005 bgez a2,30 <foo+0x30>
1c: 000340c3 sra t0,v1,0x3
20: 24c6ffff addiu a2,a2,-1
24: 2402fff8 li v0,-8
28: 00c23025 or a2,a2,v0
2c: 24c60001 addiu a2,a2,1
30: 2cc20008 sltiu v0,a2,8
34: 10400040 beqz v0,138 <foo+0x138>
38: 3c020000 lui v0,0x0
3c: 24420000 addiu v0,v0,0
40: 00063080 sll a2,a2,0x2
44: 00463021 addu a2,v0,a2
48: 8cc20000 lw v0,0(a2)
4c: 00400008 jr v0
50: 00000000 nop
54: 8ca30004 lw v1,4(a1)
58: 8ca20000 lw v0,0(a1)
5c: 24840008 addiu a0,a0,8
60: ac83fffc sw v1,-4(a0)
64: ac82fff8 sw v0,-8(a0)
68: 24a50008 addiu a1,a1,8
6c: 8ca70004 lw a3,4(a1)
70: 8ca60000 lw a2,0(a1)
74: 24820008 addiu v0,a0,8
78: 24a50008 addiu a1,a1,8
7c: ac870004 sw a3,4(a0)
80: ac860000 sw a2,0(a0)
84: 8ca70004 lw a3,4(a1)
88: 8ca60000 lw a2,0(a1)
8c: 24440008 addiu a0,v0,8
90: 24a50008 addiu a1,a1,8
94: ac470004 sw a3,4(v0)
98: ac460000 sw a2,0(v0)
9c: 8ca30004 lw v1,4(a1)
a0: 8ca20000 lw v0,0(a1)
a4: 24860008 addiu a2,a0,8
a8: 24a50008 addiu a1,a1,8
ac: ac830004 sw v1,4(a0)
b0: ac820000 sw v0,0(a0)
b4: 8cab0004 lw t3,4(a1)
b8: 8caa0000 lw t2,0(a1)
bc: 24c20008 addiu v0,a2,8
c0: 24a50008 addiu a1,a1,8
c4: accb0004 sw t3,4(a2)
c8: acca0000 sw t2,0(a2)
cc: 8ca70004 lw a3,4(a1)
d0: 8ca60000 lw a2,0(a1)
d4: 24440008 addiu a0,v0,8
d8: 24a50008 addiu a1,a1,8
dc: ac470004 sw a3,4(v0)
e0: ac460000 sw a2,0(v0)
e4: 8ca30004 lw v1,4(a1)
e8: 8ca20000 lw v0,0(a1)
ec: 2508ffff addiu t0,t0,-1
f0: ac830004 sw v1,4(a0)
f4: 19000010 blez t0,138 <foo+0x138>
f8: ac820000 sw v0,0(a0)
fc: 24a50008 addiu a1,a1,8
100: 24840008 addiu a0,a0,8
104: 8ca30004 lw v1,4(a1)
108: 8ca20000 lw v0,0(a1)
10c: ac830004 sw v1,4(a0)
110: ac820000 sw v0,0(a0)
114: 24a50008 addiu a1,a1,8
118: 8ca30004 lw v1,4(a1)
11c: 8ca20000 lw v0,0(a1)
120: 24840008 addiu a0,a0,8
124: ac830004 sw v1,4(a0)
128: ac820000 sw v0,0(a0)
12c: 24a50008 addiu a1,a1,8
130: 1000ffce b 6c <foo+0x6c>
134: 24840008 addiu a0,a0,8
138: 03e00008 jr ra
13c: 00000000 nop
140: 1000ffd0 b 84 <foo+0x84>
144: 00801025 move v0,a0
148: 1000ffe0 b cc <foo+0xcc>
14c: 00801025 move v0,a0
150: 1000ffd8 b b4 <foo+0xb4>
154: 00803025 move a2,a0
00000158 <bar>:
158: 28c20008 slti v0,a2,8
15c: 14400030 bnez v0,220 <bar+0xc8>
160: 00a01825 move v1,a1
164: 00c03825 move a3,a2
168: 00801025 move v0,a0
16c: 8c690004 lw t1,4(v1)
170: 8c680000 lw t0,0(v1)
174: ac490004 sw t1,4(v0)
178: ac480000 sw t0,0(v0)
17c: 24630040 addiu v1,v1,64
180: 8c69ffcc lw t1,-52(v1)
184: 8c68ffc8 lw t0,-56(v1)
188: ac49000c sw t1,12(v0)
18c: ac480008 sw t0,8(v0)
190: 8c69ffd4 lw t1,-44(v1)
194: 8c68ffd0 lw t0,-48(v1)
198: 24420040 addiu v0,v0,64
19c: ac49ffd4 sw t1,-44(v0)
1a0: ac48ffd0 sw t0,-48(v0)
1a4: 8c69ffdc lw t1,-36(v1)
1a8: 8c68ffd8 lw t0,-40(v1)
1ac: ac49ffdc sw t1,-36(v0)
1b0: ac48ffd8 sw t0,-40(v0)
1b4: 8c69ffe4 lw t1,-28(v1)
1b8: 8c68ffe0 lw t0,-32(v1)
1bc: ac49ffe4 sw t1,-28(v0)
1c0: ac48ffe0 sw t0,-32(v0)
1c4: 8c69ffec lw t1,-20(v1)
1c8: 8c68ffe8 lw t0,-24(v1)
1cc: ac49ffec sw t1,-20(v0)
1d0: ac48ffe8 sw t0,-24(v0)
1d4: 8c69fff4 lw t1,-12(v1)
1d8: 8c68fff0 lw t0,-16(v1)
1dc: ac49fff4 sw t1,-12(v0)
1e0: ac48fff0 sw t0,-16(v0)
1e4: 8c69fffc lw t1,-4(v1)
1e8: 8c68fff8 lw t0,-8(v1)
1ec: 24e7fff8 addiu a3,a3,-8
1f0: 28ea0008 slti t2,a3,8
1f4: ac49fffc sw t1,-4(v0)
1f8: 1140ffdc beqz t2,16c <bar+0x14>
1fc: ac48fff8 sw t0,-8(v0)
200: 24c6fff8 addiu a2,a2,-8
204: 000610c2 srl v0,a2,0x3
208: 24430001 addiu v1,v0,1
20c: 00031980 sll v1,v1,0x6
210: 000210c0 sll v0,v0,0x3
214: 00832021 addu a0,a0,v1
218: 00a32821 addu a1,a1,v1
21c: 00c23023 subu a2,a2,v0
220: 18c00009 blez a2,248 <bar+0xf0>
224: 000630c0 sll a2,a2,0x3
228: 00863021 addu a2,a0,a2
22c: 24a50008 addiu a1,a1,8
230: 8ca3fffc lw v1,-4(a1)
234: 8ca2fff8 lw v0,-8(a1)
238: 24840008 addiu a0,a0,8
23c: ac83fffc sw v1,-4(a0)
240: 1486fffa bne a0,a2,22c <bar+0xd4>
244: ac82fff8 sw v0,-8(a0)
248: 03e00008 jr ra
24c: 00000000 nop
m68k has movem, for block copies above a certain size. Could copy something like 48 bytes in 2 instructions.
On the actual M68000 that is 18 cycles per 4 bytes copied, plus 20 cycles overhead. You can't use all 16 registers (64 bytes), obviously, because you need some for the src and dst pointers, minimum. And whatever registers you do use, you probably have to save them somewhere at the start, and reload them later.
OK, let's say 12 registers. So 12*18+20 = 236 cycles to save them at the start and restore them after. And then 236 cycles plus some loop overhead (dbnz, say, 10 cycles) for each 48 bytes copied.
A simple MOVE.L (An)+,(An)+ is 20 cycles and doesn't use any registers except the src and dst pointers so MOVEM saves very little if you don't mind unrolling the MOVE.L loop a little.
Yes, that's why I said above a certain size. You have to be doing a silly amount of small copies to even care about the cost.
Considerably more loop overhead can be eliminated at some point too with a duff device.
Duff is just a way to do unrolling without having a leftover part at the end by jumping into the middle of the loop the first time. It only works if the pointers are actually updated after every unit copied, not using offset addressing with one big pointer update at the end of the loop -- so it doesn't apply to typical RISC code.
Yes, they use a computed goto/jump.
Duff devices work on MIPS/MIPS64 and ARM/ARM64. So not sure what "doesn't apply to typical RISC code means".
Let's check.
Any questions?
In both cases, the 2nd version, without Duff, is going to be much preferable.
Questions, yes.. why would you even write another memcpy() function in C? Makes no sense. Existing memcpy() is likely to have all kinds of optimisations already implemented.
Here is Go's assembly duff code generator for various CPUs for duffcopy/zero.
https://golang.org/src/runtime/mkduff.go
It's been notorious that throughout the last 42 years REP MOVSB on x86 has been slower than a well written explicit loop on (almost?) every Intel and AMD model.
Threshold to use Enhanced REP MOVSB. Since there is overhead to set up REP MOVSB operation, REP MOVSB isn't faster on short data. The memcpy micro benchmark in glibc shows that 2KB is the approximate value above which REP MOVSB becomes faster than SSE2 optimization on processors with Enhanced REP MOVSB. Since larger register size can move more data with a single load and store, the threshold is higher with larger register size.
QuoteThreshold to use Enhanced REP MOVSB. Since there is overhead to set up REP MOVSB operation, REP MOVSB isn't faster on short data. The memcpy micro benchmark in glibc shows that 2KB is the approximate value above which REP MOVSB becomes faster than SSE2 optimization on processors with Enhanced REP MOVSB. Since larger register size can move more data with a single load and store, the threshold is higher with larger register size.
.global memcpy
# void *memcpy(void* dest, const void* src, size_t n)
# a0=dest, a1=src, a2=n
#
memcpy:
mv a3, a0 # Copy destination
loop:
vsetvli t0, a2, e8,m8,ta,ma # Vectors of 8b
vle8.v v0, (a1) # Load bytes
add a1, a1, t0 # Bump pointer
sub a2, a2, t0 # Decrement count
vse8.v v0, (a3) # Store bytes
add a3, a3, t0 # Bump pointer
bnez a2, loop # Any more?
ret # Return
For a RISC-type instruction set, having an extension to enable byte and full register-wide address autoincrement and autodecrement loads and stores yields most of the possible speedups.
This is not correct. Adding autoincrement loads to an ISA such as RISC-V suddenly requires that it be possible for an integer instruction to write results to two registers instead of one.That only applies to loads. Another possibility for loads is to merge the autoincrement/decrement into the pointer comparison branching instruction. (Then only two additional instructions are needed: one that autoincrements a register and branches if it is less than (or equal to) another register; and another that autodecrements a register and branches if it is greater than (or equal to) another register. Technically, branching whenever not equal suffices. Whether the register store is done always, or only when the branch is taken, and whether the autoincrement/decrement occurs before or after the comparison, is irrelevant as long as it is consistent.)
Isn't this getting into the "much ado about nothing" when your pipelined instruction stream is churning along at less than 1cycle (<1ns) per instruction, but your memory is taking 60ns or so for each access? block copies are going to be more limited by cache architectures and memory system than whether you have a specialized instruction or not.
That only applies to loads. Another possibility for loads is to merge the autoincrement/decrement into the pointer comparison branching instruction.
For copies that don't hit in cache, absolutely right, though you do need to consider that your 60 ns access time is then going to give you at least 64 bytes of data
In order that two instructions in the pipeline can be executed simultaneously, they must not have data dependencies. A sequence of auto-increment instructions like r2=*r1++, r3=*r1++,... actually has data dependencies (via r1), i.e. r3=*r1++ cannot start before r2=*r1++ has finished (unless the CPU splits them internally to microinstructions w/o auto-increment like r2=*r1, r1++, r3=*r1, r1++,... -- then the memory fetch microoperations can overlap again).
In order that two instructions in the pipeline can be executed simultaneously, they must not have data dependencies. A sequence of auto-increment instructions like r2=*r1++, r3=*r1++,... actually has data dependencies (via r1), i.e. r3=*r1++ cannot start before r2=*r1++ has finished (unless the CPU splits them internally to microinstructions w/o auto-increment like r2=*r1, r1++, r3=*r1, r1++,... -- then the memory fetch microoperations can overlap again).
Right.
Yet another reason that r2=0[r1]; r3=1[r1]; r4=2[r1]; r5=3[r1] r1+=4 is a much better idea.
Yet another reason that r2=0[r1]; r3=1[r1]; r4=2[r1]; r5=3[r1] r1+=4 is a much better idea.Which works for large block copies, and does nothing to help with the other related operations. :(
EDIT:
Yet another point which has not been considered/discussed yet (or maybe I missed it?) is branch misprediction of the copy loop's conditional branch. Branch misprediction penalty of 1..3 dozens of clock cycles is not unusual. In detail it depends of course on the particular microarchitecture, and the CPU's state and operations in progress when it happens.
Ahhh ... 3 *dozens* of clock cycles? No way. I can't think of any machine that would apply to.
Note that with modern branch predictors that use a branch history table as part of the predictor (a series of true/false bits indicating whether the last N branches were taken or not taken, it is typical for loops that iterate just a few times (anywhere from 10 to 30) to have EVERY branch correctly predicted, including the final branch that exits the loop.
Ahhh ... 3 *dozens* of clock cycles? No way. I can't think of any machine that would apply to.
I'm basically referring to the analysis of the authors of this paper (https://users.elis.ugent.be/~leeckhou/papers/ispass06-eyerman.pdf) which say that
"the branch misprediction penalty can be substantially larger than the frontend pipeline length
(which is often equated with the misprediction penalty)"
if the indirect side-effects of the misprediction are taken into account. Figure 1 shows the analyzed average penalties for a couple of different application profiles, and they range from 9 to 35 cycles. They are obviously very application-dependent. I was surprised, too, that the indirect effects can be so significant.
Another point is that the BPT is basically yet another N-way cache (indexed by a couple of bits from the branch instruction's address), so that the former history of of a branch can get evicted, and the entry reused for other branch instructions in the program which happen to map to the same index.
Ahhh ... 3 *dozens* of clock cycles? No way. I can't think of any machine that would apply to.
I'm basically referring to the analysis of the authors of this paper (https://users.elis.ugent.be/~leeckhou/papers/ispass06-eyerman.pdf) which say that
"the branch misprediction penalty can be substantially larger than the frontend pipeline length
(which is often equated with the misprediction penalty)"
if the indirect side-effects of the misprediction are taken into account. Figure 1 shows the analyzed average penalties for a couple of different application profiles, and they range from 9 to 35 cycles. They are obviously very application-dependent. I was surprised, too, that the indirect effects can be so significant.
QuoteNote that with modern branch predictors that use a branch history table as part of the predictor (a series of true/false bits indicating whether the last N branches were taken or not taken, it is typical for loops that iterate just a few times (anywhere from 10 to 30) to have EVERY branch correctly predicted, including the final branch that exits the loop.
Luckily there are cases where this indeed works, but I still doubt that every branch inside a memcpy function can be predicted correctly, given that the function can be called with a different size argument upon each invocation.
An example piece of RISC-V (RV32I) assembly code for a block copy of N 'words' looks like (do not hesitate to provide more optimized code if there is):Something brucehoult mentioned above made me realize we haven't discussed these operations at the pipeline stage level, to discover what the theoretical maximum copy rate is on your target processor. I assume, it can only do one RAM access per clock cycle (load or store); and only one register file store per clock cycle.
[Assume: a0 contains N, a1 the source start address and a2 the destination start address]
What operations can be done in parallel within the same clock cycle?
(I am talking about lowest-level operations, within the instruction pipeline; not about superscalarity or individual full instructions.)
That's exactly why I wrote, "on [SiliconWizard's] target processor". If we limit to RISC-V alone, then there is nothing to discuss, is there?What operations can be done in parallel within the same clock cycle?Depends on the processor's microarchitecture. "RISC-V" is not a particular processor implementation, but an open Instruction Set Architecture. Optimizations at this level are no longer generic, but microarchitecture-dependent.
(I am talking about lowest-level operations, within the instruction pipeline; not about superscalarity or individual full instructions.)
Not completely sure what Nominal Animal meant here: "I am talking about lowest-level operations, within the instruction pipeline; not about superscalarity or individual full instructions."The question is, what are the allowed combinations of "many things" on your implementation: which micro-operations your target can do at the same time.
Any instruction can do many things at once as long as the logic path doesn't exceed your requirements.
Based on the above, the utter maximum theoretical copy rate is half a word (of native register size) per clock cycle, or equivalently two clock cycles per native register size word coped. (For string operations, two cycles per byte copied, unless you have a branch instruction "branch if none of the bytes in the target register are zeroes".)
Because we cannot create an unrolled version for every single length, up to one cycle per iteration is taken by the loop branch. As brucehoult showed earlier, that overhead depends on the number of unrolled copies per loop iteration – but then the block size needs to be at least as large as the copy step.
That leaves the address register updates. With offset addressing (a small immediate constant, a multiple of the native register size, added to a register), that too boils down to one or two cycles per loop iteration (depending on whether updating a register value can be done in a single cycle or not). If the register update can be done in parallel with a load or store or the loop branch, its "cost" vanishes.
It seems to me that both block-memcpy instruction and a normal assembly loops should be able to get arbitrarily close to two cycles per native word copied (using the above assumptions), and that anything better than this, you need e.g. wider single-cycle memory accesses, or something else outside the box here, so to speak. If you agree, then shouldn't we look at which low-level operations can be done in parallel in the same clock cycle on your processor design, to see how to reach the two cycles per word copy rate? And what kind of instructions would get there?
add limit,src,size
loop:
lw tmp, 0(src); addi src,src,8
sw tmp 0(dst); addi dst,dst,8
lw tmp -4(src); cmp src,limit
sw tmp -4(dst); blt loop
For an example of the other possibilities, let's say you could implement lw rd, (rs1+rs2), i.e. load from an address that is the sum of two registers, and the store-autoinc/dec. Then, you could use rs1 as the target address (autoinc/dec'd), and rs2 be a constant, the difference between the target and source addresses. You'd get the same near-maximum copy rate as the last snippet above, using one more register:
sub a1, a1, a2
.loop:
lw t1, (a2+a1)
swi (a2++), t1
lw t1, (a2+a1)
swi (a2++), t1
bne a2, a3, .loop
addi a2,a2,-4
sub a1,a1,a2
loop:
lw t1,(a2,a1)
sw! t1,4(a2)
bne a2,a3,loop
However I wanted to improve on this in the RISC-V B extension, but without doing something as crude as a special "strlen" function.It is not just strlen()/strnlen(), but also for strcmp()/strncmp() and strcpy()/strncpy(), these being the three most used string functions.
With a dual issue machine, you only need to unroll the loop once to completely hide the update of two pointers, a compare against a limit, and a conditional branch. (the compare and branch are one instruction on RISC-V)Quite true, and I did not think of that (or rather, that it was pertinent here).
It's cheaper to write back the effective address you already calculated.This! This is exactly the kind of low-level µ-op details I wanted to know. Your version is obviously superior to mine; how did I miss it earlier?
But it does require one more adjustment instruction before the loop.It really depends on the function signature. With known pointers/size, at least GCC will implement the copy loop itself, and avoid the adjustment by folding in the "constants". It does the same for e.g. static inline assembly functions, where the arithmetic is done when supplying input register values, when the parameter values are known at compile time. For unknown pointers/size, for standard C functions, there is at minimum an addition to calculate the limiting address, and possibly some validity checks (at least for the user-programmer facing functions), so one more adjustment is just noise there.
Not completely sure what Nominal Animal meant here: "I am talking about lowest-level operations, within the instruction pipeline; not about superscalarity or individual full instructions."The question is, what are the allowed combinations of "many things" on your implementation: which micro-operations your target can do at the same time.
Any instruction can do many things at once as long as the logic path doesn't exceed your requirements.
Here is a practical suggestion for extending RV32I:
- SBi (rd), rs: Store least significant byte of register rs to memory address specified by register rd, and increment rd by one.
- SBd (rd), rs: Store least significant byte of register rs to memory address specified by register rd, and decrement rd by one.
- SWi (rd), rs: Store register rs to memory address specified by register rd, and increment rd by four.
- SWd (rd), rs: Store register rs to memory address specified by register rd, and decrement rd by four.
- BNEi rd, rs, 1<<n, delta: Compare rd to rs, and increment rd by 1<<n. If not equal, branch by delta.
- BNEd rd, rs, 1<<n, delta: Compare rd to rs, and decrement rd by 1<<n. If not equal, branch by delta.
I see absolutely no issue nor any required added latency for any of the instructions you suggest above.Excellent; thanks. :-+
The only difference with existing instructions is the additional inc/dec of a dest register. This can be done fully in parallel with the rest of what they have to do, so no problem.
However I wanted to improve on this in the RISC-V B extension, but without doing something as crude as a special "strlen" function.It is not just strlen()/strnlen(), but also for strcmp()/strncmp() and strcpy()/strncpy(), these being the three most used string functions.
If the branch does a byte-wise "branch only if none are equal" comparison between the 4 bytes in the two registers, you could also implement strchr()/strnchr(), and speed up strstr()/memmem().
If the branch does a byte-wise "branch only if none are equal" comparison between 4×4 bytes (in the two registers), you could also implement strpbrk()/strspn()/strcspn() with very low cost, since each register could hold up to four initial bytes in a matching sequence.
Not completely sure what Nominal Animal meant here: "I am talking about lowest-level operations, within the instruction pipeline; not about superscalarity or individual full instructions."The question is, what are the allowed combinations of "many things" on your implementation: which micro-operations your target can do at the same time.
Any instruction can do many things at once as long as the logic path doesn't exceed your requirements.
Here is a practical suggestion for extending RV32I:
- SBi (rd), rs: Store least significant byte of register rs to memory address specified by register rd, and increment rd by one.
- SBd (rd), rs: Store least significant byte of register rs to memory address specified by register rd, and decrement rd by one.
- SWi (rd), rs: Store register rs to memory address specified by register rd, and increment rd by four.
- SWd (rd), rs: Store register rs to memory address specified by register rd, and decrement rd by four.
- BNEi rd, rs, 1<<n, delta: Compare rd to rs, and increment rd by 1<<n. If not equal, branch by delta.
- BNEd rd, rs, 1<<n, delta: Compare rd to rs, and decrement rd by 1<<n. If not equal, branch by delta.
I see absolutely no issue nor any required added latency for any of the instructions you suggest above.
The only difference with existing instructions is the additional inc/dec of a dest register. This can be done fully in parallel with the rest of what they have to do, so no problem.
Better still if the *programmer* calculates the strlen() first, and uses memcpy()/memcmp(), and remembers the strlen() for later use.It is still duplicated work; I'd like to avoid that.
Probably nobody did, it was such a poor description!QuoteIf the branch does a byte-wise "branch only if none are equal" comparison between 4×4 bytes (in the two registers), you could also implement strpbrk()/strspn()/strcspn() with very low cost, since each register could hold up to four initial bytes in a matching sequence.I didn't quite get that.
On the most modern systems block-copy performance is limited by the latency and throughput of RAM interface, and therefore a single block-copy CISC instruction would have no noticeable advantage over multiple RISC instructions.
Except cache still carries latency penalty. For example Skylake’s L1 Data cache latency is whooping 4 cycles per operation (8 cycles in total, since you need to read and write). And that’s if you are lucky enough to have both source and destination rows in tiny L1 cache.On the most modern systems block-copy performance is limited by the latency and throughput of RAM interface, and therefore a single block-copy CISC instruction would have no noticeable advantage over multiple RISC instructions.
Except most block-copy calls (and probably most bytes copied in block copies) are in sizes far smaller than typical L1 caches, so a CISCy instruction with lower start-up overhead than a RISC software loop (choosing the appropriate version of copy, initializing registers etc) could have an advantage.
Except Intel have finally recently gotten REP MOVSB faster than a software loop -- for large copies. If I understand certain web pages correctly, on Skylake and newer REP MOVSB takes 20 cycles for any copy less than 12 bytes, and has a 40 cycle startup overhead for any copy greater than 76 bytes, but then copies 64 bytes every 4 cycles.
Except cache still carries latency penalty. For example Skylake’s L1 Data cache latency is whooping 4 cycles per operation (8 cycles in total, since you need to read and write). And that’s if you are lucky enough to have both source and destination rows in tiny L1 cache.On the most modern systems block-copy performance is limited by the latency and throughput of RAM interface, and therefore a single block-copy CISC instruction would have no noticeable advantage over multiple RISC instructions.
Except most block-copy calls (and probably most bytes copied in block copies) are in sizes far smaller than typical L1 caches, so a CISCy instruction with lower start-up overhead than a RISC software loop (choosing the appropriate version of copy, initializing registers etc) could have an advantage.
Except Intel have finally recently gotten REP MOVSB faster than a software loop -- for large copies. If I understand certain web pages correctly, on Skylake and newer REP MOVSB takes 20 cycles for any copy less than 12 bytes, and has a 40 cycle startup overhead for any copy greater than 76 bytes, but then copies 64 bytes every 4 cycles.
Writing to “Write Back” cache also could have latency. If the line that is being written to is not present in cache, the line will have to be loaded from the RAM first, unless the word that is being written has the size of the cache line and is perfectly aligned.