EEVblog Electronics Community Forum

Products => Computers => Programming => Topic started by: SiliconWizard on January 13, 2021, 01:20:38 am

Title: Block copy instructions
Post by: SiliconWizard on January 13, 2021, 01:20:38 am
Just having this thought lately, about potential benefits (or lack thereof) of block copy instructions in a CPU.

Memory block copy is a *very* common operation in typical programs. Efficient block copy seems critical. (We can also refer to the other thread dealing with strcpy() and memcpy() for instance.)
Some CPUs implement block copy instructions (such as Intel x86 processors), others do not.

Random thoughts about that:
* As a "complex" kind of instructions, they may be more a CISC thing than a RISC thing.
* On RISC processors, it may make more sense to "fuse" simpler instructions than implement such "complex" instructions. Fusing instructions, super-scalar execution, and such, is not trivial though.
* Block copy instructions would typically be non-interruptible, possibly taking MANY cycles and thus "blocking" the CPU, which could cause issues with interrupt latency and just thread scheduling.
* They could be made interruptible, but it would make the CPU design more complex.
* A DMA controller could be used instead, but it fits different needs IMO, would have extra setup overhead...

An example piece of RISC-V (RV32I) assembly code for a block copy of N 'words' looks like (do not hesitate to provide more optimized code if there is):
[Assume: a0 contains N, a1 the source start address and a2 the destination start address]
Code: [Select]
BlockCopy:
    beqz a0, BlockCopyEnd
    slli a0, a0, 2
    add t1, a1, a0
   
BlockCopyLoop:
    addi a1, a1, 4
    lw t2, -4(a1)
    addi a2, a2, 4
    sw t2, -4(a2)
    bgtu t1, a1, BlockCopyLoop

BlockCopyEnd:

That's 8 instructions, and 5 in the copy loop.

We could imagine a series of block copy instructions doing just that, with varying width (per byte, per half word, per word copy), and 'N' in a register or immediate.
For instance, the block copy instruction dealing with words and 'N' in a register could look like:
Code: [Select]
memcpy a2, a1, a0

Drawback for the "'N' in a register" versions: they would require 3 source registers. The immediate version would require only 2.
Such instructions don't seem very complicated to implement at all.

Your thoughts?
Title: Re: Block copy instructions
Post by: Cerebus on January 13, 2021, 02:07:22 am
I suspect that a few minutes spent with an appropriate search engine will turn you up more papers and PhD theses than you'll care to read on evaluating block copy instructions versus macro-coding block copies.

Without empirical data any opinions I'd have would largely amount to 'hand waving' and the last thing we need on here is more of that kind of stuff. I will however say that my instinct is that in a well designed RISC CPU a block copy instruction won't be worth the complexity it adds.
Title: Re: Block copy instructions
Post by: T3sl4co1l on January 13, 2021, 04:24:27 am
I mean, that's all the 8086 is doing, via microcode.  In essence, you've shifted some user-space software into the "BIOS", except not even that, it's a deeper level, inside the CPU itself.  It's like having a very brief library built into the hardware.

RISC tends to be a bit verbose on instruction length (preferring fixed length encodings at full bus width, and obviously there are many exceptions, on the ISA level (e.g. ARM Thumb), or within an ISA (an instruction with a full-bus-width parameter e.g. jump-immediate)), so the software version does tend to take up a fair amount of space.  But that space all fits in the instruction cache, so it's no execution penalty.  And such systems usually have more than enough program memory or cache to handle it (as long as you don't mind the cost of that memory, which was a bigger deal say in the 80s and 90s).

As I understand it, the point of RISC is to avoid, as much as possible, anything that can't be done in a single cycle.  So you very commonly get load-store architectures, even very ornate ones that can do quite a lot in that cycle (complex ALU ops on multiple arguments -- gobs and gobs of hardware, deep pipelines).  But never a humble little ADD [mem], imm that requires a read-modify-write cycle.

And because RISC is speed focused, and microcode is subject to the same restrictions, it's not like one can be faster (given the same depth of optimization, obviously).  Your only cost seems to be program space.  And the jump-return overhead, if using it as a library function rather than inlined (which is just a simple time-space tradeoff).

The natural exceptions then being, anywhere caches or multiple buses can fill in for those hazards.  A memory bus could in principle be designed for certain elementary modify operations.  I don't know the implementation details of course, but some AVR instructions or peripheral registers manifest in that way: certain IO locations can be modified at no cycle penalty; or some peripherals present a set of registers that perform the respective operations, and presumably the peripheral is resolving those internally.  Or DSPs for instance, where multiple buses are the norm, so massive throughputs can be had even on very dense operations like MAC [dest], [src1], [src2] (where all three operands are memory).  Anyway, these are architecture specific and I have very little practical knowledge about the architectures that use these methods (caches and types, multiple buses).

Put another way, the reason I think CISC tend to implement block instructions, is because, due to their complex instruction decoding and multiple housekeeping cycles per instruction, you are at a significant disadvantage doing certain very common -- namely, block or string -- operations on them.  So it is advantageous to provide that shortcut.

(With peculiar side cases being understood, like, wasn't it on several Pentiums that the written-out loop can perform better than REP STOS?  A side effect of the early transition to micro-op architectures, I think?)

Finally, the differences disappear in older and simpler architectures; I've seen 68k called RISCy before, though maybe that's not a good description of it after all.  Its instructions tend to be faster than 8086's?  Though that's also mainly due to 8086's stupid EA modes.  Architectures lacking pipelines tend to perform about the same.  Even AVR (firmly RISC?) has comparable speed to those classic chips, despite being only 8 bits, and a much newer process (though hardly cutting edge <10nm or anything).

Tim
Title: Re: Block copy instructions
Post by: DiTBho on January 13, 2021, 06:51:43 am
[..] because RISC is speed focused, and microcode is subject to the same restrictions, it's not like one can be faster (given the same depth of optimization, obviously).  Your only cost seems to be program space.  And the jump-return overhead, if using it as a library function rather than inlined (which is just a simple time-space tradeoff).

sums up perfectly what I think  :D
Title: Re: Block copy instructions
Post by: brucehoult on January 13, 2021, 08:21:03 am
An example piece of RISC-V (RV32I) assembly code for a block copy of N 'words' looks like (do not hesitate to provide more optimized code if there is):
[Assume: a0 contains N, a1 the source start address and a2 the destination start address]
Code: [Select]
BlockCopy:
    beqz a0, BlockCopyEnd
    slli a0, a0, 2
    add t1, a1, a0
   
BlockCopyLoop:
    addi a1, a1, 4
    lw t2, -4(a1)
    addi a2, a2, 4
    sw t2, -4(a2)
    bgtu t1, a1, BlockCopyLoop

BlockCopyEnd:

That's 8 instructions, and 5 in the copy loop.

And 5 clock cycles per word copied, on a typical single-issue RISC pipeline (assuming the branches are correctly predicted).

You can easily drop that as close as you want to 2 instructions (cycles) per 4 bytes as you want by unrolling a few times -- four times will give you 11 instructions for 4 words copied (2.75 cycles/word), 16 times will give you 35 instructions for 16 words copied (2.19 cycles/word)

That's assuming you're copying in SRAM or L1 cache. If you're copying in DRAM then you're going to quickly hit the memory bandwidth limit anyway.

You need a bit of overhead to deal with left-over bytes, misalignment etc.

Newlib unrolls four times. Newlib Nano does a byte loop.

Code: [Select]
/* Nonzero if either X or Y is not aligned on a "long" boundary.  */
#define UNALIGNED(X, Y) \
  (((long)X & (sizeof (long) - 1)) | ((long)Y & (sizeof (long) - 1)))

/* How many bytes are copied each iteration of the 4X unrolled loop.  */
#define BIGBLOCKSIZE    (sizeof (long) << 2)

/* How many bytes are copied each iteration of the word copy loop.  */
#define LITTLEBLOCKSIZE (sizeof (long))

/* Threshhold for punting to the byte copier.  */
#define TOO_SMALL(LEN)  ((LEN) < BIGBLOCKSIZE)

void *
__inhibit_loop_to_libcall
memcpy (void *__restrict dst0,
const void *__restrict src0,
size_t len0)
{
#if defined(PREFER_SIZE_OVER_SPEED) || defined(__OPTIMIZE_SIZE__)
  char *dst = (char *) dst0;
  char *src = (char *) src0;

  void *save = dst0;

  while (len0--)
    {
      *dst++ = *src++;
    }

  return save;
#else
  char *dst = dst0;
  const char *src = src0;
  long *aligned_dst;
  const long *aligned_src;

  /* If the size is small, or either SRC or DST is unaligned,
     then punt into the byte copy loop.  This should be rare.  */
  if (!TOO_SMALL(len0) && !UNALIGNED (src, dst))
    {
      aligned_dst = (long*)dst;
      aligned_src = (long*)src;

      /* Copy 4X long words at a time if possible.  */
      while (len0 >= BIGBLOCKSIZE)
        {
          *aligned_dst++ = *aligned_src++;
          *aligned_dst++ = *aligned_src++;
          *aligned_dst++ = *aligned_src++;
          *aligned_dst++ = *aligned_src++;
          len0 -= BIGBLOCKSIZE;
        }

      /* Copy one long word at a time if possible.  */
      while (len0 >= LITTLEBLOCKSIZE)
        {
          *aligned_dst++ = *aligned_src++;
          len0 -= LITTLEBLOCKSIZE;
        }

       /* Pick up any residual with a byte copier.  */
      dst = (char*)aligned_dst;
      src = (char*)aligned_src;
    }

  while (len0--)
    *dst++ = *src++;

  return dst0;
#endif /* not PREFER_SIZE_OVER_SPEED */
}

I actually have a criticism of this code. I'd remove the TOO_SMALL() test as being both pointless (the "while (len0 >= BIGBLOCKSIZE)" will do the same job) but also harmful because even going around a word copy loop once (let alone three times) is going to be better than copying a byte at a time.

Through the maqic of modern compilers, all those *foo++ get converted to optimal code with offsets from a pointer that is updated once per loop.
Title: Re: Block copy instructions
Post by: fanOfeeDIY on January 13, 2021, 10:14:39 am
The are probably many reasons of not having special block copy instruction on RISC-V but easiest one I could imagine is when the pipeline of the RISC-V implementation gets deeper, then it will have more read after write registrar stall and becomes slower.

Code: [Select]
    lw t2, -4(a1)
    addi a2, a2, 4
    sw t2, -4(a2) <- will stall reading from register t2 for Read after write (RAW), if the implementation have more than 6 or 7 stages.

Even the modern compiler for x86 do not generate string copy instruction even the instruction exists.
The RISC-V is open specification so it could have highly pipelined implementation in the future by someone and probably RISC-V would like to maintain natural from the implementations.

I wrote similar topic here at RISC-V study meeting.
This is the link to the slide, page 7.
https://1drv.ms/b/s!AnLZDi2WpU1Ogx8bmfm5HvzBGOm-?e=kZ0f8S

I found nice chart on the pipile hazard here.
https://stackoverflow.com/questions/37396533/pipeline-stall-after-a-load-instruction-but-not-after-an-add-instruction

This was the link to the event.
https://risc-v.connpass.com/event/194872/
Title: Re: Block copy instructions
Post by: westfw on January 13, 2021, 10:52:57 am
Hmm.  Does any implemented block copy instruction copy the data out-of-order to match up with the way memory and/or cache is actually implemented?

Title: Re: Block copy instructions
Post by: DiTBho on January 13, 2021, 11:05:06 am
Hmm.  Does any implemented block copy instruction copy the data out-of-order to match up with the way memory and/or cache is actually implemented?

What is the problem you are thinking about?


Title: Re: Block copy instructions
Post by: andersm on January 13, 2021, 01:06:10 pm
Drawback for the "'N' in a register" versions: they would require 3 source registers.
If you use the third register as a counter that is updated for each iteration, that would make the instruction interruptible without too much complication.

Edit: That kind of resembles the 68010 loop mode.
Title: Re: Block copy instructions
Post by: DiTBho on January 13, 2021, 01:31:10 pm
That kind of resembles the 68010 loop mode.

m68k does it this way:

Code: [Select]
// Compare two blocks of memory for equality

     lea #Source,A0              // A0 points to source block
     lea #Destination,A1         // A1 points to destination block
     move.w #Count-1,D0          // Compare Count words
rpt:
     cmpm.w (A0)+,(A1)+          // Compare pair of words
     dbne D0,rpt                 // Repeat until all done

dbne: decrement D0 and  branch until D0 is not equal to zero

In a RISC design like PowerPC, D0 is not a common register file but rather a dedicated register.
So the pipeline knows exactly what's happening without the need of looking at the the branch prediction.

cmpm compares two words in memory for equality, the m68k ISA also has a similar complex move opcode 
Code: [Select]
     move.w (A0)+,(A1)+          // copy mem(p_src) into mem(p_trg)  and auto increment pointers

It was good for CISCs. It's terrible and doesn't add any gain when implemented in RISCs.
Except in the PowerPC case, which then adds extra instructions and extra complexity to manage the hardware register loop counters.

There is a reason if MIPS doesn't have anything similar  :D
Title: Re: Block copy instructions
Post by: tggzzz on January 13, 2021, 02:04:22 pm
Just having this thought lately, about potential benefits (or lack thereof) of block copy instructions in a CPU.

Memory block copy is a *very* common operation in typical programs. Efficient block copy seems critical.

Instruction execution time is no longer the determinant of program execution time. Main memory latency and bandwidth and cache coherence protocols are the killer.

The best way to speed up memory copying was very well illustrated a couple of decades ago when people were optimising LAN protocol stacks up to the application program. The successful technique was, unsurprisingly, don't copy buffers - just pass the relevant pointers up the protocol stack.

That isn't too difficult in special cases such as network stacks, but correctly sharing buffers/references in complex large applications is inherently complex. The best starting point is a language that takes care of it for you, i.e. not C/C++.

The dynamics of such applications is one reason that the microbenchmarks beloved of the C/C++ community can be very misleading with large complex applications.
Title: Re: Block copy instructions
Post by: DiTBho on January 13, 2021, 02:11:22 pm
The best starting point is a language that takes care of it for you, i.e. not C/C++

Probably here Rust gains some points :D
Title: Re: Block copy instructions
Post by: techman-001 on January 13, 2021, 02:13:15 pm
The Z80 had special commands just for this purpose:

Block Transfer and Search

      Table 9 lists the extremely powerful block transfer instructions. These instructions operate
      with three registers.

      · HL points to the source location
      · DE points to the destination location
      · BC is a byte counter


After the programmer initializes these three registers, any of these four instructions can be
used. The Load and Increment (LDI) instruction moves one byte from the location pointed
to by HL to the location pointed to by DE. Register pairs HL and DE are then automati-
cally incremented and are ready to point to the following locations. The byte counter (i.e.,
register pair BC) is also decremented at this time. This instruction is valuable when the
blocks of data must be moved but other types of processing are required between each
move. The Load, Increment and Repeat (LDIR) instruction is an extension of the LDI
instruction. The same load and increment operation is repeated until the byte counter
reaches the count of zero. As a result, this single instruction can move any block of data
from one location to any other.

Because 16-bit registers are used, the size of the block can be up to 64KB long
(1KB = 1024 bits) and can be moved from any location in memory to any other location.
Furthermore, the blocks can be overlapping because there are no constraints on the data
used in the three register pairs.

The LDD and LDDR instructions are similar to LDI and LDIR. The only difference is that
register pairs HL and DE are decremented after every move so that a block transfer starts
from the highest address of the designated block rather than the lowest.

Table 10 specifies the op codes for the four block search instructions. The first, CPI (Com-
pare and Increment) compares the data in the Accumulator with the contents of the mem-
ory location pointed to by Register HL. The result of the compare is stored in one of the
flag bits and the HL register pair is then incremented and the byte counter (register pair
BC) is decremented.

http://www.zilog.com/docs/z80/um0080.pdf
 (http://www.zilog.com/docs/z80/um0080.pdf)
Title: Re: Block copy instructions
Post by: Nominal Animal on January 13, 2021, 03:02:37 pm
Just having this thought lately, about potential benefits (or lack thereof) of block copy instructions in a CPU.
I, too, have wondered and examined these.  I came to the conclusion that auto-increment loads, stores, and comparisons (if using a separate comparison status register, unlike e.g. RISC-V where the comparison operation is inherent in the conditional branch instruction) would make the biggest difference.  It does not even matter if the increments occur before or after the access, nor even if it differs between loads and stores.  If you explore the assembly needed to implement the most commonly used operations – memset(); memcpy_inc() and memcpy_dec() (implements memcpy(), memmove(), memrepeat()); strncmp(), strnlen(), memcmp(), etc.; whichever are most common for your workloads – you'll see that the address increment/decrement is the only part common to all of these, and the one that saves the most cycles per iteration if folded to the load/store/comparison operations.

Note that the autoincrement/autodecrement matching the access size is perfectly acceptable, and that larger than byte/char accesses may have to be aligned.  I personally would like to see both 8-bit and 32/64-bit (native register size) versions; everything else can be done in software.

bzero()/memset(,0,) is only worth accelerating for full aligned cachelines or pages, and then basically only if there is some trick you can do with the memory management to do the clearing efficiently.

There is a rare case of copying unaligned memory to aligned memory, or vice versa, in e.g. the network stack and binary file format handling.  If the architecture does not support unaligned memory accesses, having a special load operation that shifts the target register up by a byte as it loads the byte from the address specified by another register (and the inverse for a store), preferably with optional autoincrement/autodecrement, would help a lot.  That way, on a 32-bit architectures, such copies would take just 4 loads, 1 store, 1 conditional branch per 32-bit word (or 1 load, 4 stores, 1 conditional branch); but might be useful for other scatter/gather byte composition/extraction too.  Note that little-endian byte order uses autoincrement, but big-endian autodecrement.

I am sure you can see the difference these make for the most common cases – str*() and mem*() functions, but also for endian-specific unaligned accessor functions (that I intend my base library to provide by default as static inline functions) – but if not, I'd be happy to post some particular examples in pseudo-assembly.
Title: Re: Block copy instructions
Post by: DiTBho on January 13, 2021, 03:24:18 pm
auto-increment loads, stores, and comparisons [..] would make the biggest difference

So basically you want these two, right?
Code: [Select]
load_postinc.size D_reg, (A_reg)   // load D_reg from mem[A_reg], and then increment A_reg+=size
store_postinc.size D_reg, (A_reg)  // store D_reg into mem[A_reg], and then increment A_reg+=size
size={ sizeof(u8), sizeof(u16), sizeof(u32)} = { 1, 2, 4} bytes
D_reg is the data register.
A_reg is the address register.

In a RISC design both D_reg and A_reg belong to the register file set.

MIPS doesn't have this kind of special load/store instructions with post-increment, but I think that's possible job and makes sense even for a pure RISC design.

But I wouldn't add anything more than this, otherwise the "RISC" will become a "CRISC" :D
Title: Re: Block copy instructions
Post by: SiliconWizard on January 13, 2021, 03:55:39 pm
Judging from the sheer amount of benchmarks, studies and available pieces of code for optimizing block copy, this is certainly NOT a solved issue IMHO, at least not in the general case.

And I actually found *very few* articles about block copy instructions. Very few. Most articles you can find are about optimizing block copy in assembly with instructions sets that don't have dedicated instructions. Exceptions, as said earlier, is mostly with x86 processors, on which you can find various comparisons using either the block copy instructions (that are x86 legacy and are usually not particularly well optimized on modern x86 processors), or vectored operations, with or without loop unrolling, etc. And now even when optimizations are possible with specific instruction scheduling, vector operations, etc, it requires specific CPU features and is almost an art in itself.

@bruce: taking the RISC-V example, loop unrolling (along with using the offset for load and store operations) will indeed make it almost as efficient as a dedicated controller. Drawback is: it's not a general solution, loop unrolling entirely depends on the code, and another one: it makes code size quickly grow, a lot. It's hard to beat 1 instruction when the equivalent would be anything from 8 to possibly tens of instructions (with loop unrolling). Code size is not just a size issue, it also fills up instruction caches unnecessarily.

Of course if code size is no issue, and/or you have a super-scalar CPU, and/or with vector operations, and/or huge caches, dedicated instructions/a dedicated controller are probably not useful. In other cases, I'm convinced at this point that it can be.

And implementing such instructions would definitely take much less ressources than implementing a super-scalar processor, and/or vector operations.

@andersm: a number of ways to work around the interruptibility issue and the 3 sources issue indeed. Not a big problem. As for interruptibility specifically, I'm thinking it could be interesting to have instructions that would optionally make the block copy operation either interruptible or not: for implementing exclusive access to memory at some point, a non-interruptible version would certainly be efficient IMO, compared to mutexes or the like.

For people who still think that's a useless feature, especially on a RISC CPU, I found this report:
"Hardware Acceleration for Memory to Memory Copies" - UCB/EECS-2017-2 - supervised by Randy Katz and Krste Asanovic themselves. An interesting reading.

Another read, less interesting but still dealing with the issue: AN12628 - "Optimizing Memory Copy Routines", NXP.

Finally, as to how frequent "block copy" operations are, I think many do not realize how much.
They are certainly not just coming from explicit memcpy()-like calls. Anytime data is moved around (variables, objects, ...), a "block copy" is actually issued. I have no "benchmarks" to give figures (and I'd be interested in seeing some), but I'm certain moving memory around is a very significant part of most programs. And even if your particular microarch *and* compiler allow copies as efficient as dedicated instructions/dedicated controllers, that would potentially save significant code memory. Anyway, I recommend reading the above UCB report.

Title: Re: Block copy instructions
Post by: SiliconWizard on January 13, 2021, 04:17:32 pm
(...)
But I wouldn't add anything more than this, otherwise the "RISC" will become a "CRISC" :D

This is one of the points I mentioned in my first post. But thinking about it, I'm not sure it's really a valid point.

Instead of calling such instructions "complex instructions", just call the block copy feature an "accelerator" - that happens to be invocable through single instructions, part of some IS extension - and suddenly that all becomes trendy, and not many will object.
Title: Re: Block copy instructions
Post by: Nominal Animal on January 13, 2021, 05:17:37 pm
auto-increment loads, stores, and comparisons [..] would make the biggest difference
So basically you want these two, right?
Code: [Select]
load_postinc.size D_reg, A_reg   // load D_reg from mem[A_reg], and then increment A_reg+=size
store_postinc.size D_reg, A_reg  // store D_reg into mem[A_reg], and then increment A_reg+=size
size={1,2,4} as for {u8, u16, u32} I/O, D_reg, data register; A_reg, address register; in a RISC design Both D_reg and A_reg belong to the register file set.
I'd trade 16-bit for autodecrement, and I don't care if the increment is preincrement or postincrement, but yes.

Autodecrement is necessary for memmove() when dst > src within the same array, and very commonly used when inserting data into a sorted array.

single instructions, part of some IS extension
That's what I'd prefer, too.  Essentially, there are ten instruction forms that I'd like to see:
From the first four, either pair is acceptable, both forms have their uses.  I mean, for block memory functions, either keep-upper-bits-unchanged or zero-upper-bits is fine; just do not do sign extension.

I've always had a slight dislike for sign extending a byte to a full word, because it is so rarely needed.  In most code I've seen, it is actually a bug, where a single character from a string is cast to int, and the underlying char type happens to be signed.  It is certainly not needed in block memory functions.

For example, in C, <ctype.h> functions must not be used on char; an explicit cast to (unsigned char) is required, but most programmers forget/don't know/don't care.  That is, the following code is buggy:
Code: [Select]
static inline const char *skipspaces(const char *src) {
    if (src)
        while (isspace(*src))
            src++;
    return src;
}
whereas the correct version is
Code: [Select]
static inline const char *skipspaces(const char *src) {
    if (src)
        while (isspace((unsigned char)(*src)))
            src++;
    return src;
}
The only actual case for sign extension from byte to full register is when you have a variable of type signed char or int8_t, and you load it for integer manipulation.  So I do see a need for sign extension for a single byte load to a register; but not as part of memory block functions.
Title: Re: Block copy instructions
Post by: tggzzz on January 13, 2021, 07:03:55 pm

Finally, as to how frequent "block copy" operations are, I think many do not realize how much.
They are certainly not just coming from explicit memcpy()-like calls. Anytime data is moved around (variables, objects, ...), a "block copy" is actually issued. I have no "benchmarks" to give figures (and I'd be interested in seeing some), but I'm certain moving memory around is a very significant part of most programs. And even if your particular microarch *and* compiler allow copies as efficient as dedicated instructions/dedicated controllers, that would potentially save significant code memory. Anyway, I recommend reading the above UCB report.

Language semantics have a significant effect. Some languages mandate copying, others don't. T

To some extent optimisation can reduce such overhead, if and only if there can be no aliasing (to the specific data) anywhere in the program.
Title: Re: Block copy instructions
Post by: SiliconWizard on January 13, 2021, 08:29:20 pm

Finally, as to how frequent "block copy" operations are, I think many do not realize how much.
They are certainly not just coming from explicit memcpy()-like calls. Anytime data is moved around (variables, objects, ...), a "block copy" is actually issued. I have no "benchmarks" to give figures (and I'd be interested in seeing some), but I'm certain moving memory around is a very significant part of most programs. And even if your particular microarch *and* compiler allow copies as efficient as dedicated instructions/dedicated controllers, that would potentially save significant code memory. Anyway, I recommend reading the above UCB report.

Language semantics have a significant effect. Some languages mandate copying, others don't. T

To some extent optimisation can reduce such overhead, if and only if there can be no aliasing (to the specific data) anywhere in the program.

True, although even so, data copy is IMO still significant. What language do you have in mind?

I'd like to find or write a tool that allows identifying data copy at runtime and yield statistics such as min, max, average block size and total amount of copied data, so we get real figures instead of just guessing.
Title: Re: Block copy instructions
Post by: tggzzz on January 13, 2021, 08:57:34 pm

Finally, as to how frequent "block copy" operations are, I think many do not realize how much.
They are certainly not just coming from explicit memcpy()-like calls. Anytime data is moved around (variables, objects, ...), a "block copy" is actually issued. I have no "benchmarks" to give figures (and I'd be interested in seeing some), but I'm certain moving memory around is a very significant part of most programs. And even if your particular microarch *and* compiler allow copies as efficient as dedicated instructions/dedicated controllers, that would potentially save significant code memory. Anyway, I recommend reading the above UCB report.

Language semantics have a significant effect. Some languages mandate copying, others don't. T

To some extent optimisation can reduce such overhead, if and only if there can be no aliasing (to the specific data) anywhere in the program.

True, although even so, data copy is IMO still significant. What language do you have in mind?

C/C++ mandate copying of structs/objects as parameters to function calls. Many other languages copy a reference to the struct/object,but that can have it's own disadvantages.

Rust has an intriguing approach to ameliorating the problems, but I haven't "kicked the tyres".

Quote
I'd like to find or write a tool that allows identifying data copy at runtime and yield statistics such as min, max, average block size and total amount of copied data, so we get real figures instead of just guessing.

Don't forget to also measure the size of the working set vs cache size,and cache line alignment.
Title: Re: Block copy instructions
Post by: brucehoult on January 14, 2021, 01:49:16 am
C/C++ mandate copying of structs/objects as parameters to function calls. Many other languages copy a reference to the struct/object,but that can have it's own disadvantages.

Yes, but programmers who care about performance never do that for large objects/structs.
Title: Re: Block copy instructions
Post by: RenThraysk on January 14, 2021, 02:37:00 am
m68k has movem, for block copies above a certain size. Could copy something like 48 bytes in 2 instructions.

The general solution if copying is the performance bottleneck, is to completely eliminate the need to copy. AKA Zero-copy.

Most modern OS's have zero copy APIs for things like files and networking.

Title: Re: Block copy instructions
Post by: brucehoult on January 14, 2021, 03:27:05 am
m68k has movem, for block copies above a certain size. Could copy something like 48 bytes in 2 instructions.

On the actual M68000 that is 18 cycles per 4 bytes copied, plus 20 cycles overhead. You can't use all 16 registers (64 bytes), obviously, because you need some for the src and dst pointers, minimum. And whatever registers you do use, you probably have to save them somewhere at the start, and reload them later.

OK, let's say 12 registers. So 12*18+20 = 236 cycles to save them at the start and restore them after. And then 236 cycles plus some loop overhead (dbnz, say, 10 cycles) for each 48 bytes copied.

A simple MOVE.L (An)+,(An)+ is 20 cycles and doesn't use any registers except the src and dst pointers so MOVEM saves very little if you don't mind unrolling the MOVE.L loop a little.
Title: Re: Block copy instructions
Post by: tggzzz on January 14, 2021, 10:18:30 am
C/C++ mandate copying of structs/objects as parameters to function calls. Many other languages copy a reference to the struct/object,but that can have it's own disadvantages.

Yes, but programmers who care about performance never do that for large objects/structs.

Yes, they accept the difficulties and problems (in C/C++) with the alternatives.

But that becomes much more problematic where code from multiple people, projects, organisations and companies has to be combined.

In such cases language semantics can and does make a big difference, especially with the vast majority of programmers.

Title: Re: Block copy instructions
Post by: DiTBho on January 14, 2021, 10:34:27 am
[..] zero copy APIs [..]

That's what is written in the first chapter of "Optimizing MIPS code" Book ;D

Title: Re: Block copy instructions
Post by: tggzzz on January 14, 2021, 11:09:39 am
The general solution if copying is the performance bottleneck, is to completely eliminate the need to copy. AKA Zero-copy.

Most modern OS's have zero copy APIs for things like files and networking.

That solution can easily bring its own problems, e.g. aliasing (and not being able to prove the absence of aliasing), ownership, memory reclamation.

Language support helps, and it absence hinders optimisation and correctness.
Title: Re: Block copy instructions
Post by: Nominal Animal on January 14, 2021, 11:15:49 am
You cannot avoid memory copies completely.

Consider I/O in a general purpose OS with paged virtual memory, where recently used files are cached in otherwise unused ("free") RAM.  Standard I/O involves a minimum of one copy from the page cache to the userspace process buffer; typically two, with the second being within the userspace, copying from its own file buffer to the target.  Memory-mapped I/O can have significant cost in (kernel) metadata structures and setup and teardown costs, and is not possible when the source is a pipe, socket, character device, or any other non-seekable non-file-like object.

(Async I/O directly to userspace buffers would be nice, but their implementation is definitely nontrivial.  The security aspects alone are very hairy: if DMA is used, the target pages need to be locked, and that opens up the risk of denial-of-service.  Per-page DMA and locking may have too much overhead.  And so on.  And even if you do have Async I/O, to have true zero-copy I/O, you need functions that access the data directly in the userspace buffer – interface needed differs a lot from C string functions! – with the way they handle encountering the end of the buffer is critical: they should return inconclusive, so that the buffer can be extended or remapped.  Very, very nontrivial.)

(Personally, I have had to fight several "experts" who claim async I/O, for example MPI async I/O, is "inherently prone to deadlock", simply because they cannot wrap their mind around how it works.  I've basically given up trying to push for true async I/O because of how few actually understand and can use it correctly, and instead nowadays use a thread to handle blocking or non-blocking I/O in an asynchronous manner.)

In a multithreaded system, memory copies are essential for efficient operation (RCU (https://en.wikipedia.org/wiki/Read-copy-update)).  In particular, if you have a state object protected by a mutex or a rwlock, you always want to copy the object to local storage for general examination and processing, so that the lock duration can be kept to minimum.

Dense cache-effective data structures – arrays and (binary) heaps in particular – require memory copy operations during insertion and deletion.  Using pointers not only grows the per-item size (often significantly), but also tends to reduce data locality (because consecutive items are not necessarily near each other in memory), which tends to reduce the effectiveness of caching schemes.  A particularly well known example is naïve matrix multiplication on x86/x86-64, where doing an (in-place) transpose to one of the matrices to make the inner loop access memory consecutively gives a significant efficiency boost.

Finally, I personally do not wish to limit to block copy operations, as the block comparison and block examination (zero search in particular) are very similar, and when used, tend to be heavily used (a significant fraction of the CPU time used by a process).
Title: Re: Block copy instructions
Post by: RenThraysk on January 14, 2021, 11:35:15 am
m68k has movem, for block copies above a certain size. Could copy something like 48 bytes in 2 instructions.

On the actual M68000 that is 18 cycles per 4 bytes copied, plus 20 cycles overhead. You can't use all 16 registers (64 bytes), obviously, because you need some for the src and dst pointers, minimum. And whatever registers you do use, you probably have to save them somewhere at the start, and reload them later.

OK, let's say 12 registers. So 12*18+20 = 236 cycles to save them at the start and restore them after. And then 236 cycles plus some loop overhead (dbnz, say, 10 cycles) for each 48 bytes copied.

A simple MOVE.L (An)+,(An)+ is 20 cycles and doesn't use any registers except the src and dst pointers so MOVEM saves very little if you don't mind unrolling the MOVE.L loop a little.

Yes, that's why I said above a certain size. You have to be doing a silly amount of small copies to even care about the cost.
Considerably more loop overhead can be eliminated at some point too with a duff device.

Title: Re: Block copy instructions
Post by: RenThraysk on January 14, 2021, 11:41:04 am
The general solution if copying is the performance bottleneck, is to completely eliminate the need to copy. AKA Zero-copy.

Most modern OS's have zero copy APIs for things like files and networking.

That solution can easily bring its own problems, e.g. aliasing (and not being able to prove the absence of aliasing), ownership, memory reclamation.

Language support helps, and it absence hinders optimisation and correctness.

But you have to have identified that you have a problem with copy performance before you'd even consider attempting to change strategy to a zero copy one.
You certainly shouldn't attempt a zero copy implementation until have written the simpler copy version first, and benchmarked to determine the copy operations are the performance issue.
It's highly unlikely they will be.

Premature optimisation is the root of all evil, and all that.

Or as Dave would put it, a trap for young players.
Title: Re: Block copy instructions
Post by: tggzzz on January 14, 2021, 02:30:22 pm
You cannot avoid memory copies completely.

Of course.

But I was considering the usual copy-by-accident because people don't understand what the (magic) compiler does. Yes, that is not uncommon :(

Quote
Consider I/O in a general purpose OS with paged virtual memory, where recently used files are cached in otherwise unused ("free") RAM.  Standard I/O involves a minimum of one copy from the page cache to the userspace process buffer; typically two, with the second being within the userspace, copying from its own file buffer to the target.  Memory-mapped I/O can have significant cost in (kernel) metadata structures and setup and teardown costs, and is not possible when the source is a pipe, socket, character device, or any other non-seekable non-file-like object.

ISTR that HP significantly reduced the copies needed during network IO, but that was 25 years ago and I wasn't involved. ISTR one trick was to combine memory traversal for CRC computation with moving a buffer.

Quote
In a multithreaded system, memory copies are essential for efficient operation (RCU (https://en.wikipedia.org/wiki/Read-copy-update)).  In particular, if you have a state object protected by a mutex or a rwlock, you always want to copy the object to local storage for general examination and processing, so that the lock duration can be kept to minimum.

Are you sure of that? Have you looked at how modern HotSpot (Java JVM) implementations work?

Quote
Dense cache-effective data structures – arrays and (binary) heaps in particular – require memory copy operations during insertion and deletion.  Using pointers not only grows the per-item size (often significantly), but also tends to reduce data locality (because consecutive items are not necessarily near each other in memory), which tends to reduce the effectiveness of caching schemes.  A particularly well known example is naïve matrix multiplication on x86/x86-64, where doing an (in-place) transpose to one of the matrices to make the inner loop access memory consecutively gives a significant efficiency boost.

Pointers aren't a panacea - far from it!
Title: Re: Block copy instructions
Post by: radiogeek381 on January 14, 2021, 04:03:41 pm
The VAX instruction set had at least two block copy instructions: MOVC3, MOVC5.  (Depending on how you look at it, add MOVTC and MOVTUC.)

The instructions block move instructions were *all* interruptible, since the transfer link could be much larger than a page.  There had to be a way of handling a page fault that had to go to disk. There were also real-time considerations, as a 64KB+ block could take seconds to complete on a 730, for instance.

MOVC3 and MOVC5 even dealt with *overlapping* source and destination.  Microcode adjusted the order of transfer (front to back, or back to front) in order to get the "right" result.

MOVC5 would fill the destination buffer with a specified character if the source buffer was smaller than the destination buffer.

The interruptibility of the instructions gave rise to quite a few microcoding issues, as the intermediate state was saved in general purpose registers.  Fiddling with the return stack before resuming a MOVCx could produce surprise results.  However, the VAX architecture commitment mandated that it could not cause a system hang.

These instructions were very important in the paging system and, believe it or not, to the BACKUP program.   And it was faster than the "equivalent" series of MOV (Rx)+,(Ry)+,  and AOBLEQ.
Title: Re: Block copy instructions
Post by: RenThraysk on January 14, 2021, 04:08:55 pm
Pointers aren't a panacea - far from it!

Yeah, for example garbage collected languages. If your struct/data is small enough it can be better to copy it around, rather than use pointers and increase pressure on the gc.
Title: Re: Block copy instructions
Post by: Nominal Animal on January 14, 2021, 04:22:40 pm
Quote
In a multithreaded system, memory copies are essential for efficient operation (RCU (https://en.wikipedia.org/wiki/Read-copy-update)).  In particular, if you have a state object protected by a mutex or a rwlock, you always want to copy the object to local storage for general examination and processing, so that the lock duration can be kept to minimum.
Are you sure of that?
I should have phrased that better, now that you point it out!  :-DD

It is true for RCU objects, and these are rather common on the OS kernel side.  I did not mean all state objects protected by a mutex or rwlock need to be copied.  (If you've looked at any example code I've supplied, you'll see that when using mutexes with normal state objects, I don't do copies.)

A typical example would be a set of resource counters for some process or thread.  While e.g. the calculations needed by a scheduler to apply priorities etc. to decide the next process to be run are simple, they are nonneglible, and instead of holding the lock for the duration of the calculation, you only hold the lock for the duration of copying the data. The higher the contention on the state object, the more important it is to minimize the lock hold time, so that you do not end up serializing the operations done using the information in the state object.

Pointers aren't a panacea - far from it!
Exactly.  Many cache-efficient algorithms do need efficient copy operations.
Title: Re: Block copy instructions
Post by: SiliconWizard on January 14, 2021, 05:16:39 pm
The VAX instruction set had at least two block copy instructions: MOVC3, MOVC5.  (Depending on how you look at it, add MOVTC and MOVTUC.)

The instructions block move instructions were *all* interruptible, since the transfer link could be much larger than a page.  There had to be a way of handling a page fault that had to go to disk. There were also real-time considerations, as a 64KB+ block could take seconds to complete on a 730, for instance.

MOVC3 and MOVC5 even dealt with *overlapping* source and destination.  Microcode adjusted the order of transfer (front to back, or back to front) in order to get the "right" result.

MOVC5 would fill the destination buffer with a specified character if the source buffer was smaller than the destination buffer.

The interruptibility of the instructions gave rise to quite a few microcoding issues, as the intermediate state was saved in general purpose registers.  Fiddling with the return stack before resuming a MOVCx could produce surprise results.  However, the VAX architecture commitment mandated that it could not cause a system hang.

These instructions were very important in the paging system and, believe it or not, to the BACKUP program.   And it was faster than the "equivalent" series of MOV (Rx)+,(Ry)+,  and AOBLEQ.

Thanks for this piece of history. I admit I know next to nothing about VAX.
Title: Re: Block copy instructions
Post by: SiliconWizard on January 14, 2021, 05:46:52 pm
Exactly.  Many cache-efficient algorithms do need efficient copy operations.

Absolutely!

Now there is one obvious reason for copying: keeping the original "value" of a given "object". This is a pretty common need.
Sure, doing it if it's not required is wasteful, sure, some languages are better than others to help you avoid that, but still. There are still many, many reasons for doing it for a good reason.

As I said, I'd really like to find (which I have little hope about) or write a tool that could analyze code dynamically (at run time) and extract stats about copy operations. My experience and intuition tell me that they are a significant part of a large range of programs, and since that's just intuition and not facts, a tool would definitely help figure that out. It could also help optimize code to limit copy operations when they are not strictly needed, or even help devise programming languages that are more efficient in that regard. Of course, it would also help determine whether block copy instructions would have a real benefit or not.

Title: Re: Block copy instructions
Post by: Nominal Animal on January 14, 2021, 06:31:15 pm
As I said, I'd really like to find (which I have little hope about) or write a tool that could analyze code dynamically (at run time) and extract stats about copy operations.
While
    ltrace -o logfile -e memcpy+memmove+memset command
will log all memcpy(), memmove(), and memset() calls to logfile, I suspect one would need to instrument GCC itself to find out.

For example, using GCC 7.5.0 on x86-64 Linux,
Code: [Select]
struct foo {
    double  d;
    float   f;
    long    o;
    int     i;
    char    c;
};

void copy(struct foo *dst, struct foo *src)
{
    *dst = *src;
}
using -O2 will generate an open-coded copy, not a call to memcpy()/memmove().

If you want to see something really funny, create an empty empty.py and run 'ltrace -o log -e memset+memcpy+memmove python3 empty.py'.
On my machine, that yields a log of 57,946 calls to those three functions alone...
Title: Re: Block copy instructions
Post by: brucehoult on January 14, 2021, 09:57:54 pm
m68k has movem, for block copies above a certain size. Could copy something like 48 bytes in 2 instructions.

On the actual M68000 that is 18 cycles per 4 bytes copied, plus 20 cycles overhead. You can't use all 16 registers (64 bytes), obviously, because you need some for the src and dst pointers, minimum. And whatever registers you do use, you probably have to save them somewhere at the start, and reload them later.

OK, let's say 12 registers. So 12*18+20 = 236 cycles to save them at the start and restore them after. And then 236 cycles plus some loop overhead (dbnz, say, 10 cycles) for each 48 bytes copied.

A simple MOVE.L (An)+,(An)+ is 20 cycles and doesn't use any registers except the src and dst pointers so MOVEM saves very little if you don't mind unrolling the MOVE.L loop a little.

Yes, that's why I said above a certain size. You have to be doing a silly amount of small copies to even care about the cost.
Considerably more loop overhead can be eliminated at some point too with a duff device.

Duff is just a way to do unrolling without having a leftover part at the end by jumping into the middle of the loop the first time. It only works if the pointers are actually updated after every unit copied, not using offset addressing with one big pointer update at the end of the loop -- so it doesn't apply to typical RISC code.

If the unroll factor is small (four, say) then it's questionable whether Duff is even worth it at all as (depending on the ISA) often needs a few instructions and maybe a memory reference to calculate the address of the instruction to jump to.
Title: Re: Block copy instructions
Post by: brucehoult on January 14, 2021, 10:08:58 pm
The VAX instruction set had at least two block copy instructions: MOVC3, MOVC5.  (Depending on how you look at it, add MOVTC and MOVTUC.)

:

These instructions were very important in the paging system and, believe it or not, to the BACKUP program.   And it was faster than the "equivalent" series of MOV (Rx)+,(Ry)+,  and AOBLEQ.

Are you sure? Maybe on some models.

Dave Patterson's actual measurements of the slowness of VAX 780 instructions such as MOVC3, CALLS/CALLG and the more complicated addressing modes vs doing the same thing using simple instructions was what led him to formulate RISC in the first place.

If you're doing a loop with MOVL (Rx)+,(Ry)+ and AOBLEQ then those may already be too complicated instructions that invoke a slow microcode path.

It's been notorious that throughout the last 42 years REP MOVSB on x86 has been slower than a well written explicit loop on (almost?) every Intel and AMD model.
Title: Re: Block copy instructions
Post by: tggzzz on January 14, 2021, 10:25:32 pm
Pointers aren't a panacea - far from it!

Yeah, for example garbage collected languages. If your struct/data is small enough it can be better to copy it around, rather than use pointers and increase pressure on the gc.

Very few people from the C/C++ world (or those who only dimly remember reference counting from university lectures) realise how effective modern garbage collectors are - provided there is appropriate language semantics to support it.

Too many such people think they can manage memory more efficiently manually. (I can manage memory arbitrarily efficiently if I dont have to guarantee correctness :) )
Title: Re: Block copy instructions
Post by: RenThraysk on January 15, 2021, 12:00:34 am
m68k has movem, for block copies above a certain size. Could copy something like 48 bytes in 2 instructions.

On the actual M68000 that is 18 cycles per 4 bytes copied, plus 20 cycles overhead. You can't use all 16 registers (64 bytes), obviously, because you need some for the src and dst pointers, minimum. And whatever registers you do use, you probably have to save them somewhere at the start, and reload them later.

OK, let's say 12 registers. So 12*18+20 = 236 cycles to save them at the start and restore them after. And then 236 cycles plus some loop overhead (dbnz, say, 10 cycles) for each 48 bytes copied.

A simple MOVE.L (An)+,(An)+ is 20 cycles and doesn't use any registers except the src and dst pointers so MOVEM saves very little if you don't mind unrolling the MOVE.L loop a little.

Yes, that's why I said above a certain size. You have to be doing a silly amount of small copies to even care about the cost.
Considerably more loop overhead can be eliminated at some point too with a duff device.

Duff is just a way to do unrolling without having a leftover part at the end by jumping into the middle of the loop the first time. It only works if the pointers are actually updated after every unit copied, not using offset addressing with one big pointer update at the end of the loop -- so it doesn't apply to typical RISC code.

Yes, they use a computed goto/jump.
Duff devices work on MIPS/MIPS64 and ARM/ARM64. So not sure what "doesn't apply to typical RISC code means".
Title: Re: Block copy instructions
Post by: brucehoult on January 15, 2021, 12:45:29 am
m68k has movem, for block copies above a certain size. Could copy something like 48 bytes in 2 instructions.

On the actual M68000 that is 18 cycles per 4 bytes copied, plus 20 cycles overhead. You can't use all 16 registers (64 bytes), obviously, because you need some for the src and dst pointers, minimum. And whatever registers you do use, you probably have to save them somewhere at the start, and reload them later.

OK, let's say 12 registers. So 12*18+20 = 236 cycles to save them at the start and restore them after. And then 236 cycles plus some loop overhead (dbnz, say, 10 cycles) for each 48 bytes copied.

A simple MOVE.L (An)+,(An)+ is 20 cycles and doesn't use any registers except the src and dst pointers so MOVEM saves very little if you don't mind unrolling the MOVE.L loop a little.

Yes, that's why I said above a certain size. You have to be doing a silly amount of small copies to even care about the cost.
Considerably more loop overhead can be eliminated at some point too with a duff device.

Duff is just a way to do unrolling without having a leftover part at the end by jumping into the middle of the loop the first time. It only works if the pointers are actually updated after every unit copied, not using offset addressing with one big pointer update at the end of the loop -- so it doesn't apply to typical RISC code.

Yes, they use a computed goto/jump.
Duff devices work on MIPS/MIPS64 and ARM/ARM64. So not sure what "doesn't apply to typical RISC code means".

Let's check.

Code: [Select]
#include <inttypes.h>

typedef uint64_t word;

void foo(word *to, word *from, int count) {
  int n = (count + 7) / 8;
  switch (count % 8) {
  case 0: do { *to++ = *from++;
  case 7:      *to++ = *from++;
  case 6:      *to++ = *from++;
  case 5:      *to++ = *from++;
  case 4:      *to++ = *from++;
  case 3:      *to++ = *from++;
  case 2:      *to++ = *from++;
  case 1:      *to++ = *from++;
    } while (--n > 0);
  }
}


void bar(word *to, word *from, int count) {
  while (count >= 8) {
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    count -= 8;
  }
  while (count > 0) {
    *to++ = *from++;
    count -= 1;
  }
}


Compiled for arm64 with gcc -O2:

Code: [Select]
0000000000000000 <foo>:
   0:   31001c44        adds    w4, w2, #0x7
   4:   11003843        add     w3, w2, #0xe
   8:   1a844063        csel    w3, w3, w4, mi  // mi = first
   c:   6b0203e4        negs    w4, w2
  10:   12000884        and     w4, w4, #0x7
  14:   12000842        and     w2, w2, #0x7
  18:   5a844442        csneg   w2, w2, w4, mi  // mi = first
  1c:   13037c63        asr     w3, w3, #3
  20:   71000c5f        cmp     w2, #0x3
  24:   54000740        b.eq    10c <foo+0x10c>  // b.none
  28:   5400060d        b.le    e8 <foo+0xe8>
  2c:   7100145f        cmp     w2, #0x5
  30:   540006a0        b.eq    104 <foo+0x104>  // b.none
  34:   5400022b        b.lt    78 <foo+0x78>  // b.tstop
  38:   7100185f        cmp     w2, #0x6
  3c:   540000e0        b.eq    58 <foo+0x58>  // b.none
  40:   71001c5f        cmp     w2, #0x7
  44:   540005a1        b.ne    f8 <foo+0xf8>  // b.any
  48:   f9400022        ldr     x2, [x1]
  4c:   91002000        add     x0, x0, #0x8
  50:   f81f8002        stur    x2, [x0, #-8]
  54:   91002021        add     x1, x1, #0x8
  58:   f9400024        ldr     x4, [x1]
  5c:   91002002        add     x2, x0, #0x8
  60:   91002021        add     x1, x1, #0x8
  64:   f9000004        str     x4, [x0]
  68:   f9400024        ldr     x4, [x1]
  6c:   91002040        add     x0, x2, #0x8
  70:   91002021        add     x1, x1, #0x8
  74:   f9000044        str     x4, [x2]
  78:   f9400022        ldr     x2, [x1]
  7c:   91002004        add     x4, x0, #0x8
  80:   91002021        add     x1, x1, #0x8
  84:   f9000002        str     x2, [x0]
  88:   f9400020        ldr     x0, [x1]
  8c:   91002082        add     x2, x4, #0x8
  90:   91002021        add     x1, x1, #0x8
  94:   f9000080        str     x0, [x4]
  98:   f9400024        ldr     x4, [x1]
  9c:   91002040        add     x0, x2, #0x8
  a0:   91002021        add     x1, x1, #0x8
  a4:   f9000044        str     x4, [x2]
  a8:   f9400022        ldr     x2, [x1]
  ac:   51000463        sub     w3, w3, #0x1
  b0:   f9000002        str     x2, [x0]
  b4:   7100007f        cmp     w3, #0x0
  b8:   5400020d        b.le    f8 <foo+0xf8>
  bc:   91002021        add     x1, x1, #0x8
  c0:   91002000        add     x0, x0, #0x8
  c4:   f9400022        ldr     x2, [x1]
  c8:   91002021        add     x1, x1, #0x8
  cc:   f9000002        str     x2, [x0]
  d0:   91002000        add     x0, x0, #0x8
  d4:   f9400022        ldr     x2, [x1]
  d8:   91002021        add     x1, x1, #0x8
  dc:   f9000002        str     x2, [x0]
  e0:   91002000        add     x0, x0, #0x8
  e4:   17ffffdd        b       58 <foo+0x58>
  e8:   7100045f        cmp     w2, #0x1
  ec:   54fffde0        b.eq    a8 <foo+0xa8>  // b.none
  f0:   5400006c        b.gt    fc <foo+0xfc>
  f4:   34fffe82        cbz     w2, c4 <foo+0xc4>
  f8:   d65f03c0        ret
  fc:   aa0003e2        mov     x2, x0
 100:   17ffffe6        b       98 <foo+0x98>
 104:   aa0003e2        mov     x2, x0
 108:   17ffffd8        b       68 <foo+0x68>
 10c:   aa0003e4        mov     x4, x0
 110:   17ffffde        b       88 <foo+0x88>
 114:   d503201f        nop

0000000000000118 <bar>:
 118:   71001c5f        cmp     w2, #0x7
 11c:   5400052d        b.le    1c0 <bar+0xa8>
 120:   51002042        sub     w2, w2, #0x8
 124:   aa0103e3        mov     x3, x1
 128:   53037c47        lsr     w7, w2, #3
 12c:   d37a70e6        ubfiz   x6, x7, #6, #29
 130:   910100c6        add     x6, x6, #0x40
 134:   8b060004        add     x4, x0, x6
 138:   f9400065        ldr     x5, [x3]
 13c:   91010063        add     x3, x3, #0x40
 140:   f9000005        str     x5, [x0]
 144:   91010000        add     x0, x0, #0x40
 148:   eb00009f        cmp     x4, x0
 14c:   f85c8065        ldur    x5, [x3, #-56]
 150:   f81c8005        stur    x5, [x0, #-56]
 154:   f85d0065        ldur    x5, [x3, #-48]
 158:   f81d0005        stur    x5, [x0, #-48]
 15c:   f85d8065        ldur    x5, [x3, #-40]
 160:   f81d8005        stur    x5, [x0, #-40]
 164:   f85e0065        ldur    x5, [x3, #-32]
 168:   f81e0005        stur    x5, [x0, #-32]
 16c:   f85e8065        ldur    x5, [x3, #-24]
 170:   f81e8005        stur    x5, [x0, #-24]
 174:   f85f0065        ldur    x5, [x3, #-16]
 178:   f81f0005        stur    x5, [x0, #-16]
 17c:   f85f8065        ldur    x5, [x3, #-8]
 180:   f81f8005        stur    x5, [x0, #-8]
 184:   54fffda1        b.ne    138 <bar+0x20>  // b.any
 188:   8b060021        add     x1, x1, x6
 18c:   4b070c42        sub     w2, w2, w7, lsl #3
 190:   7100005f        cmp     w2, #0x0
 194:   5400014d        b.le    1bc <bar+0xa4>
 198:   51000442        sub     w2, w2, #0x1
 19c:   d2800000        mov     x0, #0x0                        // #0
 1a0:   91000442        add     x2, x2, #0x1
 1a4:   d503201f        nop
 1a8:   f8607823        ldr     x3, [x1, x0, lsl #3]
 1ac:   f8207883        str     x3, [x4, x0, lsl #3]
 1b0:   91000400        add     x0, x0, #0x1
 1b4:   eb02001f        cmp     x0, x2
 1b8:   54ffff81        b.ne    1a8 <bar+0x90>  // b.any
 1bc:   d65f03c0        ret
 1c0:   aa0003e4        mov     x4, x0
 1c4:   17fffff3        b       190 <bar+0x78>


Let's try MIPS with gcc -O2 -march=mips64

Code: [Select]
00000000 <foo>:
   0:   24c30007        addiu   v1,a2,7
   4:   04610002        bgez    v1,10 <foo+0x10>
   8:   3c028000        lui     v0,0x8000
   c:   24c3000e        addiu   v1,a2,14
  10:   24420007        addiu   v0,v0,7
  14:   00c23024        and     a2,a2,v0
  18:   04c10005        bgez    a2,30 <foo+0x30>
  1c:   000340c3        sra     t0,v1,0x3
  20:   24c6ffff        addiu   a2,a2,-1
  24:   2402fff8        li      v0,-8
  28:   00c23025        or      a2,a2,v0
  2c:   24c60001        addiu   a2,a2,1
  30:   2cc20008        sltiu   v0,a2,8
  34:   10400040        beqz    v0,138 <foo+0x138>
  38:   3c020000        lui     v0,0x0
  3c:   24420000        addiu   v0,v0,0
  40:   00063080        sll     a2,a2,0x2
  44:   00463021        addu    a2,v0,a2
  48:   8cc20000        lw      v0,0(a2)
  4c:   00400008        jr      v0
  50:   00000000        nop
  54:   8ca30004        lw      v1,4(a1)
  58:   8ca20000        lw      v0,0(a1)
  5c:   24840008        addiu   a0,a0,8
  60:   ac83fffc        sw      v1,-4(a0)
  64:   ac82fff8        sw      v0,-8(a0)
  68:   24a50008        addiu   a1,a1,8
  6c:   8ca70004        lw      a3,4(a1)
  70:   8ca60000        lw      a2,0(a1)
  74:   24820008        addiu   v0,a0,8
  78:   24a50008        addiu   a1,a1,8
  7c:   ac870004        sw      a3,4(a0)
  80:   ac860000        sw      a2,0(a0)
  84:   8ca70004        lw      a3,4(a1)
  88:   8ca60000        lw      a2,0(a1)
  8c:   24440008        addiu   a0,v0,8
  90:   24a50008        addiu   a1,a1,8
  94:   ac470004        sw      a3,4(v0)
  98:   ac460000        sw      a2,0(v0)
  9c:   8ca30004        lw      v1,4(a1)
  a0:   8ca20000        lw      v0,0(a1)
  a4:   24860008        addiu   a2,a0,8
  a8:   24a50008        addiu   a1,a1,8
  ac:   ac830004        sw      v1,4(a0)
  b0:   ac820000        sw      v0,0(a0)
  b4:   8cab0004        lw      t3,4(a1)
  b8:   8caa0000        lw      t2,0(a1)
  bc:   24c20008        addiu   v0,a2,8
  c0:   24a50008        addiu   a1,a1,8
  c4:   accb0004        sw      t3,4(a2)
  c8:   acca0000        sw      t2,0(a2)
  cc:   8ca70004        lw      a3,4(a1)
  d0:   8ca60000        lw      a2,0(a1)
  d4:   24440008        addiu   a0,v0,8
  d8:   24a50008        addiu   a1,a1,8
  dc:   ac470004        sw      a3,4(v0)
  e0:   ac460000        sw      a2,0(v0)
  e4:   8ca30004        lw      v1,4(a1)
  e8:   8ca20000        lw      v0,0(a1)
  ec:   2508ffff        addiu   t0,t0,-1
  f0:   ac830004        sw      v1,4(a0)
  f4:   19000010        blez    t0,138 <foo+0x138>
  f8:   ac820000        sw      v0,0(a0)
  fc:   24a50008        addiu   a1,a1,8
 100:   24840008        addiu   a0,a0,8
 104:   8ca30004        lw      v1,4(a1)
 108:   8ca20000        lw      v0,0(a1)
 10c:   ac830004        sw      v1,4(a0)
 110:   ac820000        sw      v0,0(a0)
 114:   24a50008        addiu   a1,a1,8
 118:   8ca30004        lw      v1,4(a1)
 11c:   8ca20000        lw      v0,0(a1)
 120:   24840008        addiu   a0,a0,8
 124:   ac830004        sw      v1,4(a0)
 128:   ac820000        sw      v0,0(a0)
 12c:   24a50008        addiu   a1,a1,8
 130:   1000ffce        b       6c <foo+0x6c>
 134:   24840008        addiu   a0,a0,8
 138:   03e00008        jr      ra
 13c:   00000000        nop
 140:   1000ffd0        b       84 <foo+0x84>
 144:   00801025        move    v0,a0
 148:   1000ffe0        b       cc <foo+0xcc>
 14c:   00801025        move    v0,a0
 150:   1000ffd8        b       b4 <foo+0xb4>
 154:   00803025        move    a2,a0

00000158 <bar>:
 158:   28c20008        slti    v0,a2,8
 15c:   14400030        bnez    v0,220 <bar+0xc8>
 160:   00a01825        move    v1,a1
 164:   00c03825        move    a3,a2
 168:   00801025        move    v0,a0
 16c:   8c690004        lw      t1,4(v1)
 170:   8c680000        lw      t0,0(v1)
 174:   ac490004        sw      t1,4(v0)
 178:   ac480000        sw      t0,0(v0)
 17c:   24630040        addiu   v1,v1,64
 180:   8c69ffcc        lw      t1,-52(v1)
 184:   8c68ffc8        lw      t0,-56(v1)
 188:   ac49000c        sw      t1,12(v0)
 18c:   ac480008        sw      t0,8(v0)
 190:   8c69ffd4        lw      t1,-44(v1)
 194:   8c68ffd0        lw      t0,-48(v1)
 198:   24420040        addiu   v0,v0,64
 19c:   ac49ffd4        sw      t1,-44(v0)
 1a0:   ac48ffd0        sw      t0,-48(v0)
 1a4:   8c69ffdc        lw      t1,-36(v1)
 1a8:   8c68ffd8        lw      t0,-40(v1)
 1ac:   ac49ffdc        sw      t1,-36(v0)
 1b0:   ac48ffd8        sw      t0,-40(v0)
 1b4:   8c69ffe4        lw      t1,-28(v1)
 1b8:   8c68ffe0        lw      t0,-32(v1)
 1bc:   ac49ffe4        sw      t1,-28(v0)
 1c0:   ac48ffe0        sw      t0,-32(v0)
 1c4:   8c69ffec        lw      t1,-20(v1)
 1c8:   8c68ffe8        lw      t0,-24(v1)
 1cc:   ac49ffec        sw      t1,-20(v0)
 1d0:   ac48ffe8        sw      t0,-24(v0)
 1d4:   8c69fff4        lw      t1,-12(v1)
 1d8:   8c68fff0        lw      t0,-16(v1)
 1dc:   ac49fff4        sw      t1,-12(v0)
 1e0:   ac48fff0        sw      t0,-16(v0)
 1e4:   8c69fffc        lw      t1,-4(v1)
 1e8:   8c68fff8        lw      t0,-8(v1)
 1ec:   24e7fff8        addiu   a3,a3,-8
 1f0:   28ea0008        slti    t2,a3,8
 1f4:   ac49fffc        sw      t1,-4(v0)
 1f8:   1140ffdc        beqz    t2,16c <bar+0x14>
 1fc:   ac48fff8        sw      t0,-8(v0)
 200:   24c6fff8        addiu   a2,a2,-8
 204:   000610c2        srl     v0,a2,0x3
 208:   24430001        addiu   v1,v0,1
 20c:   00031980        sll     v1,v1,0x6
 210:   000210c0        sll     v0,v0,0x3
 214:   00832021        addu    a0,a0,v1
 218:   00a32821        addu    a1,a1,v1
 21c:   00c23023        subu    a2,a2,v0
 220:   18c00009        blez    a2,248 <bar+0xf0>
 224:   000630c0        sll     a2,a2,0x3
 228:   00863021        addu    a2,a0,a2
 22c:   24a50008        addiu   a1,a1,8
 230:   8ca3fffc        lw      v1,-4(a1)
 234:   8ca2fff8        lw      v0,-8(a1)
 238:   24840008        addiu   a0,a0,8
 23c:   ac83fffc        sw      v1,-4(a0)
 240:   1486fffa        bne     a0,a2,22c <bar+0xd4>
 244:   ac82fff8        sw      v0,-8(a0)
 248:   03e00008        jr      ra
 24c:   00000000        nop


Any questions?

In both cases, the 2nd version, without Duff, is going to be much preferable.

Title: Re: Block copy instructions
Post by: RenThraysk on January 15, 2021, 01:25:00 am
m68k has movem, for block copies above a certain size. Could copy something like 48 bytes in 2 instructions.

On the actual M68000 that is 18 cycles per 4 bytes copied, plus 20 cycles overhead. You can't use all 16 registers (64 bytes), obviously, because you need some for the src and dst pointers, minimum. And whatever registers you do use, you probably have to save them somewhere at the start, and reload them later.

OK, let's say 12 registers. So 12*18+20 = 236 cycles to save them at the start and restore them after. And then 236 cycles plus some loop overhead (dbnz, say, 10 cycles) for each 48 bytes copied.

A simple MOVE.L (An)+,(An)+ is 20 cycles and doesn't use any registers except the src and dst pointers so MOVEM saves very little if you don't mind unrolling the MOVE.L loop a little.

Yes, that's why I said above a certain size. You have to be doing a silly amount of small copies to even care about the cost.
Considerably more loop overhead can be eliminated at some point too with a duff device.

Duff is just a way to do unrolling without having a leftover part at the end by jumping into the middle of the loop the first time. It only works if the pointers are actually updated after every unit copied, not using offset addressing with one big pointer update at the end of the loop -- so it doesn't apply to typical RISC code.

Yes, they use a computed goto/jump.
Duff devices work on MIPS/MIPS64 and ARM/ARM64. So not sure what "doesn't apply to typical RISC code means".

Let's check.

Any questions?

In both cases, the 2nd version, without Duff, is going to be much preferable.

Questions, yes.. why would you even write another memcpy() function in C? Makes no sense. Existing memcpy() is likely to have all kinds of optimisations already implemented.

Here is Go's assembly duff code generator for various CPUs for duffcopy/zero.
https://golang.org/src/runtime/mkduff.go

Title: Re: Block copy instructions
Post by: magic on January 15, 2021, 03:00:40 am
I'm not really fluent in ARM assembly and calling conventions so not sure what's happening in your code, but why is the Duff so long?
And what stops you from jumping into the middle of a loop which uses offset addressing? You mean R+constant addressing mode by that, with increasing constant for each unolled iteration?
Title: Re: Block copy instructions
Post by: brucehoult on January 15, 2021, 04:00:32 am
Questions, yes.. why would you even write another memcpy() function in C? Makes no sense. Existing memcpy() is likely to have all kinds of optimisations already implemented.

Someone has to write the library one.

Quote
Here is Go's assembly duff code generator for various CPUs for duffcopy/zero.
https://golang.org/src/runtime/mkduff.go

If it's assembly language then it's not Duff's device.

Duff specifically means a trick in C interleaving a loop and a switch statement to get an effect similar to what you can trivially do in assembly language.
Title: Re: Block copy instructions
Post by: Nominal Animal on January 15, 2021, 06:02:11 pm
It's not just memcpy()!

First of all, there are two variants of memcpy(): one that copies in increasing addresses, and one that copies in decreasing addresses.  I call these memcpy_inc() and memcpy_dec(), respectively; with memcpy() itself being arbitrarily either one.

memmove(dst, src, len) uses memcpy_inc() if dst < src, and memcpy_dec() if dst > src.

memrepeat(dest, off, totallen) uses memcpy_inc() with src = dest, dst = dest + off, and len = totallen - off.  It speeds up array initialization significantly, works for any array members including structures, and in several old Fortran-C comparisons (Fortran using something similar internally to initialize arrays, and standard C not having one) was the reason of Fortran being faster than C.  (Fortran as a language is easier to optimize, and array slicing allows some very powerful techniques, but all those can be implemented in C; just is not in the standard C library.  When I started writing MD code, I often had to explain why I was using C and not Fortran, and Fortran users weren't happy when I showed how I could do everything at least as efficiently as anyones Fortran code could.  I seem to have a knack of offending people by showing them facts they really don't like.)

Then we have very often used comparison functions, that are very similar to copies, except they read both sources, and abort at first difference.
This includes memcmp(), memmem(), strcmp(), strncmp(), strstr(), and (rarely implemented strnstr()).

Then we have many of the string functions, that use a single character or a small lookup table to determine when to return.
This includes strspn(), strcspn(), strchr(), strrchr(), memchr(), strsep(), strpbrk().

In assembly, these functions have the same base loop, and really only differ in the loop condition.  The lookup table based ones do an extra load from the lookup table (using a byte from the string as the offset to the table) and test a bit in the result (since in the lookup table, each entry contains a bit mask, with each bit specifying a class to be checked).

(That said, an instruction to test a bit in a possibly large table by bit index alone – so bits 0,1,2 identify the bit, and bits 3-31 (or whatever the native word size is) the byte offset – is something that is not very often needed, but when needed, might make a significant difference.)

My point earlier was that autoincrement and autodecrement speeds up all these, so much so that the inner loops have no other common code than the autoincrement/decrement load from memory.  Looking at ltrace -e str*+mem* output – even for something as simple as the date binary! – shows that these library functions are extremely common; at least as common as memcpy()/memset()/memmove() in typical userspace application code, so picking an approach that makes all these more efficient is a valid strategy in my opinion.

Having watched the x86/x86-64 evolve with respect to its block instructions – rep/repz/repnz cmps/ins/lods/movs/outs/scas/stos –, I do not believe optimizing block operations further than address autoincrement and autodecrement is worth the effort.  Indeed, for a number of x86 processors, these instructions were far from optimal, until Intel/AMD put some effort in the microcode to make them efficient again.

For a RISC-type instruction set, having an extension to enable byte and full register-wide address autoincrement and autodecrement loads and stores yields most of the possible speedups.  Zeroing aligned memory, entire cachelines or pages, is a separate task, and probably something that should be done in cooperation with the memory controller; but only really necessary/useful for general purpose OSes, where privilege separation and avoiding kernel information leaks are important, and often require zeroing memory.

Apologies for any repetition, but this is important.  This does have a measurable impact on many common applications – and a significant one on e.g. the standard Python interpreter (CPython).  Run an application via the aforementioned ltrace command, but do note that current C compilers open-code many memory copies etc., and this only logs the library calls.
Title: Re: Block copy instructions
Post by: gf on January 15, 2021, 07:10:05 pm
It's been notorious that throughout the last 42 years REP MOVSB on x86 has been slower than a well written explicit loop on (almost?) every Intel and AMD model.

It was rather slow in the past, but starting with Ivy Bridge, Enhanced REP MOVSB did become very fast.

For instance, glibc's memcpy() uses it to copy large blocks.
https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html

Quote
Threshold to use Enhanced REP MOVSB.  Since there is overhead to set up REP MOVSB operation, REP MOVSB isn't faster on short data.  The memcpy micro benchmark in glibc shows that 2KB is the approximate value above which REP MOVSB becomes faster than SSE2 optimization on processors with Enhanced REP MOVSB.  Since larger register size can move more data with a single load and store, the threshold is higher with larger register size.
Title: Re: Block copy instructions
Post by: brucehoult on January 16, 2021, 12:27:57 am
Quote
Threshold to use Enhanced REP MOVSB.  Since there is overhead to set up REP MOVSB operation, REP MOVSB isn't faster on short data.  The memcpy micro benchmark in glibc shows that 2KB is the approximate value above which REP MOVSB becomes faster than SSE2 optimization on processors with Enhanced REP MOVSB.  Since larger register size can move more data with a single load and store, the threshold is higher with larger register size.

2 KB!!?? That's freaking huge.

Once it hits production I'm sure memcpy() using RISC-V V extension will be optimal for ... I don't know ... maybe 16 bytes and up.

Code: [Select]
    .global memcpy
    # void *memcpy(void* dest, const void* src, size_t n)
    # a0=dest, a1=src, a2=n
    #
  memcpy:
      mv a3, a0 # Copy destination
  loop:
    vsetvli t0, a2, e8,m8,ta,ma   # Vectors of 8b
    vle8.v v0, (a1)               # Load bytes
      add a1, a1, t0              # Bump pointer
      sub a2, a2, t0              # Decrement count
    vse8.v v0, (a3)               # Store bytes
      add a3, a3, t0              # Bump pointer
      bnez a2, loop               # Any more?
      ret                         # Return

For any call copying up to 8 times the vector register size (typically 128 to 512 bits, so 8 * 16 to 64 bytes, or 128-512 bytes) this will execute exactly ten instructions, including the call and return (or 8 if inlined).

If you compare that to the Newlib memcpy() I posted earlier in the thread, for a 1 register memcpy() it actually executes more instructions because it first has to figure out whether this is a big or a small memcpy() and jump to the simple loop -- which still calculates the limit and increments the src and dst in executing the loop once.
Title: Re: Block copy instructions
Post by: brucehoult on January 16, 2021, 12:58:34 am
For a RISC-type instruction set, having an extension to enable byte and full register-wide address autoincrement and autodecrement loads and stores yields most of the possible speedups.

The hole in your argument is your assumption that a program with fewer instructions is faster.

This is not correct. Adding autoincrement loads to an ISA such as RISC-V suddenly requires that it be possible for an integer instruction to write results to two registers instead of one.

There are two basic implementation choices:

1) build a register file that can take two writes in the same clock cycle. This is expensive, and a huge waste given that no other instruction will make use of this capability.

2) split the autoincrement load into two uops, executing in two clock cycles. The execution time will be the same as using two instructions in the first place, with the only savings being a small amount of program code storage.


OK, there's another option -- make a queue of results waiting for writeback to the register file and do the extra write in a cycle when an instruction is executing that doesn't have a result, such as a conditional branch or a store.

That will work, but it's a lot of complexity added, including the complexity of checking whether the current value for every register read is in the actual register or in the writeback queue. That's getting frighteningly close to full OoO with register renaming.

If you want to implement a register file that can take two result writes every clock cycle then you're probably better off keeping the simple instructions and doing lockstep dual-issue to two pipelines. Then your load and your pointer bump instructions can execute at the same time in the two pipelines -- but so can a whole bunch of other pairs of instructions. Generally experience shows a good implementation will dual-issue about 60% of the time, at a cost of about an extra 15% in core area and power use, making it a very good option for cores intended for maximum energy efficiency such as the ARM A7/A8/A9/A53/A55, SiFive U74 (etc), Western Digital SweRV.
Title: Re: Block copy instructions
Post by: SiliconWizard on January 16, 2021, 01:51:19 am
In my thoughts, I clearly wanted to avoid any of this. No two destination registers. It would not just require a register file that allows two writes per cycle (which is already bad enough), but more importantly, it would drastically increase the complexity of data forwarding (bypassing) in the pipeline. That's way worse IMO. So, basically, nope!

That's why the "accelerator" I have in mind would actually NOT modify any register - it would have its own address registers.
Title: Re: Block copy instructions
Post by: westfw on January 16, 2021, 02:21:05 am
Isn't this getting into the "much ado about nothing" when your pipelined instruction stream is churning along at less than 1cycle (<1ns) per instruction, but your memory is taking 60ns or so for each access?  block copies are going to be more limited by cache architectures and memory system than whether you have a specialized instruction or not.

(ok, I'm curious - the cache on modern CPUs is pretty complex and generally under-documented.  For an N-byte block copy, what percentage of the memory accesses (both source and destination) are likely to hit what level of cache?)

(and this overlooks common block copies, ie for IO, where either source or destination is KNOWN to be in "external" memory.)

Title: Re: Block copy instructions
Post by: Nominal Animal on January 16, 2021, 03:17:51 am
This is not correct. Adding autoincrement loads to an ISA such as RISC-V suddenly requires that it be possible for an integer instruction to write results to two registers instead of one.
That only applies to loads.  Another possibility for loads is to merge the autoincrement/decrement into the pointer comparison branching instruction. (Then only two additional instructions are needed: one that autoincrements a register and branches if it is less than (or equal to) another register; and another that autodecrements a register and branches if it is greater than (or equal to) another register.  Technically, branching whenever not equal suffices.  Whether the register store is done always, or only when the branch is taken, and whether the autoincrement/decrement occurs before or after the comparison, is irrelevant as long as it is consistent.)

Note that my analysis is not based on instruction count: it is based on the commonality among all these block functions.  The address increment/decrement is the common factor, and if it can be merged with one of the other functions without additional cycle cost, it will bring the savings.  The reason I assumed they would be done in conjunction with the loads and stores – although I explicitly said I don't care if the addresses are pre- or post-incremented/decremented – is that many architectures have that.

Just because autoincrement/decrement is problematic with loads, I would not throw out the entire idea.
Title: Re: Block copy instructions
Post by: brucehoult on January 16, 2021, 03:19:42 am
Isn't this getting into the "much ado about nothing" when your pipelined instruction stream is churning along at less than 1cycle (<1ns) per instruction, but your memory is taking 60ns or so for each access?  block copies are going to be more limited by cache architectures and memory system than whether you have a specialized instruction or not.

For copies that don't hit in cache, absolutely right, though you do need to consider that your 60 ns access time is then going to give you at least 64 bytes of data -- so 16 loops of register copies on a 32 bit machine, or 8 loops on a 64 bit machine.

Also loads from L2 into L1 will for sure be in the L1 cache line size and will be a lot less latency than 60 ns (I think 5 to 7 ns is typical), while loads from RAM into L2 have big latency but may return several L1 cache lines worth of data.

For copying thing that are in cache, every instruction matters.
Title: Re: Block copy instructions
Post by: brucehoult on January 16, 2021, 03:24:54 am
That only applies to loads.  Another possibility for loads is to merge the autoincrement/decrement into the pointer comparison branching instruction.

Agreed that the argument doesn't apply to stores (which is why I didn't mention them), and also that inc/dec and branch is a much less disruptive instruction.

There is also the possibility of combining autoincrement stores and loads with base+offset addressing, which I have written about here before. This fits well with typical code that reads more than one data item per loop but only stores one result.
Title: Re: Block copy instructions
Post by: gf on January 16, 2021, 10:31:26 am
For copies that don't hit in cache, absolutely right, though you do need to consider that your 60 ns access time is then going to give you at least 64 bytes of data

In order to utilize the full memory bandwidth, multiple memory transactions must overlap over time. Consequently it is important to fill the CPU pipeline with enough memory fetch microinstructions, which can be pending simultaneously then, and/or arrange asynchronous prefetch to L1 cache. The former is usually achieved by unrolling the loop, the latter can be done explicitly via prefetch instructions or happens automatically. The CPU's automatic prefetch may take a couple of memory transactions, though, until it detects the sequential access pattern an joins in. Optimizing all this stuff depends very much on the particular microarchitecture.

In order that two instructions in the pipeline can be executed simultaneously, they must not have data dependencies. A sequence of auto-increment instructions like r2=*r1++, r3=*r1++,... actually has data dependencies (via r1), i.e. r3=*r1++ cannot start before r2=*r1++ has finished (unless the CPU splits them internally to microinstructions w/o auto-increment like r2=*r1, r1++, r3=*r1, r1++,... -- then the memory fetch microoperations can overlap again).

EDIT:
Yet another point which has not been considered/discussed yet (or maybe I missed it?) is branch misprediction of the copy loop's conditional branch. Branch misprediction penalty of 1..3 dozens of clock cycles is not unusual. In detail it depends of course on the particular microarchitecture, and the CPU's state and operations in progress when it happens.
Title: Re: Block copy instructions
Post by: brucehoult on January 16, 2021, 10:35:19 am
In order that two instructions in the pipeline can be executed simultaneously, they must not have data dependencies. A sequence of auto-increment instructions like r2=*r1++, r3=*r1++,... actually has data dependencies (via r1), i.e. r3=*r1++ cannot start before r2=*r1++ has finished (unless the CPU splits them internally to microinstructions w/o auto-increment like r2=*r1, r1++, r3=*r1, r1++,... -- then the memory fetch microoperations can overlap again).

Right.

Yet another reason that r2=0[r1]; r3=1[r1]; r4=2[r1]; r5=3[r1] r1+=4 is a much better idea.
Title: Re: Block copy instructions
Post by: gf on January 16, 2021, 11:13:51 am
In order that two instructions in the pipeline can be executed simultaneously, they must not have data dependencies. A sequence of auto-increment instructions like r2=*r1++, r3=*r1++,... actually has data dependencies (via r1), i.e. r3=*r1++ cannot start before r2=*r1++ has finished (unless the CPU splits them internally to microinstructions w/o auto-increment like r2=*r1, r1++, r3=*r1, r1++,... -- then the memory fetch microoperations can overlap again).

Right.

Yet another reason that r2=0[r1]; r3=1[r1]; r4=2[r1]; r5=3[r1] r1+=4 is a much better idea.

:-+ Yes, this eliminates even the data dependency via ADD instructions between the memory fetch instructions (although the ADD is usually just a single cycle, i.e. likely still neglligible when the memory fetch does not hit the cache).
Title: Re: Block copy instructions
Post by: Nominal Animal on January 16, 2021, 03:00:07 pm
Yet another reason that r2=0[r1]; r3=1[r1]; r4=2[r1]; r5=3[r1] r1+=4 is a much better idea.
Which works for large block copies, and does nothing to help with the other related operations. :(
Title: Re: Block copy instructions
Post by: brucehoult on January 16, 2021, 11:51:53 pm
EDIT:
Yet another point which has not been considered/discussed yet (or maybe I missed it?) is branch misprediction of the copy loop's conditional branch. Branch misprediction penalty of 1..3 dozens of clock cycles is not unusual. In detail it depends of course on the particular microarchitecture, and the CPU's state and operations in progress when it happens.

Ahhh ... 3 *dozens* of clock cycles? No way. I can't think of any machine that would apply to.

The penalty for a branch mispredict -- with both possible targets already in cache -- is strictly less than or equal to the pipeline length.

For example, the SiFive FE-310 (32 bit microcontroller) and FU-540 (64 bit Linux capable) both have 5 stage pipelines, and both have 3 cycle penalty for mispredicted branches.

The SiFive FU-740 has an 8 stage pipeline (two pipelines, as it is dual-issue). Mispredicted conditional branches usually have a 4 cycle penalty, except when they depend on the result of an instruction that issues in the same clock cycle as the branch (in the other pipeline). In this case there is a 6 cycle penalty.

Mispredicted indirect branches (i.e. JALR) have a 6 cycle penalty. As these are not conditional branches, the "misprediction" involved is failure of branch target prediction (including a mismatch in the return stack cache in the case of RET) rather than branch taken/not taken prediction.


The Out-of-Order U84 core has a 12 stage pipeline, and a 10 cycle branch mispredict penalty.

The AMD Zen 3 has an 11 to 18 cycle branch mispredict penalty. Skylake and Zen 1 are 19 or 20 cycles. ARM A72 is 15 cycles. A76 is 11 cycles.

I don't know of anything that is 2 dozen, let alone 3 dozen.

Note that with modern branch predictors that use a branch history table as part of the predictor (a series of true/false bits indicating whether the last N branches were taken or not taken, it is typical for loops that iterate just a few times (anywhere from 10 to 30) to have EVERY branch correctly predicted, including the final branch that exits the loop.
Title: Re: Block copy instructions
Post by: gf on January 17, 2021, 01:11:41 pm
Ahhh ... 3 *dozens* of clock cycles? No way. I can't think of any machine that would apply to.

I'm basically referring to the analysis of the authors of this paper (https://users.elis.ugent.be/~leeckhou/papers/ispass06-eyerman.pdf) which say that

    "the branch misprediction penalty can be substantially larger than the frontend pipeline length
     (which is often equated with the misprediction penalty)"

if the indirect side-effects of the misprediction are taken into account. Figure 1 shows the analyzed average penalties for a couple of different application profiles, and they range from 9 to 35 cycles. They are obviously very application-dependent. I was surprised, too, that the indirect effects can be so significant.

Quote
Note that with modern branch predictors that use a branch history table as part of the predictor (a series of true/false bits indicating whether the last N branches were taken or not taken, it is typical for loops that iterate just a few times (anywhere from 10 to 30) to have EVERY branch correctly predicted, including the final branch that exits the loop.

Luckily there are cases where this indeed works, but I still doubt that every branch inside a memcpy function can be predicted correctly, given that the function can be called with a different size argument upon each invocation.

Another point is that the BPT is basically yet another N-way cache (indexed by a couple of bits from the branch instruction's address), so that the former history of of a branch can get evicted, and the entry reused for other branch instructions in the program which happen to map to the same index.
Title: Re: Block copy instructions
Post by: SiliconWizard on January 17, 2021, 03:44:10 pm
Ahhh ... 3 *dozens* of clock cycles? No way. I can't think of any machine that would apply to.

I'm basically referring to the analysis of the authors of this paper (https://users.elis.ugent.be/~leeckhou/papers/ispass06-eyerman.pdf) which say that

    "the branch misprediction penalty can be substantially larger than the frontend pipeline length
     (which is often equated with the misprediction penalty)"

if the indirect side-effects of the misprediction are taken into account. Figure 1 shows the analyzed average penalties for a couple of different application profiles, and they range from 9 to 35 cycles. They are obviously very application-dependent. I was surprised, too, that the indirect effects can be so significant.

Thanks for the paper, I'll have a look. Certainly things can get hairy with superscalar CPUs.

Another point is that the BPT is basically yet another N-way cache (indexed by a couple of bits from the branch instruction's address), so that the former history of of a branch can get evicted, and the entry reused for other branch instructions in the program which happen to map to the same index.

Yep, but typical memory copy loops are pretty tight, which means addresses are relatively near, so the probability of a branch target being evicted in a tight loop is pretty low (since BTB indexing is most often on the lower bits of the addresses).

Of course, that is true only if the whole copy executes without any context switch. In case of a context switch in the middle of the copy, all bets are off.
Title: Re: Block copy instructions
Post by: brucehoult on January 17, 2021, 04:49:55 pm
Ahhh ... 3 *dozens* of clock cycles? No way. I can't think of any machine that would apply to.

I'm basically referring to the analysis of the authors of this paper (https://users.elis.ugent.be/~leeckhou/papers/ispass06-eyerman.pdf) which say that

    "the branch misprediction penalty can be substantially larger than the frontend pipeline length
     (which is often equated with the misprediction penalty)"

if the indirect side-effects of the misprediction are taken into account. Figure 1 shows the analyzed average penalties for a couple of different application profiles, and they range from 9 to 35 cycles. They are obviously very application-dependent. I was surprised, too, that the indirect effects can be so significant.

So they are talking about a different thing.

Their basic results seem to be that mispredicted branches frequently depend on a value loaded from RAM or L2 and that the ROB drains while waiting to resolve it. And that this happens often enough to be a significant factor.

I can well well believe this is true for the machine they base their results on!

The paper is rather old (written in 2005 for a conference in early 2006), and studies only one particular invented/simulated machine that is not at all representative of the machines of today:

- 4 wide Out-of-order superscalar
- 64 entry ROB
- 16 KB L1 dcache, 1 MB L2
- 12 cycle L2 access time, 200 cycle RAM access time

Machine we discuss here tend to be either microcontrollers running out of SRAM or else OoO desktop/smartphone CPUs with rather more execution resources.

4-wide is still typical for modern OoO CPUs. Apple has built an 8-wide monster, but let's ignore that for now -- benchmarks clearly indicate that they have the effect of mispredicts well under control.

- 64 entry ROB with 4-wide execution means there is only 16 cycles of instructions buffered. This compares badly to the 12 cycle L1 miss time, let alone RAM. Current CPUs from Intel and AMD have typical ROB sizes of 192 or 224 with the latest Zen 3 having 256 and Intel’s Sunny Cove and Willow Cove 352. Tests show Apple's M1 has around a 630 entry ROB.

- 12 cycle L1 miss / L2 access time is still typical, but happens a good bit less often because L1 cache sizes are typically 32 KB. Also now when L2 misses we have 8 MB or more of L3 with typical access times around 40 cycles, which is vastly different to 200 for RAM.

- with clock speeds now around 4 to 5 GHz and RAM stuck at 60 ns, RAM access is even a bit more than assumed in the paper, at 240 to 300 cycles. But with those big L3 caches it happens a lot less frequently.


For me, the glaring thing here is the ROB containing only 16 cycles of instructions, barely enough to cover an L1 miss, and simply dying on an L2 miss.

Current CPUs ROBs buffer 50 to 90 cycles worth of instructions, while servicing most L2 cache misses from L3 in 40 cycles instead of 200+ for RAM.

I think the same study done with real current CPUs would show very different results.

Quote
Quote
Note that with modern branch predictors that use a branch history table as part of the predictor (a series of true/false bits indicating whether the last N branches were taken or not taken, it is typical for loops that iterate just a few times (anywhere from 10 to 30) to have EVERY branch correctly predicted, including the final branch that exits the loop.

Luckily there are cases where this indeed works, but I still doubt that every branch inside a memcpy function can be predicted correctly, given that the function can be called with a different size argument upon each invocation.

You don't have to get *every* branch correct, only the vast majority.

Branch prediction uses the history of where the memcpy() was called from. In most cases any particular caller calls memcpy() with the same size.
Title: Re: Block copy instructions
Post by: Nominal Animal on January 17, 2021, 05:27:19 pm
An example piece of RISC-V (RV32I) assembly code for a block copy of N 'words' looks like (do not hesitate to provide more optimized code if there is):
[Assume: a0 contains N, a1 the source start address and a2 the destination start address]
Something brucehoult mentioned above made me realize we haven't discussed these operations at the pipeline stage level, to discover what the theoretical maximum copy rate is on your target processor.  I assume, it can only do one RAM access per clock cycle (load or store); and only one register file store per clock cycle.
What else? What operations can be done in parallel within the same clock cycle?

(I am talking about lowest-level operations, within the instruction pipeline; not about superscalarity or individual full instructions.)

Based on the above, the utter maximum theoretical copy rate is half a word (of native register size) per clock cycle, or equivalently two clock cycles per native register size word coped.  (For string operations, two cycles per byte copied, unless you have a branch instruction "branch if none of the bytes in the target register are zeroes".)

Because we cannot create an unrolled version for every single length, up to one cycle per iteration is taken by the loop branch.  As brucehoult showed earlier, that overhead depends on the number of unrolled copies per loop iteration – but then the block size needs to be at least as large as the copy step.

That leaves the address register updates.  With offset addressing (a small immediate constant, a multiple of the native register size, added to a register), that too boils down to one or two cycles per loop iteration (depending on whether updating a register value can be done in a single cycle or not).  If the register update can be done in parallel with a load or store or the loop branch, its "cost" vanishes.

It seems to me that both block-memcpy instruction and a normal assembly loops should be able to get arbitrarily close to two cycles per native word copied (using the above assumptions), and that anything better than this, you need e.g. wider single-cycle memory accesses, or something else outside the box here, so to speak.  If you agree, then shouldn't we look at which low-level operations can be done in parallel in the same clock cycle on your processor design, to see how to reach the two cycles per word copy rate?  And what kind of instructions would get there?
Title: Re: Block copy instructions
Post by: gf on January 17, 2021, 07:13:12 pm
What operations can be done in parallel within the same clock cycle?
(I am talking about lowest-level operations, within the instruction pipeline; not about superscalarity or individual full instructions.)

Depends on the processor's microarchitecture. "RISC-V" is not a particular processor implementation, but an open Instruction Set Architecture. Optimizations at this level are no longer generic, but microarchitecture-dependent.
Title: Re: Block copy instructions
Post by: Nominal Animal on January 17, 2021, 09:51:39 pm
What operations can be done in parallel within the same clock cycle?
(I am talking about lowest-level operations, within the instruction pipeline; not about superscalarity or individual full instructions.)
Depends on the processor's microarchitecture. "RISC-V" is not a particular processor implementation, but an open Instruction Set Architecture. Optimizations at this level are no longer generic, but microarchitecture-dependent.
That's exactly why I wrote, "on [SiliconWizard's] target processor".  If we limit to RISC-V alone, then there is nothing to discuss, is there?

As SiliconWizard has mentioned, they envision these instructions as a custom extension; and I do believe discussing the limitations/abilities of the underlying implementation is necessary to discuss what kind of instructions would be most effective.
Title: Re: Block copy instructions
Post by: SiliconWizard on January 17, 2021, 10:38:50 pm
So that we don't run in circles, if possible...
As I said earlier:

* On superscalar architectures, on which the increment/decrement instructions can be executed in parallel with the loads/stores, dedicated instructions would probably not have any sizeable performance benefit. OTOH, on simpler architectures, they can be considered, and my rationale is that dedicated instructions with a relatively simple "accelerator" would be much, much simpler to implement than a superscalar architecture - so they could have a benefit for simpler processors.

* As we discussed with Bruce, loop unrolling and offset addressing can mitigate the inc/dec, but that takes a lot more instructions overall.

* And in any case: dedicated instructions would yield smaller code size.

Not completely sure what Nominal Animal meant here: "I am talking about lowest-level operations, within the instruction pipeline; not about superscalarity or individual full instructions."
Any instruction can do many things at once as long as the logic path doesn't exceed your requirements. For instance, RISC-V CSR instructions involve read, modify and write of CSRs and that can be done (that's what I do) in a single cycle.

Loading or storing AND incrementing or decrementing of the address with a single instruction while having a CPI=1 throughput is no problem whatsoever.
(The problem, though, as we said earlier, is with loads, because if the instruction modifies 2 registers instead of one - the read value and the address register - this is a royal pain. Stores are no problem.)
Title: Re: Block copy instructions
Post by: Nominal Animal on January 18, 2021, 01:04:25 am
Not completely sure what Nominal Animal meant here: "I am talking about lowest-level operations, within the instruction pipeline; not about superscalarity or individual full instructions."
Any instruction can do many things at once as long as the logic path doesn't exceed your requirements.
The question is, what are the allowed combinations of "many things" on your implementation: which micro-operations your target can do at the same time.

Here is a practical suggestion for extending RV32I:
Note that BNEi and BNEd are really 2-5 instructions each, for at least n=0 (change by 1) and n=2 (change by 4).

If these are implementable without extra latencies or overheads compared to their non-autoincrement versions in RV32I, then these are one possible set that provide very close to the maximum throughput for block memory copies, but also allowing efficient implementations for the other memory/string operations.

But my point is, there are other such sets, depending on the implementation of the target processor.  Without knowing which micro-ops the implementation can do in parallel, it is useless to discuss such sets, because efficiency-wise they are equivalent.  We'd just be listing the possible sets.

For bulk RV32I memcpy(), unrolling the inner iterations using an immediate offset as brucehoult described earlier,
    .loop:
        lw    t1, 0(a1)
        sw    0(a2), t1
        lw    t1, 4(a1)
        sw    4(a2), t1
        addi  a1, a1, 8
        addi  a2, a2, 8
        bne  a3, a1, .loop
yields a copy rate of 4N/(2N+3) bytes per instruction with N copies per iteration (0.8, 1.143, 1.333, 1.455, ...), if the size is a multiple of 4N.  However,
    .loop:
        lw    t1, (a1)
        swi   (a2++), t1
        bnei  a1, a3, 1<<2, .loop
would yield a copy rate of 4/3 = 1.333 bytes per instruction for any size a multiple of 4, and
    .loop:
        lw    t1, 0(a1)
        swi   (a2++), t1
        lw    t1, 4(a1)
        swi   (a2++), t1
        bnei  a1, a3, 1<<3, .loop
would yield a copy rate of 4N/(2N+1) bytes per instruction with N copies per iteration (N=1: 1.333, N=2: 1.6, N=4: 1.778, N=8: 1.882, N=16: 1.939) bytes per instruction.  That is, unrolling the inner loop 16 times, copying 64 bytes per iteration, we'd reach 97% of the theoretical maximum copy rate. (You need to unroll the RV32I loop 48 times, copying 144 bytes at a time, to reach the same.)

For an example of the other possibilities, let's say you could implement lw rd, (rs1+rs2), i.e. load from an address that is the sum of two registers, and the store-autoinc/dec.  Then, you could use rs1 as the target address (autoinc/dec'd), and rs2 be a constant, the difference between the target and source addresses.  You'd get the same near-maximum copy rate as the last snippet above, using one more register:
        sub   a1, a1, a2
    .loop:
        lw    t1, (a2+a1)
        swi   (a2++), t1
        lw    t1, (a2+a1)
        swi   (a2++), t1
        bne  a2, a3, .loop

A yet another possible set is sum-of-two-register address loads and stores, and an autoadd/subtract branch-if-not-equal.

For alignment tests, a branch if any of the N least significant bits in a source register are nonzero, would be useful.
For string operations, a branch comparing each byte in first source register to the corresponding or least significant byte in the second source register, and taking the branch only if none are equal, would allow up to 4× speedup for many operations (with the caveat that internally, the function may access up to three bytes past the end-of-string nul byte).  (This is why I am partial to the autoinc/dec stores and conditional branches as an extension: many birds with few similar stones.)
Title: Re: Block copy instructions
Post by: brucehoult on January 18, 2021, 03:18:40 am
Based on the above, the utter maximum theoretical copy rate is half a word (of native register size) per clock cycle, or equivalently two clock cycles per native register size word coped.  (For string operations, two cycles per byte copied, unless you have a branch instruction "branch if none of the bytes in the target register are zeroes".)

It's not that bad. The following will detect a null byte in a 64 bit register without any special instructions.

#define DETECTNULL(X) (((X) - 0x0101010101010101) & ~(X) & 0x8080808080808080)

Example

X = 'Hello\0xx' = 0x48454c4c4f007878
W = X - 0x0101010101010101 = 0x47444b4b4dff7777
V = ~X = 0xb7bab3b3b0ff8787
U = W & V = 0x700030300ff0707
T = U & 0x8080808080808080 = 0x0000000000800000

T is not 0, so some byte in X was 0.

This is only four instructions, or three instructions if your CPU has a BIC / ANDN / ANDC instruction (as RISC-V does in the B extension, and also ARM, MSP430, PDP-11 and VAX).

You can find the exact byte using CLZ (Count Leading Zeroes) if you have it, otherwise just iterate through that one register worth of data.

However I wanted to improve on this in the RISC-V B extension, but without doing something as crude as a special "strlen" function.

We already were considering a "Generalized Reverse" (GREV) instruction which takes a 64 bit value from a register and passes it through a 6-level network controlled by 6 bits from the 2nd operand (5 bits for a 32 bit machine):

- bit 0 controls whether pairs of bits in the 1st operand are swapped
- bit 1 controls whether bits 2 apart in each nybble are swapped
- bit 2 controls whether bits 4 apart in each byte are swapped
- bit 3 controls whether bits 8 apart in each short are swapped
- bit 4 controls whether bits 16 apart in each word are swapped
- bit 5 controls whether bits 32 apart are swapped

Some uses:

- a control value of 000000 does nothing
- a control value of 000111 reverses the bits in each byte
- a control value of 111000 reverses the bytes in a register, keeping each individual byte intact (i.e. reverse a string)
- a control value of 111111 reverses all the bits in the register


I realized that another useful instruction (which were are calling GORC for now) could reuse the same network, but instead of conditionally swapping pairs of bits, use another function. Specifically, GORC takes each pair of bits and either passes them through unchanged or else replaces both or them by their OR. You can get GREV+GORC for maybe 10% more than the cost of GREV alone.

- a control value of 000000 does nothing
- a control value of 000111 ORs together all the bits in each byte

Bingo!

Taking our X = 'Hello\0xx' = 0x48454c4c4f007878 example again, GORC X, 7 produces 0xffffffffff00ffff

If the result is not all 1s (-1) then one of the bytes was zero.


Quote
Because we cannot create an unrolled version for every single length, up to one cycle per iteration is taken by the loop branch.  As brucehoult showed earlier, that overhead depends on the number of unrolled copies per loop iteration – but then the block size needs to be at least as large as the copy step.

That leaves the address register updates.  With offset addressing (a small immediate constant, a multiple of the native register size, added to a register), that too boils down to one or two cycles per loop iteration (depending on whether updating a register value can be done in a single cycle or not).  If the register update can be done in parallel with a load or store or the loop branch, its "cost" vanishes.

It seems to me that both block-memcpy instruction and a normal assembly loops should be able to get arbitrarily close to two cycles per native word copied (using the above assumptions), and that anything better than this, you need e.g. wider single-cycle memory accesses, or something else outside the box here, so to speak.  If you agree, then shouldn't we look at which low-level operations can be done in parallel in the same clock cycle on your processor design, to see how to reach the two cycles per word copy rate?  And what kind of instructions would get there?

Seems all correct.

I think a dual issue pipeline will be the minimum within a few years, for all but the tiniest embedded cores, as it is *so* little added area/complexity/expense for a large gain in efficiency See ARM M7 / A7 / A8 / A9, SiFive 7-series, Western Digital SweRV (open sourced).

With a dual issue machine, you only need to unroll the loop once to completely hide the update of two pointers, a compare against a limit, and a conditional branch. (the compare and branch are one instruction on RISC-V)

Code: [Select]
add limit,src,size
loop:
  lw tmp, 0(src); addi src,src,8
  sw tmp 0(dst); addi dst,dst,8
  lw tmp -4(src); cmp src,limit
  sw tmp -4(dst); blt loop
Title: Re: Block copy instructions
Post by: brucehoult on January 18, 2021, 03:32:31 am
For an example of the other possibilities, let's say you could implement lw rd, (rs1+rs2), i.e. load from an address that is the sum of two registers, and the store-autoinc/dec.  Then, you could use rs1 as the target address (autoinc/dec'd), and rs2 be a constant, the difference between the target and source addresses.  You'd get the same near-maximum copy rate as the last snippet above, using one more register:
        sub   a1, a1, a2
    .loop:
        lw    t1, (a2+a1)
        swi   (a2++), t1
        lw    t1, (a2+a1)
        swi   (a2++), t1
        bne  a2, a3, .loop

I have proposed pretty much this before, except my version was more like:

Code: [Select]
  addi a2,a2,-4
  sub a1,a1,a2
loop:
  lw t1,(a2,a1)
  sw! t1,4(a2)
  bne a2,a3,loop

It's cheaper to write back the effective address you already calculated. But it does require one more adjustment instruction before the loop.

This also extends naturally to code that steps through 2 or 3 or more arrays, loading data and doing some element-wise calculation on it, before writing a single result to another array. Which is pretty common. It's just that all the arrays need to have the same element size.
Title: Re: Block copy instructions
Post by: Nominal Animal on January 18, 2021, 03:01:40 pm
However I wanted to improve on this in the RISC-V B extension, but without doing something as crude as a special "strlen" function.
It is not just strlen()/strnlen(), but also for strcmp()/strncmp() and strcpy()/strncpy(), these being the three most used string functions.

If the branch does a byte-wise "branch only if none are equal" comparison between the 4 bytes in the two registers, you could also implement strchr()/strnchr(), and speed up strstr()/memmem().

If the branch does a byte-wise "branch only if none are equal" comparison between 4×4 bytes (in the two registers), you could also implement strpbrk()/strspn()/strcspn() with very low cost, since each register could hold up to four initial bytes in a matching sequence.

Other than this nitpick, I agree.  I don't like the idea of enhancing just one operation either; and I'd prefer to try to keep the patterns so simple there is hope of teaching the compilers to use them.

With a dual issue machine, you only need to unroll the loop once to completely hide the update of two pointers, a compare against a limit, and a conditional branch. (the compare and branch are one instruction on RISC-V)
Quite true, and I did not think of that (or rather, that it was pertinent here).

You see, in HPC, many AMD cores have two SSE/AVX ALUs, so that 128-bit vector operations are essentially dual issue.  I've exploited this a lot in my own code, using compiler intrinsics for the vector operations – perhaps not as efficient as hand-tuned and cycle-profiled assembly, but gets most of the speedup.

It's cheaper to write back the effective address you already calculated.
This!  This is exactly the kind of low-level µ-op details I wanted to know.  Your version is obviously superior to mine; how did I miss it earlier?

But it does require one more adjustment instruction before the loop.
It really depends on the function signature.  With known pointers/size, at least GCC will implement the copy loop itself, and avoid the adjustment by folding in the "constants".  It does the same for e.g. static inline assembly functions, where the arithmetic is done when supplying input register values, when the parameter values are known at compile time.  For unknown pointers/size, for standard C functions, there is at minimum an addition to calculate the limiting address, and possibly some validity checks (at least for the user-programmer facing functions), so one more adjustment is just noise there.
Title: Re: Block copy instructions
Post by: SiliconWizard on January 18, 2021, 03:07:27 pm
Not completely sure what Nominal Animal meant here: "I am talking about lowest-level operations, within the instruction pipeline; not about superscalarity or individual full instructions."
Any instruction can do many things at once as long as the logic path doesn't exceed your requirements.
The question is, what are the allowed combinations of "many things" on your implementation: which micro-operations your target can do at the same time.

Here is a practical suggestion for extending RV32I:
  • SBi (rd), rs: Store least significant byte of register rs to memory address specified by register rd, and increment rd by one.
  • SBd (rd), rs: Store least significant byte of register rs to memory address specified by register rd, and decrement rd by one.
  • SWi (rd), rs: Store register rs to memory address specified by register rd, and increment rd by four.
  • SWd (rd), rs: Store register rs to memory address specified by register rd, and decrement rd by four.
  • BNEi rd, rs, 1<<n, delta: Compare rd to rs, and increment rd by 1<<n.  If not equal, branch by delta.
  • BNEd rd, rs, 1<<n, delta: Compare rd to rs, and decrement rd by 1<<n.  If not equal, branch by delta.

I see absolutely no issue nor any required added latency for any of the instructions you suggest above.
The only difference with existing instructions is the additional inc/dec of a dest register. This can be done fully in parallel with the rest of what they have to do, so no problem.
Title: Re: Block copy instructions
Post by: Nominal Animal on January 18, 2021, 04:36:28 pm
I see absolutely no issue nor any required added latency for any of the instructions you suggest above.
The only difference with existing instructions is the additional inc/dec of a dest register. This can be done fully in parallel with the rest of what they have to do, so no problem.
Excellent; thanks.  :-+

How wide are your µ-op memory accesses, then?  Native word sized, or larger?

(The above can reach very close to the theoretical maximum copy rate with one memory access per cycle, in a single-issue core, using native register sized words; with just the branch instruction as the loop iteration "overhead".  If you want faster, you need wider than native word memory accesses per cycle.)
Title: Re: Block copy instructions
Post by: brucehoult on January 18, 2021, 08:47:17 pm
However I wanted to improve on this in the RISC-V B extension, but without doing something as crude as a special "strlen" function.
It is not just strlen()/strnlen(), but also for strcmp()/strncmp() and strcpy()/strncpy(), these being the three most used string functions.

strlen() is implicit in those -- often checked in parallel, but on some machines it is better to explicitly calculate strlen() first and then use a counted loop for the copy or compare.

Better still if the *programmer* calculates the strlen() first, and uses memcpy()/memcmp(), and remembers the strlen() for later use.


Quote
If the branch does a byte-wise "branch only if none are equal" comparison between the 4 bytes in the two registers, you could also implement strchr()/strnchr(), and speed up strstr()/memmem().

That only requires an XOR and then the "is any byte zero" test.

A new branch instruction would fit, but reg + reg + 12 bit branch offset uses a *lot* of encoding space. That's ok for beq/bne/blt/bltu/ble/bleu because they are used absolutely everywhere in programs (~20% of all instructions). The same goes for new reg + reg + 12 bit immediate ALU operations. Personally I would probably remove (or never have added) XORI as it is used *so* seldom and when it is used it's usually with 0xffffffff so can be done as a subtract from -1 anyway. ORI is also very marginal whether its worth having, used a lot less than ANDI.

The advantage of src1+src2+dst register to register instructions is they use comparatively little opcode space, so you can have a ton of them (eventually). It takes 128 of them to use as much encoding space as a single conditional branch or immediate ALU instruction.

The immediate shift instructions SLLI, SRLI, SRAI use much less encoding space than the other immediate instructions because they only use a 6 bit immediate (5 for RV32), so you can have 64 of them for the same encoding space as a single normal 12 bit immediate instruction. So GREVI and GORCI and other "shift-like" instructions are easy to add.

Quote
If the branch does a byte-wise "branch only if none are equal" comparison between 4×4 bytes (in the two registers), you could also implement strpbrk()/strspn()/strcspn() with very low cost, since each register could hold up to four initial bytes in a matching sequence.

I didn't quite get that.
Title: Re: Block copy instructions
Post by: brucehoult on January 18, 2021, 09:27:00 pm
Not completely sure what Nominal Animal meant here: "I am talking about lowest-level operations, within the instruction pipeline; not about superscalarity or individual full instructions."
Any instruction can do many things at once as long as the logic path doesn't exceed your requirements.
The question is, what are the allowed combinations of "many things" on your implementation: which micro-operations your target can do at the same time.

Here is a practical suggestion for extending RV32I:
  • SBi (rd), rs: Store least significant byte of register rs to memory address specified by register rd, and increment rd by one.
  • SBd (rd), rs: Store least significant byte of register rs to memory address specified by register rd, and decrement rd by one.
  • SWi (rd), rs: Store register rs to memory address specified by register rd, and increment rd by four.
  • SWd (rd), rs: Store register rs to memory address specified by register rd, and decrement rd by four.
  • BNEi rd, rs, 1<<n, delta: Compare rd to rs, and increment rd by 1<<n.  If not equal, branch by delta.
  • BNEd rd, rs, 1<<n, delta: Compare rd to rs, and decrement rd by 1<<n.  If not equal, branch by delta.

I see absolutely no issue nor any required added latency for any of the instructions you suggest above.
The only difference with existing instructions is the additional inc/dec of a dest register. This can be done fully in parallel with the rest of what they have to do, so no problem.

Correct for pipeline design, except that post-increment and post-decrement need an adder which is not already present, and not used by anything else. Both the stores and the branches would fit better if they did the add/increment/decrement *before* the store or compare.

There is also the quibble that for consistency with everything else in the RV32I ISA rd should be changed to rs1 and rs to rs2.

Which brings me to opcode space usage :-)

If they can't have offsets then SBi, SBd, SWi, SWd are very very cheap on opcode space. There would be no reason to leave SH and SD versions out.

On the other hand BNEi and BNEd are very very expensive in opcode space, at least if delta has the same range as the other conditional branch instructions. If n can be 0..3 then BNEi and BNEd use 33% more opcode space than all the existing BNE, BEQ, BLT, BLTU, BGE, BGEU combined. At the very least delta should be shortened by three bits (to +/- 512 bytes instead of +/- 4k) to make room for 2 bits for n and 1 bit for i/d.

In the base ISA func3 (3 bits) is used to choose between BNE, BEQ, BLT, BLTU, BGE, BGEU with values 010 and 011 unused. Nothing else in the currently ratified extensions (MAFD) uses those. I don't know offhand whether one of the extensions in progress does.

My opinion is it might be worth using one of those for BNEi and BNEd, but maybe shortening delta by even one or two bits more, to allow room for other specialized "short conditional" branch instructions in future, and keep one func3 value completely free.
Title: Re: Block copy instructions
Post by: Nominal Animal on January 18, 2021, 09:54:23 pm
Better still if the *programmer* calculates the strlen() first, and uses memcpy()/memcmp(), and remembers the strlen() for later use.
It is still duplicated work; I'd like to avoid that.

Also, a lot of existing C code relies heavily on nul-terminated string operations (instead of known-length string operations); python3 (the standard CPython interpreter) in particular.   Based on ltrace -o log -e strlen python3 empty.py, python-3.6.9 on my machine does 15,509 strlen() calls on interpreter startup; about half of them of 9 characters and longer; over 3000 of them are on strings longer than 16 characters.

Quote
If the branch does a byte-wise "branch only if none are equal" comparison between 4×4 bytes (in the two registers), you could also implement strpbrk()/strspn()/strcspn() with very low cost, since each register could hold up to four initial bytes in a matching sequence.
I didn't quite get that.
Probably nobody did, it was such a poor description!

Consider registers a and b as four-byte arrays.  The instruction I was trying to describe is
    if (a[0] == b[0] || a[0] == b[1] || a[0] == b[2] || a[0] == b[3] ||
        a[1] == b[0] || a[1] == b[1] || a[1] == b[2] || a[1] == b[3] ||
        a[2] == b[0] || a[2] == b[1] || a[2] == b[2] || a[2] == b[3] ||
        a[3] == b[0] || a[3] == b[1] || a[3] == b[2] || a[3] == b[3]) take_branch;

Typical strspn(), strcspn(), strtok(), and strpbrk() delimiter sets are relatively small, typically 6 (ASCII whitespace characters).  If you have the above branch test instruction, a 32-bit register can describe up to four different delimiters, or four initial characters for a parallel multi-string strstr()/memmem() function, with a match jumping to a character-by-character test.

Let's say t1 contains the next four bytes in the string, and t2 and t3 contain 5 to 8 delimiters.  (If there are fewer, then some are just duplicated.)  If we call the above branch instruction blah, the inner loop would be something like
    .loop:   ; src=a1, end=a2, delimiters in t2 and t3
        lw    t1, (a1)
        blah  t1, t2, .check
        blah  t1, t3, .check   
        bnei  a1, a2, 1<<2, .loop
        return, no matches
    .check:
        byte-by-byte check at a1

The idea is that matches are relatively rare, and we can afford examining them one by one when we know a match exists; we just want to speed up the "no match" case.

However, string operations seem much less common as memcpy(), memmove(), and memcmp().   On my machine, python-3.6.9 does 43,435 of memcpy()/memmove()/memcmp() on startup.  Thus, I consider the string stuff secondary: wishlist kind of thing.  And this 4×4-byte register comparison operator, just an example of what could work from a programmer perspective.
Title: Re: Block copy instructions
Post by: vad on January 19, 2021, 01:16:30 am
On the most modern systems block-copy performance is limited by the latency and throughput of RAM interface, and therefore a single block-copy CISC instruction would have no noticeable advantage over multiple RISC instructions.
Title: Re: Block copy instructions
Post by: brucehoult on January 19, 2021, 02:19:34 am
On the most modern systems block-copy performance is limited by the latency and throughput of RAM interface, and therefore a single block-copy CISC instruction would have no noticeable advantage over multiple RISC instructions.

Except most block-copy calls (and probably most bytes copied in block copies) are in sizes far smaller than typical L1 caches, so a CISCy instruction with lower start-up overhead than a RISC software loop (choosing the appropriate version of copy, initializing registers etc) could have an advantage.

Except Intel have finally recently gotten REP MOVSB faster than a software loop -- for large copies. If I understand certain web pages correctly, on Skylake and newer REP MOVSB takes 20 cycles for any copy less than 12 bytes, and has a 40 cycle startup overhead for any copy greater than 76 bytes, but then copies 64 bytes every 4 cycles.
Title: Re: Block copy instructions
Post by: vad on January 19, 2021, 02:51:48 am
On the most modern systems block-copy performance is limited by the latency and throughput of RAM interface, and therefore a single block-copy CISC instruction would have no noticeable advantage over multiple RISC instructions.

Except most block-copy calls (and probably most bytes copied in block copies) are in sizes far smaller than typical L1 caches, so a CISCy instruction with lower start-up overhead than a RISC software loop (choosing the appropriate version of copy, initializing registers etc) could have an advantage.

Except Intel have finally recently gotten REP MOVSB faster than a software loop -- for large copies. If I understand certain web pages correctly, on Skylake and newer REP MOVSB takes 20 cycles for any copy less than 12 bytes, and has a 40 cycle startup overhead for any copy greater than 76 bytes, but then copies 64 bytes every 4 cycles.
Except cache still carries latency penalty. For example Skylake’s L1 Data cache latency is whooping 4 cycles per  operation (8 cycles in total, since you need to read and write). And that’s if you are lucky enough to have both source and destination rows in tiny L1 cache.
Title: Re: Block copy instructions
Post by: brucehoult on January 19, 2021, 06:30:37 am
On the most modern systems block-copy performance is limited by the latency and throughput of RAM interface, and therefore a single block-copy CISC instruction would have no noticeable advantage over multiple RISC instructions.

Except most block-copy calls (and probably most bytes copied in block copies) are in sizes far smaller than typical L1 caches, so a CISCy instruction with lower start-up overhead than a RISC software loop (choosing the appropriate version of copy, initializing registers etc) could have an advantage.

Except Intel have finally recently gotten REP MOVSB faster than a software loop -- for large copies. If I understand certain web pages correctly, on Skylake and newer REP MOVSB takes 20 cycles for any copy less than 12 bytes, and has a 40 cycle startup overhead for any copy greater than 76 bytes, but then copies 64 bytes every 4 cycles.
Except cache still carries latency penalty. For example Skylake’s L1 Data cache latency is whooping 4 cycles per  operation (8 cycles in total, since you need to read and write). And that’s if you are lucky enough to have both source and destination rows in tiny L1 cache.

Writes don't have latency. They go into a write queue immediately and you don't care what happens to them after that (unless you try to read the value back again before it gets written to L1 cache, in which case you get up to a 10 cycle penalty on Skylake and newer).
Title: Re: Block copy instructions
Post by: DiTBho on January 19, 2021, 09:31:49 am
Write Through and Write Deferred (aka Write Back) in Cache


Title: Re: Block copy instructions
Post by: vad on January 20, 2021, 02:37:06 am
Writing to “Write Back” cache also could have latency. If the line that is being written to is not present in cache, the line will have to be loaded from the RAM first, unless the word that is being written has the size of the cache line and is perfectly aligned.
Title: Re: Block copy instructions
Post by: brucehoult on January 20, 2021, 02:58:37 am
Writing to “Write Back” cache also could have latency. If the line that is being written to is not present in cache, the line will have to be loaded from the RAM first, unless the word that is being written has the size of the cache line and is perfectly aligned.

Not if you have a store buffer.

(unless you have so many cache misses on write in a row that the store buffer backs up)