EEVblog Electronics Community Forum

Products => Computers => Programming => Topic started by: HwAoRrDk on July 03, 2020, 10:44:00 am

Title: Writing a memmove() implementation
Post by: HwAoRrDk on July 03, 2020, 10:44:00 am
I'm writing a memmove() implementation in assembly for a microcontroller platform that doesn't have a very optimal version in the compiler's standard C library.

I am pondering whether or not to implement a certain shortcut in the code: in the case that source and destination pointers are the same, should I have the function do nothing and return immediately?

One of the reasons I have doubts about whether this is a good idea is regarding side effects. For example, perhaps the region of memory is mapped to some peripheral register that triggers certain actions when the register is read or rewritten? It may confuse the user if they are expecting this behaviour and it doesn't occur because my memmove() made the operation a no-op.

Any thoughts?
Title: Re: Writing a memmove() implementation
Post by: greenpossum on July 03, 2020, 10:55:46 am
The relevant standard for memmove states what the result of the function shall be and the implementor is free to optimise as long as that result is obtained. Any other side effect is unspecified.
Title: Re: Writing a memmove() implementation
Post by: HwAoRrDk on July 03, 2020, 11:30:25 am
I suppose following that, no optimisation like this should be allowed, as the standard states that the result shall be that data is copied. :)

What I'm wanting to do is follow the 'principle of least surprise' - that is, do what other memmove() implementations commonly do.

I have actually found a Stack Overflow post that says that glibc's memmove() does such a shortcut (at least, on x86-64), so other implementations do indeed do this. But, again, there is the matter of side effects. A desktop PC platform is not the same as a microcontroller. I would assume the potential register read/write side effects issue would not be a concern on x86.
Title: Re: Writing a memmove() implementation
Post by: Siwastaja on July 03, 2020, 11:37:59 am
I think there is no problem. In the case you describe (memory must be read, no optimizations allowed), the memory must be qualified volatile, and the standard memmove definition does not have volatile qualifier:
    void *memmove(void *dest, const void *src, size_t n);

With GCC, using a volatile pointer when calling memmove would result in, even without -Wall:
Code: [Select]
t.c: In function ‘main’:
t.c:9:10: warning: passing argument 1 of ‘memmove’ discards ‘volatile’ qualifier from pointer target type [-Wdiscarded-qualifiers]
  memmove(dst, src, 123);
          ^~~
In file included from t.c:2:0:
/usr/include/string.h:46:14: note: expected ‘void *’ but argument is of type ‘volatile char *’
 extern void *memmove (void *__dest, const void *__src, size_t __n)
              ^~~~~~~
t.c:9:15: warning: passing argument 2 of ‘memmove’ discards ‘volatile’ qualifier from pointer target type [-Wdiscarded-qualifiers]
  memmove(dst, src, 123);
               ^~~
In file included from t.c:2:0:
/usr/include/string.h:46:14: note: expected ‘const void *’ but argument is of type ‘volatile char *’
 extern void *memmove (void *__dest, const void *__src, size_t __n)
              ^~~~~~~

so the developer would instantly note their mistake. If they try to cope by not defining the variables volatile, then the problem isn't just within the memmove function; it's actually everywhere, in their own code as well, no read operation is guaranteed.

If you want to be helpful for some particular special case, you could introduce your own memmove_volatile with volatile arguments; then it would make sense you guarantee things like:
1) All src bytes are read
2) All dst bytes are written to,
3) preferably the order of reads/writes,
and document how it works exactly.

But I bet whoever needs such special requirements, writes their own implementation in their own code instead.
Title: Re: Writing a memmove() implementation
Post by: greenpossum on July 03, 2020, 11:42:49 am
I suppose following that, no optimisation like this should be allowed, as the standard states that the result shall be that data is copied. :)

No, the standard says that the destination shall be the same as the source upon return. How that's achieved is not mandated.
Title: Re: Writing a memmove() implementation
Post by: Siwastaja on July 03, 2020, 11:56:37 am
It's such an obvious optimization that I would be surprised to see a modern, actually-in-wide-use implementation that does not do that.
Title: Re: Writing a memmove() implementation
Post by: HwAoRrDk on July 03, 2020, 01:28:51 pm
No, the standard says that the destination shall be the same as the source upon return. How that's achieved is not mandated.

Where in the standard (or which standard?) does it say that? I'm reading ISO/IEC 9899:1999 section 7.21.2.2, and it doesn't say that. It just says that characters will be copied from source to destination. As far as I can see, 2011 has identical wording for the definition of memmove(). (I would paste the section here, but the text when copied from the PDF loses all sense of spacing and formatting.)

Anyway, Siwastaja makes a good point. :-+ The function arguments are not volatile, so no expectation of guaranteed reading/writing of memory should be made.

I think I will add the shortcut optimisation. It's literally one additional branch instruction.
Title: Re: Writing a memmove() implementation
Post by: greenpossum on July 03, 2020, 01:36:32 pm
The point is can you really tell after return whether your function has done read and write, or has done nothing? Both satisfy the word copy. Copy does not imply read and write, it just means the destination content is the same as the source content. You're attributing to it a particular implementation of copy.

Also note that the case of overlap of which this is a perfect overlap is defined in terms of "as if". No serious implementation would actually allocate a temp buffer to do the copy.

The thing is you have to look at the intent of the standard, not the letter of it. There are various implementations which are allowed under the definition. On a machine which can efficiently copy multibyte words at a time, one could imagine a routine that noticed the source and destination were word aligned, so does word copies, with additional byte copies for the tail.
Title: Re: Writing a memmove() implementation
Post by: Jeroen3 on July 03, 2020, 02:44:54 pm
You can't reliably use memmove on peripheral registers since it is using void pointers. This means the access width is lost.
On many embedded targets you can't do all possible widths on some address regions. You may even hit unaligned error if you start optimizing.

For example, perhaps the region of memory is mapped to some peripheral register that triggers certain actions when the register is read or rewritten?
I'm highly doubtful memmove would the function to use in that situation. The difference between move and copy is that move allows overlap and copy does not.
If you write your own you make even more possible implicit exceptions on when it is not ok to use that function.

This is what you will be optimizing against:
https://github.com/torvalds/linux/blob/7bd1d5edd0160b615ab8748cf94dabcab1fb01cb/arch/x86/boot/compressed/string.c#L53
https://github.com/gcc-mirror/gcc/blob/master/libgcc/memmove.c#L5

It's such an obvious optimization that I would be surprised to see a modern, actually-in-wide-use implementation that does not do that.
They do not. Calling move or copy where both pointers are equal suggests an error somewhere else in the code. It would be a useless comparison in 99.99% percent of other calls.
Title: Re: Writing a memmove() implementation
Post by: HwAoRrDk on July 03, 2020, 04:56:23 pm
The point is can you really tell after return whether your function has done read and write, or has done nothing?

Yes, that is true. :) I suppose it boils down to arguments over semantics. Whether the verb "copy" means "perform the process of replication", or "end up with an identical replica" (by whatever means).

You can't reliably use memmove on peripheral registers since it is using void pointers. This means the access width is lost.
On many embedded targets you can't do all possible widths on some address regions. You may even hit unaligned error if you start optimizing.

This is for an 8-bit microcontroller. Alignment and word width don't really come in to it. (And I sure as hell wouldn't be attempting to do this for something that isn't a simple 8-bitter.)

I'm highly doubtful memmove would the function to use in that situation. The difference between move and copy is that move allows overlap and copy does not.
If you write your own you make even more possible implicit exceptions on when it is not ok to use that function.

This will ultimately be for third-party use, so you never know what other people will do with it. Although it may seem a contrived case, if it's possible to use memmove() on a memory-mapped peripheral register, it will probably happen sooner or later. :palm:

This is what you will be optimizing against:
https://github.com/torvalds/linux/blob/7bd1d5edd0160b615ab8748cf94dabcab1fb01cb/arch/x86/boot/compressed/string.c#L53
https://github.com/gcc-mirror/gcc/blob/master/libgcc/memmove.c#L5

The current compiler's standard library memmove() is essentially the same as the latter; but the problem is that it doesn't compile well for this particular architecture.
Title: Re: Writing a memmove() implementation
Post by: greenpossum on July 03, 2020, 10:44:20 pm
The point is can you really tell after return whether your function has done read and write, or has done nothing?

Yes, that is true. :) I suppose it boils down to arguments over semantics. Whether the verb "copy" means "perform the process of replication", or "end up with an identical replica" (by whatever means).

For some reason I'm reminded of the Monty Python homicidal barber sketch where he just pretends to cut the customer's hair and says All done.   ;D
Title: Re: Writing a memmove() implementation
Post by: golden_labels on July 05, 2020, 06:55:07 pm
I would question rationality of such an optimisation.

How often are you passing two identical pointers to memmove? Is it worth wasting instructions on doing a test that is expected to be nearly always false and therefore saves nothing?

If it ever happens to you that memmove is applied in that way, wrap it in an if.
Title: Re: Writing a memmove() implementation
Post by: brucehoult on July 05, 2020, 10:41:43 pm
I'm sure this is a legal "optimization".

What I'm not at all sure of is whether it is actually an optimization i.e. something that will make programs go faster, on average.

It's at least for sure not much of a pessimization. memmove() already has a bit of unavoidable overhead comparing src against dst (and maybe src+sz against dst) to decide whether to copy upwards or downwards. Once you've compared src and dst in order to do a blt it's very cheap to also do a beq.

If src and dst compare as different then you've wasted an instruction. But if they compare as equal then you've saved the length of the memory region multiplied by somewhere between 1 and maybe 5 instructions per byte depending on the exact ISA and whether you unroll the loop. (assuming the OP's 8 bit micro so copying using a larger register isn't an option)

The question is how often does each case happen?

Intuitively, I'm struggling to come up with a programming idiom that would generate memmove() with the same src and dst. Maybe some kind of buffer insertion/deletion function that is asked to insert or delete a zero length string and needlessly tries to slide the tail of the buffer up or down by 0 bytes? Better to optimize that function instead.

But the real way to answer this is with data. Make a version of memmove() that counts the total number of calls, the number of calls with src and dst equal, and the sz when they are equal. And then run preferably your actual system with this version installed, or else at least run a large corpus of code using it. SPECINT is widely used even to evaluate a possible optimization for embedded code. Or make the change in glibc and run your entire Linux desktop system with the patch installed for a while.

My bet is this case would pretty much never be hit. But i could be wrong. That's why it's best to try the test.

Ideally, you'd like to log the file and line number of the *caller* of memmove() when this optimization hits (or ideally the entire backtrace, but at minimum the current return address), and go in and investigate why it happens. Linux, Windows, and OS X all have (different) ways to do this. Or you can use debugger breakpoints with actions to do it (probably much more slowly).

Or just wave a wet finger in the air and put this check in your memmove(). Worst case, it's not going to hurt much -- probably too little to measure.
Title: Re: Writing a memmove() implementation
Post by: RenThraysk on July 06, 2020, 01:56:51 pm
I would also guess that the extra test is not worth it.

For discussion sake, if the test saved enough cycles to be measurable, then you'd still be able to optimise it more by avoiding the call & (and parameter setup) all together.

For instance, if the C compiler has inlining capability then

Code: [Select]
void *memmove(void *dest, const void *src, size_t len) {
     if (dest == src || len == 0) {
          return dest;
     }
     return asmmemmove(dst, src, len);
}

might be one way to remove the call with slight impact on output size.

So I wouldn't put the test in the asm anyway.
Title: Re: Writing a memmove() implementation
Post by: SiliconWizard on July 06, 2020, 03:53:59 pm
I would question rationality of such an optimisation.

How often are you passing two identical pointers to memmove? Is it worth wasting instructions on doing a test that is expected to be nearly always false and therefore saves nothing?

If it ever happens to you that memmove is applied in that way, wrap it in an if.

I agree with that.

There is a way to optimize this approach though if you really want to use it - just declare your own memmove() function static.
If it's going to be used in just one source file, no problem. Just define it there with a static qualifier. If you want to use it in more places, you can define it in a header file (the whole body, not just the prototype) with a static qualifier.

It will give the optimizer the opportunity to optimize it everywhere it's called, usually automatically inline it, and even remove non-necessary tests (as long as it can be inferred statically from your source code.) Downside is, code size will grow a little if you call this function in many places.

GCC will do this rather cleverly in most cases, and automatically inlines static functions where it's worhtwhile doing so - you don't even have to use the inline qualifier.

And with all this said, do not underestimate the ability of good compilers (including GCC) to optimize this use case even if you use the standard memmove() function.
If it can be inferred from source code that the destination and source pointers are the same, it will usually optimize out the memmove() call itself, as the standard prototype doesn't define its parameters as volatile.

Of course there are always cases where this can't be inferred statically. But frankly, as golden_labels said, I don't think there's really ground for bothering here.
If this case happens often in some part of your code, that may be an indication that your code is not well implemented to begin with.
Title: Re: Writing a memmove() implementation
Post by: golden_labels on July 06, 2020, 04:43:33 pm
For anyone who wants to see what black magic compilers can do nowadays, some time ago we made a set of tests with a friend: different, often quite convoluted, implementations of some simple operations. This code below is the only one in the set that has failed to produce optimal output. Coincidently it calls memmove:
Code: [Select]
void swap(unsigned char* v) {
    unsigned char tmp;
   
    memmove(&tmp, v, 1);
    memmove(v, v + 3, 1);
    memmove(v + 3, &tmp, 1);
    memmove(&tmp, v + 1, 1);
    memmove(v + 1, v + 2, 1);
    memmove(v + 2, &tmp, 1);
}
But it wasn’t far from the optimum (2 instructions) and — considering what most people would exect — it does it pretty well. gcc on x86_64 produces:
Code: [Select]
   0:	0f b6 07             	movzbl (%rdi),%eax
   3: 0f b6 57 03          movzbl 0x3(%rdi),%edx
   7: 66 c1 47 01 08        rolw   $0x8,0x1(%rdi)
   c: 88 17                mov    %dl,(%rdi)
   e: 88 47 03              mov    %al,0x3(%rdi)
  11: c3                    retq   
The optimum is, of course, uing the bswap instruction, avoiding two extra moves.

All implementations of swap, including this one, if used on a contant and being static, are completely eliminated and replaced by constants.

This is why doing optimizations before actually detecting a performance problem is seen a bad practice. Modern tools produce output that may be far from what you expect it to be: to the point of replacing whole portions of code with constants. This is not always true and sometimes doing early optimizations may be beneficial, but this is usually if you already know that you did something 20 times and you know that the compiler generated poor code 20 times for that particular solution.

Title: Re: Writing a memmove() implementation
Post by: westfw on July 07, 2020, 06:36:00 am
FWIW, the avr-libc memmove() does not do such an optimization.It may date from a time where size was more important than speed, though - it would have added 3 instructions to an 11-instruction subroutine...
Title: Re: Writing a memmove() implementation
Post by: RenThraysk on July 07, 2020, 10:30:30 am
Occasionally use godbolt.org just to get an idea of what various C compilers are outputting for small functions. Quite handy.

https://godbolt.org/z/LxsY4m