On the Z80, you do have 16-bit atomic loads and stores as long as the address is in one of the register pairs (BC, DE, HL), and even an atomic 16-bit swap (but only to the top of the stack), and also 16-bit addition and subtraction, so there are several different ways one could do it atomically. Expressing them in C is difficult, though.
Cortex-M4 (see PM0214 at st.com) implements load-linked, store-conditional primitives. Essentially, you have a short interval, a few instructions, to do some operations, and then you can do a store-conditional write to that variable. If that variable was not accessed in between (by anything in the system), the store succeeds; otherwise, the store fails (the value is not updated!), with the success/failure in a (chosen) register. CMSIS provides these as __LDREXB()/__LDREXH()/__LDREXW() and __STREXB()/__STREXH()/__STREXW() macros or functions.
For the 64-bit cycle counter, you can make it "lockless" by using two copies of the 64-bit counter, and an 8-bit LL/SC index or offset indicating which copy is valid:
static volatile word64 cycles_count[2];
static volatile uint8_t cycles_index; /* 0 if cycles_count[0] is valid, 8 if cycles_count[1] is valid */
uint64_t cycles(void)
{
word64 count;
uint32_t i;
/* Load most recently fully updated value */
do {
i = __LDREXB(&cycles_index) & 8;
count = cycles_count[i >> 3];
} while (__STREXB(i, &cycles_index));
const uint32_t cyccnt = ARM_DWT_CYCCNT; /* Or DWT->CYCCNT */
/* Update cycle count */
count.u32_hi += (cyccnt < count.u32_lo);
count.u32_lo = cyccnt;
/* Store the cycle count, and if valid, the index too. */
do {
i = (__LDREXB(&cycles_index) ^ 8) & 8;
cycles_count[i >> 3] = count;
} while (__STREXB(i, &cycles_index));
return count.u64;
}
The cycles_index contains the byte offset to the currently valid cycles_index entry.
The first do { .. } while loop obtains the currently valid 64-bit counter state. It almost always only does one iteration, as it only loops if the operation is interrupted (or some other code accesses cycles_index). Because we cannot read the 64-bit counter state atomically, we do need a loop here.
We then read the cycle counter, and update the local copy of the 64-bit counter state. (We cannot read the cycle counter before we know the old counter state, because otherwise we might see time flowing backwards.)
Finally, in the second do { .. } while loop, we store the updated 64-bit counter value to the not-currently-valid entry. If the store is not interrupted and no other code accesses cycles_index while we do the store, cycles_index is updated to reflect this new entry is now valid. We do need a loop here, because there are only two slots we can use. If there were more slots than it were possible to have nested cycles() calls (consider interrupts with different priorities!), then we could just do one iteration. In practice, the loop is so rarely repeated, it is not worth it to use more than two slots.
Because cycles_index is only modified when the corresponding entry has been updated, the currently valid entry is never trashed.
If a cycles() call is interrupted by something that also does a cycles() call, afterwards the valid entry reflects the 64-bit cycle counter value in the outermost call (which is obviously a bit earlier than the innermost cycles() call obtained). The return values from cycles() are monotonic; it is only the concurrent cycles() calls that can cause the internal state to change in a non-monotonic manner. It only matters if you do debugging and examine the cycle counter state variables: then, if you have concurrent calls to cycle(), say one call is interrupted by a function that also calls cycles(), then you can see the internal state updated in a non-monotonic order.
The above probably needs the following definitions:
#include <stdint.h>
typedef union {
uint64_t u64;
int64_t i64;
uint32_t u32[2];
int32_t i32[2];
#if defined(__BYTE_ORDER__) && defined(__ORDER_LITTLE_ENDIAN__) && __BYTE_ORDER__-0 == __ORDER_LITTLE_ENDIAN__-0
struct {
uint32_t u32_lo;
uint32_t u32_hi;
};
struct {
int32_t i32_lo;
int32_t i32_hi;
};
#elif defined(__BYTE_ORDER__) && defined(__ORDER_BIG_ENDIAN__) && __BYTE_ORDER__-0 == __ORDER_BIG_ENDIAN__-0
struct {
uint32_t u32_hi;
uint32_t u32_lo;
};
struct {
int32_t i32_hi;
int32_t i32_lo;
};
#else
#error Unknown byte order. Define __BYTE_ORDER__ to __ORDER_LITTLE_ENDIAN__ or to __ORDER_BIG_ENDIAN__.
#endif
} word64;
__attribute__((always_inline))
static inline uint8_t __LDREXB(const volatile uint8_t *addr)
{
uint32_t result;
asm volatile ("ldrexb\t%0, [%1]\n\t"
: "=r" (result)
: "r" (addr)
);
return result;
}
__attribute__((always_inline))
static inline uint32_t __STREXB(uint8_t val, volatile uint8_t *addr)
{
uint32_t result;
asm volatile ("strexb\t%0, %1, [%2]\n\t"
: "=&r" (result)
: "r" (val), "r" (addr)
);
return result;
}
__attribute__((always_inline))
static inline uint16_t __LDREXH(const volatile uint16_t *addr)
{
uint32_t result;
asm volatile ("ldrexh\t%0, [%1]\n\t"
: "=r" (result)
: "r" (addr)
);
return result;
}
__attribute__((always_inline))
static inline uint32_t __STREXH(uint16_t val, volatile uint16_t *addr)
{
uint32_t result;
asm volatile ("strexh\t%0, %1, [%2]\n\t"
: "=&r" (result)
: "r" (val), "r" (addr)
);
return result;
}
__attribute__((always_inline))
static inline uint32_t __LDREXW(const volatile uint32_t *addr)
{
uint32_t result;
asm volatile ("ldrexw\t%0, [%1]\n\t"
: "=r" (result)
: "r" (addr)
);
return result;
}
__attribute__((always_inline))
static inline uint32_t __STREXW(uint32_t val, volatile uint32_t *addr)
{
uint32_t result;
asm volatile ("strexh\t%0, %1, [%2]\n\t"
: "=&r" (result)
: "r" (val), "r" (addr)
);
return result;
}
Note that the same approach works no matter how large each slot (cycle_counter) is; it could very well be a structure. Each does need their own index or offset variable, of course. Because the load-linked store-conditional is done only for the duration of copying the structure (and incrementing the index in the store case), even a larger structure does not increase the retry count, because the window for interruption is just a few clock cycles each time. Again, there is no race window per se, because even if the interrupt occurs at the worst possible moment, that only means the loop does another iteration, and each iteration only takes a few clock cycles anyway.