Author Topic: Pi Pico, memory/cache coherency, and __dmb()  (Read 956 times)

0 Members and 1 Guest are viewing this topic.

Offline profdc9Topic starter

  • Frequent Contributor
  • **
  • Posts: 334
  • Country: us
Pi Pico, memory/cache coherency, and __dmb()
« on: June 10, 2024, 09:51:23 pm »
Hello,

I was wondering if there are any concurrency experts that could help me understand how to synchronize events between cores.

One question I had is about when an atomic write of a memory location occurs on one core (for example a single 32-bit write) becomes visible to another core.  I understand that, for example, in order to ensure that for a single thread that a set of memory operations occurs before a second set of memory operations, a __dmb() must be placed between them to ensure that all out-of-order memory operations are ordered such that the memory accesses before the barrier all are completed before the memory accesses after the barrier.  So for example, if one is updating a structure of data, and then is setting a flag to indicate that the structure has been updated, a __dmb() should be placed in between the update and setting the flag to ensure that, for example, a IRQ handler monitoring the flag sees the update already there.

Furthermore, there are hardware spinlocks as well to arbitrate access to resources, and SEV / WFE to be able to asynchronously deliver a signal from one core to another.  I am trying to figure out how to solve my problem, perhaps with using these tools.

What I would like to do is to have both cores performing 1/2 of a calculation simultaneously every 40 microseconds like clockwork.  Every 40 microseconds I want them to sync up and combine their results.  For a single core, I can just create an alarm and trigger it every 40 microseconds.  Each core processes an audio channel sample and the results are combined together and so this must happen in real time.

I could, for example, have core 0 triggering every 40 microseconds using an alarm or interrupt, and then core 1 polling a flag that is set to when core 0 is done with its half of the calculation.  On some processors, there are issues of cache coherency, so that a flag written into memory on one core would not necessarily be visible to the other.  I do not think that __dmb() ensures that a write on one processor is visible to another, only that the writes happen in a particular order.

Also, I have noticed that sometimes __dmb() has a very long delay and so if it is placed in an interrupt, the interrupt may not finish in time for the next sample.

I have looked at some of the built-in synchronization primitives in the pico library, and it looks like the pico_queue library might work.  I do not like it uses calloc() and I may add a function to use a pointer passed to it instead like this:

Code: [Select]
void queue_init_with_spinlock_user_buf(queue_t *q, uint element_size, uint element_count, uint spinlock_num, void *data) {
    lock_init(&q->core, spinlock_num);
    q->data = (uint8_t *)data;
    q->element_count = (uint16_t)element_count;
    q->element_size = (uint16_t)element_size;
    q->wptr = 0;
    q->rptr = 0;
}

Would pico_queue be better to use here?  I am trying to understand how it works, and how the code relies on knowing that the inspected memory is shared between the cores.  If I understood better how it is known that the memory written by one core can be read by another, I would be able to perhaps write my own code that uses the spinlocks to synchronize the cores.

Thank you,

Dan

 

Offline ataradov

  • Super Contributor
  • ***
  • Posts: 11650
  • Country: us
    • Personal site
Re: Pi Pico, memory/cache coherency, and __dmb()
« Reply #1 on: June 10, 2024, 10:09:24 pm »
There is no cache in RP2040 or in Cortex-M0+ in general. On systems with a cache, any cache coherency issues are resolved via a cache controller.

I can't imagine a scenario where DMB would be slow. If DMB is slow, it would mean that pending access is also slow, which has the same potential to impact ISR performance. But you would have to provide more details, it is likely you are not measuring it correctly.

But in this case you don't need any barriers. Just a shared variable is sufficient. As long as one core sets it, and the other core reads and clears it, there should be no issues.
« Last Edit: June 10, 2024, 10:11:46 pm by ataradov »
Alex
 
The following users thanked this post: profdc9

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15148
  • Country: fr
Re: Pi Pico, memory/cache coherency, and __dmb()
« Reply #2 on: June 10, 2024, 10:23:34 pm »
A single 32-bit write or read is by nature atomic on 32-bit ARM, so no need to even use atomics for that. Of course, if the data is wider than this, then you'll need atomics.
Another use case where you need atomic operations even for 32-bit data is for anything requiring more than a single read or write access, such as incrementing or decrementing a value in memory.
Note that Cortex-M0(/+) cores do not have specific atomic instructions, so they are usually implemented in software via various approaches, such as disabling interrupts.

The queues from the pico-sdk works very well and unless you need some very specific, hand-crafted code for optimizing performance to a few cycles, I recommend using them. Benefit is that they serve both as "safe" data passing between cores, but also as a sync mechanism as you can wait on a queue. So if you have say typically a core that produces data and another core which consumes it, queues are ideal.
« Last Edit: June 10, 2024, 10:25:12 pm by SiliconWizard »
 
The following users thanked this post: profdc9

Offline profdc9Topic starter

  • Frequent Contributor
  • **
  • Posts: 334
  • Country: us
Re: Pi Pico, memory/cache coherency, and __dmb()
« Reply #3 on: June 11, 2024, 02:15:44 am »
I am not sure I understand how __dmb() works then.  For example, take the following example from the Pi Pico library

Code: [Select]
static inline uint queue_get_level_unsafe(queue_t *q) {
    int32_t rc = (int32_t)q->wptr - (int32_t)q->rptr;
    if (rc < 0) {
        rc += q->element_count + 1;
    }
    return (uint)rc;
}

static bool queue_add_internal(queue_t *q, const void *data, bool block) {
    do {
        uint32_t save = spin_lock_blocking(q->core.spin_lock);
        if (queue_get_level_unsafe(q) != q->element_count) {
            memcpy(element_ptr(q, q->wptr), data, q->element_size);
            q->wptr = inc_index(q, q->wptr);
            lock_internal_spin_unlock_with_notify(&q->core, save);
            return true;
        }
        if (block) {
            lock_internal_spin_unlock_with_wait(&q->core, save);
        } else {
            spin_unlock(q->core.spin_lock, save);
            return false;
        }
    } while (true);
}

static bool queue_remove_internal(queue_t *q, void *data, bool block) {
    do {
        uint32_t save = spin_lock_blocking(q->core.spin_lock);
        if (queue_get_level_unsafe(q) != 0) {
            memcpy(data, element_ptr(q, q->rptr), q->element_size);
            q->rptr = inc_index(q, q->rptr);
            lock_internal_spin_unlock_with_notify(&q->core, save);
            return true;
        }
        if (block) {
            lock_internal_spin_unlock_with_wait(&q->core, save);
        } else {
            spin_unlock(q->core.spin_lock, save);
            return false;
        }
    } while (true);
}

A single 32-bit write is atomic, however, it is not guaranteed that if several writes are performed which order they are performed in, unless there is a __dmb() delineating the writes that should happen first from those that should happen afterwards, correct?  So let's say on core0 we call the queue_add_internal() function, and q->wptr is set after the memcpy in the queue_add_internal() function, but without a __dmb(), there doesn't seem to be a guarantee that the memcpy write happens before the write to q->wptr.  So lets say right away after q->wptr is written, on core1 we call the queue_remove_internal() function.  The queue_level_get_unsafe() function reads the number of elements in the queue, but there is no guarantee that the memcpy() from queue_add_internal() has occurred by the time it does this.  So there's no guarantee that the data to be memcpy() is actually in the queue when it is going to be removed. 

I would think that the writers of these functions would not make a mistake, so why isn't a __dmb() required here:

Code: [Select]
static bool queue_add_internal(queue_t *q, const void *data, bool block) {
    do {
        uint32_t save = spin_lock_blocking(q->core.spin_lock);
        if (queue_get_level_unsafe(q) != q->element_count) {
            memcpy(element_ptr(q, q->wptr), data, q->element_size);
            __dmb();
            q->wptr = inc_index(q, q->wptr);
            lock_internal_spin_unlock_with_notify(&q->core, save);
            return true;
        }
        if (block) {
            lock_internal_spin_unlock_with_wait(&q->core, save);
        } else {
            spin_unlock(q->core.spin_lock, save);
            return false;
        }
    } while (true);
}

static bool queue_remove_internal(queue_t *q, void *data, bool block) {
    do {
        uint32_t save = spin_lock_blocking(q->core.spin_lock);
        if (queue_get_level_unsafe(q) != 0) {
            memcpy(data, element_ptr(q, q->rptr), q->element_size);
            __dmb();
            q->rptr = inc_index(q, q->rptr);
            lock_internal_spin_unlock_with_notify(&q->core, save);
            return true;
        }
        if (block) {
            lock_internal_spin_unlock_with_wait(&q->core, save);
        } else {
            spin_unlock(q->core.spin_lock, save);
            return false;
        }
    } while (true);
}

Wouldn't these __dmb() be required to ensure that the data is in the queue before the writer pointer is advanced, and similarly the data is removed from the queue before the read pointer is advanced?

 

Offline ataradov

  • Super Contributor
  • ***
  • Posts: 11650
  • Country: us
    • Personal site
Re: Pi Pico, memory/cache coherency, and __dmb()
« Reply #4 on: June 11, 2024, 02:44:23 am »
A single 32-bit write is atomic, however, it is not guaranteed that if several writes are performed which order they are performed in, unless there is a __dmb() delineating the writes that should happen first from those that should happen afterwards, correct?
This is correct, but it only applies to regular memories, not strongly ordered memories. All ARM devices have a memory map with a number of regions. All those regions have a number of attributes. MPU (if present) will let you override those things.

So yes, in theory multiple sequential writes to SRAM may happen in any order. In practice none of the Cortex-M0+ devices do any out of order access.

So there's no guarantee that the data to be memcpy() is actually in the queue when it is going to be removed. 
In a general abstract case this is correct. In practice on this specific device it is not necessary. And on devices where it is necessary, you need to use much more complicated things, not just DMB.

Most cases where you need memory barriers involve writes to different types of memories. For example, a flash controller may take data over AHB bus and a command over APB bus. In this case access order may get messed up in some cases. But again, this will only happen on Cortex-M3/M4/M7. Cortex-M0+ cores are primitive, they do everything in order over the same AHB interface. There is no buffering or reordering.
« Last Edit: June 11, 2024, 02:49:46 am by ataradov »
Alex
 

Offline ataradov

  • Super Contributor
  • ***
  • Posts: 11650
  • Country: us
    • Personal site
Re: Pi Pico, memory/cache coherency, and __dmb()
« Reply #5 on: June 11, 2024, 02:55:39 am »
One place where you need to be careful is when signalling to the other code after the data write. If this signalling does not happen over the regular variable, but for example using SEV, you do need a barrier. before SEV. But in this case you need DSB, not DMB.

And again, this only applies on Cortex-M4/M7. Those cores still do all access in order, but they have a somewhat deep write buffer. So, the data may be hanging in that buffer when SEV wakes up the other core. This is not an issue if signalling happens over a variable, since that variable would be written last, so by the time second core reads it as ready, all other data would have been written.
Alex
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15148
  • Country: fr
Re: Pi Pico, memory/cache coherency, and __dmb()
« Reply #6 on: June 11, 2024, 03:00:24 am »
- There is no cache in the RP2040, except for XIP (and so only for read-only memory, so there is no cache issue possible) and the M0+ core is very simple, so memory order issues are a bit unlikely.
- Even so, as you noticed, all queue accesses are guarded by a spinlock, so memory access order wouldn't matter anyway. The whole section is guarded.

Apart from that specific piece of code, if you're interested in knowing more about memory barrier instructions on Cortex-M: https://documentation-service.arm.com/static/5efefb97dbdee951c1cd5aaf
« Last Edit: June 11, 2024, 03:02:28 am by SiliconWizard »
 

Offline profdc9Topic starter

  • Frequent Contributor
  • **
  • Posts: 334
  • Country: us
Re: Pi Pico, memory/cache coherency, and __dmb()
« Reply #7 on: June 13, 2024, 03:32:21 am »
Thanks.  I got it working so that the two cores work together to process samples:  synth on core 1, effects on core 0.
 
The following users thanked this post: SiliconWizard


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf