Author Topic: Why does Cortex-M0(+) implement instructions like ISB if it has no cache? (Read 11525 times)

incf · « **on:** February 26, 2025, 11:58:23 pm »

Why does Cortex-M0(+) implement instructions like ISB (Instruction Synchronization Barrier) if it has no cache?

Are there any systems/SOCs that actually go through the effort to implement external caching while at the same time deliberately using the very slow/limited Cortex-M0/M0+ core?

My impression is that all (most?) popular MCU's using Cortex-M0/+ type cores simply never require the user to use synchronization instructions.

Do those instructions have more value when multiple Cortex-M0/+ type cores are sharing the same memory? (Which is also exceedingly rare I think)

edit: I suppose maybe there are some Big.Little embedded systems that I overlooked when I asked the question

coppice · « **Reply #1 on:** February 27, 2025, 12:03:39 am »

If any application for the M0 core, anywhere, in any strange heterogenous product you never thought of, needs a feature then everybody has it. The licencing for ARM cores means you are not allowed to tamper with the core in even the most trivial way.

cfbsoftware · « **Reply #2 on:** February 27, 2025, 12:48:02 am »

Quote

The ISB instruction flushes the pipeline in Cortex-M processors and ensures effects of all context
altering operations prior to the ISB are recognized by subsequent operations. It should be used
after the CONTROL register is updated.

REF: Application Note 321 ARM Cortex-M Programming Guide to Memory Barrier Instructions

You can see an example of its use in the Run procedure in the Oberon-RTK Kernel for the Cortex-M0+ RP2040:

https://github.com/ygrayne/oberon-rtk

jnk0le · « **Reply #3 on:** February 27, 2025, 06:52:57 pm »

cortex-m0 implements armv6-m architecture.
And the armv6-m can be used for something bigger than the M0. ARM didn't go that route (and no one else is allowed to, of course), but there is possibility still.

There is also the purpose of forward compatibility, so the armv6-m programs work correctly on armv7-m.

Siwastaja · « **Reply #4 on:** March 01, 2025, 07:36:56 pm »

Remember that ISB and DSB instructions have absolutely nothing to do with cache. So if working with e.g. STM32H7 or something else with caches, and you want to run caches enabled, and need to flush them for whatever reason (e.g., DMA), DSB won't do that, you will need a separate flush command for that.

incf · « **Reply #5 on:** March 01, 2025, 07:46:50 pm »

Quote from: Siwastaja on March 01, 2025, 07:36:56 pm

Remember that ISB and DSB instructions have absolutely nothing to do with cache. ...

That begs the question... what are they for?

After reading "REF: Application Note 321 ARM Cortex-M Programming Guide to Memory Barrier Instructions" I find myself struggling to imagine a situation on a Cortex M0 (Plus, or no plus) type device where instructions like ISB or DSB do anything except possibly make certain events occur a few clock cycles earlier than they would have.

They dedicated several op-codes to this, but I don't see what is gained by it in this specific case.

hli · « **Reply #6 on:** March 01, 2025, 10:58:19 pm »

ISB prevents instruction re-ordering, and finishes all delayed writes.
It ensures that all writes which which are in the program before the ISB are completed when the ISB is finished, and that any instructions after the ISB will get to see the results of these writes. (see https://developer.arm.com/documentation/dui0497/a/the-cortex-m0-processor/memory-model/software-ordering-of-memory-accesses for ther actual wording on that, and for some exampled when to use it)

coppice · « **Reply #7 on:** March 01, 2025, 11:08:07 pm »

Quote from: hli on March 01, 2025, 10:58:19 pm

ISB prevents instruction re-ordering, and finishes all delayed writes.
It ensures that all writes which which are in the program before the ISB are completed when the ISB is finished, and that any instructions after the ISB will get to see the results of these writes. (see https://developer.arm.com/documentation/dui0497/a/the-cortex-m0-processor/memory-model/software-ordering-of-memory-accesses for ther actual wording on that, and for some exampled when to use it)

I don't think anyone has an argument about the purpose of the instructions. What is not clear is what, on a processor as simple and basic as an M0, ever gets things out of order? Are these instruction just there for compatibility with more complex processors which might use the same instruction set, or do the really do something with the M0? Although its a fairly simple core it does have a pipeline, and that frequently has the ability to spring surprises about ordering.

ejeffrey · « **Reply #8 on:** March 02, 2025, 01:05:11 am »

Even in-order pipeline stages are a cache. And while the M0+ is quite simple and only has two states (fetch and execute) the bus it is attached to can be quite complex. Without memory barriers there is no way to force memory transactions to different endpoints (such as SRAM vs a peripheral) act in order.

Siwastaja · « **Reply #9 on:** March 02, 2025, 08:08:13 am »

Quote from: ejeffrey on March 02, 2025, 01:05:11 am

Even in-order pipeline stages are a cache.

No, pipelines are not caches. For example, Wikipedia definition is quite decent:
"In computing, a cache is a hardware or software component that stores data so that future requests for that data can be served faster"

Pipelining, buffering and prefetching are not examples of caching. Caching involves fetching the stored value "back". FIFOs are not caches; synchonization is not cache; pipelining so that two operations can run in parallel with different data is not cache. Let's not create confusion by redefining commonly used terms even more broadly than their already-quite-broad meaning.

These barrier instructions are needed because, as others have pointed out, buses can have different latencies end-to-end due to different buffering or different clock domain crossings. A typical use case is ensuring that data has been written to RAM before triggering a DMA transfer accessing that RAM.

Caches have similar considerations with coherency, but they are not the same thing.

jnk0le · « **Reply #10 on:** March 02, 2025, 01:05:45 pm »

prefetch buffer + multi stage pipeline can bite in case of self modifying code even on m0

Code: [Select]

.balign 4
	adr r0, abc
	strh r1, [r0]
	//isb
abc:
	nop // in decode (M0) or fetch (M0+) stage when strh finishes

using isb instead of enough amount of spacing instructions makes it portable to any uarch.

coppice · « **Reply #11 on:** March 02, 2025, 06:43:55 pm »

Is the flash in M0 and M0+ parts 32 bits or 64 bits wide? A lot of MCUs that appear 32 bits wide actually have 64 or more bits across the flash words, which they use to increase the apparent speed of the flash. Muxing and demuxing that can lead to some out of orderiness.

Sacodepatatas · « **Reply #12 on:** March 02, 2025, 07:26:42 pm »

Quote from: coppice on March 02, 2025, 06:43:55 pm

Is the flash in M0 and M0+ parts 32 bits or 64 bits wide? A lot of MCUs that appear 32 bits wide actually have 64 or more bits across the flash words, which they use to increase the apparent speed of the flash. Muxing and demuxing that can lead to some out of orderiness.

The STM32G0xx architecture (M0+) has a prefetch, that is, as you said, the whole 64 bits reading of flash and then executing from 2 to 4 instructions until the next read, but also they have 16 Bytes of cache to overcome the flash wait states...

westfw · « **Reply #13 on:** March 03, 2025, 07:08:53 am »

Quote

Is the flash in M0 and M0+ parts 32 bits or 64 bits wide? A lot of MCUs that appear 32 bits wide actually have 64 or more bits across the flash words, which they use to increase the apparent speed of the flash.

I don't think that the flash memory width is considered part of the ARM side of the M0/M0+ definition.
As you say, many implementations have "wide" flash memory, plus other forms of "flash acceleration" that are frequently not very well documented.
Then you have chips like the rp2040 that actually implement an actual cache in front of flash, even though the M0 archiecture doesn't include cache. Cached memory SHOULD be pretty much invisible to the CPU, other than making memory access times rather variable. (Hmm. I suppose it's an interesting question whether the barrier instructions interface properly with the 3rd party caches...)

Jeroen3 · « **Reply #14 on:** March 03, 2025, 12:58:40 pm »

These are really low level instructions, and that are certainly there with a purpose. Except that purpose is for advanced functions.
Most people don't have to worry about them. But there are also chips running multi-core M0+, or even big-little M4/M0.
Then these barrier instructions come into the picture for synchronisation purposes. And for deterministic initialisation stuff.

Let me cut&paste this from the arm docs. It's about armv8, but v6 should work the same. It should answer the question.

Quote from: https://developer.arm.com/documentation/dui0662/b/Cortex-M0--Peripherals/Memory-Protection-Unit

When do you need a DSB followed by an ISB?

Here are few example scenarios where you need a Data Synchronization Barrier (DSB) followed by an Instruction Synchronization Barrier (ISB).

MPU configuration:

A DSB is used after enabling the MPU to ensure that the subsequent ISB instruction is executed only after the write to the MPU Control register is completed. The ISB instruction is used after the DSB to ensure the processor pipeline is flushed and subsequent instructions are re-fetched with new MPU configuration settings.

Enable or disable the floating-point unit (FPU):

Before using an FPU, you need to program CPACR (Coprocessor Access Control Register) to enable the FPU. The CPACR register lets you enable or disable the FPU. Because writing to the CPACR register affects subsequent floating-point instructions, a DSB instruction is executed to ensure that the write to the CPACR register is completed. The ISB instruction is used after the DSB to ensure that new FPU settings are applied to subsequent floating-point instructions.

Enabling interrupts using NVIC:

If a pended interrupt request needs to be recognized immediately after being enabled in the NVIC, a DSB instruction followed by an ISB instruction is recommended. The DSB instruction ensures that the write to the NVIC enable register is complete, while the ISB instruction ensures that IRQ is executed.

Vector table configuration:

In Cortex-M processors, typically the location of the vector table is determined by the Vector Table Offset Register (VTOR). If you need to change the vector table base address, then a DSB instruction should be used after writing a new value to the VTOR register. This ensures that the write to the VTOR register is complete. An ISB followed by a DSB is required to ensure that any subsequent exceptions and interrupts use the new vector table base address.

peter-h · « **Reply #15 on:** March 03, 2025, 08:45:03 pm »

Quote

a DSB instruction should be used after writing a new value to the VTOR register. This ensures that the write to the VTOR register is complete.

This is very interesting stuff, and makes sense, even to me

The Q is why the thousands of designs out there, which don't do any of this, don't just blow up. The answer must be that in practical scenarios you set up e.g. VTOR many many instructions before interrupts get enabled. And probably same with enabling the FPU. STM's own "HAL" code (as generated by e.g. Cube MX) doesn't do DSB either.

incf · « **Reply #16 on:** March 03, 2025, 10:07:44 pm »

Quote from: peter-h on March 03, 2025, 08:45:03 pm

Quote
a DSB instruction should be used after writing a new value to the VTOR register. This ensures that the write to the VTOR register is complete.

This is very interesting stuff, and makes sense, even to me

The Q is why the thousands of designs out there, which don't do any of this, don't just blow up. The answer must be that in practical scenarios you set up e.g. VTOR many many instructions before interrupts get enabled. And probably same with enabling the FPU. STM's own "HAL" code (as generated by e.g. Cube MX) doesn't do DSB either.

I think ARM wraps certain registers with logic that triggers DSB/ISB behavior. Maybe chip manufacturers like ST might do something similar with certain critical peripheral registers?

Hard-earned experience has likely taught them that they retain their customers better when they auto-magically handle those sorts of low level details. (Evidentially they do a good job of it)

Jeroen3 · « **Reply #17 on:** March 04, 2025, 07:13:14 am »

Quote from: peter-h on March 03, 2025, 08:45:03 pm

Quote
a DSB instruction should be used after writing a new value to the VTOR register. This ensures that the write to the VTOR register is complete.

This is very interesting stuff, and makes sense, even to me

The Q is why the thousands of designs out there, which don't do any of this, don't just blow up. The answer must be that in practical scenarios you set up e.g. VTOR many many instructions before interrupts get enabled. And probably same with enabling the FPU. STM's own "HAL" code (as generated by e.g. Cube MX) doesn't do DSB either.

These instructions guarantee certain things. Often you have enough fluff around your code that the correct behaviour gets implied.

If you look at the cmsis where it is used, in the cm0plus it isn't use that much, but for example:

Code: [Select]

/**
  \brief   Disable Interrupt
  \details Disables a device specific interrupt in the NVIC interrupt controller.
  \param [in]      IRQn  Device specific interrupt number.
  \note    IRQn must not be negative.
 */
__STATIC_INLINE void __NVIC_DisableIRQ(IRQn_Type IRQn)
{
  if ((int32_t)(IRQn) >= 0)
  {
    NVIC->ICER[0U] = (uint32_t)(1UL << (((uint32_t)IRQn) & 0x1FUL));
    __DSB();
    __ISB();
  }
}

For example, using the above function without barriers would probably work fine if any of the next instructions don't do anything with the nvic or with the peripheral sending an irq.

Race conditions, these instructions prevent them.

Quote from: incf on March 03, 2025, 10:07:44 pm

...
I think ARM wraps certain registers with logic that triggers DSB/ISB behavior. Maybe chip manufacturers like ST might do something similar with certain critical peripheral registers?

I don't think they do. Some registers are just slower because they are on a slower clock, that might imply some of this effect.
I only had to use a DMB barrier once to ensure writes to the irq flag register were finished on exception return. Which it sometimes appears as it didn't on the F103.

peter-h · « **Reply #18 on:** March 04, 2025, 09:11:20 am »

Yeah, you are not wrong, and this is 32F417:

Code: [Select]



/**
  \brief   Disable Interrupt
  \details Disables a device specific interrupt in the NVIC interrupt controller.
  \param [in]      IRQn  Device specific interrupt number.
  \note    IRQn must not be negative.
 */
__STATIC_INLINE void __NVIC_DisableIRQ(IRQn_Type IRQn)
{
  if ((int32_t)(IRQn) >= 0)
  {
    NVIC->ICER[0U] = (uint32_t)(1UL << (((uint32_t)IRQn) & 0x1FUL));
    __DSB();
    __ISB();
  }
}

Interestingly that function is never called

And DSB is not found in any other NVIC code.

langwadt · « **Reply #19 on:** March 04, 2025, 02:11:40 pm »

Quote from: Jeroen3 on March 04, 2025, 07:13:14 am

These instructions guarantee certain things. Often you have enough fluff around your code that the correct behaviour gets implied.

If you look at the cmsis where it is used, in the cm0plus it isn't use that much, but for example:
Code: [Select]
/** \brief Disable Interrupt \details Disables a device specific interrupt in the NVIC interrupt controller. \param [in] IRQn Device specific interrupt number. \note IRQn must not be negative. */ __STATIC_INLINE void __NVIC_DisableIRQ(IRQn_Type IRQn) { if ((int32_t)(IRQn) >= 0) { NVIC->ICER[0U] = (uint32_t)(1UL << (((uint32_t)IRQn) & 0x1FUL)); __DSB(); __ISB(); } }
For example, using the above function without barriers would probably work fine if any of the next instructions don't do anything with the nvic or with the peripheral sending an irq.

Race conditions, these instructions prevent them.

seems so, https://github.com/ARM-software/CMSIS_5/issues/110

Jeroen3 · « **Reply #20 on:** March 04, 2025, 02:17:23 pm »

Hah, nice find.

peter-h · « **Reply #21 on:** March 04, 2025, 03:15:19 pm »

Well, there is a lot of this in arm32 e.g. the WFI instruction
https://www.eevblog.com/forum/microcontrollers/how-fast-does-st-32f417-enter-standby-mode/msg4056850/#msg4056850
where you have to put it into a loop, so if there is a delay in it, it always happens, just a tiny bit later.

Very few coders will do a DI() or similar and immediately expect it to be effective. There will always be pending interrupts... It's just a messy topic. But if writing in C, there will nearly always be a ton of other machine code after DSB/ISB etc.

This stuff is so deep that if it was a common issue, few products would ever get finished

Another thing is that the STM libs are done for other chips like the H7 and same code is used. You also get this issue in the infamous ETH low level code (LWIP to ETH glue) where DSB is needed on the H7 but not the F4. I spent many hours digging around this stuff.

newbrain · « **Reply #22 on:** March 06, 2025, 11:50:05 am »

Quote from: peter-h on March 04, 2025, 03:15:19 pm

DSB is needed on the H7 but not the F4.

Typical example: clearing interrupt flags in a ISR and then returning.
If __DSB() is not used, the IRQ will still be pending at return time and the IRS will immediately be called again.

Simplified actual code of my I2C driver:

Code: [Select]

void I2C1_EV_IRQHandler(void)
{
    FsmI2c state = tr->state;
    /* Check what happened! */
    if (i2c->ISR & I2C_ISR_TXIS)
    {
        switch (state)
        {
        ...
        case TxI2CaddrR1: /* Write Address and go to R0 */
            --state;      /* R2 -> R1 -> R0 or W2 -> W1 -> W0 == TxData */
            i2c->TXDR = tr->subArr[--tr->subCount]);
            break;
        ...
        }
    }
    else if (i2c->ISR & (I2C_ISR_TC | I2C_ISR_STOPF))
    {
        ...
    }
    else if (i2c->ISR & I2C_ISR_RXNE)
    {
        switch (state)
        {
        case RxData: /* Do not expect any other state! */
            *tr->dataPtr++ = i2c->RXDR;
            break;
        ....
        }
    }
    else
    {
        /* All other error situations, including NACK */
        state = Error;
    }
    tr->state = state;

    /* Everything is complete (state == Idle) or there's an error, wrap up transaction */
    if (state <= Error)
    {
        ...
    }

    /* Avoid retriggering IRQs by forcing data tranfers to be complete. */
    /* This makes sure the IRQ flags are updated before exiting the ISR */
    __DSB();
}

Here, flags in I²C's ISR register are cleared automatically by the reading or writing of the data registers, but the change has not yet taken place when the return is executed.
Note that caching has nothing to do with this, as peripheral registers reside in memory of Device type which is never cached.
Without __DSB(), the whole ISR was called again.

peter-h · « **Reply #23 on:** March 06, 2025, 12:09:28 pm »

So how come ISRs work on the 32F4?

Is it because the clearing of the IRQ happens far enough back from the ISR exit, as in here

Code: [Select]


void serial_transmit_irq_handler(UART_HandleTypeDef *huart, uint8_t port)
{

	port--;	// offset from 0

	// Is there anything left to send?
	if (serial_transmit_buffer_tx_index[port] == serial_transmit_buffer_put_index[port])
	{
		// Disable the TXEmpty interrupt
		huart->Instance->CR1 &= ~(UART_IT_TXE);
	}
	else
	{
		// Send the next byte
		huart->Instance->DR = serial_transmit_buffer[port][serial_transmit_buffer_tx_index[port]++];
		if( serial_transmit_buffer_tx_index[port] >= SERIAL_TX_BUFFER_SIZE)
		{
			serial_transmit_buffer_tx_index[port] = 0;
		}
		g_comms_act[port+1]=LED_COM_TC;	// indicate data flow on LED 1-4
	}

}

but then how does this one work

Code: [Select]


// This ISR turns off the 485 driver on p3/p4

void serial_transmit_complete_irq_handler(UART_HandleTypeDef *huart, uint8_t port)
{

	if ( (port == 3) || (port == 4) )
	{
		// If this is an RS485 port and auto driver mode is set, disable the driver now
		if (serial_driver_mode[port-1] == 2)
		{
			serial_485_driver_control_internal(port, 0);
		}
	}

	// Transmit complete
	serial_transmit_active[port-1] = false;

	// Clear and disable the TC interrupt. This is supposed to be done by a read of SR and a write to DR
	// but in this case we have nothing to write to DR so we clear the TC bit by writing a 0 to it but
	// avoiding RMW on SR - see http://efton.sk/STM32/gotcha/g9.html
	huart->Instance->SR = ~(UART_FLAG_TC);
	huart->Instance->CR1 &= ~(UART_IT_TC);

}

Sauli · « **Reply #24 on:** March 06, 2025, 01:07:50 pm »

Quote

but then how does this one work

Maybe because you are executing at least a function epilogue before returning from the interrupt?


EEVblog® Main Site	EEVblog® on Youtube	EEVblog® on Twitter	EEVblog® on Facebook	EEVblog® on Odysee

Author Topic: Why does Cortex-M0(+) implement instructions like ISB if it has no cache? (Read 11525 times)

Share me