Author Topic: Interrupt latency benchmarking on CH32V003 w/ and w/o hardware-stacking (HPE) (Read 4422 times)

HwAoRrDk · « **on:** March 31, 2023, 06:34:25 am »

The WCH CH32V003 microcontroller with its QingKeV2 RISC-V CPU core has a slightly different interrupt controller (named Programmable Fast Interrupt Controller, or PFIC) than the higher-spec CH32V30x series with the QingKeV4 CPU core, specifically related to how the Hardware Prologue/Epilogue (HPE) feature works.

The HPE hardware stacking feature, when enabled, automatically saves a certain set of CPU registers upon an interrupt firing, and restores them again after the ISR returns. This way, code within the ISR does not have to perform this work, and thus enhancing performance by supposedly decreasing interrupt latency (i.e. how long between the event that triggers the interrupt, and the ISR being able to perform whatever actions in response).

On the CH32V30x, the HPE feature saves registers in a single cycle to a private internal hardware stack area which supports 3 levels of depth. However, on the CH32V003, HPE saves registers to the general stack area in RAM (and with a maximum depth of 2).

However, in a recent other thread, the discussion turned to whether, given that the CH32V003 just saves to in-RAM stack like normal, there is actually any performance advantage to interrupt latency.

So, I decided to benchmark it, following roughly the same testing methodology as newbrain in a thread about the CH32V307.

The code I wrote to perform the benchmark goes as follows:

- PC1 and PC2 are configured as outputs, and PC3 as an input.
- PC1 is connected externally to PC3.
- An EXTI interrupt is configured for PC3 with a rising-edge trigger.
- In the main loop, PC1 is set high.
- In the EXTI ISR, PC2 is set high.
- The ISR also makes a call to a sub-function, in order to create a worst-case scenario where all registers have to be saved.

Using an oscilloscope, I measured the period between the rising edges of the two output signals (PC1 in main loop to PC2 in ISR).

For the test case with HPE disabled, the ISR was marked with __attribute__((interrupt)). For the case with HPE enabled, the ISR uses __attribute__((interrupt("WCH-Interrupt-fast"))). The without-HPE case was compiled using mainline GCC 12.2 and -mabi=ilp32e -march=rv32ec_zicsr -mcmodel=medany -misa-spec=2.2 options. The with-HPE case was compiled using WCH's GCC 8.2 fork and -mabi=ilp32e -mcmodel=medany -march=rv32ecxw options. Both were also compiled with -Os.

The chip was initialised to run at the default 24 MHz using the HSI oscillator.

Here is the ISR disassembly for the case without HPE:

Code: [Select]

0000021a <EXTI7_0_IRQHandler>:
 21a:	fd810113          	addi	sp,sp,-40
 21e:	c23a                	sw	a4,4(sp)
 220:	c03e                	sw	a5,0(sp)
 222:	d206                	sw	ra,36(sp)
 224:	d016                	sw	t0,32(sp)
 226:	ce1a                	sw	t1,28(sp)
 228:	cc1e                	sw	t2,24(sp)
 22a:	ca2a                	sw	a0,20(sp)
 22c:	c82e                	sw	a1,16(sp)
 22e:	c632                	sw	a2,12(sp)
 230:	c436                	sw	a3,8(sp)
 232:	400117b7          	lui	a5,0x40011
 236:	4711                	li	a4,4
 238:	cb98                	sw	a4,16(a5)
 23a:	3f31                	jal	156 <foo>
 23c:	400107b7          	lui	a5,0x40010
 240:	40078793          	addi	a5,a5,1024 # 40010400 <__global_pointer$+0x2000fc00>
 244:	577d                	li	a4,-1
 246:	cbd8                	sw	a4,20(a5)
 248:	5092                	lw	ra,36(sp)
 24a:	5282                	lw	t0,32(sp)
 24c:	4372                	lw	t1,28(sp)
 24e:	43e2                	lw	t2,24(sp)
 250:	4552                	lw	a0,20(sp)
 252:	45c2                	lw	a1,16(sp)
 254:	4632                	lw	a2,12(sp)
 256:	46a2                	lw	a3,8(sp)
 258:	4712                	lw	a4,4(sp)
 25a:	4782                	lw	a5,0(sp)
 25c:	02810113          	addi	sp,sp,40
 260:	30200073          	mret

And the ISR disassembly for the case with HPE:

Code: [Select]

0000021a <EXTI7_0_IRQHandler>:
 21a:	400117b7          	lui	a5,0x40011
 21e:	4711                	li	a4,4
 220:	cb98                	sw	a4,16(a5)
 222:	3f15                	jal	156 <foo>
 224:	400107b7          	lui	a5,0x40010
 228:	577d                	li	a4,-1
 22a:	40e7aa23          	sw	a4,1044(a5) # 40010414 <__global_pointer$+0x2000fc14>
 22e:	40078793          	addi	a5,a5,1024
 232:	30200073          	mret

The results are as follows:

Without HPE: 1.45 us
With HPE: 0.87 us

So, using HPE results in an interrupt latency that is 0.58 us shorter!

The supposition was that because the CH32V003 simply does in hardware what software would be doing anyway, it would be little-to-no faster. But here we see that it is over half a microsecond faster. This must mean that there is definitely something special going on when it is saving to the stack with hardware.

brucehoult · « **Reply #1 on:** March 31, 2023, 09:45:53 am »

Quote from: HwAoRrDk on March 31, 2023, 06:34:25 am

The results are as follows:

Without HPE: 1.45 us
With HPE: 0.87 us

So, using HPE results in an interrupt latency that is 0.58 us shorter!

The supposition was that because the CH32V003 simply does in hardware what software would be doing anyway, it would be little-to-no faster. But here we see that it is over half a microsecond faster. This must mean that there is definitely something special going on when it is saving to the stack with hardware.

So that's 14 clock cycles.

Your non-HPE code is executing 22 instructions more than the HPE code: 10 stores, 10 loads, and 2 ADDI to SP. And it is 48 bytes of code, as the loads and stores are 2 byte instructions, but the ADDIs exceed the range for 2 byte instructions and each use 4 bytes of code.

Assuming that code is always fetched at 4 bytes per cycle, regardless of whether it contains one 4-byte instruction or two 2-byte instructions, that 48 bytes of code accounts for 12 clock cycles of the 14 cycles difference.

So there is nothing very magical in the hardware stacking. It can load/store one register per cycle, while the software stacking need 3 cycles per 2 registers due to the instruction fetches (there is no icache).

BUT ... there is a flaw in your test.

Part of the point of using software stacking is that you only have to save and restore the registers you actually use.

Your code uses only a4 and a5, so those (plus ra) are all that actually need to be saved and restored. But the code is saving a0-a5 plus t0-t2. Everything!

This is because you call the standard C function foo(). And so everything has to be saved, in case foo() nukes it.

I suggest you declare foo() also as __attribute__((interrupt)) and measure again. Oh, wait, crud, you can't just do that because it will end with mret not ret. We need something that is like __attribute__((interrupt)) but doesn't do that.

If the function is simple (doesn't create a stack frame) then jam an asm volatile ("ret") into it as well.

HwAoRrDk · « **Reply #2 on:** March 31, 2023, 10:34:40 am »

Quote from: brucehoult on March 31, 2023, 09:45:53 am

BUT ... there is a flaw in your test.

Part of the point of using software stacking is that you only have to save and restore the registers you actually use.

Your code uses only a4 and a5, so those (plus ra) are all that actually need to be saved and restored. But the code is saving a0-a5 plus t0-t2. Everything!

No flaw, the test intentionally saves all registers.

I did say it was supposed to be a worst-case scenario. I wanted to make it an apples-to-apples comparison - that each case was doing the same work. The docs say HPE saves 10 registers, so make the other case do the same.

I might do some more testing tomorrow with a best-case where the software-only version saves only a couple of registers (i.e. eliminate the function call). I'm sure the difference will be negligible, or even in favour of non-HPE.

brucehoult · « **Reply #3 on:** March 31, 2023, 11:04:41 am »

Quote from: HwAoRrDk on March 31, 2023, 10:34:40 am

Quote from: brucehoult on March 31, 2023, 09:45:53 am
BUT ... there is a flaw in your test.

Part of the point of using software stacking is that you only have to save and restore the registers you actually use.

Your code uses only a4 and a5, so those (plus ra) are all that actually need to be saved and restored. But the code is saving a0-a5 plus t0-t2. Everything!

No flaw, the test intentionally saves all registers. I did say it was supposed to be a worst-case scenario. I wanted to make it an apples-to-apples comparison - that each case was doing the same work. The docs say HPE saves 10 registers, so make the other case do the same.

OK, fair enough.

And you got 0.58 µs difference at 24 MHz, and uliano got 0.44 µs difference at 48 MHz.

Prediction:

Code: [Select]

EXTI7_0_IRQHandler:
    addi sp,sp,-16
    sw a4,12(sp)
    sw a5,8(sp)
    lui a5,0x40011
    li a4,4
    sw a4,16(a5)
    lui a5,0x40010
    li a4,-1
    sw a4,1044(a5)
    lw a4,12(sp)
    lw a5,8(sp)
    addi sp,sp,16
    mret

0.6 µs at 24 MHz.

If you compile your main program with -ffixed-a4 -ffixed-a5 (which is of course likely to make main program code a little bigger and slower) then you can cut the interrupt routine down to (NOT using WCH fast interrupt mode)

Code: [Select]

EXTI7_0_IRQHandler:
    lui a5,0x40011
    li a4,4
    sw a4,16(a5)
    lui a5,0x40010
    li a4,-1
    sw a4,1044(a5)
    mret

This is a little bit expensive on RV32E with just 16 registers, but on RV32I with 32 registers you can easily partition off a few registers for use only by interrupt code with almost no effect on performance of the main program.

HwAoRrDk · « **Reply #4 on:** April 01, 2023, 06:09:39 am »

Quote from: brucehoult on March 31, 2023, 11:04:41 am

Prediction:

0.6 µs at 24 MHz.

Close, but no cigar.

It came out to 0.746 us with no sub function call.

The ISR assembly was as follows:

Code: [Select]

0000020a <EXTI7_0_IRQHandler>:
 20a:	1161                	addi	sp,sp,-8
 20c:	c23a                	sw	a4,4(sp)
 20e:	c03e                	sw	a5,0(sp)
 210:	4711                	li	a4,4
 212:	400117b7          	lui	a5,0x40011
 216:	cb98                	sw	a4,16(a5)
 218:	400107b7          	lui	a5,0x40010
 21c:	40078793          	addi	a5,a5,1024 # 40010400 <__global_pointer$+0x2000fc00>
 220:	577d                	li	a4,-1
 222:	cbd8                	sw	a4,20(a5)
 224:	4712                	lw	a4,4(sp)
 226:	4782                	lw	a5,0(sp)
 228:	0121                	addi	sp,sp,8
 22a:	30200073          	mret

So faster than with HPE, but not by as much as you predicted - only 124 ns. I see your predicted code has one less instruction (no addi in constructing the 2nd register's base address), but that would only account for 0.042 us less at 24 MHz.

I estimate that the cross-over point, where using HPE gains you shorter ISR latency, as having 4 or more registers being saved/restored.

The take-away here seems to be that it is indeed faster not to use HPE in some scenarios - namely where the code doesn't call any other functions and is short and simple enough not to use more than 3 registers. But in other scenarios, HPE brings some benefit to interrupt latency.

brucehoult · « **Reply #5 on:** April 01, 2023, 08:05:45 am »

Quote from: HwAoRrDk on April 01, 2023, 06:09:39 am

I estimate that the cross-over point, where using HPE gains you shorter ISR latency, as having 4 or more registers being saved/restored.

Sounds reasonable. The crossover will be even lower at 48 MHz -- you might not even get any gain to avoiding HPE at 2 registers, and it is hard to imagine an ISR that doesn't need at least 2.

Thanks for running the tests!

I've got the 003 kit, but it's sitting unopened at the moment. What I've been playing with today is at the other end of the RISC-V spectrum at the moment: an SG2042 machine with 64 2.0 GHz OoO THead C910 cores, each with dual vectpr pipelines with 256 bit ALUs. Each core is maybe 20% faster than those in a Pi 4 (mostly because of the extra MHz), but it has 16x as many cores :-)

jnk0le · « **Reply #6 on:** April 04, 2023, 08:55:00 pm »

BTW, with Xtightlycoupledio this entire test sample would be just two `tio.addi` instructions (and isr return). No registers used.

https://github.com/jnk0le/XTightlyCoupledIO

SiliconWizard · « **Reply #7 on:** April 04, 2023, 09:04:28 pm »

Quote from: brucehoult on April 01, 2023, 08:05:45 am

What I've been playing with today is at the other end of the RISC-V spectrum at the moment: an SG2042 machine with 64 2.0 GHz OoO THead C910 cores, each with dual vectpr pipelines with 256 bit ALUs.

langwadt · « **Reply #8 on:** April 04, 2023, 09:14:57 pm »

Quote from: brucehoult on March 31, 2023, 09:45:53 am

So there is nothing very magical in the hardware stacking. It can load/store one register per cycle, while the software stacking need 3 cycles per 2 registers due to the instruction fetches (there is no icache).

don't know much about risc-v but depending on hardware and memory, on some architectures it could also mean stacking, fetching a vector address and first ISR instructions could also be done in parallel, an cortex-M4 does that

brucehoult · « **Reply #9 on:** April 04, 2023, 11:28:16 pm »

Quote from: langwadt on April 04, 2023, 09:14:57 pm

Quote from: brucehoult on March 31, 2023, 09:45:53 am
So there is nothing very magical in the hardware stacking. It can load/store one register per cycle, while the software stacking need 3 cycles per 2 registers due to the instruction fetches (there is no icache).

don't know much about risc-v but depending on hardware and memory, on some architectures it could also mean stacking, fetching a vector address and first ISR instructions could also be done in parallel, an cortex-M4 does that

This is not a question of RISC-V, but of the size and complexity of the CPU core. Cortex-M4 is a couple of steps up from this core which is more on a level with M0.

I'd assume that this very low end core can't do two memory operations at the same time e.g. write a register to RAM at the same time as reading an instruction from flash.

The most basic RISC-V interrupt handling mode simply switches execution mode to M (but this chip only has M anyway), disabled interrupts, saves the old execution mode and interrupt enable status, saves the PC to MEPC, puts exception/interrupt number and associated data into special CSRs, and copies MTVEC to PC. That's all in 1 cycle. The CPU then simply does the same as for any branch instruction, fetching and executing the instruction at the new PC which on a typical low end core has 1 instruction fetch wait cycle.

If you use the original vectored mode then PC is MTVEC + 4*cause, it takes typically 2 cycles to fetch and execute the jump instruction in the handler table, and 2 cycles more to execute the first instruction in the actual handler. This imposes a minimum of complexity and circuitry on the core (no state machine is needed other than the usual instruction fetch mechanism).

A more advanced vectored interrupt mode stores the address of the handler in the interrupt vector table. This is a couple of cycles faster and allows the interrupt handler to be anywhere in the address space, but it adds complexity to the core for not all that much gain.

langwadt · « **Reply #10 on:** April 05, 2023, 08:36:00 am »

Quote from: brucehoult on April 04, 2023, 11:28:16 pm

Quote from: langwadt on April 04, 2023, 09:14:57 pm
Quote from: brucehoult on March 31, 2023, 09:45:53 am
So there is nothing very magical in the hardware stacking. It can load/store one register per cycle, while the software stacking need 3 cycles per 2 registers due to the instruction fetches (there is no icache).

don't know much about risc-v but depending on hardware and memory, on some architectures it could also mean stacking, fetching a vector address and first ISR instructions could also be done in parallel, an cortex-M4 does that

This is not a question of RISC-V, but of the size and complexity of the CPU core. Cortex-M4 is a couple of steps up from this core which is more on a level with M0.

I'd assume that this very low end core can't do two memory operations at the same time e.g. write a register to RAM at the same time as reading an instruction from flash.

yeh, it seems like for ARM only M3 and up is harvard

HwAoRrDk · « **Reply #11 on:** April 05, 2023, 11:57:53 am »

Quote from: brucehoult on April 04, 2023, 11:28:16 pm

A more advanced vectored interrupt mode stores the address of the handler in the interrupt vector table. This is a couple of cycles faster and allows the interrupt handler to be anywhere in the address space, but it adds complexity to the core for not all that much gain.

It would appear that the latter is what the WCH-supplied startup assembly code configures the CH32V003 to do. At least, that's what my interpretation of the assembly code is. It stores the address of a default dummy interrupt handler in the vector table, configures the MODE0 & MODE1 bits of mtvec to both be 1. All the default interrupt handler labels are declared as 'weak' so they can of course be overridden in user code.

Code: [Select]

_start:
	.option   norvc;
    j       handle_reset
    .word   0
    .word   NMI_Handler                  /* NMI Handler */
    .word   HardFault_Handler            /* Hard Fault Handler */
    /* ...etc... */
    .word   TIM2_IRQHandler            	/* TIM2 */

NMI_Handler:              1: j 1b
HardFault_Handler:        1: j 1b
/* ...etc... */
TIM2_IRQHandler:          1: j 1b

    la t0, _start
    ori t0, t0, 3
    csrw mtvec, t0

This reminds me - I wonder if anyone can explain this: I have noticed a weird thing when looking at the disassembly of compiled code for the default dummy interrupt handlers. They are just a simple infinite loop, comprised of a single jump instruction back to themselves. But, for some of the default handlers, the compiler emits multiple jump instructions, despite there only being a single one in the startup assembly code.

For example:

Code: [Select]

000002ea <DMA1_Channel7_IRQHandler>:
     2ea:	a001                	j	2ea <DMA1_Channel7_IRQHandler>

000002ec <ADC1_IRQHandler>:
     2ec:	a001                	j	2ec <ADC1_IRQHandler>
     2ee:	a001                	j	2ee <ADC1_IRQHandler+0x2>
     2f0:	a001                	j	2f0 <ADC1_IRQHandler+0x4>
     2f2:	a001                	j	2f2 <ADC1_IRQHandler+0x6>

000002f4 <SPI1_IRQHandler>:
     2f4:	a001                	j	2f4 <SPI1_IRQHandler>

Obviously, for ADC1_IRQHandler here, anything following the first instruction is redundant, and pointless to be there. What I think is going on here is that, in the case of ADC1_IRQHandler, is that these additional instructions are 'leftovers' from the dummy default handlers for other interrupts for I2C and UART which are normally between the ADC and SPI handlers. In this instance, I have my own ISRs defined for I2C1_EV_IRQHandler, I2C1_ER_IRQHandler, and USART1_IRQHandler, which correlate with the 3 excess instructions.

Why can't the compiler/linker get rid of this redundant code? Is it because it doesn't realise that, once the 'weak' reference is overridden, that these instruction don't actually belong to the nearest preceding label? Is there any way we can tidy this up?

As far as I understand, the way the linker omits unused code is via the -ffunction-sections option, so could something be done to put each default dummy interrupt handler into its own section, so that the linker knows to omit ones that have been overridden?

brucehoult · « **Reply #12 on:** April 05, 2023, 12:09:33 pm »

-ffunction-sections is an option for the C compiler, to make a different section for each function.

In assembly language you make your own sections regardless, so if the source doesn't do that then it's all one big section for the default handlers.

The linker can't just go mucking about deleting stuff from the INSIDE of a section.

HwAoRrDk · « **Reply #13 on:** April 05, 2023, 12:36:55 pm »

Quote from: brucehoult on April 05, 2023, 12:09:33 pm

-ffunction-sections is an option for the C compiler, to make a different section for each function.

Silly me, I mis-remembered - I was thinking of --gc-sections.

So, could it work if one were to put each default interrupt handler in its own section?

I note that they are all currently put in a section called .text.vector_handler, but what if one were to do the following?

Code: [Select]

.section .text.vector_handler.NMI_Handler
NMI_Handler:              1: j 1b
.section .text.vector_handler.HardFault_Handler
HardFault_Handler:        1: j 1b
/* ...etc... */
.section .text.vector_handler.TIM2_IRQHandler
TIM2_IRQHandler:          1: j 1b

Then the linker would be able to (with --gc-sections specified) recognise any overridden default interrupt handlers and exclude them from the compiled binary, right? Thus saving space by omitting redundant code.

BTW, I can't quite figure out what the flags and type argument are on the default section directive for the 'vector_handler' section: .section .text.vector_handler, "ax", @progbits. What is an 'allocatable' section ('a' flag), and why does it need to marked with '@progbits' (meaning "section contains data")?

brucehoult · « **Reply #14 on:** April 05, 2023, 02:03:19 pm »

That looks like it should work. Might want to check the linker script to see if it will gather the (non GCd) handler sections together .. if you care. On a CPU without cache it's probably not important.

HwAoRrDk · « **Reply #15 on:** April 05, 2023, 04:16:11 pm »

The principle seems to work.

Or, at least, it compiles and links apparently fine. I would need to actually test running the resultant code on physical hardware.

If I list each default dummy interrupt handler like as follows (using the same section name as a user-implemented ISR function section would get):

Code: [Select]

	.section .text.NMI_Handler, "ax", @progbits
	.weak NMI_Handler
NMI_Handler:              1: j 1b

Then there is no redundant code emitted for handlers that are overridden.

For instance, if I have a user-implemented handler for EXTI7_0_IRQHandler, whose default dummy handler normally resides between the RCC and AWU handlers, I get the following:

Code: [Select]

000000ae <RCC_IRQHandler>:
  ae:	a001                	j	ae <RCC_IRQHandler>

000000b0 <AWU_IRQHandler>:
  b0:	a001                	j	b0 <AWU_IRQHandler>

No extra redundant code at the tail end of RCC_IRQHandler.

And the ISR from my code appears further on as expected:

Code: [Select]

00000208 <EXTI7_0_IRQHandler>:
 208:	1161                	addi	sp,sp,-8
 20a:	c23a                	sw	a4,4(sp)
 20c:	c03e                	sw	a5,0(sp)
 20e:	4711                	li	a4,4
 210:	400117b7          	lui	a5,0x40011
[...etc...]

All of this is of course nitpicking with regard to wasted space, as you'd probably be saving at most a few dozen bytes of space, but when you only have 16 kB to play with, it could make all the difference.

brucehoult · « **Reply #16 on:** April 05, 2023, 09:09:25 pm »

If you want to hunt down every byte then having a different infinite loop for each default handler is a waste. Unless you actually do get an unexpected trap and connect with a debugger to see exactly which infinite loop you are in.

Also, I seen to remember something zeroing all the registers on startup. There shouldn't be any code relying on that so it could go.

langwadt · « **Reply #17 on:** April 05, 2023, 09:15:29 pm »

Quote from: brucehoult on April 05, 2023, 09:09:25 pm

If you want to hunt down every byte then having a different infinite loop for each default handler is a waste. Unless you actually do get an unexpected trap and connect with a debugger to see exactly which infinite loop you are in.

the stm cube startup code uses .thumb_set to alias all the all different handler names with a single default_handler

brucehoult · « **Reply #18 on:** April 05, 2023, 09:37:49 pm »

Quote from: langwadt on April 05, 2023, 09:15:29 pm

Quote from: brucehoult on April 05, 2023, 09:09:25 pm
If you want to hunt down every byte then having a different infinite loop for each default handler is a waste. Unless you actually do get an unexpected trap and connect with a debugger to see exactly which infinite loop you are in.

the stm cube startup code uses .thumb_set to alias all the all different handler names with a single default_handler

It's not a problem to put all the default handler labels one after another with a single j . following them instead of one each. It just loses information in the event one of them gets called unexpectedly. Not totally, as you can check the mcause CSR in the debugger if need be.

HwAoRrDk · « **Reply #19 on:** April 07, 2023, 09:20:47 am »

Quote from: brucehoult on April 05, 2023, 09:09:25 pm

Also, I seen to remember something zeroing all the registers on startup. There shouldn't be any code relying on that so it could go.

No, their startup code doesn't have any register zeroing. But I have seen that in other startup code (I think it was some guy's recent article on using Rust on CH32V); just a long train of mv xN, zero instructions for x1 through x15.

Are RISC-V CPU registers not guaranteed to be zero at reset then? But yes, I would imagine it is a rare case that would have problems without that.

Quote from: brucehoult on April 05, 2023, 09:37:49 pm

It's not a problem to put all the default handler labels one after another with a single j . following them instead of one each. It just loses information in the event one of them gets called unexpectedly. Not totally, as you can check the mcause CSR in the debugger if need be.

That's neat. If it's possible to determine which interrupt/exception it was with mcause, one wonders why they bothered wasting space with separate default handlers at all.

brucehoult · « **Reply #20 on:** April 07, 2023, 09:45:11 am »

Quote from: HwAoRrDk on April 07, 2023, 09:20:47 am

Are RISC-V CPU registers not guaranteed to be zero at reset then? But yes, I would imagine it is a rare case that would have problems without that.

I don't think there is anything in the ISA spec saying anything about initial values. Individual core designers might well zero on reset -- maybe even most of them. I think it would be bad coding practice for code to depend on either the hardware or the start code doing it.

Quote

That's neat. If it's possible to determine which interrupt/exception it was with mcause, one wonders why they bothered wasting space with separate default handlers at all.

Well, sure it's possible. Many setups use a single handler for all exceptions, put register save/restore in a single location, implement features such as late preemption by a higher priority interrupt coming in during register stacking for lower priority interrupt, interrupt chaining etc. Then at some point you do have to figure out what the interrupt was and dispatch to the correct handling for it.

HwAoRrDk · « **Reply #21 on:** April 07, 2023, 10:26:51 am »

Quote from: brucehoult on April 07, 2023, 09:45:11 am

Many setups use a single handler for all exceptions, put register save/restore in a single location, implement features such as late preemption by a higher priority interrupt coming in during register stacking for lower priority interrupt, interrupt chaining etc. Then at some point you do have to figure out what the interrupt was and dispatch to the correct handling for it.

I guess if you wanted to do that you would do so with the mtvec mode setting where it will always just jump to the specified base address on interrupt, rather than base+4*N.

SiliconWizard · « **Reply #22 on:** April 07, 2023, 08:47:12 pm »

Quote from: brucehoult on April 07, 2023, 09:45:11 am

Quote from: HwAoRrDk on April 07, 2023, 09:20:47 am
Are RISC-V CPU registers not guaranteed to be zero at reset then? But yes, I would imagine it is a rare case that would have problems without that.

I don't think there is anything in the ISA spec saying anything about initial values. Individual core designers might well zero on reset -- maybe even most of them. I think it would be bad coding practice for code to depend on either the hardware or the start code doing it.

There isn't as far as I remember. And I've rarely seen any piece of code relying on that on any CPU. That would sound odd.
Of course we're talking about general-purpose registers here.
The PC and a few other dedicated registers are of course initialized to a known value upon reset, not necessarily zero either.

Quote from: brucehoult on April 07, 2023, 09:45:11 am

Quote
That's neat. If it's possible to determine which interrupt/exception it was with mcause, one wonders why they bothered wasting space with separate default handlers at all.

Well, sure it's possible. Many setups use a single handler for all exceptions, put register save/restore in a single location, implement features such as late preemption by a higher priority interrupt coming in during register stacking for lower priority interrupt, interrupt chaining etc. Then at some point you do have to figure out what the interrupt was and dispatch to the correct handling for it.

There is a vectored mode in the RISC-V ISA.

Of course, many actual implementations add their own interrupt controller (which is one of the reasons they often need a patched compiler), which can make things more efficient than the base exception/interrupt modes.

In my own rendition of an interrupt controller, it does directly set the PC to the corresponding handler address stored in a dedicated table only accessible by the interrupt controller. It's faster and more secure. But there are plenty of other approaches out there.

brucehoult · « **Reply #23 on:** April 08, 2023, 12:48:44 am »

Quote from: SiliconWizard on April 07, 2023, 08:47:12 pm

Quote from: brucehoult on April 07, 2023, 09:45:11 am
Quote
That's neat. If it's possible to determine which interrupt/exception it was with mcause, one wonders why they bothered wasting space with separate default handlers at all.

Well, sure it's possible. Many setups use a single handler for all exceptions, put register save/restore in a single location, implement features such as late preemption by a higher priority interrupt coming in during register stacking for lower priority interrupt, interrupt chaining etc. Then at some point you do have to figure out what the interrupt was and dispatch to the correct handling for it.

There is a vectored mode in the RISC-V ISA.

I'm not sure what this adds to the discussion?

The ENTIRE THREAD has been about a RISC-V vectored interrupt mode.

The message you replied to is pointing out that there is ALSO a NON-vectored mode. which is used by many systems, especially those running Linux or similar on >1 GHz CPUs, where you don't care about shaving nanoseconds off the interrupt response times and you usually have some simple 32 or 64 bit auxiliary CPU core to handle real-time tasks.

SiliconWizard · « **Reply #24 on:** April 11, 2023, 05:35:19 am »

Quote from: brucehoult on April 08, 2023, 12:48:44 am

Quote from: SiliconWizard on April 07, 2023, 08:47:12 pm
Quote from: brucehoult on April 07, 2023, 09:45:11 am
Quote
That's neat. If it's possible to determine which interrupt/exception it was with mcause, one wonders why they bothered wasting space with separate default handlers at all.

Well, sure it's possible. Many setups use a single handler for all exceptions, put register save/restore in a single location, implement features such as late preemption by a higher priority interrupt coming in during register stacking for lower priority interrupt, interrupt chaining etc. Then at some point you do have to figure out what the interrupt was and dispatch to the correct handling for it.

There is a vectored mode in the RISC-V ISA.

I'm not sure what this adds to the discussion?

Possibly nothing?

But if that helps, I think my point may have not been very clear.

There is a vectored mode in the RISC-V ISA, *but* you still need to read the xcause register to figure out if this is an interrupt or an exception.
And use exception codes >= 16 (at least if I understand the specs correctly) for your custom interrupt sources, if you want vectored interrupt handlers - otherwise the basic approach is to use the external interrupt code and then read the interrupt source from some register in your interrupt controller, not the most efficient.

But vendors do implement interrupt controllers in various ways, more or less deviating from the base exception mechanism.

I remember a talk from Krste Asanovic himself about fast interrupt handling, if that can be of any interest to anyone. Can be found easily on YT.

Quote from: brucehoult on April 08, 2023, 12:48:44 am

The message you replied to is pointing out that there is ALSO a NON-vectored mode. which is used by many systems, especially those running Linux or similar on >1 GHz CPUs, where you don't care about shaving nanoseconds off the interrupt response times and you usually have some simple 32 or 64 bit auxiliary CPU core to handle real-time tasks.

Sure, but the part I was actually replying to didn't make it obvious you were talking about very fast systems when the whole topic was about MCUs.
And on a MCU running at a few tens of MHz, having to check the interrupt flag in xcause, then test against the exception code and execute code based on that when you have a number of possible interrupt sources, that sure wouldn't be insignificant.

bson · « **Reply #25 on:** April 20, 2023, 09:46:19 pm »

Quote from: brucehoult on March 31, 2023, 09:45:53 am

This is because you call the standard C function foo(). And so everything has to be saved, in case foo() nukes it.

Does the RV ABI used mandate this? Normally in other ABIs a called function saves any registers it modifies, so he should be fine with a minimal saved set.

brucehoult · « **Reply #26 on:** April 21, 2023, 05:47:48 am »

Quote from: bson on April 20, 2023, 09:46:19 pm

Quote from: brucehoult on March 31, 2023, 09:45:53 am
This is because you call the standard C function foo(). And so everything has to be saved, in case foo() nukes it.
Does the RV ABI used mandate this?

Absolutely!

Any function following the standard RISC-V ABI is free to nuke registers a0-a7, and t0-t6.

Quote

Normally in other ABIs a called function saves any registers it modifies, so he should be fine with a minimal saved set.

That is not correct:

32 bit Arm: functions are free to nuke r0-r3, r12, r14 (LR)

64 bit Arm: functions are free to nuke x0-x17, x18 (PR) if not reserved by the platform

64 bit x86: functions are free to nuke rax, rcx, rdx, rsi, rdi, r8-r11 (SysV e.g. Linux/Mac)

AVR: functions are free to nuke r0, r18-r27, r30-r31

PowerPC: functions are free to nuke r0, r3-r12

MSP430: functions are free to nuke r11-r15

Perhaps you were thinking of ancient (and inefficient) ABIs that pass arguments and function results on the stack, such as VAX, M68000, and (most) 32 bit x86 calling conventions?

I don't think any ABI designed since about 1985 requires called functions to preserve all registers.

NorthGuy · « **Reply #27 on:** April 21, 2023, 01:27:09 pm »

Strange that they save registers for the interrupts in the stack in such a modern design. There are better ways. You can use shadow register sets and such. Even PIC16s are doing this (and have been doing this for 10 years or so), so the interrupt latency is 500 ns (for 8 MHz CPU cycle), faster than what you measured for HPE.

SiliconWizard · « **Reply #28 on:** April 21, 2023, 07:55:36 pm »

Quote from: NorthGuy on April 21, 2023, 01:27:09 pm

Strange that they save registers for the interrupts in the stack in such a modern design. There are better ways. You can use shadow register sets and such. Even PIC16s are doing this (and have been doing this for 10 years or so), so the interrupt latency is 500 ns (for 8 MHz CPU cycle), faster than what you measured for HPE.

Shadow registers are a whole rabbit hole with - in general - more critics these days than supporters.
Regarding RISC-V, keep in mind the base ISA is meant to be as barebones as possible while allowing extending it easily.

Nothing would prevent a RISC-V extension to include shadow registers, and in fact there may be one out there that I don't know of.
Meanwhile, vendors are free to implement extensions as they see fit and shadow registers is not something that I think I've seen at this point for RISC-V MCUs.

Then there is the very recurring topic of interrupt latency on "modern" MCUs.
Shaving off a few cycles on something running @100MHz+ with single-cycle instructions - for the most part - doesn't necessarily bring much to the table. It was obviously much more critical on a few-MHz MCU with 4-cycle instructions.

And, as some of us often say, if you need ultra-low latency for some very low-level function, you're usually better served with dedicated peripherals these days, and many modern MCU even have hardware triggers that make the need for very-low latency interrupts even less relevant. Outside maybe of some very niche stuff.

Now to be fair, RISC-V MCUs are still relatively few and often not as featureful as the more established ones when it comes to peripherals, hardware triggers and such. But that will come.

brucehoult · « **Reply #29 on:** April 21, 2023, 10:13:33 pm »

Quote from: SiliconWizard on April 11, 2023, 05:35:19 am

Quote from: brucehoult on April 08, 2023, 12:48:44 am
The message you replied to is pointing out that there is ALSO a NON-vectored mode. which is used by many systems, especially those running Linux or similar on >1 GHz CPUs, where you don't care about shaving nanoseconds off the interrupt response times and you usually have some simple 32 or 64 bit auxiliary CPU core to handle real-time tasks.

Sure, but the part I was actually replying to didn't make it obvious you were talking about very fast systems when the whole topic was about MCUs.

I wasn't. MCUs also have the non-vectored mode.

Quote

And on a MCU running at a few tens of MHz, having to check the interrupt flag in xcause, then test against the exception code and execute code based on that when you have a number of possible interrupt sources, that sure wouldn't be insignificant.

The total time to save registers and branch on mcause is comparable to the interrupt response time on Cortex-M0+ or M3.

brucehoult · « **Reply #30 on:** April 21, 2023, 10:26:50 pm »

Quote from: NorthGuy on April 21, 2023, 01:27:09 pm

Strange that they save registers for the interrupts in the stack in such a modern design. There are better ways. You can use shadow register sets and such. Even PIC16s are doing this (and have been doing this for 10 years or so), so the interrupt latency is 500 ns (for 8 MHz CPU cycle), faster than what you measured for HPE.

We are talking about the smallest cheapest core here. If you want shadow register sets then spend a few cents more on a CH32V3xx instead of the CH32V003. The same HPE set-up code and interrupt handler code works transparently on either.

Shadow register sets are not for free. They use a lot of transistors which could be spent on something else, or not spent at all. It makes little sense to cut the register set down from 32 to 16 to save transistors, and then spent a ton of transistors on having multiple sets of them!

Not every application needs 10ns interrupt response time, or even sub µs. We used to achieve a lot with 1 MHz 6502 and 8 MHz M68000, both of which are around 6 µs to execute the first instruction of an interrupt handler.

brucehoult · « **Reply #31 on:** April 21, 2023, 10:30:04 pm »

Quote from: SiliconWizard on April 21, 2023, 07:55:36 pm

Nothing would prevent a RISC-V extension to include shadow registers, and in fact there may be one out there that I don't know of.
Meanwhile, vendors are free to implement extensions as they see fit and shadow registers is not something that I think I've seen at this point for RISC-V MCUs.

WCH's other, higher end, MCUs have shadow register sets for the HPE feature instead of writing the registers to the stack. The CH32V3xx series, for example.

This is their cheapest, most cut down, one, remember?


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Interrupt latency benchmarking on CH32V003 w/ and w/o hardware-stacking (HPE) (Read 4422 times)

Share me