Author Topic: Cortex SAM3X interrupt latency -- what gives?! (Read 4328 times)

ebastler · « **on:** September 19, 2017, 06:16:04 am »

I have started a little project using the SAM3X for an embedded application -- simulating the (broken) magnetic drum memory in a rather old computer, the LGP-30. I am using an Arduino Due board for this: Figured the simple high-level libraries might come in handy, e.g. to store drum contents on an SD card. For the time-critical steps, I use direct hardware access instead of the inefficient Arduino I/O.

This works in principle, but I am struggling with an interrupt latency which is much larger than expected. My code needs to react to an incoming pulse (the LGP-30 system clock, running at 130 kHz). I know that reacting to a state change of a general PIO pin is slow on the SAM3X: All PIO inputs share a single interrupt only, so the pin that has changed needs to be decoded first. But I also need to trigger a one-shot timer by each incoming pulse. I have configured a timer/counter to react to this pulse (external event on TIOB), and have enabled this timer's interrupt for this external event:

Code: [Select]

  REG_TC2_CMR0 =
    TC_CMR_WAVE
    | TC_CMR_TCCLKS_TIMER_CLOCK1  // MCK/2
    | TC_CMR_CPCSTOP              // stop when reaching RC
    | TC_CMR_EEVT_TIOB            // external event is TIOB input
    | TC_CMR_EEVTEDG_RISING       // TIOB rising edge
    | TC_CMR_ENETRG               // ... resets counter and starts clock
    | TC_CMR_WAVSEL_UP            // count up (and stop there)
    | TC_CMR_AEEVT_SET            // ext event sets TIOA
    | TC_CMR_ACPC_CLEAR;          // reaching RC clears TIOA

  REG_TC2_RC0 = 55;               // one-shot duration in MCLK/2 ticks (24ns)
  REG_TC2_CCR0 = TC_CCR_CLKEN;    // Enable clock

  // Set up timer interrupt handler TC6_Handler.
  // This has lower latency than the PIO interrupt, "only" 600ns(?) instead of 940ns.

  REG_TC2_IER0 = TC_IER_ETRGS;    // interrupts on external event
  REG_TC2_IMR0 = TC_IMR_ETRGS;    // interrupts on external event
  //  NVIC_SetPriority (TC6_IRQn, 0); // test -- does not affect latency 
  NVIC_EnableIRQ(TC6_IRQn);

My interrupt routine is the corresponding TC6_Handler -- which I understand to be called via a direct entry in the processor's interrupt vector table, without further overhead. First thing I do in the TC6_Handler is to set a digital output pin, so I can observe the timing on a scope. (The pin is set via direct PIO_SODR manipulation, not the Arduino's slow digital_write call.) What I get is shown in the attached scope screenshot:

Channel 1 (yellow): Incoming clock pulse.
Channel 2 (blue): Single-shot timer output. There is a delay of two MCLK/2 ticks (2*24ns) before this starts; one more tick than I had expected, but I don't mind.
Channel 3 (pink): Output set by TC6_handler.

The TC6_Handler becomes active ~550ns after the trigger pulse (with some jitter, which is fine). That's 46 clock cycles at the Due's 84 MHz clock rate! But the SAM3X interrupt latency should be 12 cycles only?!

I am aware that there may be the occasional wait state in accessing the flash memory. But most of the 12 cycles are for storing the processor context internally, I believe, and should not incur waits. My "set the digital output" instruction should add another 2 cycles. So I would expect around 16 cycles of actual latency -- but where does the much longer delay come from?!

I must be overlooking something here. Thanks for your thoughts!

ataradov · « **Reply #1 on:** September 19, 2017, 06:46:32 am »

Just some things that will contribute:
1. Digital synchronization and filtering on the PIO pin.
2. Vector fetch. Your vector table is most likely located in the flash, and a fetch of the handler address will be subject to wait states. This can be solved by moving the whole table into RAM.
3. Compiler prologue in the interrupt handler. Look at it in the disassembly and count instructions there (also subject to wait states). Same solution will apply - move the whole handler into RAM.

andersm · « **Reply #2 on:** September 19, 2017, 07:50:04 am »

General suggestions: Verify that you're actually running at the speed you are. Make sure that you've enabled and configured whatever flash acceleration features the chip has.

ebastler · « **Reply #3 on:** September 19, 2017, 03:42:32 pm »

Thank you both!
A couple more comments:

Quote from: ataradov on September 19, 2017, 06:46:32 am

Just some things that will contribute:
1. Digital synchronization and filtering on the PIO pin.

Yes, I was aware of that one. That's probably the reason that the single-shot timer starts with a little delay. But once that timer has started (48ns later), I trust that the SAM3X has processed and internalized the incoming signal. So this can only be responsible for a small part of the extra latency.

Quote

2. Vector fetch. Your vector table is most likely located in the flash, and a fetch of the handler address will be subject to wait states. This can be solved by moving the whole table into RAM.
3. Compiler prologue in the interrupt handler. Look at it in the disassembly and count instructions there (also subject to wait states). Same solution will apply - move the whole handler into RAM.

This is the kind of "dark matter" I am concerned about... I would certainly hope that the CPU's vector table is in RAM once the system has booted up (it can be relocated anywhere via a vector table offset register), and that it points directly to the TC6_Handler. But who knows if the clever folks at Arduino haven't added another level of redirection... Same for potential prologue code in the handler. I will need to figure out where the Arduino toolchain lets me see the assembly code; I have not had reason to look for it in the simple AVR projects I have done in earlier years.

Quote from: andersm on September 19, 2017, 07:50:04 am

General suggestions: Verify that you're actually running at the speed you are. Make sure that you've enabled and configured whatever flash acceleration features the chip has.

Valid point. I do know that the peripheral clocks (and hence the processor's main clock) run at the speed I expect. But I should reconfirm the core's clock and potential memory wait states. On the other hand, everything on the Arduino Due runs inside the Cortex' internal memory, and I would hope that they have not slowed that down by a factor of 3?!

I will look into the above once I get back home. Will post an update if I find anything. Thanks again!

ataradov · « **Reply #4 on:** September 19, 2017, 03:57:13 pm »

Quote from: ebastler on September 19, 2017, 03:42:32 pm

I would certainly hope that the CPU's vector table is in RAM once the system has booted up

This is typically not the case in most projects, so I would not just hope that it is in RAM in Arduino code.

westfw · « **Reply #5 on:** September 19, 2017, 10:28:33 pm »

Quote

I would certainly hope that the CPU's vector table is in RAM once the system has booted up (it can be relocated anywhere via a vector table offset register)

I very much doubt it. RAM tends to be more precious than speed.

Quote

Same for potential prologue code in the handler.

In the absence of additional indirection, handler functions on CM3 do not contain any additional prologue, because the ISR hardware matches the programatic ABI. ISRs don't even have any designation as such; they're just normal C functions with the right names...

Quote

I will need to figure out where the Arduino toolchain lets me see the assembly code

Yes, this is certainly important to do if you're trying to figure out latencies.
The Arduino toolchain does not save assembly code, but it leaves a .elf file lying around in a mysterious temp directory, where it can be disassembled using objdump:

Code: [Select]

.../arduino/tools/arm-none-eabi-gcc/4.8.3-2014q1/bin/arm-none-eabi-objdump -SC -I *.elf

Quote

I would hope that they have not slowed that down by a factor of 3

FLASH apparently needs 5 cycles (4 Wait states) at 84MHz. There is a 2x128bit flash buffer for "acceleration", but the documentation is a bit too skimpy to figure out exactly how it will behave in any particular situation :-(

nctnico · « **Reply #6 on:** September 19, 2017, 11:21:44 pm »

Quote from: westfw on September 19, 2017, 10:28:33 pm

Quote
I would certainly hope that the CPU's vector table is in RAM once the system has booted up (it can be relocated anywhere via a vector table offset register)
I very much doubt it. RAM tends to be more precious than speed.

Besides that RAM is way more prone to corruption (think buffer overflow or pointer error!) than flash so I prefer to have interrupt vectors in flash.

ebastler · « **Reply #7 on:** September 20, 2017, 06:20:37 am »

Thank you all for further input. Here's a quick update:

Quote from: westfw on September 19, 2017, 10:28:33 pm

The Arduino toolchain does not save assembly code, but it leaves a .elf file lying around in a mysterious temp directory, where it can be disassembled using objdump:

Code: [Select]
.../arduino/tools/arm-none-eabi-gcc/4.8.3-2014q1/bin/arm-none-eabi-objdump -SC -I *.elf

Yep, that seems to be the way to go, thanks! Figured that out the other night, and had a look at the assembly code.

Quote

In the absence of additional indirection, handler functions on CM3 do not contain any additional prologue, because the ISR hardware matches the programatic ABI. ISRs don't even have any designation as such; they're just normal C functions with the right names...

The assembly does show a bit of extra prologue code. The CM3 context switch only saves registers r0..r3, so the generated code moves nine further registers onto the stack. Definitely explains a significant part the extra latency.

Quote

FLASH apparently needs 5 cycles (4 Wait states) at 84MHz. There is a 2x128bit flash buffer for "acceleration", but the documentation is a bit too skimpy to figure out exactly how it will behave in any particular situation :-(

It can't be quite as bad. There are some threads where people have timed the execution (from flash) of 1000 sequential NOPs, and it comes out at 1003 cycles or so. Pipelining to the rescue, I assume. But, of course, for looking up in interrupt vector from flash and then jumping to the ISR, there woud be a couple of cycles which take the full hit regarding wait states.

Quote

Quote
I would certainly hope that the CPU's vector table is in RAM once the system has booted up (it can be relocated anywhere via a vector table offset register)
I very much doubt it. RAM tends to be more precious than speed.

That part I still have not figured out yet. From looking at the generated assembly, or the system libraries' source code, I can't deduct where the vector table actually lives. I probably need to do some runtime debugging to see where the vector table is located, and whether its timer interrupt vector points straight to my handler. -- I have found some published code which modifies the vector table at runtime however, so I still tend to think it lives in RAM.

In summary, it seems that I will have to live with the higher latency, unless I dive in very deeply. (ISR in assembly to minimize the context switch overhead; potentially relocate the vector table to RAM...) I had hoped that I had just committed some easily fixable goof, which was responsible for a major delay. Given that there is no simple fix, I'll probably see whether I can arrange things so that I can live with the present latency.

Anyway, a bit of a disappointment that the advertised "12 cycles latency, and we already have taken care of the context switch for you" is not the full truth...

ataradov · « **Reply #8 on:** September 20, 2017, 06:26:23 am »

Quote from: ebastler on September 20, 2017, 06:20:37 am

That part I still have not figured out yet. From looking at the generated assembly, or the system libraries' source code, I can't deduct where the vector table actually lives.

If you have full source code, then just search for "SCB->VTOR". This register contains pointer to the table. If you have a debugger, then just look at its value.

Quote from: ebastler on September 20, 2017, 06:20:37 am

Anyway, a bit of a disappointment that the advertised "12 cycles latency, and we already have taken care of the context switch for you" is not the full truth...

That part is advertised by ARM, not a chip vendor. And they do say that additional delays may apply. It is absolutely possible to get to that number, but not with Arduino code. Being efficient is not what Arduino is all about.

andersm · « **Reply #9 on:** September 20, 2017, 07:00:58 am »

Quote from: ebastler on September 20, 2017, 06:20:37 am

The assembly does show a bit of extra prologue code. The CM3 context switch only saves registers r0..r3, so the generated code moves nine further registers onto the stack. Definitely explains a significant part the extra latency.

The hardware pushes the registers designated as caller-saved in the architecture calling convention (which is what allows ISRs to be plain C functions). If you have a complex ISR, then you probably need more registers, especially if you are calling other functions.

AIUI, the core is designed to fetch the interrupt vector in parallel with pushing the registers, in which case having the vectors in RAM could add latency via bus contention.

tszaboo · « **Reply #10 on:** September 20, 2017, 08:19:43 am »

Maybe the issue is simpler than this. Did you turn on all the optimization in the compiler? I've seen C compiler, which translated (slowest sped) ; into a NOP, so you can place a breakpoint on it.

ebastler · « **Reply #11 on:** September 20, 2017, 08:27:40 am »

Quote from: ataradov on September 20, 2017, 06:26:23 am

Quote from: ebastler on September 20, 2017, 06:20:37 am
Anyway, a bit of a disappointment that the advertised "12 cycles latency, and we already have taken care of the context switch for you" is not the full truth...
That part is advertised by ARM, not a chip vendor. And they do say that additional delays may apply. It is absolutely possible to get to that number, but not with Arduino code. Being efficient is not what Arduino is all about.

Yes, I know... Although I think I'm largely seeing the overhead from the C++ compiler, not the Arduino-specific infrastructure and libraries. Hence, as mentioned above, a major improvement would probably require switching to assembly code, and trying to limit register use to those which are implicitly saved in the context. For the present project, I don't think I can justify that effort.

Anyway, I was a bit naïve in assuming that the core-based context switch would take care of everything (i.e. all registers). Should have read the small print...

ebastler · « **Reply #12 on:** September 20, 2017, 08:39:34 am »

Quote from: NANDBlog on September 20, 2017, 08:19:43 am

Maybe the issue is simpler than this. Did you turn on all the optimization in the compiler? I've seen C compiler, which translated (slowest sped) ; into a NOP, so you can place a breakpoint on it.

Worth a look; thanks for the suggestion. The assembly code I reviewed did seem a bit awkward, but that may well be by necessity due to the RISC instruction set -- I never studied the Cortex instruction set in any detail. I will play with compiler options tonight. I can certainly have the compiler optimize for speed only, since the SAM3X has generous memory for my application.

tggzzz · « **Reply #13 on:** September 20, 2017, 08:43:31 am »

For hard real-time, consider XMOS xCORE processors.

The IDE guarantees maximum timings - which is easy since there are no interrupts and no cache and so many cores that you can dedicate a separate core to each peripheral.

Each "FPGA-like" i/o port (i.e. "peripheral") can be clocked at up to 250MHz (programmable), has SERDES, and has associated timers so that the time at which input occurred is stored, and outputs can be set to occur on a specific clock cycle.

All in all it is ideal for "event driven" programming.

The £12 startKIT dev board has 8 free 100MHz cores, plus USB connectivity. The IDE is cost-free, based on Eclipse/LLVM/gdb.

I've used one to count transitions in two 62.5Mb/s inputs plus USB comms plus front panel i/o - all simultaneously and all timings guaranteed. Remarkably easy and pleasant.

Example below is of the IDE, showing a call-flow graph of -O3 optimised binary code, and that the critical path (inside the grey highlighted area) is 700ns.

westfw · « **Reply #14 on:** September 20, 2017, 08:41:21 pm »

Quote

The assembly does show a bit of extra prologue code. The CM3 context switch only saves registers r0..r3, so the generated code moves nine further registers onto the stack. Definitely explains a significant part the extra latency.

Really? That's actually rather strange. If I take a random Arduino sketch and add:

Code: [Select]

void TC6_Handler() {
  digitalWrite(13, 1);
}

I get very minimal code:

Code: [Select]

00080148 <TC6_Handler>:
   80148:       200d            movs    r0, #13
   8014a:       2101            movs    r1, #1
   8014c:       f000 bcca       b.w     80ae4 <digitalWrite>

The existing ISR (UART_Handler, USART1_HANDLER, etc) are similarly simple, and they call C++ methods.
You must either have a very complex ISR, or something else weird is going on.
(If your ISR must be complex, I think that this implies that you should be able to divide up your code:

Code: [Select]

void TC6_Handler()
{
   // simple code requiring low-latency
   TC6_Handler_part2();
}
void TC6_Handler_part2()
{
   // more complex code that doesn't need low latency
}

(The way that function prologues work, all of the context saving required by a function will end up at the start of the function. Watch out for inlining, though!)

Quote

The hardware pushes the registers designated as caller-saved ... If you have a complex ISR, then you probably need more registers, especially if you are calling other functions.

Actually, the clever thing about having the HW ISR save match the ABI "caller-save" list is that calling other function does NOT suddenly require saving additional registers - they'll all be saved by the callees (as required by the ABI.) This is in contrast wit h the AVR, which has a long list of registers that are not caller-saved that an ISR must "suddenly save" if calling another function, even though they're seldom used...

Can you post a trimmed-down version of your ISR code that demonstrates this large prologue generation? (This is starting to look mysterious enough to be "interesting.")

westfw · « **Reply #15 on:** September 21, 2017, 01:44:40 am »

Quote

Quote
I can't deduct where the vector table actually lives.
If you have full source code, then just search for "SCB->VTOR"

For Arduino Due, the code DOES change the Vector Table, since there will normally be a bootloader at the beginning of the address space, with the actual user sketch at 0x80000.
The setup looks like it's done in
.../arduino/tools/CMSIS-Atmel/1.1.0/CMSIS/Device/ATMEL/sam3xa/source/as_gc/startup_sam3xa.c
as part of the Reset_Handler code (ie REALLY early in startup code.) (The actual vectors are in the same module.)

ebastler · « **Reply #16 on:** September 21, 2017, 06:23:22 am »

Quote from: westfw on September 20, 2017, 08:41:21 pm

You must either have a very complex ISR, or something else weird is going on.
...
Can you post a trimmed-down version of your ISR code that demonstrates this large prologue generation?

Well, I wouldn't exactly call it a "large prologue"; but it is large enough to take a non-negligible number of cycles. With the compiler's default settings, the prologue is

stmdb sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}

i.e. nine extra registers saved to the stack. Setting the optimization option to "-O3" actually reduces the register use quite a bit:

stmdb sp!, {r4, r5, r6, r7, r8}

My ISR is somewhat complex indeed. I struggle to shorten the code in a meaningful way, so let me just describe it verbally:

By way of background, I am simulating the magnetic drum for the LGP-30, a bit-serial computer from the 1950's. The drum stores the main memory (just 1 bit to be read/written per clock cycle, dependent on a track address). But it also stores the three CPU registers (program counter, current instruction, and accumulator). Those are implemented as "recycling registers": Each of them has its own track, and the 32-bit value repeats in every sector/word along the track. Separate read and write heads are provided for each register. The write head (placed 32 bit upstream from the read head) either just re-writes what the read head has just presented, for reference in the next instruction cycle and sector, or writes a modified value coming back from the LGP-30 logic unit.
In total, I need to read 12 input bits (in two groups -- some change ~300ns after the clock pulse, some only later after I have set some of the outputs), and write 16 output bits (again in two timing groups). Some of the outputs are pre-determined, i.e. I can look them at the end of the previous ISR call and have them ready for output in the next interrupt; some depend on the input bits I see in the present interrupt.
The behavior of the ISR is also state-dependent: The simulated drum starts up in write-protected mode, and write access only gets activated after the real LGP-30 CPU has provided meaningful input for a defined time. (The ISR has to monitor this while in write-protected mode.) The actual drum memory (registers and main memory) resides in arrays, and the ISR code involves various array look-ups.

So I can't really fault the compiler for using multiple registers, especially given the instruction set's heavy reliance on register-based addressing modes. It seems that I may be able to get some minor improvements in latency, but will mainly have to live with it. Specifically, I have opted for an external flip-flop to latch those input bits where I have only ~300ns before they change. That seems much easier and more reliable than trying to get the latency below that threshold...

By the way: Yes, I am aware that a microcontroller is fundamentally the "wrong" platform to implement this.

I do know about CPLDs and FPGAs, and as you may have seen in another thread, have in fact implemented my complete replica of the LGP-30 in an FPGA. My justification for doing this in a microcontroller, and an Arduino on top of that, is the easy implementation of the user interface, and specifically the storage of drum images on an SD card. (I want to be able to move drum images between a PC and the real LGP-30 easily.)

Thanks for all your input! I think I am ready to move ahead with this solution and work around its shortcomings. Now on to adding the level converters, to make the µC actually talk to the tube-based LGP-30 logic...

westfw · « **Reply #17 on:** September 21, 2017, 07:19:23 am »

Quote

Well, I wouldn't exactly call it a "large prologue"; but it is large enough to take a non-negligible number of cycles. With the compiler's default settings, the prologue is
stmdb sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
My ISR is somewhat complex indeed.

Ahh. I'd normally define "ISR prologue" as "stuff done at the start of an ISR whether it's needed or not. AVR-GCC has an annoying 5-instruction ISR prologue where is saves CPU flags and sets up the known-zero register, even if the ISR code doesn't use them (recently fixed, I think.)

In your case, the ISR is saving a lot of registers because your code USES a lot of registers, and it's pretty difficult to get around that. :-( (There are some CPUs with a separate SET of registers for ISRs, specifically to address this sort of latency issue. But it's probably a bit late to change CPUs... (Hmm. I think PIC32 is one of them, and it's also supported by an Arduino-like environment.)
(annoyingly, simple processors that have less state to save can respond more quickly, but they need more code to DO anything. (The AVR isn't one of them; AVRs have a really obnoxious ratio of state to RAM...))


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Cortex SAM3X interrupt latency -- what gives?! (Read 4328 times)

ebastler

Cortex SAM3X interrupt latency -- what gives?!

ataradov

Re: Cortex SAM3X interrupt latency -- what gives?!

andersm

Re: Cortex SAM3X interrupt latency -- what gives?!

ebastler

Re: Cortex SAM3X interrupt latency -- what gives?!

ataradov

Re: Cortex SAM3X interrupt latency -- what gives?!

westfw

Re: Cortex SAM3X interrupt latency -- what gives?!

nctnico

Re: Cortex SAM3X interrupt latency -- what gives?!

ebastler

Re: Cortex SAM3X interrupt latency -- what gives?!

ataradov

Re: Cortex SAM3X interrupt latency -- what gives?!

andersm

Re: Cortex SAM3X interrupt latency -- what gives?!

tszaboo

Re: Cortex SAM3X interrupt latency -- what gives?!

ebastler

Re: Cortex SAM3X interrupt latency -- what gives?!

ebastler

Re: Cortex SAM3X interrupt latency -- what gives?!

tggzzz

Re: Cortex SAM3X interrupt latency -- what gives?!

westfw

Re: Cortex SAM3X interrupt latency -- what gives?!

westfw

Re: Cortex SAM3X interrupt latency -- what gives?!

ebastler

Re: Cortex SAM3X interrupt latency -- what gives?!

westfw

Re: Cortex SAM3X interrupt latency -- what gives?!

Share me