Author Topic: [ARM] optimization and inline functions (Read 7666 times)

NorthGuy · « **Reply #50 on:** January 19, 2022, 05:40:47 pm »

Quote from: nctnico on January 19, 2022, 04:55:14 pm

IMHO products become way too fragile when they need to rely on cycle accurate execution. That is something you don't want to really care about when writing software in C for a product.

You cannot possibly rely on cycle accuracy if you write in C. The C compiler is the source of uncertainty by itself - next compilation may be different from the current one. Caches, prefetchers, and bus entanglements are other sources of uncertainty. Therefore, you need to have a margin. As in the Siwastaja's example - you get 17 cycles while anything below 100 cycles would do. These 17 cycles can grow some, but probably will never overgrow 100 cycles, so the project is safe. This example uses roughly 500% margin.

You only need cycle accuracy when you want to push the limit and eliminate the safety margin. For that you need zero uncertainty. This lets you increase performance dramatically, but is rarely needed, because there's no premium on doing things faster than required by the project.

hans · « **Reply #51 on:** January 19, 2022, 05:56:10 pm »

@Siwastaja: Complete portability is definitely an unicorn. But that doesn't make a BSP a bad idea to have. It's likely only a small portion of code ends up in there. I don't think most MCU requirements are anything extraordinary. Canonical implementations of SPI, I2C, UART, etc. can be assumed to all have a similar "interface" (in terms of software, that is). Anything outside of BSP can easily be written in a portable way, and be tested with more productive tools than an electronics bench + slow debug dongles.

I agree with you that fighting for the last cycle (or explanation thereof) quickly becomes chasing red herrings or your own tail. But there is a difference between allowing for some overhead and having timing-sensitive code that actually breaks.

Quote

But in the end, having the port register qualified volatile also means, the compiler cannot reorder the ISR so that the port write would be the last thing. Compiler is of course allowed to insert an unnecessary calculation of Pi before that, but why would it do that?

Finally, you measure the latency with an oscillosscope and see that it's actually taking 17 cycles +/- 1 cycle of jitter and once in a year, when a DMA transfer is triggered during full moon and Michael Jackson's Thriller is playing in the radio, it has 3 cycles of jitter(!!).

Now, tggzzz would say you have proven nothing. But realistically, what are the chances that this breaks down beyond 100 clock cycles when the GCC version updates?

The C/C++ specification isn't always as well defined. Some things are left over to compilers infinite wisdom. For example in my example with the race condition, the access of 2 volatile values was reordered within that 1 statement. Did that code rely on UB? Yes it probably did.

I agree that the chances of a GCC update making such a large difference is very very slim. But "academically speaking", I can see where tggzzz is coming from that nothing has been proven. Measurements only suggest that it's unlikely to happen.

On a philosophical tangent: it's odd that some industries (including automotive) have very strict requirements on only using cache-free microcontrollers, so that worst-case execution times can be accurately determined and therefore the firmware can be verified to work reliably in all cases. On the other hand people are also working on self-driving cars, that have such a plethora of unknowns and high complexity, that we'll have to accept that also these computer systems won't be 100% safe. Nonetheless, some people still put it forth as a requirement or expectation.

MK14 · « **Reply #52 on:** January 19, 2022, 06:55:17 pm »

Quote from: hans on January 19, 2022, 05:56:10 pm

On the other hand people are also working on self-driving cars, that have such a plethora of unknowns and high complexity, that we'll have to accept that also these computer systems won't be 100% safe. Nonetheless, some people still put it forth as a requirement or expectation.

Even if the timings, and most other functionalities were bolted down, perfectly. We could never really be sure, how the self driving car, is going to perform/behave, given the numerous variations in real life situations, the self driving car will experience.
Posted somewhere (I vaguely remember it being here in the Tea thread), is a Tesla Car, self driving (without getting into arguments, as to if autopilot, is fully self-driving or not), whereby a lorry in front, had a large number of traffic lights, in the back of it. Because there is normally only a limited number of traffic lights, and traffic lights, don't usually drive off in front of the Tesla. The software went partly crazy, as it recognized the traffic lights, but got wildly confused. On some kind of autopilot status screen.

First link (Twitter Video) is the truck with the traffic lights in back, the second link is a youtube video about it getting mixed up with the Moon.

https://twitter.com/i/status/1400207129479352323

Simon · « **Reply #53 on:** January 19, 2022, 07:14:26 pm »

Quote from: Siwastaja on January 19, 2022, 04:15:23 pm

Well, microcontroller projects (the ones that actually control things, using peripherals etc.) just are not portable, we have to accept this. By trying to make them portable, we either sacrifice so much of the MCU capability that we just say "no" to the projects - or use FPGAs to implement them. Or, we are writing and verifying so much extra abstraction that the original project would have been manually ported ten times when the "portable" project finishes.

IMHO, the key to success is not to strive for ideal world that does not exist in MCUs. After you accept this, you can leverage the features that are there, for the low cost, and save a lot of time and money compared to rolling out an FPGA design (or using an esoteric, vendor-lock-in solution like the XMOS).

Which is why the arduino is a piece of shit!

MK14 · « **Reply #54 on:** January 19, 2022, 07:30:06 pm »

Quote from: Simon on January 19, 2022, 07:14:26 pm

Which is why the arduino is a piece of s**t!

On the one hand it has an unbelievably massive amount of ram, for such a small device (2K), making it especially suited for massive programming projects (sarcasm).

But on the other hand, as you imply, it is its huge compatibility (with itself), tremendous popularity, availability of the Arduino (IDE, open source and cheap clones) and huge eco system (massive range of libraries and example/completed software and hardware for it). It has been a massive success. Just a pity they couldn't of chosen, some kind of upcoming arm chip, instead of the AVR they use. I know you can now get arm based ones, but the huge inertia of pre-existing stuff. Means that it is somewhat unheard of (Arduino/Arm), compared to the usual AVR Atmel (Now Microchip) versions.

Simon · « **Reply #55 on:** January 19, 2022, 07:44:22 pm »

As brief as my experience of programming is, every time I have to use an arduino I find myself highly frustrated by lack of access to things that I know the micro-controller has and take for granted but are unsupported in Arduino like interrupts. The only interrupts available are the pin interrupts. Given the laughable utility of the system, to have pin interrupts is over the top and is well, only useful for keeping some part of the system working when the gobshite that some libraries are, are running slow.

Lets face it, the best way to do code scheduling in arduino is get a PWM pin going and then setup an interrupt on it or another connected pin to trigger a scheduler. Any attempt at an arduino scheduler that I have seen is an incomplete mess because you can't have one scheduler working against the existing one and surprise surprise the one I looked at used the delay function still to flash the LED it was supposed to be flashing with the interrupt

Trying to use millis and micros proved a nightmare for me and I was trying to do things that normally I would simply do with a counter firing interrupts.....

Simon · « **Reply #56 on:** January 19, 2022, 07:45:11 pm »

Oh and want a good one? apparently on the arduino in integer math 0/100 = 1!!!!!!!! and yes I had to put a check in the program that would only allow the calculation to go ahead if the numerator was more than 100.............

brucehoult · « **Reply #57 on:** January 19, 2022, 08:32:57 pm »

Quote from: NorthGuy on January 19, 2022, 03:57:12 pm

Quote from: brucehoult on January 19, 2022, 11:07:42 am
It also means that -- contrary to what some are saying here -- it is, by design, very easy to write cycle-accurate code on them.

The absence of branch prediction doesn't make the CPU cycle-accurate. There are caches and bus contention. For example, the CPU may compete with DMA for bus cycles which may lead to unpredictable delays.

Cortex M0{+} doesn't have caches.

Many systems they are used in don't have DMA (you should know whether you have DMA or not)

Quote

BTW: MIPS eliminates "branch prediction penalty" completely by using delay slots. This doesn't make it cycle-accurate neither.

One or two early simple MIPS implementations with short pipelines and single-issue managed to cover the delay in fetching code on the branch-taken path by executing the instruction in the delay slot.

Later implementations with longer pipelines or dual or multiple issue or long latency RAM need branch prediction for performance just as much as anything else does.

Even with the simple short pipeline single-issue CPUs it's not always easy to find something useful for the branch delay slot -- something that is needed on both the branch taken and branch not taken paths, but that isn't needed to determine the branch condition -- and some large percentage of the time you end up with a NOP there, wasting time and program size.

nctnico · « **Reply #58 on:** January 19, 2022, 08:51:10 pm »

Quote from: brucehoult on January 19, 2022, 08:32:57 pm

Quote from: NorthGuy on January 19, 2022, 03:57:12 pm
Quote from: brucehoult on January 19, 2022, 11:07:42 am
It also means that -- contrary to what some are saying here -- it is, by design, very easy to write cycle-accurate code on them.

The absence of branch prediction doesn't make the CPU cycle-accurate. There are caches and bus contention. For example, the CPU may compete with DMA for bus cycles which may lead to unpredictable delays.

Cortex M0{+} doesn't have caches.

Many have 'caches' in the form of flash accellerators (pre-fetch buffers) so the execution time of branches isn't 100% predictable.

SiliconWizard · « **Reply #59 on:** January 19, 2022, 09:24:33 pm »

Yes. Some can reply that you can always (usually) force some piece of code to run from RAM to avoid Flash caching issues. Then, even without DMA, you may still have interrupts. Having to disable interrupts may or may not cause other issues... etc.

So anyway. Relying on exact cycle execution on those modern CPUs often looks like tinkering and is very rarely worth it. As I said numerous times, using appropriate peripherals to do the job is the way to go in most cases. Good thing that many of those "modern" MCUs do embed a lot of peripherals with which you can implement a lot of stuff. Sure, NXP's FlexIO (or RPi's PIO) is very flexible for this, but even with just timers, PWM, input/output compares, FMC... you can implement a lot of "bit banging" with accurate timings.

If you still need to really directly toggle GPIOs with some *minimal* delay - again it will be hard to make that accurate - there is often one thing you can do that many people do not think about: set the clock rate for the GPIO peripheral to an appropriate frequency. Most ARM Cortex-based MCUs allow this. So you could have the core run at, say, 100 MHz, and the GPIOs clocked at, say, 10 MHz (if this divider is available). Then, just toggling a GPIO with two consecutive instructions, without any explicit delay, will always occur at least 100 ns apart. Maybe obvious, but something to think about.

NorthGuy · « **Reply #60 on:** January 19, 2022, 09:28:45 pm »

Quote from: brucehoult on January 19, 2022, 08:32:57 pm

One or two early simple MIPS implementations with short pipelines and single-issue managed to cover the delay in fetching code on the branch-taken path by executing the instruction in the delay slot.

Later implementations with longer pipelines or dual or multiple issue or long latency RAM need branch prediction for performance just as much as anything else does.

We're talking about "small" MCUs which don't have branch predictions, remember? This includes, for example, PIC32MZ running at 250 MHz. For these, the delay slot works very well.

Look, for example, at the OP's original post. His loop runs at 1 MHz on a 4 MHz MCU. If this was MIPS, it would run at 2 MHz - twice as fast. Thanks to the delay slot.

Quote from: brucehoult on January 19, 2022, 08:32:57 pm

Even with the simple short pipeline single-issue CPUs it's not always easy to find something useful for the branch delay slot -- something that is needed on both the branch taken and branch not taken paths, but that isn't needed to determine the branch condition -- and some large percentage of the time you end up with a NOP there, wasting time and program size.

Usually you can find something quite easily. If you cannot, you don't have to use it - just put a NOP into there. It's certainly better than M0 where you cannot put any command into the delay slot because there is no delay slot at all. On M0, you get the delay whether you like it or not.

westfw · « **Reply #61 on:** January 19, 2022, 11:40:37 pm »

Quote

Cortex M0{+} doesn't have caches.

This doesn't stop vendors from implementing caches "outside" of the "CPU Proper" (as part of the memory or flash system, for instance.)

The Raspberry Pi RP2040 has a 16k cache (sorely needed since it fronts a QSPI program store), and almost every M0 implementation I've seen that runs faster than 32MHz has some kind of "flash accelerator" that is more complicated than just a wide bus. I've seen the time taken by cycle-counting "delay" functions vary wildly depending on exactly where they ended up in the code. (though that was on a CM4 which apparently had both cache AND flash acceleration. (still not ARM-defined cache, though.))

Often the cache and/or accelerator behavior is not documented very well.

westfw · « **Reply #62 on:** January 19, 2022, 11:54:30 pm »

Quote

By trying to make them portable, we ... sacrifice so much of the MCU capability
Quote
Which is why the arduino is a piece of shit!

Arduino certainly sacrifices much of the MCU capabilities in its provided "common function libraries."
But complaining about that discounts the VAST number of applications that do not in fact need any of the specialized MCU capabilities.
(And also the fact that you can still access those MCU capabilities with only slightly more trouble than it would have been to access them via bare metal programming or a vendor-specific "bloated intentionally to provide access to all features!" library.)

All my life I've watched features implemented to make efficient use of some special capability, that in time became deprecated and ignored in favor of simply using faster infrastructure (CPU, memory, bandwidth.) Just the other night at a cisco Anniversary Pizza Party (virtual), we were talking about "remember how we thought doing video over Internet would be impossible without IP Multicast?" Sigh.

mikerj · « **Reply #63 on:** January 20, 2022, 12:33:47 am »

Quote from: Simon on January 19, 2022, 07:44:22 pm

As brief as my experience of programming is, every time I have to use an arduino I find myself highly frustrated by lack of access to things that I know the micro-controller has and take for granted but are unsupported in Arduino like interrupts. The only interrupts available are the pin interrupts.

Untrue. Only external interrupts are supported for the attachInterrupt() function (which uses function pointers), but regular avr-libc interrupt handlers can be included for any peripheral not already used by the arduino ecosystem e.g.

Code: [Select]

ISR(TIMER3_COMPA_vect)
{
/* Handle interrupt */
}

Siwastaja · « **Reply #64 on:** January 20, 2022, 07:54:34 am »

Quote from: NorthGuy on January 19, 2022, 05:40:47 pm

You cannot possibly rely on cycle accuracy if you write in C.

You can, if you put the timing critical part in its own module, adjust it until it matches the expectations, using a scope for example, keep it small, compile it once, verify the module, then just keep the object file.

For someone who doesn't feel confident writing assembly, this could be the easiest way. And if another option is to explode the $2 BOM to $50 by introducing an FPGA and then starting learning VHDL or hiring someone who can do it, I can see the appeal of using non-optimal tools to get to the goal. Sometimes you just use your screwdriver as a hammer

Siwastaja · « **Reply #65 on:** January 20, 2022, 08:01:18 am »

Arduino, by their own definition, is not supposed to be used by programmers or electronics designers, but by artists. The whole idea is you can just buy a shield and write led.blink(); and get an art project out of it. It needs to be dumbed down, it needs to be limited. Art projects also don't have strict requirements so you can always work with what you have.

By trying anything more challenging than that, you hit the limits, and it's not Arduino's fault. If you want to blame someone, blame fanboys who don't understand the limits.

But instead, I suggest you just completely ditch the Arduino software ecosystem. You can still use the boards, just program them like you program the microcontroller on the board.

Simon · « **Reply #66 on:** January 20, 2022, 08:23:19 am »

Quote from: Siwastaja on January 20, 2022, 08:01:18 am

Arduino, by their own definition, is not supposed to be used by programmers or electronics designers, but by artists. The whole idea is you can just buy a shield and write led.blink(); and get an art project out of it. It needs to be dumbed down, it needs to be limited. Art projects also don't have strict requirements so you can always work with what you have.

By trying anything more challenging than that, you hit the limits, and it's not Arduino's fault. If you want to blame someone, blame fanboys who don't understand the limits.

But instead, I suggest you just completely ditch the Arduino software ecosystem. You can still use the boards, just program them like you program the microcontroller on the board.

That is exactly what I tell people that come to me exited about the arduino. I inherited one arduino based project and my boss who initially came to me with this revelation that he had seen it in action has accepted my opinion and is not insisting I used it. Not to mention the fact tat, uh, all the SAMD micro's are out of stock, not sure why that is.......

Nominal Animal · « **Reply #67 on:** January 20, 2022, 09:07:47 am »

If you want cycle accuracy, C is indeed not the right tool. Even when you go through the steps to get the binary that does the thing right, you need to keep using that binary (and not the C source), because any little change in the compiler or linker can/will throw the timing off.

Me, I like using GCC/IntelCC/Clang extended asm for such critical parts. It differs from external assembly sources in that the extended asm construction explicitly tells the C compiler about the input and output registers (and if so wanted, even lets the C compiler choose the exact registers used), and what registers or memory gets clobbered. When used in an inlined function, the compiler can adjust the assembly (register use) to best fit the surrounding code (and vice versa, because it knows exactly what registers etc. are used in the extended asm).

Things like timing-critical interrupt handlers are better written as external assembly files. The C compiler really only needs its calling address, which is available at link time, and can be exported to C using a simple extern declaration. (I like to rely on other ELF object file properties, like section attribute, to make the build machinery smoother, more capable, but still robust wrt. source code changes.)

For single-instruction multiple-data or SIMD stuff, I like using GCC/IntelCC/Clang vector extensions via the vector_size attribute (on the variable type). Addition, subtraction, and multiplication on vector variables does the corresponding component-wise operation, and for the rest, the compiler provides built-in intrinsics (and a standardized set for x86-64 in <immintrin.h>). The basic vector extensions work regardless of hardware support –– for example, when the hardware only supports say two components, the compiler uses two registers for a vector with four components, transparently and quite efficiently ––, so such code is actually portable across hardware. For the same reasons as for extended asm, the compiler also generates quite efficient code for the intrinsics; typically much better than hand-written assembly, if we include the compiler-generated surrounding code in the consideration.

Arduino is a somewhat toy-like/silly/coddling environment, designed to make quick development easy for non-programmers and non-technical people.
As usual, a lot of Arduino stuff is quite crappy, but there are some very nice nuggets in there among the cores and support libraries. So, it's not all bad.
I happily use it for quick prototyping, although I am quite familiar with the cores (Arduino core code for the specific microcontroller chip) and the libraries I use, as I look through their source codes before I rely on them even a tiny little bit. (That is the difference between "I think there is a library for that" and "I use this library for that" in my case; involving several hours of source code examination.)

In all this, the ability to read and understand assembly code is quite useful and rather important. If you are like me, you rarely write any assembly from scratch, but you end up reading (the important or strange or critical parts) of compiler-generated assembly at least once every week. I myself often end up examining the generated assembly for wildly different hardware –– x86-64, MIPS, AVR, some ARM Cortex –– to find out if a particular C expression has issues on any hardware when compiled given a small set of compilers and compiler versions.

That is, I am basically never interested in what code ends up being optimal; I am interested in what code performs acceptably well in all situations, and what code patterns have issues with specific compilers and/or specific hardware architectures – the latter being more important than the former. Even when I'm using the nonstandard C compiler and linker features above, I like to know what weaknesses the pattern I am applying, has. That way, not only do I have that tool in my toolbox, but it also has a small note with it that lists the known risks/deficiencies/weaknesses in addition to its strengths, and I can rummage quickly through those to find a tool that suits a particular situation.

(From my point of view, this also explains why I dislike language-lawyerism: from this point of view, the language-lawyers are claiming the text of the standard is more important than those notes based on real life observations. The standard is better than nothing, but can never override the behaviour observed in the real world.)

Whether the same approach works for others, depends on their personal strengths and what they get paid for, I believe.

Simon · « **Reply #68 on:** January 20, 2022, 11:07:20 am »

The SAMC has a cache for the non volatile memory controller, enabled by default. That may explain it.

josip · « **Reply #69 on:** January 20, 2022, 11:15:50 am »

Quote from: westfw on January 19, 2022, 11:40:37 pm

The Raspberry Pi RP2040 has a 16k cache (sorely needed since it fronts a QSPI program store), and almost every M0 implementation I've seen that runs faster than 32MHz has some kind of "flash accelerator" that is more complicated than just a wide bus. I've seen the time taken by cycle-counting "delay" functions vary wildly depending on exactly where they ended up in the code...

I am using Kinetis M0+ @96 MHz with cycle aligned code (bit banging), that is running from flash or RAM without issues. But I am codding in assembler. Branching also can be done right. Yes, undocumented things must be resolved by yourself. FlexIO is great, but some things can't be done with it.

DiTBho · « **Reply #70 on:** January 20, 2022, 11:57:54 am »

Quote

BTW: MIPS eliminates "branch prediction penalty" completely by using delay slots. This doesn't make it cycle-accurate neither.

Quote from: brucehoult on January 19, 2022, 08:32:57 pm

One or two early simple MIPS implementations with short pipelines and single-issue managed to cover the delay in fetching code on the branch-taken path by executing the instruction in the delay slot.

Later implementations with longer pipelines or dual or multiple issue or long latency RAM need branch prediction for performance just as much as anything else does.

I have worked with MIPSII, III, IV, including R10K, R12K, R14K and R16K: yup, definitively long pipeline MIPS, very different from R2K (5 stages)

MIPS64R2 is also a long pipeline MIPS (never seen yet a short one)
MIPS32R2 ... well, it depends on the chip-manufacturer. The Microchip's one is a short-pipeline version, the Atheros' one is a long-pipeline.

Quote from: brucehoult on January 19, 2022, 08:32:57 pm

Even with the simple short pipeline single-issue CPUs it's not always easy to find something useful for the branch delay slot -- something that is needed on both the branch taken and branch not taken paths, but that isn't needed to determine the branch condition -- and some large percentage of the time you end up with a NOP there, wasting time and program size.

Yup, that's why the academic code for educational MIPS32-5-stages-pipeline is often stuffed with a NOP.

MK14 · « **Reply #71 on:** January 20, 2022, 01:14:51 pm »

Quote from: Nominal Animal on January 20, 2022, 09:07:47 am

If you want cycle accuracy, C is indeed not the right tool.

I like and appreciate the rest of the post, thanks for making it. But, I find the bit I quoted, misleading. Technically speaking, you are right, it is not the right tool. But real life is not that simple.
For various reasons, even though assembly code might well be the 'RIGHT' tool for the job, that is not necessarily an allowed option. It could be that it is a place of work, and most of the programmers, don't do assembly code, an open source project, and a decision has been made to leave assembly code out of it, and numerous other examples of why assembly code is either disallowed by rules (e.g. by the company that wants the software), or considered a very bad idea because different architectures might be used, either now or in the future.
As already mentioned in this thread, by a number of people, there are various suggestions on how to use C code, and get the timing accuracy you desire. So there are many ways of achieving it, such as compiling it, and looking at the assembly listing, and modifying it, if it doesn't appear as you intended, and/or watching/measuring what it does with a scope. Adjusting the code as necessary.
Although there are usually hardware timers, in usual MCUs. They can be tied up doing other things, or be unsuitable for extremely rapid transactions, and many other reasons why they can be unsuitable for the task in hand.

Increasingly these days, even assembly code itself, is NOT that simple as regards consistent/exact timing. Cache/other-stuff has already bean mentioned, but once you get to out of order execution (and hence it is already superscalar), e.g. Raspberry PI 4. Tiny changes to the assembly code, may have dramatic effects on its running time/efficiency (i.e. which instructions you put, and where they are, can make the difference between 3 things being done in 1 clock cycle, or it taking 3 clock cycles to do the 3 instructions). Just putting in an apparently simple memory access, may slow it down by a huge amount, because it is relying on relatively slow DRAM memory accesses, instead of using the cache (which has run out of capacity, or getting too many random accesses to keep up).
Which can make C the better tool, as it will tend to automatically try to avoid such slowdowns, via its optimization. I suppose I'm saying it's a double edged sword, because on the one hand you might want consistent and fast software responses, and hence need to keep the code optimized. But hand coding in assembly becomes considerably harder, if you have to keep it optimised on modern architectures.

Sorry, the last paragraph, strayed from 'cycle accurate' coding, into potentially highly optimised coding. But some projects, need to do BOTH. I.e. They need to be fast (on a cost efficient MCU) and reasonably consistent with its timing.

Siwastaja · « **Reply #72 on:** January 20, 2022, 01:33:14 pm »

"Reasonably consistent" is well said actually, and that is what even the most modern microcontrollers easily achieve. Running the code from external SD card aside (which you shouldn't do for timing-critical routines), it's all pretty much noise. Is this thing going to take 57µs or maybe sometimes 57.1µs? Who cares? Compare this to CS (software) mindset where Java suddenly garbage collecting for five seconds is A-OK.

Even with the old "simple" microcontrollers, actually writing cycle-accurate assembly was not the typical case, but a special rarity. Sometimes to bitbang an interface with no peripheral available.

But today, we have better selection of peripherals. Even then, if no peripheral is available, thanks to just more processing power, you can "bitbang" without cycle accurate code; examples would be combining timer peripherals and code (polling for a flag), or interrupt-driven code. (Example would be an SDLC implementation at 1Mbit/s, which is well possible on Cortex-M7 @ 400MHz without cycle accuracy, but challenging on an AVR @ 16MHz with cycle accuracy).

Just STM32's basic timer's One Pulse Mode, combined with ITCM and relocatable RAM-based vector table, configurable interrupt priorities and software interrupts, creates a freely programmable state machine engine capable of timesteps in excess of 5-10MHz equivalent or so, with jitter in just hundreds of nanoseconds; and it can run the rest of the application in "parallel"! For the speed, this is only one order of magnitude behind FPGA's, basically.

So the fact that writing cycle accurate code is "difficult" is quite uninteresting in practice, because it is very rarely needed. But every now and than that special niche pops up, and if you can save $50 in BOM and 1 year in development time by avoiding turning it into an FPGA project, by using screwdriver as a hammer leaving a few ugly dents in the process, maybe, why not.

Nominal Animal · « **Reply #73 on:** January 20, 2022, 02:07:00 pm »

Quote from: MK14 on January 20, 2022, 01:14:51 pm

As already mentioned in this thread, by a number of people, there are various suggestions on how to use C code, and get the timing accuracy you desire.

You do not actually get the timing accuracy from the C code; you get it from that specific version of C compiler and linker and options.

I am not saying that one should never do such stuff in C. I am saying that if you do, the timing is dependent on the C compiler and linker, their exact version, and on the compile options (inclding target options, obviously). There is absolutely no guarantees or even hints that a different compiler, or even a future version of the same compiler, optimizes it the same way.

While you know this, many beginner C programmers do not, even though they tend to be very keen on optimization. Thus, being careful on the wording here, to convey an accurate understanding on the limitations, is important.

MK14 · « **Reply #74 on:** January 20, 2022, 02:41:46 pm »

Quote from: Nominal Animal on January 20, 2022, 02:07:00 pm

Quote from: MK14 on January 20, 2022, 01:14:51 pm
As already mentioned in this thread, by a number of people, there are various suggestions on how to use C code, and get the timing accuracy you desire.
You do not actually get the timing accuracy from the C code; you get it from that specific version of C compiler and linker and options.

I am not saying that one should never do such stuff in C. I am saying that if you do, the timing is dependent on the C compiler and linker, their exact version, and on the compile options (inclding target options, obviously). There is absolutely no guarantees or even hints that a different compiler, or even a future version of the same compiler, optimizes it the same way.

While you know this, many beginner C programmers do not, even though they tend to be very keen on optimization. Thus, being careful on the wording here, to convey an accurate understanding on the limitations, is important.

I agree with you. It is a problem, as many people, from complete beginners to advanced C programmers, and beyond. May well read (this) and other threads.

You can partially control the timings of the C software, by calling it (interrupts) or polling, the MCUs hardware timers. Right at the start of the applicable software routine. Which then gives it a degree of timing stability/consistency. You can also use the hardware timers, to record the precise end time, so that you can create diagnostic, timing jitter information. To help get the software somewhat right, even when expensive test equipment, is not being used.
For example with PID control loops. It is not necessarily about being run at precisely the right time, perhaps once every millisecond. But more about using the MCUs hardware timers, to read in what the precise time is NOW, do the PID calculations, taking the jitter into account, producing new outputs, and then re-enabling interrupts (if applicable, obviously disable them towards/at the beginning of that section in the code).

There are lots of pitfalls with assembly code. Some customers/products/stuff, has far too woolly/vague requirements and specifications, and/or you can't trust the people in charge, to NOT change their mind on what needs to be done, on a week by week basis. This can make assembly code, especially problematic. A long time ago, there were many good assembly language programmers around. Increasingly these days, it is becoming a rare thing (decent assembly language programmers).
I like to do fun things in assembly. But the latest processors, with thousands of complicated instructions, can be a real chore to write in, rather than the fun experience, older architectures, can be.

EDIT: The vague spec, is not necessarily a problem, as long as the assembly code is strictly limited, to just one or two interface ports, and defined stuff like that. It would be when large chunks of the project are coded in assembly, where regularly changing requirements, would be an issue.
My details in this post, only apply to some particular embedded projects, in the big wide world, there are a huge variety of possibilities, and anyway, some use a real time operating system (RTOS), which is another big ball game in itself.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: [ARM] optimization and inline functions (Read 7666 times)

Share me