Author Topic: Multi Microcontroller design. (Read 7431 times)

tggzzz · « **Reply #25 on:** May 03, 2020, 07:30:45 pm »

Quote from: Siwastaja on May 03, 2020, 01:48:05 pm

tggzzz's silver bullet is the XMOS,

Sorry, that is not true.

I wrote "look at the XMOS devices" and listed some of their characteristics. I did not write "use the XMOS devices"

Quote

Solution:
Use XMOS so that the IDE proves timing to 4ns!!!! You LOSE otherwise!

Sorry, that is not true. See my reply #11.

Quote

What we really need, is better specification of the problem.

... as is implied in my reply #11.

Quote

Or maybe the OP needs a bicycle, or a potato. We don't know.

... as is implied in my response #11.

ogden · « **Reply #26 on:** May 03, 2020, 09:50:24 pm »

Quote from: tggzzz on May 03, 2020, 07:30:45 pm

Sorry, that is not true.

I wrote "look at the XMOS devices"

Taste your own medicine. I wrote "4x higher freq ARM MCU could be enough for the job", yet you did not hesitate to jump on me with "what makes you qualified" and deadline guarantees "arguments" [grin]

hans · « **Reply #27 on:** May 04, 2020, 03:34:40 pm »

Quote from: Siwastaja on May 03, 2020, 01:48:05 pm

These threads are always fun; no one has freaking idea about anything that's needed to solve the problem, yet the opinions are strong. tggzzz's silver bullet is the XMOS, and nctnico's silver bullet is never ever even considering multi-MCU.

Task:
Turn some relays on/off without delays.

Solution:
Use XMOS so that the IDE proves timing to 4ns!!!! You LOSE otherwise!

Reality:
Relays have milliseconds of delays and timing jitter!

What we really need, is better specification of the problem.

For really tight timing in really complex situations, options are FPGA, CPLD, or maybe the XMOS product series if it hits the sweet spot.

Sometimes, multi-MCU is a fairly easy, also low-cost solution, but I'd default to not recommending it. You'll see if it makes your life easier instead of making it harder in your particular design.

In most cases, any properly selected single MCU does the job. ARM MCUs implement interrupt priorities and pre-empting interrupts; they also implement 12-cycle interrupt latency; running from flash induces slightly variable timing, but it's easy to prove the timing approximately. It's irrelevant whether the "turn off the relay" operation takes 15 or 20 cycles, if the specification requires doing it in less than 20000 cycles. Also the most common use cases tend to be covered by the peripherals, IN HARDWARE, for example cycle-by-cycle current limiting and safety cutoffs in motor control applications.

The chances are, the problem in the opening post is easily solved by making the safety critical things happen in a highest-priority interrupt, and the UI code run in the main loop. Make the highest-priority interrupt unmaskable by modifying BASEPRI instead of turning interrupts on/off completely. But yes, this is just guesswork. It may be that XMOS is just the tool that the OP happens to need in this completely unspecified problem.

Or maybe the OP needs a bicycle, or a potato. We don't know.

"prove" "approximately". Tight timing is not real-time. Real-time means that the environment imposes constraints on the processing and timing of the system, where missing a deadline can have serious consequences. This thread started with us guessing what the deadlines really are of this application, and turns out to be rather slow, but other than that it's still rather unspecified. This doesn't really help if you need to define the specifications of a real-time system.

In addition, proving that software is indeed never going to miss a deadline, from a digital engineering and/or firmware perspective, is not trivial. I agree with tggzzz that measurement is not a correct way to determine for example WCET's of such systems, where application-oriented 'cached' ARM designs are incredibly hard to design for in this environment. Gaining confidence in hard real-time systems is more than a matter of saying "it can ran fine on the bench". _{I liked tggzzz comment for it's technical content, not for his comment on "are you qualified".}

Therefore, recommending a 4x higher frequency ARM microcontroller is an odd solution to recommended. How do you know that CPU load is the limitation, and by how much? Many (online) scheduling algorithms cannot approach a high CPU load in real-time systems anyway, for example in rate monotonic schedulers, so I think that the design motive of - it will run fine on a high speed ARM chip - is void: define "fine", and define "high speed".

Performing critical tasks in hardware (such as FPGAs) can be beneficial because you can design & verify for latency and timing. But it's also incredibly expensive and complex for many systems.

I agree with @nctnico that doing multi-MCU designs for the sake of multi-tasking is a bad reason. If anything, communication and synchronisation between systems is a much harder problem to solve than most people think. Writing software in interrupts with UI code in main() could perhaps be acceptable if there are sufficient CPU cycles left for the interface.

tggzzz · « **Reply #28 on:** May 04, 2020, 06:07:51 pm »

Quote from: hans on May 04, 2020, 03:34:40 pm

Quote from: Siwastaja on May 03, 2020, 01:48:05 pm
These threads are always fun; no one has freaking idea about anything that's needed to solve the problem, yet the opinions are strong. tggzzz's silver bullet is the XMOS, and nctnico's silver bullet is never ever even considering multi-MCU.

Task:
Turn some relays on/off without delays.

Solution:
Use XMOS so that the IDE proves timing to 4ns!!!! You LOSE otherwise!

Reality:
Relays have milliseconds of delays and timing jitter!

What we really need, is better specification of the problem.

For really tight timing in really complex situations, options are FPGA, CPLD, or maybe the XMOS product series if it hits the sweet spot.

Sometimes, multi-MCU is a fairly easy, also low-cost solution, but I'd default to not recommending it. You'll see if it makes your life easier instead of making it harder in your particular design.

In most cases, any properly selected single MCU does the job. ARM MCUs implement interrupt priorities and pre-empting interrupts; they also implement 12-cycle interrupt latency; running from flash induces slightly variable timing, but it's easy to prove the timing approximately. It's irrelevant whether the "turn off the relay" operation takes 15 or 20 cycles, if the specification requires doing it in less than 20000 cycles. Also the most common use cases tend to be covered by the peripherals, IN HARDWARE, for example cycle-by-cycle current limiting and safety cutoffs in motor control applications.

The chances are, the problem in the opening post is easily solved by making the safety critical things happen in a highest-priority interrupt, and the UI code run in the main loop. Make the highest-priority interrupt unmaskable by modifying BASEPRI instead of turning interrupts on/off completely. But yes, this is just guesswork. It may be that XMOS is just the tool that the OP happens to need in this completely unspecified problem.

Or maybe the OP needs a bicycle, or a potato. We don't know.

"prove" "approximately". Tight timing is not real-time. Real-time means that the environment imposes constraints on the processing and timing of the system, where missing a deadline can have serious consequences. This thread started with us guessing what the deadlines really are of this application, and turns out to be rather slow, but other than that it's still rather unspecified. This doesn't really help if you need to define the specifications of a real-time system.

In addition, proving that software is indeed never going to miss a deadline, from a digital engineering and/or firmware perspective, is not trivial. I agree with tggzzz that measurement is not a correct way to determine for example WCET's of such systems, where application-oriented 'cached' ARM designs are incredibly hard to design for in this environment. Gaining confidence in hard real-time systems is more than a matter of saying "it can ran fine on the bench". _{I liked tggzzz comment for it's technical content, not for his comment on "are you qualified".}

Therefore, recommending a 4x higher frequency ARM microcontroller is an odd solution to recommended. How do you know that CPU load is the limitation, and by how much? Many (online) scheduling algorithms cannot approach a high CPU load in real-time systems anyway, for example in rate monotonic schedulers, so I think that the design motive of - it will run fine on a high speed ARM chip - is void: define "fine", and define "high speed".

Performing critical tasks in hardware (such as FPGAs) can be beneficial because you can design & verify for latency and timing. But it's also incredibly expensive and complex for many systems.

I agree with @nctnico that doing multi-MCU designs for the sake of multi-tasking is a bad reason. If anything, communication and synchronisation between systems is a much harder problem to solve than most people think. Writing software in interrupts with UI code in main() could perhaps be acceptable if there are sufficient CPU cycles left for the interface.

I agree with your points, including '_{I liked tggzzz comment for it's technical content, not for his comment on "are you qualified"}'. That was a direct response to ogden's inaccurate "Shame to Xcore preachers who are offering overkill solution"! (reply #10)

Siwastaja · « **Reply #29 on:** May 05, 2020, 01:19:50 pm »

Quote from: hans on May 04, 2020, 03:34:40 pm

Real-time means that the environment imposes constraints on the processing and timing of the system, where missing a deadline can have serious consequences.

And, anyone can play a critical realtime expert on the Internet. Yet it's unbelievably complex and in many ways, it's far more than just logic timing.

Further, deadlines are not specified in clock cycles, but in the SI time unit. Therefore talking about cpu clock cycles is irrelevant (especially when discussing relays). 5 clock cycle interrupt latency on a 4MHz 8-bit CPU similarly cannot be compared to a 12 clock cycle interrupt latency on a 200MHz 32-bit CPU.

For safety critical, you need to prove a thing like:
"Voltage level exceeding 2.5V on input pin #42 for more than 1 microseconds leads to output at #57 turn below 0.5V in less than 10 microseconds"

The design work for ARM and XCORE is equal; after the many steps of verifying analog input protection, and output driver functionality, you end up with the question, how do you prove that the CPU is reliable enough to run the code at all? The ISR likely is just a line of code that drives '0' to the IO port, and is triggered by an input pin. The XCORE solution, if I understood it correctly, might be a busy loop in one of the cores, looking at the input pin. Both will definitely work in less than 10µs, given that the CPU itself isn't having some kind of failure mode. How do you verify that?

In addition to just timing, robustness against all other sources of errors are equally important. A tightly clocked logic that always does the sequence in fixed amount of clock cycles is equally error prone to all electrical, mechanical etc. sources of errors as is a system which has a branch predictor causing a variability of a few nanoseconds between iterations. By reading tggzzz's marketing speechess on XMOS, you may get an impression that a variable timing caused by a pipelined CPU, flash memory, and possibly caches, is a complete turn-off for getting any safety-critical timing out of the system - you have lost already. This is obviously false because such critical systems are done using such processors all the time. Another fallacy is that just introducing cycle-predictable timing, everything's made so much easier. In reality, if being cycle accurate was your #1 concern, I'm sure you are not seeing the forest.

There are just so many aspects to it so concentrating on such a tiny detail definitely makes you lose the big picture.

Yes, you can buy the clock-accurate SystemC or equivalent simulation models for ARM Cortex M7 and actually prove the timing. Safety-critical is so freaking expensive and slow anyway.

Or you can do what I did just a few days ago, turn off the dynamic branch prediction, copy the program in ITCM, and verify the timing on an oscillosscope. Yes, I needed cycle-accurate. Yes, it is cycle-accurate. Yes, it works exactly the same every time. No, the system isn't safety critical so if I'm wrong, no one dies; finally, no, safety critical systems almost never rely on being cycle accurate. If it seems that way, please reconsider if the safety really depends on the cycle-accurate timing or if you could do something better to avoid such requirement.

Quote

where application-oriented 'cached' ARM designs are incredibly hard to design for in this environment.

I think I understand what you mean by "application-oriented 'cached'" design.

Yet, no one forces you to use ARM Cortex-M series devices in that "application-oriented cached" way. Heck, I've been using them for years and have yet to build such application-oriented thing, I have not even turned the caches on, not even once! Yes, a M7 can be used a bit like a high-performance application processor, and I'm aware some treat them like application processors, but you don't
1) have to use M7.
2) use the M7 that way; It's still an MCU!

The heavily pipelined dual-issue structure with quite advanced dynamic branch prediction means it's more difficult, possibly "insane", to build cycle-accurate things on M7 (still I have done that), but that doesn't mean your loop which normally runs in 200ns suddenly would be taking 200us. On a Cortex-A running linux, this is a real problem.

If you really need safety, it's all about the big picture, not a detail in whether an instruction takes 1 or 2 cycles. Providing independent watchdog systems is highly recommended; they can be analog, or even mechanical, for example.

Siwastaja · « **Reply #30 on:** May 05, 2020, 01:36:43 pm »

Finally, Xcore likely isn't an "overkill", and if that's what you are familiar with, and are buying anyway, it's likely the most sensible choice to go with (as usual with microcontrollers, "use what you know"). So I'm sure tggzzz would use an XMOS product and solve the problem in no time with it, with great results and minimal time-to-market.

I would likely use an STM32 though, and make the project happen similarly quickly. Someone would do it with an 8-bit PIC, equally successfully. (All three examples are capable of safety-critical real-time timing. Whether the solution is safe in safety-critical environment, is completely up to other factors.)

If really, really safety-critical, I'm sure all tggzzz's, mine, and the "someone else"'s solution would all fail in vigorous real-world testing and someone would die as a result. We all went for minimized time-to-market using what we are familiar with, after all. That means some mistakes while being comfortable!

From the OP's viewpoint though, they very likely have never heard about Xcore processors. With absolutely zero information about the requirements of the application (and some strong hints that it's likely not timing-critical; relays!), though, offering such specific product series may send the OP to wrong tracks. Wrong in the sense of wasting time, when we should be discussing about the requirements.

ARM MCUs are different; you very likely have used one, or will use one. They are everywhere, they run the world. If you have never heard about them, you can't go on designing microcontroller-based anything without fixing this massive gap.

Popularity has a real technical meaning here. Also for design-for-manufacturability, product lifetime, part availability, price.

If "the default" does the job just fine, start from there.

Your "default" may vary, though. Someone is an FPGA wiz and will use it even if not strictly required. Someone else defaults to XMOS and makes a business around that.

tggzzz · « **Reply #31 on:** May 05, 2020, 02:01:07 pm »

Quote from: Siwastaja on May 05, 2020, 01:19:50 pm

Further, deadlines are not specified in clock cycles, but in the SI time unit. Therefore talking about cpu clock cycles is irrelevant (especially when discussing relays). 5 clock cycle interrupt latency on a 4MHz 8-bit CPU similarly cannot be compared to a 12 clock cycle interrupt latency on a 200MHz 32-bit CPU.

It is inconceivable that the clock rate is unknown; I didn't think that needed saying!

It is entirely conceivable that, in most processors, the number of clock cycles is unknown - and variable over a wide range. Even in 486 class processors with their tiny cache and interrupts, a simple test case could demonstrate a 5:1 timing variation (from memory 30 years ago!). The only mitigation would be to disable all the caches and performance enhancements, thus greatly reducing the performance and wasting a lot of silicon. Better to use that silicon are as an independent core, as in the xCORE devices and Sun's Niagara processors.

Hence having a no interrupts and defined clock cycle count is a significant advantage.

Quote

In addition to just timing, robustness against all other sources of errors are equally important. A tightly clocked logic that always does the sequence in fixed amount of clock cycles is equally error prone to all electrical, mechanical etc. sources of errors as is a system which has a branch predictor causing a variability of a few nanoseconds between iterations. By reading tggzzz's marketing speechess on XMOS, you may get an impression that a variable timing caused by a pipelined CPU, flash memory, and possibly caches, is a complete turn-off for getting any safety-critical timing out of the system - you have lost already. This is obviously false because such critical systems are done using such processors all the time. Another fallacy is that just introducing cycle-predictable timing, everything's made so much easier. In reality, if being cycle accurate was your #1 concern, I'm sure you are not seeing the forest.

There is the concept of necessary and sufficient. Having a defined maximum timing is necessary but is not sufficient. That should be obvious.

Quote

Or you can do what I did just a few days ago, turn off the dynamic branch prediction, copy the program in ITCM, and verify the timing on an oscillosscope. Yes, I needed cycle-accurate. Yes, it is cycle-accurate. Yes, it works exactly the same every time.

OK, you slugged the system. That's sufficient for the timing, but wasteful and inelegant.

You still had to measure the timing. That is equivalent to inspection, and as engineers used to be taught, "you can't inspect quality into a product".

Quote

No, the system isn't safety critical so if I'm wrong, no one dies; finally, no, safety critical systems almost never rely on being cycle accurate. If it seems that way, please reconsider if the safety really depends on the cycle-accurate timing or if you could do something better to avoid such requirement.

I've designed and implemented a system where it could kill people - even if it worked as designed and to specification

Quote

If you really need safety, it's all about the big picture, not a detail in whether an instruction takes 1 or 2 cycles. Providing independent watchdog systems is highly recommended; they can be analog, or even mechanical, for example.

No, it isn't all about the big picture - the details are also critical.

tggzzz · « **Reply #32 on:** May 05, 2020, 02:22:06 pm »

Quote from: Siwastaja on May 05, 2020, 01:36:43 pm

Finally, Xcore likely isn't an "overkill", and if that's what you are familiar with, and are buying anyway, it's likely the most sensible choice to go with (as usual with microcontrollers, "use what you know"). So I'm sure tggzzz would use an XMOS product and solve the problem in no time with it, with great results and minimal time-to-market.

Nonsense.

Every situation is different, and I choose an appropriate tool that gets the entire job done. Sometimes, for a very relaxed specification, that can be an arduino!

At HP it was drummed into us that time-to-market wasn't the most important thing. Read the last paragraph on this page from "The HP Phenomenon"

ogden · « **Reply #33 on:** May 05, 2020, 04:00:16 pm »

Quote from: tggzzz on May 05, 2020, 02:22:06 pm

At HP it was drummed into us that time-to-market wasn't the most important thing.

Any specialist low volume, high value product manufacturer will laugh about such "rule". In short: it depends.

There are electronics product segments where you copy-paste your "go to MCU" like XCORE into every product because it is cheaper to not change anything. Then there are product segments where you not only find cheapest MCU at the moment, but optimize code out of your mind to squeeze into smallest, thus cheapest model of the cheapest MCU kind.

tggzzz · « **Reply #34 on:** May 05, 2020, 04:21:04 pm »

Quote from: ogden on May 05, 2020, 04:00:16 pm

Quote from: tggzzz on May 05, 2020, 02:22:06 pm
At HP it was drummed into us that time-to-market wasn't the most important thing.
Any specialist low volume, high value product manufacturer will laugh about such "rule". In short: it depends.

I'm not sure what you mean there.

HP was not only a "low volume high value" manufacturer. It also sold into extremely cost sensitive and time sensitive markets: consider PCs and peripherals.

I was chatting to the person in charge of HP's PC division (before he went to be come a Microsoft VP). He explicitly regarded PCs as being like bananas: low margin and if they stayed on the shelf for too long them began to smell.

OTOH I know of other HP divisions where they spent large amounts of money to on a hardware accelerators used to simulate ICs. The reasoning was simple: if you can accelerate novel product's development so it hits the market earlier, then you gain a lot of sales over the product lifecycle.

Nonetheless, in all those cases not meeting the specification was regarded as unacceptable - and several "Bill and Dave" stories were used to drive home that point.

ogden · « **Reply #35 on:** May 05, 2020, 04:28:14 pm »

Quote from: tggzzz on May 05, 2020, 04:21:04 pm

Quote from: ogden on May 05, 2020, 04:00:16 pm
Quote from: tggzzz on May 05, 2020, 02:22:06 pm
At HP it was drummed into us that time-to-market wasn't the most important thing.
Any specialist low volume, high value product manufacturer will laugh about such "rule". In short: it depends.
I'm not sure what you mean there.

You can't get what I mean if you don't read (on purpose or not). I said "Any specialist low volume, high value product manufacturer".

tggzzz · « **Reply #36 on:** May 05, 2020, 04:39:14 pm »

Quote from: ogden on May 05, 2020, 04:28:14 pm

Quote from: tggzzz on May 05, 2020, 04:21:04 pm
Quote from: ogden on May 05, 2020, 04:00:16 pm
Quote from: tggzzz on May 05, 2020, 02:22:06 pm
At HP it was drummed into us that time-to-market wasn't the most important thing.
Any specialist low volume, high value product manufacturer will laugh about such "rule". In short: it depends.
I'm not sure what you mean there.

You can't get what I mean if you don't read (on purpose or not). I said "Any specialist low volume, high value product manufacturer".

Could you at least read to the end of my post before commenting? You then choose to make a point, and deliberately snip the context that indicates your point is little more than a strawman argument. That does not reflect well on you.

To make that point clear, I reiterate my previous post:

Quote from: tggzzz on May 05, 2020, 04:21:04 pm

HP was not only a "low volume high value" manufacturer. It also sold into extremely cost sensitive and time sensitive markets: consider PCs and peripherals.

I was chatting to the person in charge of HP's PC division (before he went to be come a Microsoft VP). He explicitly regarded PCs as being like bananas: low margin and if they stayed on the shelf for too long them began to smell.

OTOH I know of other HP divisions where they spent large amounts of money to on a hardware accelerators used to simulate ICs. The reasoning was simple: if you can accelerate novel product's development so it hits the market earlier, then you gain a lot of sales over the product lifecycle.

Nonetheless, in all those cases not meeting the specification was regarded as unacceptable - and several "Bill and Dave" stories were used to drive home that point.

ogden · « **Reply #37 on:** May 05, 2020, 04:52:55 pm »

When I say "specialist low volume high value manufacturer" - I did not talk about HP. I did mean small shops specializing in custom projects. Such can't afford MCU change and they don't need to because MCU price is irrelevant in such kind of business. Get it now?

tggzzz · « **Reply #38 on:** May 05, 2020, 05:02:22 pm »

Quote from: ogden on May 05, 2020, 04:52:55 pm

When I say "specialist low volume high value manufacturer" - I did not talk about HP. I did mean small shops specializing in custom projects. Such can't afford MCU change and they don't need to because MCU price is irrelevant in such kind of business. Get it now?

In many markets HP was a "specialist low volume high value manufacturer". I'm sure you are aware of all their test equipment; that tradition is carried on by the other "parts" of HPAK: Agilent now Keysight.

Hence it was entirely reasonable to presume your description was referring to HP. You have now chosen to, ahem, "clarify" that. OK.

ogden · « **Reply #39 on:** May 05, 2020, 05:34:26 pm »

Quote from: tggzzz on May 05, 2020, 05:02:22 pm

Hence it was entirely reasonable to presume your description was referring to HP. You have now chosen to, ahem, "clarify" that. OK.

Wishful thinking. I will let others to decide - "specialist product manufacturer" sounds like HP in following context or rather not, especially knowing that explanation about HP as specialist product manufacturer came from you *after* my post:

Quote from: ogden on May 05, 2020, 04:00:16 pm

Quote from: tggzzz on May 05, 2020, 02:22:06 pm
At HP it was drummed into us that time-to-market wasn't the most important thing.
Any specialist low volume, high value product manufacturer will laugh about such "rule". In short: it depends.

Siwastaja · « **Reply #40 on:** May 07, 2020, 12:39:45 pm »

Quote from: tggzzz on May 05, 2020, 02:01:07 pm

Quote from: Siwastaja
Or you can do what I did just a few days ago, turn off the dynamic branch prediction, copy the program in ITCM, and verify the timing on an oscillosscope. Yes, I needed cycle-accurate. Yes, it is cycle-accurate. Yes, it works exactly the same every time.

OK, you slugged the system. That's sufficient for the timing, but wasteful and inelegant.

Good guess, but just wrong.

The system has a small cycle-accurate bit, but before and after that, it does normal processing, which may be written by others. Branch prediction is turned off for the cycle-accurate bit only, which is mostly unrolled anyway, so no significant performance loss. For this bit, only worst-case performance matters. Can't share the exact details, but let's say, it needs to generate a 3.000MHz signal (with some other details). With branch prediction enabled, it initially generates a 3.000MHz signal for one cycle, then something like 3.050MHz. Nothing is "crippled" by forcing it to generate the desired cycle-accurate signal equal to the worst-case performance. (You could achieve the same result keeping the branch predictor enabled; but I feel it's more elegant to turn off the error source, than to "trick" it, even if "tricking" could be proved to work consistently.)

After the small cycle-accurate bit has been ran, all CPU features are back on. Now a M7 core is a lot more capable, performance-wise, than an XMOS core, thanks to strong pipelining, dual-issue, branch prediction, caches, and so on. Single-thread performance helps software development; latter processing can be easily written in C, without taking extra measures on paralleling the code (parallel data processing isn't always trivial!), which would be necessary for a multicore system where each core is comparatively inefficient.

Because most of the code is where M7 shines - general purpose (not constrained by worst-case timing requirements; average case is important) data processing and user interfaces - and only a tiny part is cycle-accurate, it makes sense to choose M7 and "force" it to do the cycle-accurate part as well, even though it's extra work. M7 is low-cost, as well.

You could do it otherwise - choose something that's better suited for cycle-accurate work, but less suited for available developers to write efficient UI and data processing code on something they are familiar with. Or, you could build a multi-chip solution. All of these are more expensive, and the end result is not magically any better.

Quote

You still had to measure the timing. That is equivalent to inspection, and as engineers used to be taught, "you can't inspect quality into a product".

Nice try, but you are just plain wrong.

What the XMOS IDE is doing, it is simulating the internal operations of the CPU, and reporting the result to the user.

In digital logic, simulating (with proper simulation models) and measuring the number of clock cycles provide equivalent results. With improper simulation models, the actual hardware is right.

Buying the Cortex-M7 simulation model does exactly the same - the only difference, it's expensive, and likely has a more difficult user interface (like having to write your own SystemC wrapper).

Running the code on Cortex-M7 and measuring it does the same, though. It's, after all, implementing the actual digital logic. The operation runtime can be measured, and it will match the cycle-accurate simulation model.

So, cycle-accurate simulation is only convenient. It's fundamentally not magically more "accurate" than running the actual digital logic. It's actually the opposite: the simulation model must contain zero bugs. Measuring the actual device provides the correct result, in case the simulation model has any problem in it.

The error you make is, you seem to think that measuring the number of clock cycles a digital system runs in, is somehow equivalent to measuring a voltage on a prototype of a single unit of voltage reference; or measuring a length of a single piece of cut metal. This is not true.

You are not measuring a system with random error terms; you are measuring a circuit consisting of known digital logic, and if you can enumerate the possible sources of variation and show they do not exist, you have a guaranteed number of cycles.

But I'm sure it's way easier to do on XMOS, though. I'm sure they have put a lot of thought into providing an easy-to-use, yet accurate simulation models integrated in the IDE. And, most importantly, they have "crippled" the CPU performance by deliberately not including modern-day CPU optimizations that would increase the average processing performance, with the cost of variability in timing.

If you do timing-accurate work in a modern complex processor like an M7 MCU, you need to know what those timing-invariable elements are, and know how to either disable them (note: it does not necessarily cripple the performance, because they can be selectively disabled; like caches on certain memory areas, or just turning things on/off periodically), or how to calculate for the worst-case timing. Latter is more challenging, but completely possible.

I'm sure that if you have a lot of cycle-accurate work to do, Xcore is likely orders of magnitude easier to work with than a Cortex-M7. And I would love to work with one, if someone would pay me for doing that. For my own projects, I'm defaulting to the most commonly available, most easily second-sourcable parts whenever it's possible, and only resort to specialistic products with their special best-in-class features, when I absolutely must.

tggzzz · « **Reply #41 on:** May 07, 2020, 01:30:21 pm »

Quote from: Siwastaja on May 07, 2020, 12:39:45 pm

Quote from: tggzzz on May 05, 2020, 02:01:07 pm
Quote from: Siwastaja
Or you can do what I did just a few days ago, turn off the dynamic branch prediction, copy the program in ITCM, and verify the timing on an oscillosscope. Yes, I needed cycle-accurate. Yes, it is cycle-accurate. Yes, it works exactly the same every time.

OK, you slugged the system. That's sufficient for the timing, but wasteful and inelegant.

Good guess, but just wrong.

The system has a small cycle-accurate bit, but before and after that, it does normal processing, which may be written by others. Branch prediction is turned off for the cycle-accurate bit only, which is mostly unrolled anyway, so no significant performance loss. For this bit, only worst-case performance matters. Can't share the exact details, but let's say, it needs to generate a 3.000MHz signal (with some other details). With branch prediction enabled, it initially generates a 3.000MHz signal for one cycle, then something like 3.050MHz. Nothing is "crippled" by forcing it to generate the desired cycle-accurate signal equal to the worst-case performance. (You could achieve the same result keeping the branch predictor enabled; but I feel it's more elegant to turn off the error source, than to "trick" it, even if "tricking" could be proved to work consistently.)

After the small cycle-accurate bit has been ran, all CPU features are back on. Now a M7 core is a lot more capable, performance-wise, than an XMOS core, thanks to strong pipelining, dual-issue, branch prediction, caches, and so on. Single-thread performance helps software development; latter processing can be easily written in C, without taking extra measures on paralleling the code (parallel data processing isn't always trivial!), which would be necessary for a multicore system where each core is comparatively inefficient.

Because most of the code is where M7 shines - general purpose (not constrained by worst-case timing requirements; average case is important) data processing and user interfaces - and only a tiny part is cycle-accurate, it makes sense to choose M7 and "force" it to do the cycle-accurate part as well, even though it's extra work. M7 is low-cost, as well.

You could do it otherwise - choose something that's better suited for cycle-accurate work, but less suited for available developers to write efficient UI and data processing code on something they are familiar with. Or, you could build a multi-chip solution. All of these are more expensive, and the end result is not magically any better.

So, if I understand you, there is a small bit which has tight timing, then a lot with little timing constraint. And I guess the tight timing section has to be repeated starting at a precise instant. If that's the case then how much spare capacity do you have to allow in order for the tight timing section to restart at the right instant?

That's a rather peculiar workload compared with a typical real time workload where the problem is ensuring the many competing timing requirements are met.

Quote

Quote
You still had to measure the timing. That is equivalent to inspection, and as engineers used to be taught, "you can't inspect quality into a product".

Nice try, but you are just plain wrong.

What the XMOS IDE is doing, it is simulating the internal operations of the CPU, and reporting the result to the user.

No. The in the XMOS IDE instructions are not simulated, they are counted. That's a very big difference.

The nearest to simulation is that you inspect an instruction to see if it is a branch instruction, and then "take" both directions and include both the counts in the analysis.

Quote

In digital logic, simulating (with proper simulation models) and measuring the number of clock cycles provide equivalent results. With improper simulation models, the actual hardware is right.

Buying the Cortex-M7 simulation model does exactly the same - the only difference, it's expensive, and likely has a more difficult user interface (like having to write your own SystemC wrapper).

Running the code on Cortex-M7 and measuring it does the same, though. It's, after all, implementing the actual digital logic. The operation runtime can be measured, and it will match the cycle-accurate simulation model.

So, cycle-accurate simulation is only convenient. It's fundamentally not magically more "accurate" than running the actual digital logic. It's actually the opposite: the simulation model must contain zero bugs. Measuring the actual device provides the correct result, in case the simulation model has any problem in it.

The error you make is, you seem to think that measuring the number of clock cycles a digital system runs in, is somehow equivalent to measuring a voltage on a prototype of a single unit of voltage reference; or measuring a length of a single piece of cut metal. This is not true.

You are not measuring a system with random error terms; you are measuring a circuit consisting of known digital logic, and if you can enumerate the possible sources of variation and show they do not exist, you have a guaranteed number of cycles.

But I'm sure it's way easier to do on XMOS, though. I'm sure they have put a lot of thought into providing an easy-to-use, yet accurate simulation models integrated in the IDE. And, most importantly, they have "crippled" the CPU performance by deliberately not including modern-day CPU optimizations that would increase the average processing performance, with the cost of variability in timing.

If you do timing-accurate work in a modern complex processor like an M7 MCU, you need to know what those timing-invariable elements are, and know how to either disable them (note: it does not necessarily cripple the performance, because they can be selectively disabled; like caches on certain memory areas, or just turning things on/off periodically), or how to calculate for the worst-case timing. Latter is more challenging, but completely possible.

I'm sure that if you have a lot of cycle-accurate work to do, Xcore is likely orders of magnitude easier to work with than a Cortex-M7. And I would love to work with one, if someone would pay me for doing that. For my own projects, I'm defaulting to the most commonly available, most easily second-sourcable parts whenever it's possible, and only resort to specialistic products with their special best-in-class features, when I absolutely must.

I don't understand the point you are trying to make.

In this context simulation+measurement is equivalent to execution+measurement; that seems blindingly obvious and is the crux of the limitations. Given the inherent variability in execution times in a system with caches and prediction and interrupts, that's all you can do. Just hope you have stumbled on the worst case execution times. That is inelegant, and uninteresting "business as normal".

The XMOS system removes the variability by design, plus keeps the aggregate throughput high by other means. That's interesting and novel.

Sun managed to do the same in its Niagara processors, but that was in the server market.

nctnico · « **Reply #42 on:** May 07, 2020, 01:58:16 pm »

I still don't get why you'd need cycle accurate timing in software execution on a modern high speed microcontroller. Usually there are many tasks and modern hardware peripherals are perfectly capable of providing accurate (ns level) timing using timers & DMA to make things happen. When I need accurate timing I choose a microcontroller which has peripherals which can do the accurate timing in hardware. The software then needs to make sure to fill buffers / change parameters in time but the timing requirements for such actions are way more relaxed.

Siwastaja · « **Reply #43 on:** May 07, 2020, 02:05:15 pm »

Quote from: nctnico on May 07, 2020, 01:58:16 pm

I still don't get why you'd need cycle accurate timing in software execution on a modern high speed microcontroller. Usually there are many tasks and modern hardware peripherals are perfectly capable of providing accurate (ns level) timing using timers & DMA to make things happen. When I need accurate timing I choose a microcontroller which has peripherals which can do the accurate timing in hardware. The software then needs to make sure to fill buffers / change parameters in time but the timing requirements for such actions are way more relaxed.

Spot on - cycle accurate code is a rare special case.

tggzzz · « **Reply #44 on:** May 07, 2020, 02:39:02 pm »

Quote from: nctnico on May 07, 2020, 01:58:16 pm

I still don't get why you'd need cycle accurate timing in software execution on a modern high speed microcontroller.

You don't need cycle accurate timing per se. It is the means to the benefit: see below.

"High speed" are irrelevant adjectives, unless you add a statement "...relative to...".

Quote

Usually there are many tasks and modern hardware peripherals are perfectly capable of providing accurate (ns level) timing using timers & DMA to make things happen. When I need accurate timing I choose a microcontroller which has peripherals which can do the accurate timing in hardware. The software then needs to make sure to fill buffers / change parameters in time but the timing require :=\ments for such actions are way more relaxed.

And that's the crux of the problem: proving that the software will do that!

For hard realtime systems you need to know the worst case performance. That means you must know the worst case execution time between here and there. The question is how you achieve that. The problem is the unpredictability that caches and predictors must by design introduce - they only increase performance on average, but we need worst case.

The first approach is to measure and hope you happen to have spotted the worst case instruction path and pessimal data-dependent timing. Yuck.

The second approach is to turn off all caches and predictors, then measure and then hope you have exercised the worst case instruction path. The consequence is the processor is much slower, so why didn't you use a lower cost processor in the first place.

The third approach is to overspecify and underutilise a processor. That's fine (albeit wasteful) except when your problem is pushing the limits of the technology (where I normally operate), or when certification requires proof (where I have operated).

The fourth approach is to use dedicated hardware without a processor. That's fine, but expensive.

The fifth is to know in advance exactly how long a processor will take to execute a task, and that is best achieved by counting instructions and having zero data-dependent timing. That implies a relatively low performance core, as in the second approach. Then the question is how to "rescue" yourself from "lost" performance, and that can be achieved by multiicore parallelism. That raises the problems of inter-core comms and multi-core programming, and that's where xC and xCORE are unique.

If you have low-end requirements, then you can throw technology at the problem and the solution will be adequate (if inelegant

).

In all the real-time systems I've developed over the decades, I've always been pushing the limits of what processors can achieve. The discrete hardware I've designed into such systems has been the minimum necessary to bridge between what the processor can and will achieve, and what the system requires.

But then I've never used a processor to blink a LED

YMMV

Siwastaja · « **Reply #45 on:** May 07, 2020, 04:18:32 pm »

Quote from: tggzzz on May 07, 2020, 02:39:02 pm

That implies a relatively low performance core, as in the second approach. Then the question is how to "rescue" yourself from "lost" performance, and that can be achieved by multiicore parallelism.

Bingo, and improving processing performance by utilizing multicore parallelism is notoriously hard, and full of synchronization traps - I'm sure you know this, and I'm sure you can't be claiming that the XMOS IDE would be the silver-bullet to solve such vast problem field once and for all.

Single-core performance matters. Especially if the algorithm is non-trivial to parallelize, yet you want simplicity and predictability, which is just what is needed in embedded systems.

You talk about interrupts like they are some kind of burden or error source. Actually they aren't, they make your life easier. You get easy timing guarantees with interrupts! Just look at the worst-case interrupt latency, which is easy to calculate (even if you need to take flash fetch latency into account), and look at the priorities, higher-priority interrupt pre-empts lower priority one. No need for guesswork.

As I see it, multi-core parallelism as implemented by XMOS would really shine doing independent IO interfaces that no classic hardwired peripheral exist for (such cases are fairly rare, though, but I can see you could replace a more costly FPGA with this, in some cases); or implement easily parallizable algorithms.

Though, to be fair, my earlier example with cycle-accurate thing on M7, by the way, would be perfectly well handled by standard MCU peripherals, cycle accurate and no CPU interaction at all, if ST Microelectronics could get their shit together and design MCU peripherals that work.

But when ST fails to provide an SPI interface which can access the most common types of SPI devices, I need to nearly bit-bang, and then I indeed could use XMOS to do that part a bit more easily. But I still choose to do it with M7 with a bit more complexity. The end-result pays off, still, because I can leverage all the processing power, popularity, part availability and price of the ARM ecosystem. If I change to a different MCU, the chances are the SPI and DMA "just work" there and I can ditch that (small) part of the code, but if not, the cycle accurate code will be equally cycle accurate on the other manufacturer's M7!

Quote

I've always been pushing the limits of what processors can achieve. The discrete hardware I've designed into such systems has been the minimum necessary to bridge between what the processor can and will achieve, and what the system requires.

Yet strangely, I have very similar goals. One of my goals is also to be able to depend on very common, second-sourceable, popular parts and take them to their limits. Minimizing BOM line count and BOM cost, making the MCU do whatever it can. Implement a DC/DC controller with software to save a $2 DC/DC controller IC when the MCU is there anyway for other tasks. For microcontroller cores, I default to ARM very much thank you, but that's only for highly practical reasons.

I'm sure a 32-core design will keep most of the cores just waiting most of the time*, though. Which I'm completely OK with, I just don't understand what you actually mean by "bridging between what the processor can and will achieve", but clearly it isn't efficient usage of the resources. Which is interesting, because you seem to be implying repeatedly that an "overperformance" ARM is a problem, but an "over-cored" XMOS somehow isn't.

*) Unless it's a completely trivial small piece implementing a continuous, fully parallelizable data processing algorithm.

Quote

"High speed" are irrelevant adjectives, unless you add a statement "...relative to...".

Relative to the requirements, obviously...

tggzzz · « **Reply #46 on:** May 07, 2020, 05:16:25 pm »

Quote from: Siwastaja on May 07, 2020, 04:18:32 pm

Quote from: tggzzz on May 07, 2020, 02:39:02 pm
That implies a relatively low performance core, as in the second approach. Then the question is how to "rescue" yourself from "lost" performance, and that can be achieved by multiicore parallelism.

Bingo, and improving processing performance by utilizing multicore parallelism is notoriously hard, and full of synchronization traps - I'm sure you know this, and I'm sure you can't be claiming that the XMOS IDE would be the silver-bullet to solve such vast problem field once and for all.

There are no silver bullets.

Hardware is inherently parallel, but people don't complain about it being "notoriously hard" and "full of synchronisation traps". Why not? A signifcant part of that is because the design patterns and tools have evolved to deal with it and make best use of it. Such design patterns plus tools are notoriously absent from software, especially at the embedded level where ad-hoc "seems to work" hacks are normal.

The CSP/xC/xCORE mentality, design patterns and tools are the best hardware/software plus tools system available, especially in the hard realtime arena. If you know of alternatives where the hardware+software+toolset are all tied together as a coherent whole, please let us know.

Quote

Single-core performance matters. Especially if the algorithm is non-trivial to parallelize, yet you want simplicity and predictability, which is just what is needed in embedded systems.

You talk about interrupts like they are some kind of burden or error source. Actually they aren't, they make your life easier. You get easy timing guarantees with interrupts! Just look at the worst-case interrupt latency, which is easy to calculate (even if you need to take flash fetch latency into account), and look at the priorities, higher-priority interrupt pre-empts lower priority one. No need for guesswork.

Interrupts can, with care and attention, make a few easy timing guarantees. But they bring many problems, especially where there are many interrupts with different priorities, and many tasks with many priorities, and schedulers. Unless your processors can be underutilised, I'm sure you have been bitten by such.

If you think about it, interrupts plus a single core are no more than a workaround for not having multiple independent cores.

Quote

As I see it, multi-core parallelism as implemented by XMOS would really shine doing independent IO interfaces that no classic hardwired peripheral exist for (such cases are fairly rare, though, but I can see you could replace a more costly FPGA with this, in some cases); or implement easily parallizable algorithms.

Bingo! Except...

My experience in the embedded world is that "classic hardwired peripherals" either don't exist or are limited to clocked/strobed/SERDES interfaces. xCORE has all of those easily accessible by software. Some processors have dedicated silicon for USB and Ethernet peripherals, even though that can be done in xC.

Actually I think they shine with ad-hoc algorithms that are independent but aren't easily parallelisable. Algorithms that are easily parallelisable tend to imply many many threads of control, and the XMOS cores are a strictly limited resource.

Quote

Though, to be fair, my earlier example with cycle-accurate thing on M7, by the way, would be perfectly well handled by standard MCU peripherals, cycle accurate and no CPU interaction at all, if ST Microelectronics could get their shit together and design MCU peripherals that work.

I can't comment on that, but I was very pleasantly surprised at how easy it was to setup and use the xCORE i/o ports, plus I didn't spot any errata.

None of that is inherent in the xCORE/xC system, but it might be an property that emerges from having people that (appear to) understand both hardware and software and how they need to work together in embedded realtime systems.

In most places the hardware and software teams occasionally throw things over a partition, and assume that the other people will solve what they haven't got right.

Quote

But when ST fails to provide an SPI interface which can access the most common types of SPI devices, I need to nearly bit-bang, and then I indeed could use XMOS to do that part a bit more easily. But I still choose to do it with M7 with a bit more complexity. The end-result pays off, still, because I can leverage all the processing power, popularity, part availability and price of the ARM ecosystem. If I change to a different MCU, the chances are the SPI and DMA "just work" there and I can ditch that (small) part of the code, but if not, the cycle accurate code will be equally cycle accurate on the other manufacturer's M7!

The breadth of the ARM ecosystem is definitely and advantage. In spite of that XMOS appears to be successful and growing; they must have some "secret sauce"

Quote

Quote
I've always been pushing the limits of what processors can achieve. The discrete hardware I've designed into such systems has been the minimum necessary to bridge between what the processor can and will achieve, and what the system requires.

Yet strangely, I have very similar goals. One of my goals is also to be able to depend on very common, second-sourceable, popular parts and take them to their limits. Minimizing BOM line count and BOM cost, making the MCU do whatever it can. Implement a DC/DC controller with software to save a $2 DC/DC controller IC when the MCU is there anyway for other tasks. For microcontroller cores, I default to ARM very much thank you, but that's only for highly practical reasons.

Those are valid pragmatic reasons. But they don't detract from the xC/xCORE ecosystem's value.

Quote

I'm sure a 32-core design will keep most of the cores just waiting most of the time, though. I don't think this can be called "bridging between what the processor can and will achieve".

A 32 core MCU doesn't have 32 physical cores

It has 4 "intel style" cores, each of which implements 8 "SMT threads" - the silicon cost is merely duplicated register set and multiplexers.

Sun did exactly the same with their Niagara server chips, where each core's cycle time was the same as the memory's, thus removing the need for out-of-order processing and some levels of cache. Superb for "embarassingly parallel" workloads; I used that to great effect with telecom server applications.

tggzzz · « **Reply #47 on:** May 07, 2020, 05:30:27 pm »

Looks like an edit crossed in the æther

Quote from: Siwastaja on May 07, 2020, 04:18:32 pm

I'm sure a 32-core design will keep most of the cores just waiting most of the time*, though. Which I'm completely OK with, I just don't understand what you actually mean by "bridging between what the processor can and will achieve", but clearly it isn't efficient usage of the resources. Which is interesting, because you seem to be implying repeatedly that an "overperformance" ARM is a problem, but an "over-cored" XMOS somehow isn't.

Having "excessively" underutilised MCU doesn't actually solve the hard realtime problem - it merely makes it less likely the problem will be noticed. Having too many cores is merely inefficient - and once you realise how 8 cores are implemented in one tile, you understand that the inefficiency penalty in terms of silicon area and/or power is small.

Quote

*) Unless it's a completely trivial small piece implementing a continuous, fully parallelizable data processing algorithm.

Actually, that's not the case. "Fully parallelisable" tends to imply many[1] parallel operations, and the xCORE cores are a strictly limited resource.

[1] as in "there are only three numbers: zero, one, and many".

SiliconWizard · « **Reply #48 on:** May 07, 2020, 06:53:03 pm »

Quote from: tggzzz on May 07, 2020, 05:16:25 pm

Sun did exactly the same with their Niagara server chips, where each core's cycle time was the same as the memory's, thus removing the need for out-of-order processing and some levels of cache. Superb for "embarassingly parallel" workloads; I used that to great effect with telecom server applications.

I suppose you're talking about the UltraSPARC T1: https://en.wikipedia.org/wiki/UltraSPARC_T1

tggzzz · « **Reply #49 on:** May 07, 2020, 07:35:12 pm »

Quote from: SiliconWizard on May 07, 2020, 06:53:03 pm

Quote from: tggzzz on May 07, 2020, 05:16:25 pm
Sun did exactly the same with their Niagara server chips, where each core's cycle time was the same as the memory's, thus removing the need for out-of-order processing and some levels of cache. Superb for "embarassingly parallel" workloads; I used that to great effect with telecom server applications.

I suppose you're talking about the UltraSPARC T1: https://en.wikipedia.org/wiki/UltraSPARC_T1

Exactly.

They sounded good on paper, and people in two companies were surprised at the soft realtime performance my Java application achieved. But then event driven telecoms applications have always been embarrassingly parallel.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Multi Microcontroller design. (Read 7431 times)

Share me