Author Topic: Asymmetric multiprocessing considered harmful (Read 3481 times)

tabemann · « **on:** April 15, 2023, 10:48:15 pm »

For a long time I have been considering whether I should make a RISC-V port of my Forth-based RTOS, zeptoforth, which is currently for ARM Cortex-M0+/M4/M7 and when I first saw the mention of the Pine64 Ox64 I thought "maybe I will do that sometime in the near future" - it is (will be, once you can buy it) price-competitive with the ARM Cortex-M0+ RP2040-based boards that I have been focusing on as of late and significantly more powerful. But when I saw the actual specs of the thing I immediately recoiled - three cores where each of the three cores runs a different RISC-V architecture, one of them 64-bit and two of them 32-bit?! Designing an RTOS to take advantage of such a design is insane! Were I to make a port, I would just use the 64-bit core and ignore the two 32-bit cores. This contrasts with the RP2040, which is a very suitable target for symmetric multiprocessing due to its very symmetric design. Also consider the case of the STM32H745 DISCOVERY, which I own one of - it has separate ARM Cortex-M4 and Cortex-M7 cores, which makes it less attractive of a target due to their asymmetry; were I to target it I would probably just target the Cortex-M7 core and largely ignore the Cortex-M4 core (even though the board's cost, and thus less demand for support, makes me unlikely to bother; my board is still in its packaging, for one). All of this taken into consideration, I would wish that manufacturers would just make symmetric designs and not bother with asymmetric designs which, while sounding attractive to the kind of people who are like "well, we can do high performance computing on the 64-bit core, and low-power operation on one of the 32-bit cores, all at the same time!", make targeting them with practical RTOS designs one major PITA.

nctnico · « **Reply #1 on:** April 15, 2023, 10:56:49 pm »

In my idea those designs are not meant to have all cores used by the same OS but have a strict seperation between high level, communication features that typically work well using a pre-emptive multitasking OS and tasks that do not need (or even will be hindered) by using an OS that need to achieve predictable and very short (sub 1ms) time intervals.

coppice · « **Reply #2 on:** April 15, 2023, 11:11:26 pm »

This looks like a typical radio SoC. You have cores where you put a radio stack, get it approved and then leave it alone as much as you possibly can. Then you have cores where you can put applications, modify them, and not sink yourself back into the mire of complex approval processes every time. You very much DON'T want a single OS running across those cores.

tabemann · « **Reply #3 on:** April 15, 2023, 11:41:42 pm »

Quote from: coppice on April 15, 2023, 11:11:26 pm

This looks like a typical radio SoC. You have cores where you put a radio stack, get it approved and then leave it alone as much as you possibly can. Then you have cores where you can put applications, modify them, and not sink yourself back into the mire of complex approval processes every time. You very much DON'T want a single OS running across those cores.

That is the only way this design even makes sense. I was thinking "even if they were going to have an asymmetric design, could they have been kind enough to use the same architecture across each of them?" - even in the case of my STM32H745, both cores at least are ARMv7-M, so I could use one compiler for both of them, were I ever to bother with targeting that design. But with such a radically asymmetric design as this, you can't even use the same compiler configuration for all the cores at once! You have to compile different parts of your code separately, and then integrate them together somehow. So it would make sense if those 32-bit cores were meant to run binary blobs that you, the programmer, have not even implemented (especially since the thing is marketed as supporting WiFi, BLE, and Zigbee).

SiliconWizard · « **Reply #4 on:** April 16, 2023, 12:34:53 am »

A bit more difficult /= harmful.

And separation of concerns isn't a bad thing.

tabemann · « **Reply #5 on:** April 16, 2023, 12:46:52 am »

Quote from: SiliconWizard on April 16, 2023, 12:34:53 am

A bit more difficult /= harmful.

Not simply a bit more difficult, but practically unusable for a conventional multiprocessing design. And yet, at the same time, probably too closely-coupled for the opposite, i.e. the kind of completely separated radio stack I personally favor. In a way, it looks like if you took something like the ESP32, which is a basically symmetric design where the radio stack lives in the same space as user code and under the same RTOS, and made it bizarrely asymmetric.

Quote from: SiliconWizard on April 16, 2023, 12:34:53 am

And separation of concerns isn't a bad thing.

It isn't - I personally favor designs where WiFi or Bluetooth radios are completely separate from the main MCU, because then the main MCU is not complicated by having to deal with the inner workings of the (in most cases) binary blobbed radio. Too bad the two RP2040-based designs with radio that I've looked at both have issues - the Pico W has major issues with licensing due to "non-commercial" restrictions on the CYW43 driver made available by Damien George (and thus any drivers derived from it), and the Wio RP2040's ESP8285 radio is simply very buggy and unreliable, such that I have had to abandon support for it.

brucehoult · « **Reply #6 on:** April 16, 2023, 01:04:26 am »

I really don't get your problem here.

The RP2040 is also asymmetric, with two C-M0+ cores and then the very very limited (far more so than RV32EMC) PIO.

The ox64's 320 MHz E907 core is easily 50% more powerful than the two RP2040 CM0 cores combined, not even counting the FPU. With 32 registers vs the 16 on the CM0 you could write two tasks that each use half of the registers and switch between them in a couple of clock cycles, either using JAL/RET or interrupt/MRET. Note that on RISC-V, x0 (the ZERO register) is the *only* register with a dedicated function -- any register can serve equally well as stack pointer, any register can serve equally well as link register. (The C extension is optimised around the standard ABI, but the only difference that makes is how often you can use a 2-byte opcode instead of a 4-byte opcode, which is done transparently by the assembler)

The E902 can do the job of PIO, but is a lot more powerful -- I imagine it's intended mostly to run the software stack for the radio. It's also more powerful than one CM0 core on the RP2040.

And THEN you have the 480 MHz 64 bit Linux core on top of all that. And 64 MB RAM instead of 0.26 MB.

It seems an amazing value to me. And you can only call RP2040 "symmetrical" if you ignore the PIO.

The three core on the BL808 at least all use the same basic instruction set and same compiler.

The R Pi Foundation do win on their comprehensive documentation and example code.

tabemann · « **Reply #7 on:** April 16, 2023, 01:23:05 am »

Quote from: brucehoult on April 16, 2023, 01:04:26 am

I really don't get your problem here.

The RP2040 is also asymmetric, with two C-M0+ cores and then the very very limited (far more so than RV32EMC) PIO.

The ox64's 320 MHz E907 core is easily 50% more powerful than the two RP2040 CM0 cores combined, not even counting the FPU. With 32 registers vs the 16 on the CM0 you could write two tasks that each use half of the registers and switch between them in a couple of clock cycles, either using JAL/RET or interrupt/MRET. Note that on RISC-V, x0 (the ZERO register) is the *only* register with a dedicated function -- any register can serve equally well as stack pointer, any register can serve equally well as link register. (The C extension is optimised around the standard ABI, but the only difference that makes is how often you can use a 2-byte opcode instead of a 4-byte opcode, which is done transparently by the assembler)

The E902 can do the job of PIO, but is a lot more powerful -- I imagine it's intended mostly to run the software stack for the radio. It's also more powerful than one CM0 core on the RP2040.

And THEN you have the 480 MHz 64 bit Linux core on top of all that. And 64 MB RAM instead of 0.26 MB.

It seems an amazing value to me. And you can only call RP2040 "symmetrical" if you ignore the PIO.

The three core on the BL808 at least all use the same basic instruction set and same compiler.

The R Pi Foundation do win on their comprehensive documentation and example code.

Yes, the 64-bit core on this design is much more powerful per se by itself than the RP2040, since even it without the other cores would be essentially either a very small SBC or a very large MCU design, depending on how you look at it. But sometimes sheer power is not the be-all and end-all of things.

Take for instance the RP2040 - part of the big advantage of being a dual-core design is I can put most of my code on one core, and time-critical code on the other core, and yet have them share not only memory space but code and multitasking constructs. I cannot do this on the single-core designs I support, i.e. the STM32F407, the STM32F411, the STM32F746, and the STM32L476, even though, say, the STM32F746 can run circles around the RP2040 when it comes to throughput. This is also why I specifically did not implement load-balancing between cores on the RP2040 ─ even though I contemplated it ─ because I realized there is value in being able to explicitly stick everything where timing does not matter on one core, and particular things that are timing-sensitive (and which are too complex to be implemented via PIO) on another core.

And yes, this is why you could say the 32-bit cores have been added to this design, so you can run time-critical stuff independent of the 64-bit core, just as I described with the RP2040. However, the fact that three different architectures were chosen for each individual core makes things, well, inconvenient. Were I to port zeptoforth to this board, I'd immediately run into the issue that I either could not use one compiler (zeptoforth, for the record, includes a native-code compiler) for all three cores, and I could not share code across all three cores, or if it turns out that some subset of RISC-V will run on all three cores (I must admit that I have not looked at the details of RISC-V that closely), it would be limited to the lowest common denominator supported by all three cores.

ejeffrey · « **Reply #8 on:** April 16, 2023, 03:44:40 am »

Quote from: tabemann on April 16, 2023, 01:23:05 am

I could not share code across all three cores,

You are not supposed to do that, it defeats the entire purpose of this sort of SoC. You are supposed to run your main OS on the big core and treat the small cores as microcontrollers that happen to be on the same die.

This is basically the equivalent of saying that a modern Intel laptop cpu is bad because you can't share code between the x86_64 cores and the GPU.

This is very common in the ARM world, to have cortex-A cores for the user OS, and Cortex-M cores as microcontrollers. They can be for realtime tasks, power management, or security, depending on the application.

Given that they are all three risc-V variants you can use the same compiler with appropriate architecture flags, but you won't be using the same binary or migrating code from one to another.

There are asymmetric cases where you do share code such as a BIG.little. Then you do ideally want the the big and little cores to have the same ISA. Intel had a problem with that on their recent CPUs because the E cores didn't support AVX-512 they ended up having to disable it on the performance cores as well.

tabemann · « **Reply #9 on:** April 16, 2023, 04:00:18 am »

Quote from: ejeffrey on April 16, 2023, 03:44:40 am

Quote from: tabemann on April 16, 2023, 01:23:05 am
I could not share code across all three cores,

You are not supposed to do that, it defeats the entire purpose of this sort of SoC. You are supposed to run your main OS on the big core and treat the small cores as microcontrollers that happen to be on the same die.

This is basically the equivalent of saying that a modern Intel laptop cpu is bad because you can't share code between the x86_64 cores and the GPU.

This is very common in the ARM world, to have cortex-A cores for the user OS, and Cortex-M cores as microcontrollers. They can be for realtime tasks, power management, or security, depending on the application.

Given that they are all three risc-V variants you can use the same compiler with appropriate architecture flags, but you won't be using the same binary or migrating code from one to another.

There are asymmetric cases where you do share code such as a BIG.little. Then you do ideally want the the big and little cores to have the same ISA. Intel had a problem with that on their recent CPUs because the E cores didn't support AVX-512 they ended up having to disable it on the performance cores as well.

The key thing is that this arrangement makes it hard for zeptoforth to support generating code for all three cores, because it would have to have code generators that would put out instructions for each core separately, and furthermore, because it inlines much of itself into the code it generates, it would have to have triple the code to inline, one version for each core. Furthermore, any code that is compiled would have to be compiled in triplicate, with one version for each core. Of course, this arrangement would be, well, impractical. Consequently, the only real way to practically make use of all but one of the cores is to include precompiled blobs and to not support runtime compilation of code. This is fine if your goal is simply to support a WiFi/BLE/Zigbee stack on a core, which probably is the real intent here, but if one is not using such a stack it is essentially wasted silicon.

ejeffrey · « **Reply #10 on:** April 16, 2023, 04:50:02 am »

Quote from: tabemann on April 16, 2023, 04:00:18 am

This is fine if your goal is simply to support a WiFi/BLE/Zigbee stack on a core, which probably is the real intent here, but if one is not using such a stack it is essentially wasted silicon.

Radio operation seems to be the intent but there are lots of other ways people use small cores like this. If it doesn't work for your application that's fine but lots of devices use asymmetric processors.

Wasted silicon is basically a non-problem.

Siwastaja · « **Reply #11 on:** April 16, 2023, 05:09:34 am »

You have completely misunderstood the idea, no wonder it hurts trying to push the square peg through the round hole.

It's not to abstract the whole as something which runs general purpose OS scheduling whatever tasks/threads into those cores.

The idea is to, for example, use the smaller CM4 core to run a dedicated bare metal project which does a well-defined task of its own, and communicate with the another core which then can run a different bare metal or maybe OS project. In such case, the fact they are of different (but similar) architecture is only a tiny bit of mental load.

If you want to support these things in your OS, the best approach is exactly to ignore the small auxiliary cores and only target the "main" core. Users who need the small cores then know exactly what they are doing.

SiliconWizard · « **Reply #12 on:** April 16, 2023, 05:10:55 am »

Quote from: tabemann on April 16, 2023, 04:00:18 am

Quote from: ejeffrey on April 16, 2023, 03:44:40 am
Quote from: tabemann on April 16, 2023, 01:23:05 am
I could not share code across all three cores,

You are not supposed to do that, it defeats the entire purpose of this sort of SoC. You are supposed to run your main OS on the big core and treat the small cores as microcontrollers that happen to be on the same die.

This is basically the equivalent of saying that a modern Intel laptop cpu is bad because you can't share code between the x86_64 cores and the GPU.

This is very common in the ARM world, to have cortex-A cores for the user OS, and Cortex-M cores as microcontrollers. They can be for realtime tasks, power management, or security, depending on the application.

Given that they are all three risc-V variants you can use the same compiler with appropriate architecture flags, but you won't be using the same binary or migrating code from one to another.

There are asymmetric cases where you do share code such as a BIG.little. Then you do ideally want the the big and little cores to have the same ISA. Intel had a problem with that on their recent CPUs because the E cores didn't support AVX-512 they ended up having to disable it on the performance cores as well.

The key thing is that this arrangement makes it hard for zeptoforth to support generating code for all three cores, because it would have to have code generators that would put out instructions for each core separately, and furthermore, because it inlines much of itself into the code it generates, it would have to have triple the code to inline, one version for each core. Furthermore, any code that is compiled would have to be compiled in triplicate, with one version for each core. Of course, this arrangement would be, well, impractical. Consequently, the only real way to practically make use of all but one of the cores is to include precompiled blobs and to not support runtime compilation of code. This is fine if your goal is simply to support a WiFi/BLE/Zigbee stack on a core, which probably is the real intent here, but if one is not using such a stack it is essentially wasted silicon.

I understand that your issue here it not so much with the non-homogeneous multi-core architecture, but with the architecture of your tool.
I have no doubt making it able to handle all cores kind of transparently is going to be a singificant endeavour, and as others have said, I'm not sure if it's the way to go.

But OTOH, that could be interesting as a generalization of your tool. Of course, I wouldn't suggest doing this *only* for a particular target, if you're interested in the concept, but more as a generalization of your system (that I don't really know.) Those non-homogeneous multi-core SoCs are likely to become even more common in the future IMO.
That would imply adding ways for the user to assign specific tasks to a specific core though, as the whole point of these SoCs is to do precisely that.

brucehoult · « **Reply #13 on:** April 16, 2023, 05:11:31 am »

Quote from: tabemann on April 16, 2023, 04:00:18 am

The key thing is that this arrangement makes it hard for zeptoforth to support generating code for all three cores, because it would have to have code generators that would put out instructions for each core separately, and furthermore, because it inlines much of itself into the code it generates, it would have to have triple the code to inline, one version for each core.

Unlikely. All three cores can run the same RV32EMC code, if you are careful about generating it. The whole memory map is in the lowest 4 GB, and everything except the boot ROM is in the first 2 GB, so you can happily use the big fast 64 bit core as a 32 bit core if you want to. Loading a 32 bit pointer from RAM will invisible sign extend it to 64 bits, which is zero-extend since all addresses have the hi bit cleared. So then you can happily dereference it as if it was a 64 bit pointer all along. You can have functions just save and restore the lo 32 bits of registers (including function return addresses). The only two things you'd have to watch would be to not depend on arithmetic overflow wrapping around at 2³² and you'd have to be careful about using a left shift followed by a right shift for extracting zero-extended or sign-extended bitfields. If you put the shift count into a register then you can just use 63-N (or just -N) all the time as the 32 bit cores will only look at the lower 5 bits. (It's a spec violation if not) I'm not sure whether those 32 bit cores will trap with illegal instruction if you set bit 5 of the shift amount in a shift with an immediate operand for the shift count. I think those encodings might be reserved now, but weren't in 2019.

If you ignore the smallest core than the same code can run on the bigger 32 bit core and the 64 bit core using all 32 registers, and single-precision FP as well.

tabemann · « **Reply #14 on:** April 16, 2023, 05:58:16 am »

Quote from: brucehoult on April 16, 2023, 05:11:31 am

Quote from: tabemann on April 16, 2023, 04:00:18 am
The key thing is that this arrangement makes it hard for zeptoforth to support generating code for all three cores, because it would have to have code generators that would put out instructions for each core separately, and furthermore, because it inlines much of itself into the code it generates, it would have to have triple the code to inline, one version for each core.

Unlikely. All three cores can run the same RV32EMC code, if you are careful about generating it. The whole memory map is in the lowest 4 GB, and everything except the boot ROM is in the first 2 GB, so you can happily use the big fast 64 bit core as a 32 bit core if you want to. Loading a 32 bit pointer from RAM will invisible sign extend it to 64 bits, which is zero-extend since all addresses have the hi bit cleared. So then you can happily dereference it as if it was a 64 bit pointer all along. You can have functions just save and restore the lo 32 bits of registers (including function return addresses). The only two things you'd have to watch would be to not depend on arithmetic overflow wrapping around at 2³² and you'd have to be careful about using a left shift followed by a right shift for extracting zero-extended or sign-extended bitfields. If you put the shift count into a register then you can just use 63-N (or just -N) all the time as the 32 bit cores will only look at the lower 5 bits. (It's a spec violation if not) I'm not sure whether those 32 bit cores will trap with illegal instruction if you set bit 5 of the shift amount in a shift with an immediate operand for the shift count. I think those encodings might be reserved now, but weren't in 2019.

If you ignore the smallest core than the same code can run on the bigger 32 bit core and the 64 bit core using all 32 registers, and single-precision FP as well.

Sign extension is exactly an area that I was thinking would be tricky, even if otherwise lowest common denominator code can be generated. In general I would probably only want 32-bit cells; 64-bit would be a waste, and would introduce unnecessary incompatibility with other versions of zeptoforth. Also, I have no need for greater than 16 registers; zeptoforth on ARM Cortex-M does not even make use of all 16 registers available to it.

tabemann · « **Reply #15 on:** April 16, 2023, 06:05:39 am »

Quote from: Siwastaja on April 16, 2023, 05:09:34 am

You have completely misunderstood the idea, no wonder it hurts trying to push the square peg through the round hole.

It's not to abstract the whole as something which runs general purpose OS scheduling whatever tasks/threads into those cores.

The idea is to, for example, use the smaller CM4 core to run a dedicated bare metal project which does a well-defined task of its own, and communicate with the another core which then can run a different bare metal or maybe OS project. In such case, the fact they are of different (but similar) architecture is only a tiny bit of mental load.

If you want to support these things in your OS, the best approach is exactly to ignore the small auxiliary cores and only target the "main" core. Users who need the small cores then know exactly what they are doing.

What I would ideally want to do is for one compiler to be able to compile code shared between all cores, and then offload tasks onto dedicated cores meant for those particular operations (e.g. high throughput but high latency code onto the biggest core, and low throughput but low latency code onto the smaller cores). This way the user could have one codebase, rather than having to separately compile code offline that is destined for a particular core, and essentially include it as essentially an opaque binary along with the other code. As I have been now informed, most likely a lowest common denominator architecture can be compiled to that will run on all three cores, which would greatly simplify this; not being that familiar with RISC-V that was something I had not been familiar about, this was something I was uncertain about. Of course, that is unnecessary if all that is going to be run on the other cores are precompiled stacks independent of one's own codebase.

tabemann · « **Reply #16 on:** April 16, 2023, 06:09:46 am »

Quote from: SiliconWizard on April 16, 2023, 05:10:55 am

I understand that your issue here it not so much with the non-homogeneous multi-core architecture, but with the architecture of your tool.
I have no doubt making it able to handle all cores kind of transparently is going to be a singificant endeavour, and as others have said, I'm not sure if it's the way to go.

But OTOH, that could be interesting as a generalization of your tool. Of course, I wouldn't suggest doing this *only* for a particular target, if you're interested in the concept, but more as a generalization of your system (that I don't really know.) Those non-homogeneous multi-core SoCs are likely to become even more common in the future IMO.
That would imply adding ways for the user to assign specific tasks to a specific core though, as the whole point of these SoCs is to do precisely that.

Currently tasks are always specifically assigned to cores, by design; there is no automatic core assignment (other than to default to running a new task on the same core as the task that spawned it), unlike in many OS'es, for the very reason that the user may want to choose which core to run code on.

brucehoult · « **Reply #17 on:** April 16, 2023, 08:48:52 am »

Quote from: tabemann on April 16, 2023, 05:58:16 am

Sign extension is exactly an area that I was thinking would be tricky

As an example, to sign-extend an 8 bit value to full register:

On RV64I:

Code: [Select]

slli x,x,56
srai x,x,56

On RV32I:

Code: [Select]

slli x,x,24
srai x,x,24

The RV64 code might work on a sloppy RV32 core that doesn't strictly illegal instruction trap on undefined opcodes, but just looks at bits 4:0 of the literal (not bits 5:0 as RV64 does) and sees 56&0x1f = 24.

Works on both RV64 and RV32 (guaranteed by the spec):

Code: [Select]

li y,-8
sll x,x,y
sra x,x,y

DiTBho · « **Reply #18 on:** April 16, 2023, 11:21:07 am »

one core performing AES{crypt, decrypt}
one core performing ZIP{compresse, decompress}
one core performing other CPU tasks

I am done

SiliconWizard · « **Reply #19 on:** April 16, 2023, 07:22:50 pm »

Quote from: tabemann on April 16, 2023, 06:09:46 am

Quote from: SiliconWizard on April 16, 2023, 05:10:55 am
I understand that your issue here it not so much with the non-homogeneous multi-core architecture, but with the architecture of your tool.
I have no doubt making it able to handle all cores kind of transparently is going to be a singificant endeavour, and as others have said, I'm not sure if it's the way to go.

But OTOH, that could be interesting as a generalization of your tool. Of course, I wouldn't suggest doing this *only* for a particular target, if you're interested in the concept, but more as a generalization of your system (that I don't really know.) Those non-homogeneous multi-core SoCs are likely to become even more common in the future IMO.
That would imply adding ways for the user to assign specific tasks to a specific core though, as the whole point of these SoCs is to do precisely that.

Currently tasks are always specifically assigned to cores, by design; there is no automatic core assignment (other than to default to running a new task on the same core as the task that spawned it), unlike in many OS'es, for the very reason that the user may want to choose which core to run code on.

So your main issue is generating code for different cores from the same language then, and what you call a common code base?

Those different cores in general are likely to have not just differences in the instruction sets, but also peripherals and various specific features that you need to address anyway. It's not just about one core having "lower latency", one core having higher CPU performance, etc. It's also that completely different things can be achieved with them.

As the core assignment is in all likelihood done statically (I wouldn't see a point of doing that dynamically if it's not automatic), your "compiler" can generate code adapted for each as it knows which core does what.

Now one benefit of having a common code base would be to make it easier to communicate between cores - that would be the real added value here IMO, so providing communications channels through some kind of mailbox.

tabemann · « **Reply #20 on:** April 16, 2023, 07:54:41 pm »

Quote from: SiliconWizard on April 16, 2023, 07:22:50 pm

So your main issue is generating code for different cores from the same language then, and what you call a common code base?

Those different cores in general are likely to have not just differences in the instruction sets, but also peripherals and various specific features that you need to address anyway. It's not just about one core having "lower latency", one core having higher CPU performance, etc. It's also that completely different things can be achieved with them.

As the core assignment is in all likelihood done statically (I wouldn't see a point of doing that dynamically if it's not automatic), your "compiler" can generate code adapted for each as it knows which core does what.

Now one benefit of having a common code base would be to make it easier to communicate between cores - that would be the real added value here IMO, so providing communications channels through some kind of mailbox.

There is a difference between having different peripherals and requiring completely separate language implementations. A lowest common denominator ISA would allow all the cores to share the same compiler and much of the same code, even if each core requires its own multitasker and its own peripheral support. Were I to break my STM32H745 DISCOVERY out of its packaging this would be the case since both the Cortex-M4 and the Cortex-M7 cores support ARMv7-M. But without a lowest common denominator ISA the code generator would need to support multiple ISA's, and any code in the kernel or which is generated would have to be kept in multiple versions, for each ISA which such code would need to execute under.

Nominal Animal · « **Reply #21 on:** April 17, 2023, 12:47:51 pm »

What do you gain from using the exact same compiler for all cores?

I gain nothing. I habitually create build machinery so that I can trivially switch compilers; and I very often use custom build stages to generate or manipulate data. On ARMv7e-m for example, I want my freestanding C/C++ code to be compiled to sensible machine code using both GCC and clang. It is the same source code, implementing the same overall design, just compiled using various tools to the different type cores. I like asymmetric multiprocessing a lot. Calling it harmful, even in jest/hyperbole, is utterly stupid in my view.

(As I've described elsewhere, I even like to pick my programming language based on the need. For example, for fully hosted (running under a fully featured OS), I currently like to use Python 3 and Qt 5 for the user interface, because that lets the end users tweak/modify/fix the user interface without having to install any kind of development tools; and it also makes for a nice license break line if I want to include proprietary machinery in a dynamically linked native library.)

This reminds me of a discussion I had a few years ago with a self-professed "MPI Expert", who categorically declared asynchronous I/O harmful and dangerous, just because they themselves could not wrap their mind around how to do it effectively and efficiently without issues.

Instead of forcing your preferred model onto the hardware, either

Pick only hardware that suits your preferred model

or
Pick a model that well exploits the hardware features

Otherwise, you're essentially recommending/suggesting/demanding others to stop using hardware or models that do not suit you, just because they do not suit you.

Siwastaja · « **Reply #22 on:** April 17, 2023, 01:24:35 pm »

Any decent compiler should support all the cores you can buy from the series anyway, and then it's just a matter of compiling with different command line options. I don't see it relevant if the cores are integrated on the same die and sold together, or you buy them separately. For example, if your compiler supports Cortex-M7 but not Cortex-M4, it's of little use.

Having exact same ISA (or binary compatibility) would be handy only in the case of automagic core assignment, from performance resource viewpoint purely. But that's not the point of multicore microcontrollers at all.

newbrain · « **Reply #23 on:** April 17, 2023, 02:54:24 pm »

Quote from: Siwastaja on April 17, 2023, 01:24:35 pm

Any decent compiler should support all the cores you can buy from the series anyway, and then it's just a matter of compiling with different command line options.

Quote from: Nominal Animal on April 17, 2023, 12:47:51 pm

I habitually create build machinery so that I can trivially switch compilers;

I see a bit of point missing, in constantly referring to CLI options or build tools.
OPs perspective is quite different, we are not talking about external tools, but about a compiler that's an integral part of the Forth interpreter running on the target, with the inherent limitations.

It is quite common for Forth implementations to include assemblers and all non-tethered Forths include a Forth compiler.
Depending on the Forth model chosen, a word (≈Forth function) can be compiled to threaded code or native code.
Mecrisp Stellaris chose the latter Forth model, GForth, e.g., the former (a modified direct threading).

I can understand that implementing different compiler "flavours" for different target cores is a PITA, and breaks the simple model "compile a word and run it on any core".

That said, the advantages of having differently capable cores for different purposes in the same MCU have been quite clearly explained.
So the BL808 is definitely not an easy or maybe even "good" target for (this) Forth, as Nominal Animal said:

Quote from: Nominal Animal on April 17, 2023, 12:47:51 pm

Pick only hardware that suits your preferred model

or

Pick a model that well exploits the hardware features

coppice · « **Reply #24 on:** April 17, 2023, 03:27:26 pm »

Quote from: tabemann on April 15, 2023, 11:41:42 pm

Quote from: coppice on April 15, 2023, 11:11:26 pm
This looks like a typical radio SoC. You have cores where you put a radio stack, get it approved and then leave it alone as much as you possibly can. Then you have cores where you can put applications, modify them, and not sink yourself back into the mire of complex approval processes every time. You very much DON'T want a single OS running across those cores.
That is the only way this design even makes sense.

That's the only sense the design was ever intended to make. Look at any of the competing devices. They are ALL highly asymmetric. Its not a bug, its a feature. The more locked down the radio stack is, the easier it is to convince approvals people that its securely locked away. and you don't need to keep re-approving every time you change the applications code. Buses, peripherals and memory spaces are all usually partitioned, partly to help with this isolation. You don't want the apps processor being able to tinker with value in peripherals the certified stack is supposed to be managing. So, they use different cores. Who cares? Compilers usually handle entire series of cores. So, whichever you choose you just select the target core in your compile scripts, and to the programmer they all look much the same. It would be good to be able to run the same RTOS on all the cores, but not essential. Most people use third party stacks, running on whatever RTOS the developer used. What you run on the apps processor could be very different. It would be a terrible idea to run the same instance of an RTOS on all the cores.

The bottom line is if you make this device more symmetric, nobody in the radio business will buy it.

SiliconWizard · « **Reply #25 on:** April 17, 2023, 07:32:11 pm »

Quote from: Nominal Animal on April 17, 2023, 12:47:51 pm

What do you gain from using the exact same compiler for all cores?

Yep, the same question can be asked about using the same language to program all parts of a system. Many people want that, while it doesn't necessarily make sense and often makes you go through hoops.

But as I pointed out above, I think one point in favor of having a common code base for all cores would be the ability to handle communication between cores on a language level - possibly using channels as I said, or something like that. *That* would be a benefit.

But it wouldn't necessarily require a compiler that generates code only for one target. It could be done with several compilers.

That said, devising a sound "IPC" mechanism for a non-homogenous multi-core system in an easy-to-use way for the programmer, using language constructs, *that* would be a complex project to embark on. Handling code generation for different targets would be the easy part.

NorthGuy · « **Reply #26 on:** April 17, 2023, 08:32:51 pm »

Quote from: SiliconWizard on April 17, 2023, 07:32:11 pm

That said, devising a sound "IPC" mechanism for a non-homogenous multi-core system in an easy-to-use way for the programmer, using language constructs, *that* would be a complex project to embark on. Handling code generation for different targets would be the easy part.

It's like you have an MCU, a device driver on PC to communicate with the MCU, and a user app on PC. Such components may be written in different languages, don't need to be compiled together. They can be written by different programmers using different principles. Then they all work together. Nobody ever had a problem with that.

tabemann · « **Reply #27 on:** April 17, 2023, 10:35:33 pm »

Quote from: newbrain on April 17, 2023, 02:54:24 pm

Quote from: Siwastaja on April 17, 2023, 01:24:35 pm
Any decent compiler should support all the cores you can buy from the series anyway, and then it's just a matter of compiling with different command line options.
Quote from: Nominal Animal on April 17, 2023, 12:47:51 pm
I habitually create build machinery so that I can trivially switch compilers;
I see a bit of point missing, in constantly referring to CLI options or build tools.
OPs perspective is quite different, we are not talking about external tools, but about a compiler that's an integral part of the Forth interpreter running on the target, with the inherent limitations.

It is quite common for Forth implementations to include assemblers and all non-tethered Forths include a Forth compiler.
Depending on the Forth model chosen, a word (≈Forth function) can be compiled to threaded code or native code.
Mecrisp Stellaris chose the latter Forth model, GForth, e.g., the former (a modified direct threading).

I can understand that implementing different compiler "flavours" for different target cores is a PITA, and breaks the simple model "compile a word and run it on any core".

That said, the advantages of having differently capable cores for different purposes in the same MCU have been quite clearly explained.
So the BL808 is definitely not an easy or maybe even "good" target for (this) Forth, as Nominal Animal said:
Quote from: Nominal Animal on April 17, 2023, 12:47:51 pm
Pick only hardware that suits your preferred model

or

Pick a model that well exploits the hardware features

Precisely. zeptoforth has an integral inlining native-code compiler that runs on the target, like Mecrisp-Stellaris; using zeptoforth is not like compiling binaries offline that are then loaded onto the target. (While there are zeptoforth binaries - zeptoforth is distributed with them - these are actually generated on-target using real hardware, and then downloaded as Intel Hex files over serial, which are then converted into binary and UF2 files.) And yes, zeptoforth includes an ARMv6-M assembler, which can generate code for a least common denominator so the same assembly source can be used on ARM Cortex-M0+, Cortex-M4, and Cortex-M7 targets.

Consequently, a zeptoforth port to a platform that involves multiple incompatible core ISA's would either have to involve multiple compilers and multiple copies of code on the target itself, or only target a subset of the cores on the platform. This is not very practical if Forth code is to run on multiple cores, of course, as has been discussed.

However, in the case of coexisting with a radio stack, a maximal amount of separation between cores is actually highly beneficial, because the radio stack can then live all by itself on one core, completely independent of the core(s) zeptoforth is running on. (This is why the model of separation used by the Pico W and the Wio RP2040, where the core (CYW43 and ESP8285 respectively) with the radio stack is completely separated from the cores zeptoforth runs on, is attractive to me; unfortunately there are other issues with these two devices, which are another story.)

Conversely, an arrangement where heterogeneous cores share resources such as RAM but a radio stack would live on one of these cores is the worst of both worlds, because zeptoforth would have to contend with living in the same space as the radio stack yet would not be able to effectively utilize all the cores by itself. In that case, either using elaborate means of working around the radio stack, or simply not using a radio stack and ignoring some of the cores, would be the only option. Note that the issues involved with working around a radio stack are a bit part of why I have not ported zeptoforth to the nRF dongle I own.

brucehoult · « **Reply #28 on:** April 18, 2023, 12:41:23 am »

Quote from: tabemann on April 17, 2023, 10:35:33 pm

zeptoforth includes an ARMv6-M assembler, which can generate code for a least common denominator so the same assembly source can be used on ARM Cortex-M0+, Cortex-M4, and Cortex-M7 targets.

Consequently, a zeptoforth port to a platform that involves multiple incompatible core ISA's

Ouch.

Running ARMv6-M code on a CM7 is severely underutilizing it! You've only got 2-address instructions (except ADD), limited addressing modes and limited offsets and constants, limited use of R8-R12....

In contrast, running RV32EMC code on the bigger cores seems also sane :-) Or at least no worse.

RV32EMC is a very close match for the Cortex-M3's ISA and certainly much more powerful than CM0. You've got full access to 16 registers, both 2-byte 2-address and 4-byte 3-address arithmetic, large constants and offsets (12 bit).

On the RV32GC core (er ... RV32GC minus DP FP?) you'd be ignoring 16 of the 32 registers, and on the RV64GCV core you'd also be ignoring half of each register. You just have to be careful to write code that works with 32 bit values but doesn't assume that intermediate values are only 32 bits

e.g. a*b/b = a will always be true on the RV64 core but won't be on the 32 bit core if the multiply overflows. And shifts deliberately trying to drop bits off the end (rather than multiplying or dividing by a smallish power of 2) will need to be written the way I already showed.

You'll probably want to occasionally canonicalize computed values before comparing them (especially for equality) by either storing them to RAM and reading them back (sw;lw) or else doing a "sign extend 32 to 64" operation "li tmp,-32;sll x,tmp;sra x,tmp" which will be a NOP on RV32. You could keep the -32 permanently in an unused register.

You've also got 64 MB RAM and 4 or 16 MB of flash to run a full compiler or multiple different code generators if you want to.

tabemann · « **Reply #29 on:** April 18, 2023, 01:19:56 am »

Quote from: brucehoult on April 18, 2023, 12:41:23 am

Quote from: tabemann on April 17, 2023, 10:35:33 pm
zeptoforth includes an ARMv6-M assembler, which can generate code for a least common denominator so the same assembly source can be used on ARM Cortex-M0+, Cortex-M4, and Cortex-M7 targets.

Consequently, a zeptoforth port to a platform that involves multiple incompatible core ISA's

Ouch.

Running ARMv6-M code on a CM7 is severely underutilizing it! You've only got 2-address instructions (except ADD), limited addressing modes and limited offsets and constants, limited use of R8-R12....

The zeptoforth kernel has two main versions (aside from further variation to take differences in initialization, flash, and the console UART into account), one for ARMv6-M and one for ARMv7-M, and the code generator for the ARMv7-M version takes advantage of ARMv7-M instructions whereas the ARMv6-M code generator has workarounds to handle things such as limited branch lengths for conditional branches (it will invert the branch condition and then take a conditional branch around a longer unconditional branch, which is an ugly hack but at least allows adequately long conditionals). It is only the user-level assembler which is currently ARMv6-M only, which is on purpose, because if I were to write and actually utilize an ARMv7-M assembler I would have to keep in sync separate ARMv6-M and ARMv7-M versions of the same functionality, and because the (complete) ARMv6-M assembler is quite small, whereas if I were to write a complete ARMv7-M assembler it would be much larger. Also, there is not much advantage when it comes to code size when it comes to ARMv7-M versus ARMv6-M instructions in many cases, because many more powerful ARMv7-M instructions are 32-bit and can be replaced with pairs of 16-bit ARMv6-M instructions.

Quote from: brucehoult on April 18, 2023, 12:41:23 am

In contrast, running RV32EMC code on the bigger cores seems also sane :-) Or at least no worse.

RV32EMC is a very close match for the Cortex-M3's ISA and certainly much more powerful than CM0. You've got full access to 16 registers, both 2-byte 2-address and 4-byte 3-address arithmetic, large constants and offsets (12 bit).

On the RV32GC core (er ... RV32GC minus DP FP?) you'd be ignoring 16 of the 32 registers, and on the RV64GCV core you'd also be ignoring half of each register. You just have to be careful to write code that works with 32 bit values but doesn't assume that intermediate values are only 32 bits

e.g. a*b/b = a will always be true on the RV64 core but won't be on the 32 bit core if the multiply overflows. And shifts deliberately trying to drop bits off the end (rather than multiplying or dividing by a smallish power of 2) will need to be written the way I already showed.

You'll probably want to occasionally canonicalize computed values before comparing them (especially for equality) by either storing them to RAM and reading them back (sw;lw) or else doing a "sign extend 32 to 64" operation "li tmp,-32;sll x,tmp;sra x,tmp" which will be a NOP on RV32. You could keep the -32 permanently in an unused register.

You've also got 64 MB RAM and 4 or 16 MB of flash to run a full compiler or multiple different code generators if you want to.

Really the main complication then is I have to preemptively make bitshift and comparison operations, which would invariably involve the top-of-stack register, on 64-bit cores behave as if they were equivalent to 32-bit operations on the 32-bit cores while transparently sign-extending so signed operations function properly in such a fashion that the same code behaves identically on 32-bit cores then. The main concerns there, aside from loss of efficiency due to adding extra operations to make the two more equivalent, are making sure all operations seamlessly behave just the same way regardless of 64-bit versus 32-bit from the user's perspective.

tabemann · « **Reply #30 on:** April 18, 2023, 01:44:48 am »

Quote from: brucehoult on April 18, 2023, 12:41:23 am

Running ARMv6-M code on a CM7 is severely underutilizing it! You've only got 2-address instructions (except ADD), limited addressing modes and limited offsets and constants, limited use of R8-R12....

I should clarify that the zeptoforth kernel, which includes much of the zeptoforth compiler, is not built using said ARMv6-M assembler that zeptoforth includes. Rather it is built using the arm-none-eabi version of gas, for ARMv6-M on the ARM Cortex-M0+ and for ARMv7-M on the Cortex-M4 and Cortex-M7. The ARMv6-M assembler that is included is rather used for compiling inline assembly integrated with the Forth codebase that comprises the rest of zeptoforth outside the kernel.

Nominal Animal · « **Reply #31 on:** April 18, 2023, 03:06:56 pm »

Quote from: tabemann on April 17, 2023, 10:35:33 pm

Precisely. zeptoforth has an integral inlining native-code compiler that runs on the target

That means zeptoforth is only suited for single-core and symmetric multi-core architectures.

Considering how useful asymmetric multiprocessing is, shouldn't this thread be labeled "zeptoforth is considered harmful"?

tabemann · « **Reply #32 on:** April 18, 2023, 04:29:12 pm »

Quote from: Nominal Animal on April 18, 2023, 03:06:56 pm

Quote from: tabemann on April 17, 2023, 10:35:33 pm
Precisely. zeptoforth has an integral inlining native-code compiler that runs on the target
That means zeptoforth is only suited for single-core and symmetric multi-core architectures.

Considering how useful asymmetric multiprocessing is, shouldn't this thread be labeled "zeptoforth is considered harmful"?

I do see the point to separating concerns, as I have expressed above - but the Pine64 Ox64 seems to be overly asymmetric just for its own sake. I can see making the two smaller cores a different ISA from the bigger core if, say, lower power usage or tighter realtime performance (e.g. no MMU) can be accomplished with it. But making them all have different ISA's from one another is a bit much - are three different ISA's in one chip actually worthwhile? And aside from running blobbed stacks on them, which is most likely the manufacturer's intent, they could be more useful if put in a big.LITTLE-type arrangement - i.e. with identical ISA's but optimized for differing performance versus power usage - with pairs of physical cores mapped to single logical cores (I assume you know this, this is just for the benefit of others here who may not know what big.LITTLE is).

Nominal Animal · « **Reply #33 on:** April 18, 2023, 05:30:15 pm »

Quote from: tabemann on April 18, 2023, 04:29:12 pm

I do see the point to separating concerns, as I have expressed above - but the Pine64 Ox64 seems to be overly asymmetric just for its own sake.

I disagree.

It is true that heterogenous/asymmetric hardware is not well suited for a traditional kernel model. This does not mean heterogenous/asymmetric hardware is harmful or overly asymmetric for its own sake; it is simply not well suited for a traditional kernel model. Zeptoforth is designed on top of a traditional single-local-kernel model, and just isn't well suited for heterogenous/asymmetric hardware.

Asymmetric hardware with heterogenous instruction set architectures are best suited for multiple kernels, and therefore best suited for distributed designs; not parallel ones. (Note how I already mentioned this reminded me of an MPI discussion ages ago? It is exactly because of the difference between distributed computing and parallel computing.)

It does not really matter what the various different cores' kernels are, actual OS kernels or just subsystems. The key is the distributed approach, as opposed to parallel approach.

If you start looking at Ox64 as a distributed system, you should realize that to adapt zeptoforth to this system, you'd need to implement the core separation and core targeting in the language; perhaps even have a zeptoforth instance for each core, and a cross-core communication system – some kind of Message Passing Interface, perhaps.

Quote from: tabemann on April 18, 2023, 04:29:12 pm

(I assume you know this, this is just for the benefit of others here who may not know what big.LITTLE is).

Yup. To be specific, I have both big.LITTLE SoCs (Samsung Exynos octa-core series, 4+4, Odroid HC1), as well as a couple of the smaller Ox64's.

Interestingly, initially Intel's heterogenous cores also had ISA differences, namely AVX512 support was available only on the P-cores, not on the E-cores. Intel Alder Lake processors P-cores do have AVX512, only disabled (AIUI with MSI Z690 firmware being capable of re-enabling it for the P-cores). Thus, there are more people who agree with you, and fewer that agree with me, if that sort of thing is important or useful data point to you. It is not for me: that sort of popularity is irrelevant. The corollary to that is the saying that a billion flies cannot be wrong: excrement tastes good.

As an example of what I'm experimenting with the smaller Ox64's: An USB-controlled external display module using small displays (see e.g. BuyDisplay.com aka EastRising's IPS panels with ILI9341, ST7789, etc. controllers using parallel interfaces), with one core combining the output to the panel from at least two framebuffers ("background" and "overlay", latter with alpha support), and another handling USB communications. Having completely separate cores for this makes things simpler – no need to worry about clock jitter due to interrups and so on –, and is basically perfect for this.

Further options include emulating old games and systems at 320x240 (or perhaps 480x320) resolutions, again with each core having their own assigned emulation tasks. I have used Teensy LC, 3.2, 4.0, and MicroMod for the exact same purpose: many of my appliances would benefit a lot from such displays.

Now, consider implementing that with zeptoforth. You'd definitely want to control which core executed which code, and not say parallelize things willy-nilly. Even having their "own" zeptoforth kernels/interpreters/JIT-compilers would make more sense.

tabemann · « **Reply #34 on:** April 18, 2023, 07:06:13 pm »

Quote from: Nominal Animal on April 18, 2023, 05:30:15 pm

Quote from: tabemann on April 18, 2023, 04:29:12 pm
I do see the point to separating concerns, as I have expressed above - but the Pine64 Ox64 seems to be overly asymmetric just for its own sake.
I disagree.

It is true that heterogenous/asymmetric hardware is not well suited for a traditional kernel model. This does not mean heterogenous/asymmetric hardware is harmful or overly asymmetric for its own sake; it is simply not well suited for a traditional kernel model. Zeptoforth is designed on top of a traditional single-local-kernel model, and just isn't well suited for heterogenous/asymmetric hardware.

Asymmetric hardware with heterogenous instruction set architectures are best suited for multiple kernels, and therefore best suited for distributed designs; not parallel ones. (Note how I already mentioned this reminded me of an MPI discussion ages ago? It is exactly because of the difference between distributed computing and parallel computing.)

It does not really matter what the various different cores' kernels are, actual OS kernels or just subsystems. The key is the distributed approach, as opposed to parallel approach.

If you start looking at Ox64 as a distributed system, you should realize that to adapt zeptoforth to this system, you'd need to implement the core separation and core targeting in the language; perhaps even have a zeptoforth instance for each core, and a cross-core communication system – some kind of Message Passing Interface, perhaps.

I am now of the view that the simplest way to support the Ox64 would indeed be to have separate zeptoforth instances on each core (cores used for blobs aside), and then implementing a virtual serial link between cores so that code can be transferred either as source code, or from images saved remotely as binaries (to avoid having to go through the hoops of rebuilding the system from source each time it is to be reimaged), from the primary core linked to the outside serial or USB link to the secondary and tertiary cores.

Quote from: Nominal Animal on April 18, 2023, 05:30:15 pm

Quote from: tabemann on April 18, 2023, 04:29:12 pm
(I assume you know this, this is just for the benefit of others here who may not know what big.LITTLE is).
Yup. To be specific, I have both big.LITTLE SoCs (Samsung Exynos octa-core series, 4+4, Odroid HC1), as well as a couple of the smaller Ox64's.

Interestingly, initially Intel's heterogenous cores also had ISA differences, namely AVX512 support was available only on the P-cores, not on the E-cores. Intel Alder Lake processors P-cores do have AVX512, only disabled (AIUI with MSI Z690 firmware being capable of re-enabling it for the P-cores). Thus, there are more people who agree with you, and fewer that agree with me, if that sort of thing is important or useful data point to you. It is not for me: that sort of popularity is irrelevant. The corollary to that is the saying that a billion flies cannot be wrong: excrement tastes good.

As an example of what I'm experimenting with the smaller Ox64's: An USB-controlled external display module using small displays (see e.g. BuyDisplay.com aka EastRising's IPS panels with ILI9341, ST7789, etc. controllers using parallel interfaces), with one core combining the output to the panel from at least two framebuffers ("background" and "overlay", latter with alpha support), and another handling USB communications. Having completely separate cores for this makes things simpler – no need to worry about clock jitter due to interrups and so on –, and is basically perfect for this.

Further options include emulating old games and systems at 320x240 (or perhaps 480x320) resolutions, again with each core having their own assigned emulation tasks. I have used Teensy LC, 3.2, 4.0, and MicroMod for the exact same purpose: many of my appliances would benefit a lot from such displays.

Now, consider implementing that with zeptoforth. You'd definitely want to control which core executed which code, and not say parallelize things willy-nilly. Even having their "own" zeptoforth kernels/interpreters/JIT-compilers would make more sense.

zeptoforth on the RP2040 already has support for explicitly executing code on specific cores; however, all its code exists in a single "world", sharing a single memory space and compiler, whereas on the Ox64 as I mentioned it would probably be best to have a separate zeptoforth "world" for each core.

trixy · « **Reply #35 on:** April 19, 2023, 02:10:16 pm »

Like it or not, this type of design seems to be the future. Cell phones, even full-scale PC processors, are getting cores of different designs packed in one package. This allows optimizing of the manufacturing, power and thermals, etc. Software will have to adapt.

I wonder if you could mash an Arm, RISC-V, and x86 all in to one processor sharing memory, bus, etc. I don't know if it would be useful but who knows. It's really just the old typical co-processor design packed together.

tggzzz · « **Reply #36 on:** April 19, 2023, 05:56:30 pm »

Quote from: trixy on April 19, 2023, 02:10:16 pm

Like it or not, this type of design seems to be the future. Cell phones, even full-scale PC processors, are getting cores of different designs packed in one package. This allows optimizing of the manufacturing, power and thermals, etc. Software will have to adapt.

That will be "software adapting" to the same degree that compilers adapted to the Itanic's awkwardness.

Basically having closely coupled asymmetric processors is an old concept, and in the general case it has major shortcomings, e.g.:

placement can't be automatic and worked out by the compiler
the fast processor can often be held up waiting for the slow processor; that was a major issue with network cards that contained their own "cpu offload processor" dedicated to networking

In special cases where the designer places functionality, it can be an improvement - but sometimes it hasn't been!

Where the asymmetric processors aren't closely coupled, it can work effectively.

coppice · « **Reply #37 on:** April 19, 2023, 06:13:42 pm »

Quote from: tggzzz on April 19, 2023, 05:56:30 pm

Basically having closely coupled asymmetric processors is an old concept, and in the general case it has major shortcomings, e.g.:
placement can't be automatic and worked out by the compiler
the fast processor can often be held up waiting for the slow processor; that was a major issue with network cards that contained their own "cpu offload processor" dedicated to networking

There is one important case where things have worked well, perhaps because it has such broad and generalised application - General purpose processors + I/O processors. That's a bit like the network offload processors you mentioned, but done right.

tggzzz · « **Reply #38 on:** April 19, 2023, 06:25:53 pm »

Quote from: coppice on April 19, 2023, 06:13:42 pm

Quote from: tggzzz on April 19, 2023, 05:56:30 pm
Basically having closely coupled asymmetric processors is an old concept, and in the general case it has major shortcomings, e.g.:
placement can't be automatic and worked out by the compiler
the fast processor can often be held up waiting for the slow processor; that was a major issue with network cards that contained their own "cpu offload processor" dedicated to networking
There is one important case where things have worked well, perhaps because it has such broad and generalised application - General purpose processors + I/O processors. That's a bit like the network offload processors you mentioned, but done right.

My very limited understanding is that such architecture considerably constrained how applications could be written. The one time I used ~~IBM~~ Amdahl big iron (in the early 80s for a week) was that the text editor was weird. IIRC the screen accumulated changes until you sent the entire screen back to the i/o processor - and hopefully hadn't done something to invisibly change the screen.

I've no idea whether such block oriented i/o could be usefully mutated to byte stream oriented i/o.

coppice · « **Reply #39 on:** April 19, 2023, 06:46:12 pm »

Quote from: tggzzz on April 19, 2023, 06:25:53 pm

Quote from: coppice on April 19, 2023, 06:13:42 pm
Quote from: tggzzz on April 19, 2023, 05:56:30 pm
Basically having closely coupled asymmetric processors is an old concept, and in the general case it has major shortcomings, e.g.:
placement can't be automatic and worked out by the compiler
the fast processor can often be held up waiting for the slow processor; that was a major issue with network cards that contained their own "cpu offload processor" dedicated to networking
There is one important case where things have worked well, perhaps because it has such broad and generalised application - General purpose processors + I/O processors. That's a bit like the network offload processors you mentioned, but done right.

My very limited understanding is that such architecture considerably constrained how applications could be written. The one time I used ~~IBM~~ Amdahl big iron (in the early 80s for a week) was that the text editor was weird. IIRC the screen accumulated changes until you sent the entire screen back to the i/o processor - and hopefully hadn't done something to invisibly change the screen.

I've no idea whether such block oriented i/o could be usefully mutated to byte stream oriented i/o.

Even DMA is an elementary form of I/O processor, and most systems would suck badly without that. Most processing works in chunks for efficiency. The gains can be huge, even for things working on an endless flow of data. How many DSP applications, other than control loops, work sample by sample?

tggzzz · « **Reply #40 on:** April 19, 2023, 08:29:51 pm »

Quote from: coppice on April 19, 2023, 06:46:12 pm

Quote from: tggzzz on April 19, 2023, 06:25:53 pm
Quote from: coppice on April 19, 2023, 06:13:42 pm
Quote from: tggzzz on April 19, 2023, 05:56:30 pm
Basically having closely coupled asymmetric processors is an old concept, and in the general case it has major shortcomings, e.g.:
placement can't be automatic and worked out by the compiler
the fast processor can often be held up waiting for the slow processor; that was a major issue with network cards that contained their own "cpu offload processor" dedicated to networking
There is one important case where things have worked well, perhaps because it has such broad and generalised application - General purpose processors + I/O processors. That's a bit like the network offload processors you mentioned, but done right.

My very limited understanding is that such architecture considerably constrained how applications could be written. The one time I used ~~IBM~~ Amdahl big iron (in the early 80s for a week) was that the text editor was weird. IIRC the screen accumulated changes until you sent the entire screen back to the i/o processor - and hopefully hadn't done something to invisibly change the screen.

I've no idea whether such block oriented i/o could be usefully mutated to byte stream oriented i/o.
Even DMA is an elementary form of I/O processor, and most systems would suck badly without that. Most processing works in chunks for efficiency. The gains can be huge, even for things working on an endless flow of data. How many DSP applications, other than control loops, work sample by sample?

All applications where waiting for a fixed length block to accumulate would cause unacceptable delays.

Start with text editors

coppice · « **Reply #41 on:** April 19, 2023, 09:01:42 pm »

Quote from: tggzzz on April 19, 2023, 08:29:51 pm

Quote from: coppice on April 19, 2023, 06:46:12 pm
Quote from: tggzzz on April 19, 2023, 06:25:53 pm
Quote from: coppice on April 19, 2023, 06:13:42 pm
Quote from: tggzzz on April 19, 2023, 05:56:30 pm
Basically having closely coupled asymmetric processors is an old concept, and in the general case it has major shortcomings, e.g.:
placement can't be automatic and worked out by the compiler
the fast processor can often be held up waiting for the slow processor; that was a major issue with network cards that contained their own "cpu offload processor" dedicated to networking
There is one important case where things have worked well, perhaps because it has such broad and generalised application - General purpose processors + I/O processors. That's a bit like the network offload processors you mentioned, but done right.

My very limited understanding is that such architecture considerably constrained how applications could be written. The one time I used ~~IBM~~ Amdahl big iron (in the early 80s for a week) was that the text editor was weird. IIRC the screen accumulated changes until you sent the entire screen back to the i/o processor - and hopefully hadn't done something to invisibly change the screen.

I've no idea whether such block oriented i/o could be usefully mutated to byte stream oriented i/o.
Even DMA is an elementary form of I/O processor, and most systems would suck badly without that. Most processing works in chunks for efficiency. The gains can be huge, even for things working on an endless flow of data. How many DSP applications, other than control loops, work sample by sample?

All applications where waiting for a fixed length block to accumulate would cause unacceptable delays.

Start with text editors

I think an interactive text editor is an example of a control loop, so I had that one covered.

tggzzz · « **Reply #42 on:** April 19, 2023, 09:15:14 pm »

Quote from: coppice on April 19, 2023, 09:01:42 pm

Quote from: tggzzz on April 19, 2023, 08:29:51 pm
Quote from: coppice on April 19, 2023, 06:46:12 pm
Quote from: tggzzz on April 19, 2023, 06:25:53 pm
Quote from: coppice on April 19, 2023, 06:13:42 pm
Quote from: tggzzz on April 19, 2023, 05:56:30 pm
Basically having closely coupled asymmetric processors is an old concept, and in the general case it has major shortcomings, e.g.:
placement can't be automatic and worked out by the compiler
the fast processor can often be held up waiting for the slow processor; that was a major issue with network cards that contained their own "cpu offload processor" dedicated to networking
There is one important case where things have worked well, perhaps because it has such broad and generalised application - General purpose processors + I/O processors. That's a bit like the network offload processors you mentioned, but done right.

My very limited understanding is that such architecture considerably constrained how applications could be written. The one time I used ~~IBM~~ Amdahl big iron (in the early 80s for a week) was that the text editor was weird. IIRC the screen accumulated changes until you sent the entire screen back to the i/o processor - and hopefully hadn't done something to invisibly change the screen.

I've no idea whether such block oriented i/o could be usefully mutated to byte stream oriented i/o.
Even DMA is an elementary form of I/O processor, and most systems would suck badly without that. Most processing works in chunks for efficiency. The gains can be huge, even for things working on an endless flow of data. How many DSP applications, other than control loops, work sample by sample?

All applications where waiting for a fixed length block to accumulate would cause unacceptable delays.

Start with text editors
I think an interactive text editor is an example of a control loop, so I had that one covered.

What is being controlled? What's the loop cycle period, if it has one? Why is so much noise injected into the loop? What's the book's bandwidth and phase margin?

SiliconWizard · « **Reply #43 on:** April 19, 2023, 09:17:25 pm »

Buffering and latency always have to be a compromise in any given system.
The usual approach is to buffer incoming/outgoing data, but 'flush' the buffer, even if not full, if the max acceptable latency has been reached.
So you 'flush' buffers if 1/ the buffer is full - or "almost" full if you don't have a double buffer - or 2/ the latency timer has expired, whichever comes first.

So yes obviously that means that your system needs to be able to deal with blocks of any size rather than strictly a fixed size, but you can still benefit from buffering.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Asymmetric multiprocessing considered harmful (Read 3481 times)

Share me