Author Topic: The Imperium programming language - IPL (Read 71503 times)

DiTBho · « **Reply #625 on:** December 31, 2022, 02:09:31 am »

Quote from: Nominal Animal on December 31, 2022, 12:47:23 am

replacing the standard C library with something more suitable for my needs

That's my plan for 2023 for m68k.
Everything replaced from crt0 up.

We can do it

DiTBho · « **Reply #626 on:** December 31, 2022, 02:23:34 am »

Quote from: MIS42N on December 31, 2022, 01:00:37 am

Is this is going off topic. Xeon processors, MIPS, Athlon, power10?

We are talking about tr-mem, which I personally need as a "language feature", indeed I added specific support and basic-rule-set in my-c.

Unfortunately tr-mem seems very "implementation dependent", that's why we ended considering different platforms, from the success of IBM at implementing and supporting tr-mem in their POWER9-10 to the failure of Intel in making weak solutions that nobody is willing to give support, plus an experimental attempt made by a company that sent me a MIPS5++ prototype where tr-mem looks the simplest solution ever seen.

DiTBho · « **Reply #627 on:** December 31, 2022, 02:38:11 am »

Quote from: MK14 on December 31, 2022, 02:12:10 am

Replace it with my_c.lib

LOL

funny, just... well, my-c doesn't have any machine layer for m68k yet, only mips5++ is supported.

(one thing I learned...
adding machine-layer isn't as easy as expected, unless you leverage clang for your language... my-c was written from scratch from the lexer up, which is good in some respects and also a BIG mistake considering the effort required to support other architectures
... d'oh)

So, it will be libc-core-v1.c for m68k-elf-none-gcc-v9.3.0

Consider that my-c is 95% compatible with a subset of C89 (MisraC95&DO178B), so ... I will be probably able to re-use most of the code

(

)

SiliconWizard · « **Reply #628 on:** December 31, 2022, 02:46:30 am »

Quote from: DiTBho on December 31, 2022, 02:09:31 am

Quote from: Nominal Animal on December 31, 2022, 12:47:23 am
replacing the standard C library with something more suitable for my needs

That's my plan for 2023 for m68k.
Everything replaced from crt0 up.

We can do it

Of course.
Does it have to comply with the standard though? That would be quite limiting.

Nominal Animal · « **Reply #629 on:** December 31, 2022, 04:00:49 am »

Quote from: MIS42N on December 31, 2022, 01:00:37 am

I'm thinking a microcontroller has less CPU grunt but is augmented by a few onboard PWMs, extra timers, voltage references, comparators, ADC and DAC ability, USARTs, onboard USB, I2C, SPI, WDT, writable permanent memory, sleep modes and commonly low powered so it can be run off battery power. There may be some offloading of functions such as DMA and gated timers. And commonly just 1 CPU.

Yes.

Quote from: MIS42N on December 31, 2022, 01:00:37 am

How do you then optimize for different architectures - AVR with multiple registers but transfers to memory are load and store - PIC with one working register but instructions like INCFSZ (increment memory and skip if zero) that doesn't touch the register at all.

That's exactly why I said I would focuse on the multiple general purpose register architectures: AVR on 8-bit, ARM/RISC-V/x86-64 on 32- and 64-bit architectures.

You really need a different approach on the backend to target the accumulator-style 8-bitters like 8051 etc.; just compare SDCC, GCC, and Clang targets for C.

Transactional memory and atomic memory access is discussed here, because it is a very similar problem on larger, multi-core processors. There are three basic ways of implementing atomic memory access:

Transactional memory. Writes to transactional memory are grouped, and only actually committed to the underlying memory (atomically) if there are no conflicts.
CAS: Compare-exchange. The basic operation is a conditional compare-exchange, that atomically replaces the memory contents if the previous contents match a given value.
LL/SC: Load-linked, store-conditional. The basic operation is a linked load, which basically "claims" ownership of that memory address. Within a few instructions (often with restrictions like no jumps), a corresponding linked store to that address will succeed if the contents were not modified in between.

Higher-level primitives like mutexes, semaphores, condition variables are implemented in terms of the above atomic operations. However, there are also a large number of lockless data structures that rely on atomic access, and they tend to differ a bit between CAS and LL/SC.
I'm sure you can imagine how problematic it is to expose the low-level atomic operations in a low-level programming language: if you expose just CAS, the code will be suboptimal on LL/SC architectures, and vice versa. If you provide both, how is the developer going to decide which one to use?

As you can see, you really need to understand the hardware differences to be able to determine what kind of language approach is suitable on use with a MCU.

A good example of this is the subset of C++ used in freestanding C++ environments. Exceptions are usually not supported at all, because they're "designed" to be implemented via stack unwinding, and that's just too much work on microcontrollers for so little return.

Quote from: MIS42N on December 31, 2022, 01:00:37 am

Perhaps the best course is to wait for Sherlock to produce the language, and see if what benefits (if any) it brings. And with something to work with, then make comment to improve (if feasible).

My beef is that they're taking an already well-trodden path that has not thus far yielded anything practically useful. Why repeat that? I'm not saying this because I hope to benefit from their work, I am saying this because I've seen too many people get frustrated and fail, and I don't want Sherlock to do so too.

Why not instead try and approach it with the intent of producing something practically useful, with eyes and mind open to what it will end up looking like, and leverage all the massive amounts of practical knowledge we have on compilers and languages and microcontrollers?

And why go for just another object-oriented imperative language, when there is good reason to expect better returns from say an event-based language – simply because there are decades of research and a lot of money behind existing object-oriented imperative languages like Rust, but very few people considering event-based programming languages, even though a lot of code we write for microcontrollers revolves around dealing with events?

I for one am not willing to just stand by and look silently when someone sets themselves up for a failure. I will pipe up, even if they themselves would prefer me to stay quiet. It's a personality flaw.

Nominal Animal · « **Reply #630 on:** December 31, 2022, 04:40:34 am »

Quote from: DiTBho on December 31, 2022, 02:09:31 am

Quote from: Nominal Animal on December 31, 2022, 12:47:23 am
replacing the standard C library with something more suitable for my needs
That's my plan for 2023 for m68k.
Everything replaced from crt0 up.

It is less work than one might expect, actually, at least for hosted environments ("userspace" code, running under a kernel providing I/O).
The trick is to start with a static inline extended assembly syscall wrappers, and implement the exported functions on top of those.
If you keep to the kernel interfaces, as opposed to standard C library interfaces, you'll also see significant code size savings.

Environment variables are nearly trivial if used in read-only manner, values deleted, or values replaced with equal length or shorter values. It is modification and appending (setenv(), putenv() in C) that get hairy, especially if you want to avoid wasting RAM.

How you end up implementing formatted input and output – printf(), scanf() in C – is a more complicated question, but even there, it is more work to try and find out what kind of interfaces one really wants (again, if RAM and CPU use is to be minimized), than implementing them.

This is also why I believe a small but fundamental change in C would solve the most common bugs: replacing pointers with arrays as the base primitive, so that instead of working with pointers, one would work with contiguous memory ranges. At the machine code level, nothing much would change; it would affect the function signatures and overall code at the conceptual level. Things like string literals and other arrays wouldn't decay to a pointer, and instead would conceptually be a pointer, length tuple.
It is just that current C compilers are so darned complex already, that I don't know if that kind of fundamental change can be done as a modification, or whether one should start (a new front-end, for GCC or Clang, instead of modifying the existing C front-end) from scratch.
I do know that the GCC Fortran front-end heavily uses a similar concept for arrays and slices –– it is why certain linear algebra code written in Fortran runs faster than equivalent C code, both compiled using GCC.

brucehoult · « **Reply #631 on:** December 31, 2022, 08:41:57 am »

Quote from: Nominal Animal on December 31, 2022, 04:00:49 am

Transactional memory and atomic memory access is discussed here, because it is a very similar problem on larger, multi-core processors. There are three basic ways of implementing atomic memory access:
Transactional memory. Writes to transactional memory are grouped, and only actually committed to the underlying memory (atomically) if there are no conflicts.
CAS: Compare-exchange. The basic operation is a conditional compare-exchange, that atomically replaces the memory contents if the previous contents match a given value.
LL/SC: Load-linked, store-conditional. The basic operation is a linked load, which basically "claims" ownership of that memory address. Within a few instructions (often with restrictions like no jumps), a corresponding linked store to that address will succeed if the contents were not modified in between.
Higher-level primitives like mutexes, semaphores, condition variables are implemented in terms of the above atomic operations. However, there are also a large number of lockless data structures that rely on atomic access, and they tend to differ a bit between CAS and LL/SC.
I'm sure you can imagine how problematic it is to expose the low-level atomic operations in a low-level programming language: if you expose just CAS, the code will be suboptimal on LL/SC architectures, and vice versa. If you provide both, how is the developer going to decide which one to use?

You've missed a class: Atomic Memory Operations, as seen in a number of current ISAs.

Semantically, this involves fetching a value from RAM (or a peripheral register), performing an arithmetic operation such as add, sub, and, or, xor, min, max, swap between the fetched value and a value in a register, and writing the result back to RAM or IO register. The result of the instruction is the original value of the memory location. This is all guaranteed to be atomic, and can not fail.

RISC-V has such instructions, as does ARMv8.1.

This seems on the face expensive and not scalable, but Berkeley's "TileLink" bus (enhanced and used by SiFive) enables the memory address, a constant, and the operation to be performed to be sent out to a suitably intelligent node in the memory system -- often the L2 cache shared by a number of CPU cores, but it might even be an IO device -- and the ALU operation is performed right where the data is stored, and only the result sent back to the CPU (and only if it's not going to be ignored by being written to register X0).

This scales extremely well.

Arm added a similar capability in AMBA 5 CHI (Coherent Hub Interface)

Nominal Animal · « **Reply #632 on:** December 31, 2022, 11:00:51 am »

Quote from: brucehoult on December 31, 2022, 08:41:57 am

Quote from: Nominal Animal on December 31, 2022, 04:00:49 am
Transactional memory and atomic memory access is discussed here, because it is a very similar problem on larger, multi-core processors. There are three basic ways of implementing atomic memory access:
Transactional memory. Writes to transactional memory are grouped, and only actually committed to the underlying memory (atomically) if there are no conflicts.
CAS: Compare-exchange. The basic operation is a conditional compare-exchange, that atomically replaces the memory contents if the previous contents match a given value.
LL/SC: Load-linked, store-conditional. The basic operation is a linked load, which basically "claims" ownership of that memory address. Within a few instructions (often with restrictions like no jumps), a corresponding linked store to that address will succeed if the contents were not modified in between.
Higher-level primitives like mutexes, semaphores, condition variables are implemented in terms of the above atomic operations. However, there are also a large number of lockless data structures that rely on atomic access, and they tend to differ a bit between CAS and LL/SC.
I'm sure you can imagine how problematic it is to expose the low-level atomic operations in a low-level programming language: if you expose just CAS, the code will be suboptimal on LL/SC architectures, and vice versa. If you provide both, how is the developer going to decide which one to use?

You've missed a class: Atomic Memory Operations, as seen in a number of current ISAs.

You're right, as usual.

I did not consider them a separate class, because GCC, Clang, and Intel Compiler Collection expose them (implemented using the hardware bus lock prefix for select instructions – CAS as LOCK CMPXCHG etc.) via atomic built-ins, and I do use them when implementing lockless data structures on x86-64; but it is CAS, LL/SC, or transactional memory I'd need to implement the higher-level primitives like mutexes.

Quote from: brucehoult on December 31, 2022, 08:41:57 am

This seems on the face expensive and not scalable, but Berkeley's "TileLink" bus (enhanced and used by SiFive) enables the memory address, a constant, and the operation to be performed to be sent out to a suitably intelligent node in the memory system -- often the L2 cache shared by a number of CPU cores, but it might even be an IO device -- and the ALU operation is performed right where the data is stored, and only the result sent back to the CPU (and only if it's not going to be ignored by being written to register X0).

Extremely interesting!

(It also indicates that my expectation/hope of such tiny ALUs becoming more widely used may not be as crazy as it sounds. An ALU micro-core between a DMA engine and I/O pins only needs a few instructions and a couple of registers to be able to do I2C, SPI, UART, PWM, PDM, and multi-bit 8080-style parallel buses. The one Raspberry Pi folks created is too limited (no ALU), and the TI Sitara PRUs are too complex (full cores, with full access to shared RAM). Something in between would be truly useful.)

tggzzz · « **Reply #633 on:** December 31, 2022, 11:22:50 am »

Quote from: Nominal Animal on December 31, 2022, 12:47:23 am

Quote from: tggzzz on December 30, 2022, 10:21:09 am
Quote from: Nominal Animal on December 30, 2022, 04:10:13 am
There are 8-bit architectures that have very few registers and were designed to pass function arguments on the stack, but they're quite difficult to optimize code for. Most hardware today, from AVRs to ARMs to RISCV, have many general-purpose registers, so targeting those makes more sense to me anyway.
Not always true...
Well, I do think 12 general purpose registers (xCORE-200 XS2 ISA (PDF)) is plenty!

You are correct, of course

I'll claim, without proof (

), that another advantage of registers is that it increases instruction encoding density. That can be important for increasing speed and reducing program memory usage.

Nonetheless, matching memory speed to core speed helps a lot w.r.t. hardware complexity and MIPS/Watt.

Quote

But, sure, there are exceptions to any rule of thumb. XCore is definitely one.

Yup, and it is an excellent example of what can be achieved when people decide they aren't interested in making incremental changes to existing cancerous growths, or archeological layered ruins if you prefer that analogy

Quote

Quote from: tggzzz on December 30, 2022, 10:21:09 am
That was demonstrated very clearly with the Sun UltrasSPARC Niagara T series processors, which had up to 128 "cores" (and used them effectively!) in 2005-2011, when x86-64 has <8 cores.
Yep. I personally like the idea of asymmetric multiprocessing a lot, and would love to play with such hardware. Alas, the only ones I have are mixed Cortex-A cores (ARM big.LITTLE).

I'm not a fan of asymmetric processing for applications which aren't nailed down when the device ships. The asymmetry introduces a nasty question of whether a particular operation should be done here or there. That was a downfall of the CELL(?) architecture with one BIG and 8(?) LITTLEs a decade ago.

Where the only operations that will be in LITTLE are fixed at design time, that disadvantage is not a major problem.

Quote

We can see from the increasing complexity of peripherals on even cheap ARM Cortex-M microcontrollers, and things like Raspberry Pi Pico/RP2040 (and of course TI AM335x with its PRUs) that we're slowly going that way anyway.

Personally, I'd love to have tiny programmable cores with just a basic ALU – addition, subtraction, comparison, and bit operations, with just a couple of run-time registers – and access to reading and writing from the main memory, and the ability to raise an interrupt in the main core. Heck, all the buses I use could be implemented with one.

When I was young, creating processors/ISAs was just about possible using bitslices like the 2900 family.

Now it is easy using FPGAs. Go do it

Quote

I looked at XEF216-512-TQ128-C20A (23.16€ in singles at Digikey, in stock), and the problem I have with it is that it is too powerful!

They have smaller ones in the family

Quote

I fully understand why XMOS developed xC, a variant of C with additions for tile computing; something like this would be perfect for developing a new event-based programming language, since it already has the hardware support for it.

Neither hardware nor software on their own are sufficient - as demonstrated many times over the decades.

The xC+xCORE software+hardware ecosystem is the key achievement. XMOS definitely got that right, even if people's imagination is too limited to see it

Quote

For now, however, I have my own sights set much lower: replacing the standard C library with something more suitable for my needs, to be used in a mixed freestanding C/C++ environment. This is well defined in the standards, and there are more than one compiler I can use it with (specifically, GCC and Clang on different architectures, including 8-bit (AVR), 32-bit (ARM, x86), and 64-bit (x86-64, Aarch64/ARM64) ones).

I presume you are familiar with the Mill architecture. That's the only other non-incremental architecture worth looking at. The architects have a remarkable knowledge of what has worked well in the non-C non-UNIX world (yes, it exists) and have developed something that has DSP-like instruction parallelism and solid security while still working very well with C+UNIX and general-purpose computations.

tggzzz · « **Reply #634 on:** December 31, 2022, 11:29:46 am »

Quote from: Nominal Animal on December 31, 2022, 04:00:49 am

...
As you can see, you really need to understand the hardware differences to be able to determine what kind of language approach is suitable on use with a MCU.

Precisely.

Unfortunately that requires a "renaissance man", and they are dying out, and growing new ones is almost impossible due to the cliff of existing details they have to absorb. The cliff was much easier when it was being created, step by step

Quote

...
And why go for just another object-oriented imperative language, when there is good reason to expect better returns from say an event-based language – simply because there are decades of research and a lot of money behind existing object-oriented imperative languages like Rust, but very few people considering event-based programming languages, even though a lot of code we write for microcontrollers revolves around dealing with events?

I for one am not willing to just stand by and look silently when someone sets themselves up for a failure. I will pipe up, even if they themselves would prefer me to stay quiet. It's a personality flaw.

Precisely.

Computer scientists tend to ignore the embedded world as being mere hardware. Damn their eyes!

And I have that personality flaw too

tggzzz · « **Reply #635 on:** December 31, 2022, 11:33:09 am »

Quote from: brucehoult on December 31, 2022, 08:41:57 am

This seems on the face expensive and not scalable, but Berkeley's "TileLink" bus (enhanced and used by SiFive) enables the memory address, a constant, and the operation to be performed to be sent out to a suitably intelligent node in the memory system -- often the L2 cache shared by a number of CPU cores, but it might even be an IO device -- and the ALU operation is performed right where the data is stored, and only the result sent back to the CPU (and only if it's not going to be ignored by being written to register X0).

This scales extremely well.

Arm added a similar capability in AMBA 5 CHI (Coherent Hub Interface)

I don't know the TileLink bus, but at that level it sounds similar to the hypertransport comms in AMD's Athlon/Opteron/etc. That works nicely up to a certain point and the cache coherence traffic and latency becomes a limiting factor.

Nominal Animal · « **Reply #636 on:** December 31, 2022, 01:47:08 pm »

Quote from: tggzzz on December 31, 2022, 11:22:50 am

I'm not a fan of asymmetric processing for applications which aren't nailed down when the device ships. The asymmetry introduces a nasty question of whether a particular operation should be done here or there. That was a downfall of the CELL(?) architecture with one BIG and 8(?) LITTLEs a decade ago.

Ah, yes: IBM Cell architecture, used on the Sony Playstation 3.

Quote from: tggzzz on December 31, 2022, 11:22:50 am

Where the only operations that will be in LITTLE are fixed at design time, that disadvantage is not a major problem.

I'm specifically talking about small "accelerators" with tiny instruction sets, basically a simple ALU (I don't even need multiplication or division myself!), attached to e.g. a DMA engine, GPIO ports, that sort of a thing; not full asymmetric cores, and not with full or random memory access.

Stuff like CRC or hash calculation while doing memory-to-memory DMA, peripheral bus implementations, PWM, PDM, ADC with averaging and min-max recording, even wide LUT.

To circle back to the original topic, it would be very nice if we had a programming language that could embed such "miniprograms" better than what we have right now. I don't know how many of you are aware, but GPU pixel and vertex shaders are written in a C or C++ -like languages, for example OpenGL and OpenGL ES GLSL supplied as strings in source format, and compiled by the graphics drivers for its own hardware.

Perhaps some kind of augmented string format, or "#include" on steroids; perhaps a source file format where initial lines declared properties (accessible as compile-time constants), with the rest of the file available as an augmented string object? Or processable with an external compiler program, with the result provided as an augmented binary blob? The compile-time constant properties of that object are useful, if one needs to e.g. describe the resource use or expectations of that miniprogram.

In Linux systems programming, I occasionally use the seccomp BPF filters, to limit a process to a subset of (explicitly allowed) system calls. It mitigates the attack and bug surface when running e.g. dynamically compiled code –– consider something that evaluates an user-defined expression a few hundred million times. BPF itself is a mini-language, with a small binary instruction set. These are "installed" for a thread or process, and run by the kernel before internal syscall dispatch. Currently, one writes such programs by populating an array using lots of preprocessor macros, which is otherwise OK, but calculating jumps is annoying:

Code: [Select]

static const struct sock_filter  strict_filter[] = {
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof (struct seccomp_data, nr))),

    BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_rt_sigreturn, 5, 0),  // Jump to RET_ALLOW if match
    BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_read,         4, 0),  // Jump to RET_ALLOW if match
    BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_write,        3, 0),  // Jump to RET_ALLOW if match
    BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_exit,         2, 0),  // Jump to RET_ALLOW if match
    BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_exit_group,   1, 0),  // Jump to RET_ALLOW if match

    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW)
};

static const struct sock_fprog  strict = {
    .len = (unsigned short)( sizeof strict_filter / sizeof strict_filter[0] ),
    .filter = (struct sock_filter *)strict_filter
};

The RP2040 PIO and TI Sitara PRU code are examples of similar "miniprograms" in microcontroller environments.

Exactly how the "sub-language" embedding would work, I'm not sure –– I'd like the "give me this file as an augmented string object at compile time" and "run this file through an external compiler and give me the result as an augmented binary blob" ––; but looking at existing code and build machineries using such miniprograms/sub-languages might yield important ideas that reduces the barrier/learning-curve for using such sub-languages more often.

The way GCC, Clang, and Intel CC support GCC-style extended asm should not be overlooked, either. This is an extremely powerful macro assembler, that lets one write assembly where the C compiler chooses the exact registers used for the assembly snippet, unlike e.g. externally included assembly source files. (With inlined functions whose bodies are basically just an extended asm statement, it means the compiler can choose which registers it uses in the inlined copy.)
The syntax for the constraints is absolutely horrible, though; otherwise it is pretty nice.

tggzzz · « **Reply #637 on:** December 31, 2022, 03:07:50 pm »

Quote from: Nominal Animal on December 31, 2022, 01:47:08 pm

Quote from: tggzzz on December 31, 2022, 11:22:50 am
I'm not a fan of asymmetric processing for applications which aren't nailed down when the device ships. The asymmetry introduces a nasty question of whether a particular operation should be done here or there. That was a downfall of the CELL(?) architecture with one BIG and 8(?) LITTLEs a decade ago.
Ah, yes: IBM Cell architecture, used on the Sony Playstation 3.

Quote from: tggzzz on December 31, 2022, 11:22:50 am
Where the only operations that will be in LITTLE are fixed at design time, that disadvantage is not a major problem.
I'm specifically talking about small "accelerators" with tiny instruction sets, basically a simple ALU (I don't even need multiplication or division myself!), attached to e.g. a DMA engine, GPIO ports, that sort of a thing; not full asymmetric cores, and not with full or random memory access.

Stuff like CRC or hash calculation while doing memory-to-memory DMA, peripheral bus implementations, PWM, PDM, ADC with averaging and min-max recording, even wide LUT.

Sounds a bit like the IBM360 channel processors, or network i/o processors.

By and large trying to offload i/o onto a separate processor only works in some cases; in many (most?) it is a problem.

The canonical example is offloading network protocols into a separate processor. Two intractable problems are:

the network processor is slower than the main processor, and becomes the limiting factor in network throughput
presuming the main processor doesn't sit waiting for the (slow) network processor, the comms between the network processor and main processor is effectively yet another network hop. For example, the network processors buffer memory could become full before the main processor gets around to receiving a packet. That means the network processor shouldn't ACK a packet's reception until it is in the main processor's memory And that rather negates the purpose of a network processor

If all you are attempting is to limit the unpredictability of i/o operations' timing, then a separate i/o processor can work well. And with that we're back to xCORE!

brucehoult · « **Reply #638 on:** January 01, 2023, 12:53:12 am »

Quote from: tggzzz on December 31, 2022, 11:33:09 am

Quote from: brucehoult on December 31, 2022, 08:41:57 am
This seems on the face expensive and not scalable, but Berkeley's "TileLink" bus (enhanced and used by SiFive) enables the memory address, a constant, and the operation to be performed to be sent out to a suitably intelligent node in the memory system -- often the L2 cache shared by a number of CPU cores, but it might even be an IO device -- and the ALU operation is performed right where the data is stored, and only the result sent back to the CPU (and only if it's not going to be ignored by being written to register X0).

This scales extremely well.

Arm added a similar capability in AMBA 5 CHI (Coherent Hub Interface)

I don't know the TileLink bus, but at that level it sounds similar to the hypertransport comms in AMD's Athlon/Opteron/etc. That works nicely up to a certain point and the cache coherence traffic and latency becomes a limiting factor.

It's not cache-coherence. The whole point is to be able to avoid moving cache lines around.

Everything works up to a certain point. The question to be answered is whether that point is 4 CPUs, 64 CPUs, or 1024+ CPUs.

x86's (and SPARC's) TSO memory model is also a limiting factor for those systems. Arm and RISC-V have better memory model, certainly by the time the CPU count gets into three figures.

Note that there are already RISC-V chips in production with more than 1000 cores e.g. ET-SoC-1 with 1088 small in-order cores with vector units, plus 4 big OoO cores.

Nominal Animal · « **Reply #639 on:** January 01, 2023, 04:38:14 am »

Quote from: tggzzz on December 31, 2022, 03:07:50 pm

By and large trying to offload i/o onto a separate processor only works in some cases; in many (most?) it is a problem.
[...]
If all you are attempting is to limit the unpredictability of i/o operations' timing, then a separate i/o processor can work well. And with that we're back to xCORE!

No, actually neither. And I'm very often fighting buffer bloat, so I'm looking to ways of avoid that, too.

The idea with the small ALUs is not to have more buffers, but to do whatever small arithmetic and bit operations are necessary when the data is accessed anyway, and use as few buffers as is possible. Zero-copy operation, really, is what I'm after.

The common Ethernet interface I see on MCUs (like NXP i.MX RT1062 and STM32F417) is a set of registers and packet data fed via FIFO, usually using a DMA channel. BPF was designed as a micro-language to examine the packets as they were received, to early discard and route them. (Even on servers, I never liked Ethernet offload, because it does add significant buffer blots; I'm not talking about that kind of offloading.)

For Ethernet interface, internal "routing" –– to the code that is interested in specific packets –– can be done whenever the IP packet headers are read; there is no need to wait for the data to complete before doing that. One can even mark the incoming packet buffer to be immediately reused, if the packet is not something that needs handling. A send operation can be completely asynchronous, just setting up the DMA descriptor for the packet, with a suitable event generated when the packet has been transmitted (UDP or raw Ethernet frame) or a TCP ACK received.

Siwastaja · « **Reply #640 on:** January 01, 2023, 07:48:47 am »

Quote from: Nominal Animal on January 01, 2023, 04:38:14 am

The idea with the small ALUs is not to have more buffers, but to do whatever small arithmetic and bit operations are necessary when the data is accessed anyway, and use as few buffers as is possible. Zero-copy operation, really, is what I'm after.

Exactly. A simple example:

Many UART or SPI based protocols add a CRC16 for data integrity verification. Yet, microcontroller peripherals usually cannot calculate CRCs. So all practical code implementations either prepare the complete message in the buffer and calculate CRC in one go - allowing DMA to be used, but timing is awkward as the CRC calculation over the whole message takes so long. Another option is to calculate CRC word per word during RX / TX - which is not very high cost operation if you process the data in interrupt handler word per word anyway. But then you can't use DMA.

Higher end STM32 SPI peripherals include hardware CRC for this reason, but being fixed in silicon, it can only cover some cases (there is no standardized "CRC over SPI").

What you describe would do exactly that, but in much more flexible way. What's even better, such simple core could bit-bang the protocol itself, so the microcontroller would not need a separate UART / I2C / SPI peripherals at all.

Quote from: tggzzz on December 31, 2022, 03:07:50 pm

Two intractable problems are:
the network processor is slower...
...waiting for the (slow) network processor...

Really, the one problem is slow speed. The solution is to make them fast. There is no reason why the simple cores could not run at comparable speed to the main CPU. In order not to make them complicated with pipelining, branch prediction etc., it means giving up some of their capabilities, but that's OK. Even if you have just basic 32-bit add and bitwise operations plus branch instructions, running at say 200MHz while the more complex main CPU runs heavily pipelined at 400MHz, that's OK. If you look at current high-end microcontroller offerings, their IO and most peripherals already run at Fcpu/2 or lower. One could just replace the hardwired peripherals with tiny, simple CPUs.

Nominal Animal · « **Reply #641 on:** January 01, 2023, 10:36:58 am »

Quote from: Siwastaja on January 01, 2023, 07:48:47 am

Even if you have just basic 32-bit add and bitwise operations plus branch instructions, running at say 200MHz while the more complex main CPU runs heavily pipelined at 400MHz, that's OK. If you look at current high-end microcontroller offerings, their IO and most peripherals already run at Fcpu/2 or lower. One could just replace the hardwired peripherals with tiny, simple CPUs.

I agree. Furthermore, these tiny cores – really, just ALUs – don't even need arbitrary memory address access, only a couple of accumulator registers (although a dedicated read-only lookup access would be nice), and very limited loop/jump/conditional capabilities. They're truly simple.

Something like 100 MHz would suffice for my own peripheral bus needs quite nicely (even when the MCU core itself runs at say 400-1000 MHz).

Still, it is the programming of such tiny limited cores in a sub-programming-language within larger applications that currently keeps them in "niche" uses, I believe.
I know tggzzz and others somewhat disagree, but I've found them (as I've described, from BPF to shaders) extremely powerful, and only difficult to "teach" others to use efficiently and effectively. (Not the programming part itself; but the design and integration part.)

In my opinion, it is exactly the development of programming paradigms –– especially event-based programming languages –– that is keeping us software folks back, when compared to the progress made on the hardware side in the last three decades or so. I mean, compilers have become cleverer and better, and we have more choices when it comes to programming languages and especially libraries, but real forward development is rare!

It is also why I do not want to discourage people like OP who want to design new languages; I only want them to consider approaches that haven't already been trodden by scores of well-resourced developers without real significant results; to see approaches that my and others' experience indicates might yield much better results, something new.

(Okay, I am actually well aware that the OP does not read any of my posts, so I'm not writing these posts to OP. I'm writing these to the other people who are mulling new microcontroller programming language development now or in the future, in the hopes that the concepts I'm pushing leads them to discover and develop something truly better than the imperative languages we have now, without complicated layers of abstractions and runtimes that require more and more computing resources like we see on server and desktop architectures.)

Siwastaja · « **Reply #642 on:** January 01, 2023, 11:00:31 am »

Quote from: Nominal Animal on January 01, 2023, 10:36:58 am

these tiny cores
...
Still, it is the programming of such tiny limited cores in a sub-programming-language within larger applications that currently keeps them in "niche" uses, I believe.

These tiny cores could be register-only machines with say 16-32 general purpose registers* (to be able to hold at least some state) and no RAM/stack. Typical program would be tens, max hundreds of LoC. The instruction set could be easy enough that everyone would just program them in assembly. With no memory addressing or other such complicated stuff to parse, the assembler itself could be <1K of code (direct, simple parsing of mnemonics to opcodes), so maybe one could even write the assembly code within the main MCU code and have main CPU assemble it at runtime. (Maybe the machine code would be doable compile-time with some clever C preprocessor stuff, even.)

EDIT: *) with a mechanism to configure the reset values of the registers, for constants, think about CRC polynomial for example.

Nominal Animal · « **Reply #643 on:** January 01, 2023, 11:09:31 am »

Consider the common Unix command-line sort utility.

Regardless of what programming language you use, you typically implement it as reading the input into an array of lines, then sort that array, and finally output the sorted array.

This is very suboptimal use of computing resources.

Instead, one should use a self-organizing data structure, for example a binary heap, so that each line read gets sorted and "inserted" into the correct position. This way, the data is sorted immediately when it is received, and you have best cache locality given the situation. You also use the CPU time that otherwise would be mostly wasted in waiting for the I/O to complete, for computation on data that has already been touched recently.

Because the I/O bandwidth is the bottleneck here –– sorting/insertion does not require much computation, just comparisons ––, in real world time, this approach can start emitting the "sorted" data immediately after the last line of input has been read, when the traditional approach is just ready to start sorting the data.

Overall, self-organizing data structure insertion operations might use fractionally more CPU time than the traditional approach, because we have some very clever offline array sort algorithms. It depends on the instruction set, hardware architecture, and especially the cache architecture, which one actually consumes fewer resources overall.

It is this kind of algorithmic approach that will lead to true progress in software development. Compiler optimization is just details. Problem is, how do we make it easier for humans to implement algorithms that yield better solutions, when the imperative approach they find intuitive favors the less efficient algorithms?

Let's consider my suggestion of focusing on event-based programming, and what a sort implemented using one would be like.

For one, we'll need the input to produce lines, preferably with each line a separate event we can act on. Then, reading the input is a matter of setting up the inputs as input line event generators, and the event handler inserts the lines to a suitable data structure, for example an extensible binary min/max-heap stored in an array. The sort key and direction rules would be applied at that insertion phase. When the input has been completed – an end-of-input event? – we switch to the data structure, emptying it line-by-line in order to the output. Done.

It is examples like this, and stuff like button/encoder-controlled menus on gadgets, finite state machines, and other very event-oriented stuff, that makes me think event-based programming languages might give us a leap forwards. As I mentioned before, even server-side software tends to be very event-oriented, and their operation easier to describe in event terms – request, response especially – than in imperative terms.

Yet, I'm not particularly enamored by event-based programming. For things like the ALU microcores mentioned in previous messages – including BPF, pixel and vertex shader programming –, I definitely prefer imperative programming.
It interests me because I see it as a possible way out of the decades-long stuck phase in software development! (Okay, object oriented programming has made user interface programming much, much easier and arguably more effective than before, but it is just about the only field I've seen truly progress forward; and yet, even it hasn't made any big difference in software reliability, effectiveness –– and definitely has worsened memory hogging.)
In other words, I see us software folks stuck in abstractions and paradigms that haven't really evolved much in the last three decades or more, and am trying to get others interested in the largely untested approaches, because the imperative-object-oriented approaches don't seem to be able to produce anything clearly better, even with decades of person-hours of development work in them.

tggzzz · « **Reply #644 on:** January 01, 2023, 11:38:12 am »

Quote from: Nominal Animal on January 01, 2023, 10:36:58 am

...
Something like 100 MHz would suffice for my own peripheral bus needs quite nicely (even when the MCU core itself runs at say 400-1000 MHz).

Still, it is the programming of such tiny limited cores in a sub-programming-language within larger applications that currently keeps them in "niche" uses, I believe.
I know tggzzz and others somewhat disagree, but I've found them (as I've described, from BPF to shaders) extremely powerful, and only difficult to "teach" others to use efficiently and effectively. (Not the programming part itself; but the design and integration part.)

I'm not sure we do disagree there. Domain Specific Languages can be an excellent architectural choice, but usually aren't.

The "usually bad" examples are typically from softies that yearn to invent their own language (the hardware equivalent is inventing their own processor).

So what might I think are (potentially) good examples?

where the underlying hardware is very different to traditional computer sequential CPU plus global fully addressible memory. Anything related to GPUs is an excellent example
where the computation isn't one of the standard sequential programming paradigms. Pattern matching is one example

Having said that, teaching effective use of such languages will remain a problem. It may be interesting to consider why it is difficult.
Is it more difficult that teaching standard programming paradigms? I tend to doubt that, based on the messes people produce in standard paradigms!
Perhaps people's minds are crippled by learning only the standard paradigms? (cf Dijkstra on COBOL)

Quote

In my opinion, it is exactly the development of programming paradigms –– especially event-based programming languages –– that is keeping us software folks back, when compared to the progress made on the hardware side in the last three decades or so. I mean, compilers have become cleverer and better, and we have more choices when it comes to programming languages and especially libraries, but real forward development is rare!

Oh yes! Precisely!

Quote

It is also why I do not want to discourage people like OP who want to design new languages; I only want them to consider approaches that haven't already been trodden by scores of well-resourced developers without real significant results; to see approaches that my and others' experience indicates might yield much better results, something new.

(Okay, I am actually well aware that the OP does not read any of my posts, so I'm not writing these posts to OP. I'm writing these to the other people who are mulling new microcontroller programming language development now or in the future, in the hopes that the concepts I'm pushing leads them to discover and develop something truly better than the imperative languages we have now, without complicated layers of abstractions and runtimes that require more and more computing resources like we see on server and desktop architectures.)

Oh yes! Precisely!

tggzzz · « **Reply #645 on:** January 01, 2023, 11:58:03 am »

Quote from: brucehoult on January 01, 2023, 12:53:12 am

Quote from: tggzzz on December 31, 2022, 11:33:09 am
Quote from: brucehoult on December 31, 2022, 08:41:57 am
This seems on the face expensive and not scalable, but Berkeley's "TileLink" bus (enhanced and used by SiFive) enables the memory address, a constant, and the operation to be performed to be sent out to a suitably intelligent node in the memory system -- often the L2 cache shared by a number of CPU cores, but it might even be an IO device -- and the ALU operation is performed right where the data is stored, and only the result sent back to the CPU (and only if it's not going to be ignored by being written to register X0).

This scales extremely well.

Arm added a similar capability in AMBA 5 CHI (Coherent Hub Interface)

I don't know the TileLink bus, but at that level it sounds similar to the hypertransport comms in AMD's Athlon/Opteron/etc. That works nicely up to a certain point and the cache coherence traffic and latency becomes a limiting factor.

It's not cache-coherence. The whole point is to be able to avoid moving cache lines around.

That's possible and desirable except where memory is shared between cores, e.g. buffers and synchronisation mechanisms.

Quote

Everything works up to a certain point. The question to be answered is whether that point is 4 CPUs, 64 CPUs, or 1024+ CPUs.

x86's (and SPARC's) TSO memory model is also a limiting factor for those systems. Arm and RISC-V have better memory model, certainly by the time the CPU count gets into three figures.

Note that there are already RISC-V chips in production with more than 1000 cores e.g. ET-SoC-1 with 1088 small in-order cores with vector units, plus 4 big OoO cores.

I can't comment on the relative quality of memory models, since it has been too long since I was interested in them.

There's a general problem of deciding the best granularity for parallel computation. Too fine a granularity and little context has to be shared but the synchronisation times will dominate the computation times. Too coarse, and synchronisation is less of a problem but communicating more context becomes an issue.

Where exactly a cliff lies will depend on both the low-level architecture and the application. Some applications are "embarassingly parallel", and for those the cliff will be more remote and/or gentle. Others will hit Amdahl's law sooner.

I am skeptical about putting vast numbers of general purpose cores on a single piece of silicon; on chip memory size and off-chip bandwidth+latency rapidly become limiting factors. For some niche applications that is not a limitation, of course.

tggzzz · « **Reply #646 on:** January 01, 2023, 12:23:23 pm »

Quote from: Nominal Animal on January 01, 2023, 11:09:31 am

Yet, I'm not particularly enamored by event-based programming. For things like the ALU microcores mentioned in previous messages – including BPF, pixel and vertex shader programming –, I definitely prefer imperative programming.

Having the right granularity of event is the key; it should map directly onto the events in the application's definition and be understandable by someone not involved in the implementation.

There was a chip with, IIRC, 144 tiny cores+memory on it. I forget the manufacturer, but Greenpak/Greenpar springs to mind. I looked at it, but the only use case I could find used each core as a little more than a single logic gate - like a LUT in an FPGA.

Quote

It interests me because I see it as a possible way out of the decades-long stuck phase in software development! (Okay, object oriented programming has made user interface programming much, much easier and arguably more effective than before, but it is just about the only field I've seen truly progress forward; and yet, even it hasn't made any big difference in software reliability, effectiveness –– and definitely has worsened memory hogging.)

OOP has had one major effect: it strongly encourages encapsulation at multiple levels from small objects to large sub-systems accessed only via interfaces. That reduces the cognitive load by reducing the number of things that have to be known before something is used as a black box component. That in turn enables larger applications to be composed by average programmers. And applications have become more complex until the problems of poor structuring limit them.

Summary: the same old problems of programmer's brainpower still apply, but at a larger/higher/more complex level.

And yes, OOP isn't necessary to enable large complex well structured applications, which can be done in C/Pascal/etc. But in practice it seems to be a precondition.

And yes, poor OOP technology leads to poor unreliable applications; think C++ (if you think C++ is the answer, then you need to revisit the question!)

Quote

In other words, I see us software folks stuck in abstractions and paradigms that haven't really evolved much in the last three decades or more, and am trying to get others interested in the largely untested approaches, because the imperative-object-oriented approaches don't seem to be able to produce anything clearly better, even with decades of person-hours of development work in them.

Yup. Me too

DiTBho · « **Reply #647 on:** January 01, 2023, 12:52:44 pm »

Quote

if you think C++ is the answer,
then you need to revisit the question!

Loooooooool, I want a t-shirt with this

tggzzz · « **Reply #648 on:** January 01, 2023, 12:56:13 pm »

Quote from: DiTBho on January 01, 2023, 12:52:44 pm

Quote
if you think C++ is the answer,
then you need to revisit the question!

Loooooooool, I want a t-shirt with this

I decided that in the late 80s, and subsequent history has only reinforced my view!

Siwastaja · « **Reply #649 on:** January 01, 2023, 01:43:05 pm »

C++ is great, because C++ connects people: basically all experienced or knowledgeable people (people with a clue, or those who grok, so to speak) agree on if not hating, but at least not loving C++, no matter how much they disagree otherwise. It's truly a horrible and messy language.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: The Imperium programming language - IPL (Read 71503 times)

Share me