Author Topic: The "alternatives to interrupts" thread (Read 14319 times)

SiliconWizard · « **on:** December 16, 2020, 06:18:05 pm »

(Disclaimer: not sure this is the appropriate section for this, but I had a hard time finding one.)

I'm currently working on CPU design (with a working RISC-V core so far), and at this point, more specifically pondering the question of implementing alternatives to classic interrupts.

I know of the XMOS architecture, for instance, so that could give some starting ideas. I thought sharing ideas/discussing them could be interesting.

PKTKS · « **Reply #1 on:** December 16, 2020, 06:55:55 pm »

On x86 things changed a lot..

from real hardware lines (maskable or not) to
a whole dedicated software only kernel router.

When in doubt I try to parse those new routing
schemas which handle interrupts routing them
properly to whatever replaced the interrupt controller
on that x86 "generation"

but it changed too much over the years and versions

Paul

rhodges · « **Reply #2 on:** December 16, 2020, 09:07:26 pm »

One interesting VLIW CPU I worked with was the Trimedia/Philips PNX1302. It had 128 32-bit registers, and by convention the top 64 were reserved for interrupts. That saved the cost of saving registers. Also, an interrupt could only be taken on a branch instruction. This may sound crazy, but it probably made things easier for the designers. It also made it very easy to do atomic operations. Everything between branches was atomic.

ataradov · « **Reply #3 on:** December 17, 2020, 12:06:07 am »

It is interesting to think about that for hardened designs, but there is another angle to FPGA based implementation that people often miss.

In FPGA you are free to implement time sensitive logic in the fabric itself, and it can be changed for each project individually. So the whole core can be designed without interrupts at all, making the whole design much simpler.

Again, this is provided that you are implementing the core for practical uses, and not just for entertainment assuming it may be implemented in silicon. There you have to think about this stuff, of course.

brucehoult · « **Reply #4 on:** December 17, 2020, 12:43:41 am »

Quote from: rhodges on December 16, 2020, 09:07:26 pm

One interesting VLIW CPU I worked with was the Trimedia/Philips PNX1302. It had 128 32-bit registers, and by convention the top 64 were reserved for interrupts. That saved the cost of saving registers. Also, an interrupt could only be taken on a branch instruction. This may sound crazy, but it probably made things easier for the designers. It also made it very easy to do atomic operations. Everything between branches was atomic.

Interesting trade-off. Switch registers to minimize interrupt latency -- but then add to it by only switching on branches. Also 128 registers uses a ton of die space.

Doctorandus_P · « **Reply #5 on:** December 17, 2020, 01:07:14 am »

On multi core designs, instead of using ISR's you can dedicate a processor core to stuff that is traditionally done with DMA or ISR's.

This fits also well with an RTOS, with message passing between the cores.

You can also have a look at what others have done. The Propellor chip for example has 8 cores, but did not have a decent programming language for a long time. I've heared that changed some years ago, but have not looked into the details. There are also some other multi core designs in the microcontroller world. I think even the ESp32 has two cores, but I don't know what they do with it.

The simplest is of course a polling loop. I once experimented with a high speed encoder that generated more then 100k ISR's /s on an AVR. The result was that it was not possible to enable any other ISR for things such as multiplexing a display, a timer tick ISR or for UART communicaton. So I moved those all to a polling loop which checked the ISR flags of the disabled interrupts, and executed functions if needed. It did work, but soon thereafter I switched to an ARM controller which had hardware quadrature encoders, so I never fully tested the AVR program. But the method in itself is valid.

The Sittara such as used in the BeagleBone Black has a main ARM processor and two PRU's (Programmable Realtime Units) which are microcontrollers that run their independent program. They have a way to share information via overlapping (paged?) RAM.

And without doubt there are other constructs in use. A good way to search for such devices may be to look for High-end microcontrollers of some 10 to 20 years old.

brucehoult · « **Reply #6 on:** December 17, 2020, 01:30:24 am »

I'm assuming you do want some kind of interrupts.

You *can* do everything by polling, but it places big constraints on software design if you want to be responsive to external events. In the limit you end up with your main program being some kind of interpreter, executing some kind of meta instruction set (or something like FORTH words) and polling for events between them.

"Gigatron" takes this to an interesting extreme, building a minimal CPU from about 40 7400-series chips and generating a standard VGA output signal in the interpreter loop, while interpreting 6502 or z80 or their own custom 16 bit ISA instructions mostly in the horizontal and vertical blanking periods.

RISC-V deliberately cleanly split the user instruction set from the privileged instruction set (which deals with interrupts and other things) so that you are free to use some other "privileged" architecture if you want to.

A minimal conforming implementation of the standard 1.10 priv architecture with only machine mode and a single non-vectored CLIC interrupt entry point can be pretty simple. You need to have a CSR (mtvec) containing the interrupt routine address but it can be fixed and read-only if you want. You need somewhere to save the old PC (mepc), You need mscratch to give you some way to free up a register to get started with saving other registers (you *could* reserve a location in the top or bottom 2k of the address space for this). You probably want something very like mcause so your interrupt routine can know why it was interrupted. And if you want to diagnose illegal instructions or illegal load/store addresses then you want mtval to hold the responsible address or opcode (you could also get this with a bit more work by examining the program instruction at mepc). And you want some way to prevent an interrupt routine being interrupted.

If you're doing all that anyway then it's probably no more work to follow the standard than not to. And then standard RTOS's will work on your CPU.

You could maybe simplify an implementation a bit by not having CSR instructions and separate address space but instead memory-mapping those registers, preferably in the top or bottom 2k of the address space so you can access them with a single instruction with no temporary registers.

You could simplify the design process at the cost of using more of the expensive kind of storage by just duplicating the PC and the whole register file and effectively just having two hardware threads. Have an interrupt and interrupt return flip the register sets. You can even have the interrupt handler written as a loop, with the return from interrupt instruction (maybe actually "flip register sets") being the instruction before the start of the interrupt handler. This kind of design does give the problem of how you initially set up the alternate register set (including PC) with appropriate values. One way to do that is to have the interrupt (and the return from interrupt instruction) just flip the PC, and have a separate instruction to flip the register file. Then you can write code to flip the register sets, load the registers from RAM, and flip them back.

A mid-point design would be to only duplicate a few registers -- enough for "simple" interrupts -- and have software save and restore the others if needed. ARM used to do this in pre Cortex designs.

SiliconWizard · « **Reply #7 on:** December 17, 2020, 01:42:49 am »

The idea of dedicating one or several cores for servicing what is usually done in ISRs is a workable approach. This is how things are done on XMOS devices, for instance, as far as I've understood.

The whole point of classic interrupts is that they don't require much additional resources to an existing core, while basically giving access to all of said core's resources in ISRs. Dedicating one or more cores just for this can have significant cost, unless 1/ this additional core(s) is simpler than the main processing core(s) or 2/ for a given range of applications, there is so much to be done in ISRs that dedicating one or more cores just for this is justified from a resource POV.

Now, considering what is done in ISRs in a typical system, a simplified core would likely be more than adequate. It would just need efficient/fast access to some shared memory, or an efficient message passing system.

And yes, this has been done before. Interestingly, this is basically what was already done a long time ago. Many old systems used a main CPU and then one or several secondary (and cheaper) processors dedicated to servicing peripheral/or otherwise secondary tasks.

ataradov · « **Reply #8 on:** December 17, 2020, 01:45:55 am »

Dedicating a CPU to a certain task is not that wasteful if you design your application correctly. You are not just dedicating a CPU to handle interrupts, but you are dedicating it to handle all the communication tasks, for example, including buffer management, priories, etc. This way it makes much more sense, provided you have the hardware for that.

SiliconWizard · « **Reply #9 on:** December 17, 2020, 01:49:33 am »

Quote from: Doctorandus_P on December 17, 2020, 01:07:14 am

The Sittara such as used in the BeagleBone Black has a main ARM processor and two PRU's (Programmable Realtime Units) which are microcontrollers that run their independent program.

This kind of looks like what I said above, thus additional but simpler "cores" to handle those tasks. I'm curious to see what those PRUs are, I'll have a look.

Edit: had a look at the datasheet. Sitara processors are interesting, but unfortunately, this was not as interesting as I "thought". As I get it, those have several cores, and the PRUs are just additional, RISC cores (probably much simpler than the main ARM core). Those PRUs have their own set of peripherals, and you can dedicate them to "low-level" tasks, but they still themselves support classic interrupts. This is not very different from a classic interrupt-based system - just with more cores.

SiliconWizard · « **Reply #10 on:** December 17, 2020, 02:05:09 am »

Quote from: brucehoult on December 17, 2020, 01:30:24 am

I'm assuming you do want some kind of interrupts.

Well, that's already a good question. What I want is either interrupts, or a mechanism that allows to do the same, possibly more efficiently / with lower latency / ...

Quote from: brucehoult on December 17, 2020, 01:30:24 am

You *can* do everything by polling, but it places big constraints on software design if you want to be responsive to external events. In the limit you end up with your main program being some kind of interpreter, executing some kind of meta instruction set (or something like FORTH words) and polling for events between them.

Clearly sticking to polling would not be "good enough" for what I have in mind.

At this point, I have implemented what is referred to as "CLINT" - which is pretty basic. I was starting work on some interrupt controller on top of it, and then before going too far, I'm willing to try alternative approaches.

I do like the idea of dedicated cores, and avoiding interrupts and context switching. As I said, on first thought, dedicating a core equivalent to the main core(s) for this sounded kind of wasteful, although it certainly depends on the application. But I realize that an extra RISC-V core with basic features wouldn't take that much...

But I'm also trying to think of typical scenarios and typical ISRs, and derive from that what would be really needed, instead of having a general-purpose core handle this. I'm pretty sure we can use simpler units for this. Of course, the extreme, as ataradov mentioned, would be to only implement additional "units" completely dedicated to the tasks for a particular application; but what I'm trying to devise here is something between both approaches, but more flexible/generic than the latter.

Edit: on second thought, the additional, dedicated core approach alone isn't enough to get rid of interrupts. One additional dedicated core, without interrupts, would be OK if it does everything by polling. Problem is, if all you do is polling, even on a dedicated core, you can't efficiently handle priorities, specifically "preempting" what the core is currently servicing to service something else. So if you need to handle preemption, then you either need interrupts - again - or use more cores, each handling one or several tasks that don't require to be preempted. On a moderately complex system, that would require a lot of cores.

rstofer · « **Reply #11 on:** December 17, 2020, 04:41:15 am »

Re: context switch

You can provide a separate set of registers for 3 modes: Kernel, Supervisor and User - like the later PDP11s

You can also reserve low memory as branch entries for Traps and Interrupts. Maybe the first 256 memory locations for Traps and the next 256 for Interrupts. Don't forget to prevent User mode from writing to these locations. This violation is probably a Trap as well.

Traps are used for system entry points like print() or get(), perhaps soft instructions like floating point arithmetic (unless you use hardware FP). Typically you would change to Supervisor mode during a trap and back to the previous mode on exit.

Interrupt entries point to the interrupt handler. This way you don't need an interrupt vector device. Typically you would change to Kernel mode during an interrupt entry and back to the previous mode, whatever it was, on exit.

Registers are cheap (in an FPGA) so maybe you have one for every interrupt vector. I guess you could do the same thing for traps but that code is going to be longer and perhaps the context switch isn't a high percentage of the code.

You will have to come up with a way to prioritize interrupts before using the interrupt vectors to branch to the handler.

Something like that...

Berni · « **Reply #12 on:** December 17, 2020, 06:36:28 am »

Well the way XMOS handles 'interrupts' is sort of like classical interrupts, just more efficiently.

What XMOS calls 'cores' these are actually threads on a 8 threaded core (that they now call a 'tile'). So there way doing it is simply a thread blocking until a interrupt signal, at that point the thread gets scheduled for execution on the next round. So its actually very much like regular interrupts in that it does a on demand context switch into a piece of code, just that the piece of code in question is a paused thread program counter rather than a interrupt vector. The benefit is just that the said context switch is done very efficiently by simply inserting the extra thread into the usual pipelined execution schedule, so no wasted cycles on context switching.

It would be possible to use this idea to also execute regular vectored interrupts. But i would imagine adding this 0 cycle overhead context switch to a CPU is not so simple. Likely easier to just add a whole extra CPU core and fire it up when an interrupt needs attention. Since the previous interrupt was done means that the registers contain garbage anyway so nothing to save, perhaps just have a quick way of zeroing all the registers when jumping in. There is still latency involved with filling the instruction pipeline, but if another interrupt raises during this one it can just transition into it with zero cycles of downtime.

Perhaps these interrupt processor cores can also be simpler cores that can only execute the basic subset of the main CPUs full instruction set (Like no reason to have a FPU on these)

helius · « **Reply #13 on:** December 17, 2020, 07:05:08 am »

Quote from: Berni on December 17, 2020, 06:36:28 am

Perhaps these interrupt processor cores can also be simpler cores that can only execute the basic subset of the main CPUs full instruction set (Like no reason to have a FPU on these)

Now that starts to sound like the SPUs on the TI Sitara (Beagle Bone Black processor).

Nominal Animal · « **Reply #14 on:** December 17, 2020, 09:10:30 am »

Quote from: SiliconWizard on December 17, 2020, 02:05:09 am

Edit: on second thought, the additional, dedicated core approach alone isn't enough to get rid of interrupts. One additional dedicated core, without interrupts, would be OK if it does everything by polling. Problem is, if all you do is polling, even on a dedicated core, you can't efficiently handle priorities, specifically "preempting" what the core is currently servicing to service something else. So if you need to handle preemption, then you either need interrupts - again - or use more cores, each handling one or several tasks that don't require to be preempted. On a moderately complex system, that would require a lot of cores.

Yes, exactly: if you want to run an OS with pre-emptive (as opposed to co-operative) multitasking, you do need some type of interrupts that change the instruction pointer.

Have you considered using an interrupt mechanism that can also be used to implement tasks/threads and coroutines?

(I have not designed any FPGAs or cores, but I have done quite a bit of low-level programming. So, everything I'll mention is purely from an OS developer/programmer point of view. I am also envisioning more than one hardware core running unprivileged code.)

Let's say you have a simple shared interrupt controller with private memory for N register files. In addition to hardware interrupt sources, it supports a few software ones, for implementing privilege separation (supervisor calls) and coroutines (or co-operative multitasking in general).

When a hardware interrupt occurs, it chooses one of the hardware cores, and swaps its register file with the one corresponding to the hardware interrupt. This way, the hardware interrupt is essentially a loop, with register contents retained between invocations. The initial register file contents identify the hardware interrupt, and must be populated at boot-up. When the handler exits, the register file is swapped back, and the core continues execution where it was interrupted.

Supervisor calls, coroutines, and software interrupts (including POSIX signals) really require that each hardware core has load/store instructions to its currently assigned privileged register file, via a register or base index the interrupt controller controls. Hardware interrupts would not have any access. Software interrupts would have (read-only) access to the register file of the task they interrupted. Co-routines would have read-write access.

Supervisor calls and syscalls would essentially switch execution, with arguments loaded via the privileged register file, and results stored via the privileged register file.
If the register file contains access control registers themselves controlled by a task priority not directly accessible to the hardware core itself, only to the interrupt controller, privilege separation would be easy to implement: when starting a new task, a privileged register file slot would be allocated for it, and populated. When it exits, the register file slot would be freed. (If the task is unprivileged, a second slot would be needed, for executing syscalls.)

When a hardware interrupt source is triggered, the interrupt controller would begin by loading the contents of the respective register file slot. At the same time, it could check which one of the existing hardware cores to use to service the interrupt, and flag it to pause at end of instruction execution. Before executing a new instruction, the core checks the flag, and while set, pauses (doing nothing, setting a hardware flag), clearing the instruction pipeline. This same flag can be used to implement cross-core atomic ops. When the interrupt controller finds the core paused, it does the swap. Task exit would do the same, but otherwise act like a software interrupt.

The reason that this kind of architecture interests myself, is primarily security. Suspending the interrupt controller to a storage medium would be nontrivial, but the simplicity of writing especially hardware interrupt handlers should be obvious; and the programmer always needs to spend extra effort to cross privilege boundaries, instead of the reverse. This kind of architecture does not necessarily need stack, either. Whether that affects the core, I don't know, but the programming implications seriously interest me. In particular, you could have a hardware memory manager, with a segmented memory model, implementing both hardware stacks, and virtual memory. One could reserve one segment for return addresses only, and another for saved register state, scratchpad, etc. This alone would eliminate whole categories of security exploits.

I know this sounds crazy and quite complicated, but if one examines e.g. hypervisors on the x86-64 architecture, you can see why a security-paranoid programmer would want these hardware features. I assume having two (or more) completely separate memory buses (to get the latencies down to acceptable limits) would be the biggest barrier. C programmers tend to dislike segmented memory models, mostly because standard or POSIX C does not support separate address spaces, and support for named address spaces in GCC is still evolving as of 2020, but it is more natural for fully featured OSes. (For example, see how difficult it is to create an array of strings in GCC C or C++, where the array of pointers is in one section, and the literal strings in another section. It is complicated.)

Fixpoint · « **Reply #15 on:** December 17, 2020, 10:26:41 am »

Quote from: SiliconWizard on December 16, 2020, 06:18:05 pm

pondering the question of implementing alternatives to classic interrupts.

Hasn't this been done already with CPU-internal event systems? Those enable me to create complex configurations and get around interrupts a lot of the time, so ... am I missing something?

Berni · « **Reply #16 on:** December 17, 2020, 10:39:01 am »

Quote from: helius on December 17, 2020, 07:05:08 am

Quote from: Berni on December 17, 2020, 06:36:28 am
Perhaps these interrupt processor cores can also be simpler cores that can only execute the basic subset of the main CPUs full instruction set (Like no reason to have a FPU on these)
Now that starts to sound like the SPUs on the TI Sitara (Beagle Bone Black processor).

Yeah also similar to the RTU (RealTimeUnit) on some TI OAMP processors.

The issue with those was that they are a pain in the ass to use. They are basically like small 32bit MCUs that have access to the main memory bus. But they ran there own weird proprietary instruction set, so you had to write the software for it in another compiler and then insert the result into the main code as a binary blob that simply gets thrown at the RTU before it is started up. It could do some neat things since it ran at quite a speedy pace, had multiple of them. But id imagine most people didn't bother using it.

Something similar are the SPE (Synergistic Processing Elements) used in the PlayStation 3 processor. Also multiple little processors sitting on the main memory bus, also running a weird proprietary instruction set. Except that these also have SIMD floating math for letting them do DSP work and have a core to core communication system similar to XMOS to allow them to more easily cooperate on a task. They can provide some very impressive processing grunt, but are also a pain in the ass to work with, so most developers ignored them all together.

But the sort of efficient interrupt handling is probably more of an interesting topic for embedded real time applications rather than big OSes. Getting device drivers in a large OS to do something other than what they use now is probably a ton of work.

tggzzz · « **Reply #17 on:** December 17, 2020, 11:35:45 am »

Quote from: SiliconWizard on December 16, 2020, 06:18:05 pm

(Disclaimer: not sure this is the appropriate section for this, but I had a hard time finding one.)

I'm currently working on CPU design (with a working RISC-V core so far), and at this point, more specifically pondering the question of implementing alternatives to classic interrupts.

I know of the XMOS architecture, for instance, so that could give some starting ideas. I thought sharing ideas/discussing them could be interesting.

Interrupts and core design are a small part of what's needed for a system.

Communications between multiple threads of execution, either classic threads or multiple cores. That includes hardware facilities plus the corresponding software to make use of it.

Good i/o structure is needed to minimise processor involvement, preferably including counters that record when input happened or when output should happen. "FPGA-like" structures have proven to be useful.

Toolsets that tie all that together, from the language[1] down to the compiler, "RTOS", debugger, and libraries.

Experience shows good hardware is a pain if the associated tools can't make good use of it.

I would recommend playing with the XMOS toolset, to see what you could achieve.

[1] Fully-featured C is not a good starting point, IMNSHO

tggzzz · « **Reply #18 on:** December 17, 2020, 11:39:21 am »

Quote from: ataradov on December 17, 2020, 12:06:07 am

It is interesting to think about that for hardened designs, but there is another angle to FPGA based implementation that people often miss.

In FPGA you are free to implement time sensitive logic in the fabric itself, and it can be changed for each project individually. So the whole core can be designed without interrupts at all, making the whole design much simpler.

Again, this is provided that you are implementing the core for practical uses, and not just for entertainment assuming it may be implemented in silicon. There you have to think about this stuff, of course.

Such hardware-software partitioning is a key consideration. It is one which too few people can do well, since it requires solid knowledge of software architectures, software languages, processor architecture and implementation, hardware architectures, FPGA capabilities and HDLs.

tggzzz · « **Reply #19 on:** December 17, 2020, 11:51:10 am »

Quote from: SiliconWizard on December 17, 2020, 01:42:49 am

The whole point of classic interrupts is that they don't require much additional resources to an existing core, while basically giving access to all of said core's resources in ISRs. Dedicating one or more cores just for this can have significant cost, unless 1/ this additional core(s) is simpler than the main processing core(s) or 2/ for a given range of applications, there is so much to be done in ISRs that dedicating one or more cores just for this is justified from a resource POV.

I think you are missing the hardware complexity and overhead introduced by interrupts, particularly in more advanced processors.

Complexities include all the hidden state in pipelined and out-of-order and speculative execution processors, interactions with the different types of caches (including TLBs) when misses occur, and the many stages between a write instruction -> caches -> memory controller -> memory (especially with memory which you read/write a row at a time).

Sun's Niagara processors very effectively tackled all that head on. They eliminated all that control hardware overhead, and replaced it with many (e.g. 64) simple cores each of which ran at DRAM speed. It worked very well for "embarassingly parallel" software, within their traditional unix.

BrianHG · « **Reply #20 on:** December 17, 2020, 11:55:40 am »

The interrupts I engineered into my own RISC cores in my FPGA were all too simple to implement. I cannot see an issue to do so in an FPGA. When an interrupt signal is sent, all I had to do is in the read instruction fetch pipeline, I would shift in a JSR/CALL to a fixed address. Yes, just a jump/call. Nothing else. The RSIC's PC would move along to the JSR & use the existing PC stack and all I had to do once there was a goto 'my interrupt handler' based on the input-port of which sub-interrupt line was signaled. All other items like protecting/storing registers, clearing the interrupt lines, restoring registers and the final normal 'return' command was handled in software in the interrupt service routine.

A simple 'AND' gate on the interrupt trigger for enable/disable the interrupt was used to prevent unwanted interrupts.

Things like interrupt on change and multiple interrupt lines was handled by external logic all ored to the single master one if you want more than 1 interrupt. Once the single interrupt is triggered, the cpu software can software check the read port and decide which int needs servicing.

This shouldn't be too complex.

tggzzz · « **Reply #21 on:** December 17, 2020, 11:57:04 am »

Quote from: brucehoult on December 17, 2020, 01:30:24 am

I'm assuming you do want some kind of interrupts.

You *can* do everything by polling, but it places big constraints on software design if you want to be responsive to external events. In the limit you end up with your main program being some kind of interpreter, executing some kind of meta instruction set (or something like FORTH words) and polling for events between them.

Nope. Or rather, only if you needlessly limit yourself to a single processor.

Now that semiconductor processes have hit the wall, multiple processors (either at the core level or distributed computing level) will become the standard. Might as well get used to designing systems that use them, because they are the future.

Parallel hardware is easy. Having software architectures and languages and libraries and tools that make use of them is not widely understood, but is critical.

tggzzz · « **Reply #22 on:** December 17, 2020, 12:11:40 pm »

Quote from: Nominal Animal on December 17, 2020, 09:10:30 am

Quote from: SiliconWizard on December 17, 2020, 02:05:09 am
Edit: on second thought, the additional, dedicated core approach alone isn't enough to get rid of interrupts. One additional dedicated core, without interrupts, would be OK if it does everything by polling. Problem is, if all you do is polling, even on a dedicated core, you can't efficiently handle priorities, specifically "preempting" what the core is currently servicing to service something else. So if you need to handle preemption, then you either need interrupts - again - or use more cores, each handling one or several tasks that don't require to be preempted. On a moderately complex system, that would require a lot of cores.
Yes, exactly: if you want to run an OS with pre-emptive (as opposed to co-operative) multitasking, you do need some type of interrupts that change the instruction pointer.

Have you considered using an interrupt mechanism that can also be used to implement tasks/threads and coroutines?

<other concepts snipped>

You are getting there, but are approaching it from the wrong angle and not being sufficiently radical

Throw out the concept of one processor executing multiple processes[1], and implement an RTOSs' scheduling and comms in hardware.

It can be done: see XMOS xCORE+xC

[1] no, not unix-level processes

Fixpoint · « **Reply #23 on:** December 17, 2020, 12:13:13 pm »

Quote from: tggzzz on December 17, 2020, 12:11:40 pm

You are getting there, but are approaching it from the wrong angle and not being sufficiently radical

Do you have a manifesto? It is imperative that you have a manifesto.

tggzzz · « **Reply #24 on:** December 17, 2020, 12:18:29 pm »

Quote from: Berni on December 17, 2020, 10:39:01 am

But the sort of efficient interrupt handling is probably more of an interesting topic for embedded real time applications rather than big OSes. Getting device drivers in a large OS to do something other than what they use now is probably a ton of work.

Very true.

The best I've seen for a "large OS" is Sun's UltraSPARC T1 and T2 processors (a.k.a. Niagara), which had the standard SPARC instruction set and ran Sun's Unix and all the standard applications.

It was very effective for "embarassingly parallel" applications where single flow execution wasn't the dominant factor. Examples: databases, web servers, soft real-time telecom control systems.

tggzzz · « **Reply #25 on:** December 17, 2020, 12:28:48 pm »

Quote from: Fixpoint on December 17, 2020, 12:13:13 pm

Quote from: tggzzz on December 17, 2020, 12:11:40 pm
You are getting there, but are approaching it from the wrong angle and not being sufficiently radical

Do you have a manifesto? It is imperative that you have a manifesto.

I don't have a manifesto, but XMOS indicates one solid approach. (IMNSHO, we need more approaches to system design for future highly parallel systems)

Start with
https://www.xmos.ai/file/xcore-200-product-brief?version=latest
https://www.xmos.ai/file/xmos-programming-guide?version=latest
https://www.xmos.ai/file/tools-user-guide?version=latest
and the other links at
https://www.xmos.ai/xcore-200/

Fixpoint · « **Reply #26 on:** December 17, 2020, 12:37:55 pm »

Quote from: tggzzz on December 17, 2020, 12:28:48 pm

we need more approaches to system design for future highly parallel systems

Whenever I advocate proper design techniques instead of ad-hoc hacking in this forum, it doesn't take long until people appear who quote phrases like "keep it simple" or who say "you are only adding complexity, good engineering is simple".

I hope you are luckier than me.

tggzzz · « **Reply #27 on:** December 17, 2020, 12:53:21 pm »

Quote from: Fixpoint on December 17, 2020, 12:37:55 pm

Quote from: tggzzz on December 17, 2020, 12:28:48 pm
we need more approaches to system design for future highly parallel systems

Whenever I advocate proper design techniques instead of ad-hoc hacking in this forum, it doesn't take long until people appear who quote phrases like "keep it simple" or who say "you are only adding complexity, good engineering is simple".

I hope you are luckier than me.

No, I'm not

It is worse when you hear such statements at work. The best retort I have found is "Everything should be made as simple as possible, but not simpler", which is often attributed to Einstein.

Nominal Animal · « **Reply #28 on:** December 17, 2020, 02:14:21 pm »

Quote from: tggzzz on December 17, 2020, 12:11:40 pm

Quote from: Nominal Animal on December 17, 2020, 09:10:30 am
Quote from: SiliconWizard on December 17, 2020, 02:05:09 am
Edit: on second thought, the additional, dedicated core approach alone isn't enough to get rid of interrupts. One additional dedicated core, without interrupts, would be OK if it does everything by polling. Problem is, if all you do is polling, even on a dedicated core, you can't efficiently handle priorities, specifically "preempting" what the core is currently servicing to service something else. So if you need to handle preemption, then you either need interrupts - again - or use more cores, each handling one or several tasks that don't require to be preempted. On a moderately complex system, that would require a lot of cores.
Yes, exactly: if you want to run an OS with pre-emptive (as opposed to co-operative) multitasking, you do need some type of interrupts that change the instruction pointer.

Have you considered using an interrupt mechanism that can also be used to implement tasks/threads and coroutines?

<other concepts snipped>

You are getting there, but are approaching it from the wrong angle and not being sufficiently radical

Well, perhaps! But what I described was what I'd like to see from a traditional processor, running machine code generated from traditional programming languages.

The core of the interrupt controller being a register file swapper basically implements what OS kernels end up doing in software, in a way that can be used to implement userspace task switching too (coroutines, signals).

The memory manager is really separate feature, and I probably should have omitted it; it just completes the picture for me.

tggzzz · « **Reply #29 on:** December 17, 2020, 02:27:00 pm »

Quote from: Nominal Animal on December 17, 2020, 02:14:21 pm

Quote from: tggzzz on December 17, 2020, 12:11:40 pm
You are getting there, but are approaching it from the wrong angle and not being sufficiently radical
Well, perhaps! But what I described was what I'd like to see from a traditional processor, running machine code generated from traditional programming languages.

You will be interested in the Mill Processor. They are still filing patents and hence haven't disclosed all their ideas, but a lot is available on the comp.arch usenet group and information dense yootoob videos.

It is radically different but absolutely will run C/Unix fast, and security is a key concern of theirs. One of the main instigators, Ivan Godard, has a long and wide experience of languages, compilers, operating systems, and how they interact with hardware. A lot of his background is with the Burroughs systems, which have some very nice benefits which ought to be more widely appreciated. Unfortunately AT&T effectively gave Unix away for free.

Apart from that, xC is C plus multiprocessor constructs, minus rough edges that originated when C defined that "anything to do with threads was a library issue"[1]. You can link C and C++ libraries with xC.

The first paragraphs of the wonderfully short and complete introductory tutorial https://www.xmos.ai/file/xmos-programming-guide?version=latest are...

To program an XMOS device you can used C, C++ or xC (C with multicore extensions). In your software project you can mix all three types of source file. To use the xC, just use the .xc file extension for your source file instead of .c. The xTIMEcomposer compiler will automatically detect this file extension and enable the C extensions for that file.

The xTIMEcomposer tools provide fully standards compliant C and C++ compilation (for example, files with a .c will be compiled as standard C). Applications can contain code written in a mixture of xC and C - you can call functions written in xC from standard C and vice-versa.

[1] which people still believed was sufficient in 2005 and beyond!

tggzzz · « **Reply #30 on:** December 17, 2020, 02:30:40 pm »

Quote from: Nominal Animal on December 17, 2020, 02:14:21 pm

The core of the interrupt controller being a register file swapper basically implements what OS kernels end up doing in software, in a way that can be used to implement userspace task switching too (coroutines, signals).

The memory manager is really separate feature, and I probably should have omitted it; it just completes the picture for me.

You can view the XMOS processors that way.

The memory manager (from caches to capacitors) interaction with the processor simply cannot be ignored. It can be a tail that wags the dog in terms of performance limitation and implementation complexitity.

Tom45 · « **Reply #31 on:** December 17, 2020, 02:48:37 pm »

Quote from: SiliconWizard on December 17, 2020, 01:42:49 am

The idea of dedicating one or several cores for servicing what is usually done in ISRs is a workable approach. This is how things are done on XMOS devices, for instance, as far as I've understood.
...
And yes, this has been done before. Interestingly, this is basically what was already done a long time ago. Many old systems used a main CPU and then one or several secondary (and cheaper) processors dedicated to servicing peripheral/or otherwise secondary tasks.

The first time this was done was the Control Data 6600 nearly 60 years ago. It had a 60 bit central processor with several execution units and 60 bit main memory, but no interrupts.

There were 10 12 bit peripheral processors (PPs) running in parallel to handle I/O and operating system tasks. The PPs could talk to peripherals, and could read/write to the 60 bit main memory. They could also tell the central processor where to start executing code.

The PPs didn't have interrupts either. They could poll whatever devices they were responsible for since each one had a dedicated task.

Fixpoint · « **Reply #32 on:** December 17, 2020, 03:09:21 pm »

Quote from: tggzzz on December 17, 2020, 12:53:21 pm

It is worse when you hear such statements at work. The best retort I have found is "Everything should be made as simple as possible, but not simpler", which is often attributed to Einstein.

The problem is that people come up with this "simplicity" nonsense all the time. They abuse the following (alleged) quotes:

Truth is ever to be found in the simplicity, and not in the multiplicity and confusion of things. -- Newton

An alleged scientific discovery has no merit unless it can be explained to a barmaid. -- Rutherford

If you can't explain it simply you don't understand it well enough. -- Einstein

If you can't teach something to a 6-year-old, that means you don't really understand it. -- Feynman

Of course, the Einstein quote contradicts the other well-known Einstein quote: "Everything should be made as simple as possible, but not simpler."

Also, it is not clear whether Rutherford, Einstein, and Feynman really said those things. They are basically only attributed.

The mathematics behind physical truth is very complicated and of course Newton, Rutherford, Einstein, and Feynman knew that very well -- they were among those who developed that very mathematics. The quotes are always abused and misunderstood.

SiliconWizard · « **Reply #33 on:** December 17, 2020, 03:52:07 pm »

Quote from: rstofer on December 17, 2020, 04:41:15 am

Re: context switch

You can provide a separate set of registers for 3 modes: Kernel, Supervisor and User - like the later PDP11s

I've seen that in some processors, some implementing that with a windowed register file, for instance in the Sun SPARC AFAIR. Whereas this is an improvement for context switching, this is still only a partial answer to the general issue of preemption at the same level/mode. So either you just don't allow preemption at the same level, or you need a large number of register sets to efficiently accomodate for a certain number of preemption levels.

But that certainly can be considered to improve performance.

Bud · « **Reply #34 on:** December 17, 2020, 04:12:45 pm »

Quote from: Fixpoint on December 17, 2020, 03:09:21 pm

Also, it is not clear whether Rutherford, Einstein, and Feynman really said those things. They are basically only attributed.

Yep,
"Do not believe everything you read on the Internet" - Einstein

SiliconWizard · « **Reply #35 on:** December 17, 2020, 04:36:52 pm »

Quote from: Fixpoint on December 17, 2020, 10:26:41 am

Quote from: SiliconWizard on December 16, 2020, 06:18:05 pm
pondering the question of implementing alternatives to classic interrupts.

Hasn't this been done already with CPU-internal event systems? Those enable me to create complex configurations and get around interrupts a lot of the time, so ... am I missing something?

Care to elaborate a bit? Especially how you can handle external interrupts with this scheme?

Fixpoint · « **Reply #36 on:** December 17, 2020, 05:44:31 pm »

Quote from: SiliconWizard on December 17, 2020, 04:36:52 pm

Care to elaborate a bit? Especially how you can handle external interrupts with this scheme?

I don't know whether I understood correctly what you are trying to achieve. The event systems that I'm talking about enable me to configure the MCU like this:

"When you see a logic LOW on pin X please start counter Y, and as soon as counter Y has a compare match please trigger the ADC on channel 6, and when the ADC is done please trigger interrupt Z."

So, in this example there is no code execution or interrupt handling involved within the event chain, only at the end there is an interrupt. This means that the CPU has no code to execute while the event chain is running and is free to do anything else. In many cases I can create a configuration where there is no interrupt and no code execution at all, only event handling.

Some series of Microchip MCUs have this kind of event system. But as I said, I don't know whether I understood you correctly and whether this has anything to do with your question.

SiliconWizard · « **Reply #37 on:** December 17, 2020, 06:04:36 pm »

Quote from: Fixpoint on December 17, 2020, 05:44:31 pm

Quote from: SiliconWizard on December 17, 2020, 04:36:52 pm
Care to elaborate a bit? Especially how you can handle external interrupts with this scheme?

I don't know whether I understood correctly what you are trying to achieve. The event systems that I'm talking about enable me to configure the MCU like this:

"When you see a logic LOW on pin X please start counter Y, and as soon as counter Y has a compare match please trigger the ADC on channel 6, and when the ADC is done please trigger interrupt Z."

So, in this example there is no code execution or interrupt handling involved within the event chain, only at the end there is an interrupt. This means that the CPU has no code to execute while the event chain is running and is free to do anything else. In many cases I can create a configuration where there is no interrupt and no code execution at all, only event handling.

Some series of Microchip MCUs have this kind of event system. But as I said, I don't know whether I understood you correctly and whether this has anything to do with your question.

Now I get what you meant (the "CPU internal events" got me confused, it kind of rang a different bell.) This is sometimes refered to as "hardware triggers". This is definitely something I considered indeed. If you design your system to generalize the use of hardware triggers, you can design ALL peripherals to support them: and ideally, at least one trigger input, and one trigger output per peripheral (possibly several) that are configurable for respectively a specific action, and a specific status. This, coupled to a DMA controller, indeed allows to do a lot of things, and "almost" everything you usually do in ISRs. Then you can do either without any interrupt, or just use interrupts for less time-critical events.

What I'm currently thinking about is some kind of "trigger controller", instead of leaving it to each peripheral, that would centralize all such triggers and could further extend the idea.

Now there are still cases for which you need a little more flexibility, that software can bring.

DiTBho · « **Reply #38 on:** December 17, 2020, 08:45:26 pm »

Quote from: BrianHG on December 17, 2020, 11:55:40 am

The interrupts I engineered into my own RISC cores in my FPGA were all too simple to implement. I cannot see an issue to do so in an FPGA. When an interrupt signal is sent, all I had to do is in the read instruction fetch pipeline, I would shift in a JSR/CALL to a fixed address. Yes, just a jump/call. Nothing else.

That's like MIPS

DiTBho · « **Reply #39 on:** December 17, 2020, 08:53:14 pm »

Quote from: tggzzz on December 17, 2020, 12:18:29 pm

standard SPARC instruction set

As far as I understand, one of the guys on the RISC-V team is the same person who was behind SPARC years ago. So I learned a simple thing: if he recently changed his mind, then probably the SPARC-ISA was not a good idea

brucehoult · « **Reply #40 on:** December 17, 2020, 09:10:47 pm »

Quote from: DiTBho on December 17, 2020, 08:45:26 pm

Quote from: BrianHG on December 17, 2020, 11:55:40 am
The interrupts I engineered into my own RISC cores in my FPGA were all too simple to implement. I cannot see an issue to do so in an FPGA. When an interrupt signal is sent, all I had to do is in the read instruction fetch pipeline, I would shift in a JSR/CALL to a fixed address. Yes, just a jump/call. Nothing else.

That's like MIPS

It can't be, exactly, because the interrupt return address has to be stored somewhere. Normal functions store the return address in a general purpose register but you can't have interrupts do that because that will clobber a value the main program expects to stay there -- unless you take interrupts only when the main program has a subroutine call instruction *anyway*.

RISC processors generally have a special internal register (CSR) to store the interrupt return address. CISC processors generally push the return address on to the stack (as it appears Brain is doing).

brucehoult · « **Reply #41 on:** December 17, 2020, 09:13:12 pm »

Quote from: DiTBho on December 17, 2020, 08:53:14 pm

Quote from: tggzzz on December 17, 2020, 12:18:29 pm
standard SPARC instruction set

As far as I understand, one of the guys on the RISC-V team is the same person who was behind SPARC years ago. So I learned a simple thing: if he recently changed his mind, then probably the SPARC-ISA was not a good idea

David Patterson invented RISC-I and RISC-II, which has register windows and is in other ways clearly a direct predecessor to SPARC.

And, yes, Dave now believes register windows to be a mistake.

SiliconWizard · « **Reply #42 on:** December 17, 2020, 09:20:24 pm »

Quote from: brucehoult on December 17, 2020, 09:13:12 pm

And, yes, Dave now believes register windows to be a mistake.

Do you know his exact rationale?

tggzzz · « **Reply #43 on:** December 17, 2020, 09:22:00 pm »

Quote from: DiTBho on December 17, 2020, 08:53:14 pm

Quote from: tggzzz on December 17, 2020, 12:18:29 pm
standard SPARC instruction set

As far as I understand, one of the guys on the RISC-V team is the same person who was behind SPARC years ago. So I learned a simple thing: if he recently changed his mind, then probably the SPARC-ISA was not a good idea

The point I was making was about the special architecture of the Niagara processors. That special architecture is ISA independent, and any other ISA could be used. Sun based the Niagara processors on SPARC because they were very familiar with it and there were no licencing issues.

DiTBho · « **Reply #44 on:** December 17, 2020, 09:30:44 pm »

Quote from: SiliconWizard on December 17, 2020, 09:20:24 pm

Quote from: brucehoult on December 17, 2020, 09:13:12 pm
And, yes, Dave now believes register windows to be a mistake.

Do you know his exact rationale?

For me: because a register-windows processor is a bloody nightmare to debug.

DiTBho · « **Reply #45 on:** December 17, 2020, 09:42:48 pm »

@brucehoult
with a common simple MIPS design (R2K-R3K), interrupts and exceptions are handled by COP0. On such events the COP0 forces the CPU to move the current PC into another register, namely "EPC". Then the COP0 records the reason for the exception in the Cause register and automatically disables further interrupts or execptions from occurring, by setting a couple of bits in the COP0's Status register. The CPU then jumps to a hardwired exception handler address, which is provided by the COP0 according to the reason and priority of the exception.

The CPU here has the interrupted instruction's address saved into EPC and starts executing the interrupt handler.

To return from a handler, the CPU only needs to move the contents of the EPC register back to a register, in order to use a simple "jump" to load it into the PC.

This requires special instructions to operate with the COP0, so the CPU can also re-enable interrupts and exceptions, by operating on a couple of bits in the COP0's Status register.

*** what I mean is: it's simple, it takes just few circuits, and it's efficient

brucehoult · « **Reply #46 on:** December 17, 2020, 10:09:01 pm »

Quote from: DiTBho on December 17, 2020, 09:42:48 pm

@brucehoult
with a common simple MIPS design (R2K-R3K), interrupts and exceptions are handled by COP0. On such events the COP0 forces the CPU to move the current PC into another register, namely "EPC".

Yes. That's a special register -- coprocessor 0 register 14, not a normal CPU register. The process is similar to a subroutine call, but different.

Quote

To return from a handler, the CPU only needs to move the contents of the EPC register back to a register, in order to use a simple "jump" to load it into the PC.
t I mean is: it's simple, it takes just few circuits, and it's efficient

You can't do that. The EPC would then remain in whatever register you moved it to after you got back to the interrupted code, which would be surprised to find that register overwritten.

Unless the ABI says normal code can never use one of the 32 general purpose registers, in which case the interrupt could just save the EPC there in the first place (much more similar to a normal subroutine call.

I note that MIPS hard wires R31 as the subroutine address register. So you could, in a MIPS-like dedicate, say, R30 to exception return. Or shadow R31.

coppice · « **Reply #47 on:** December 17, 2020, 10:24:59 pm »

Quote from: brucehoult on December 17, 2020, 12:43:41 am

Quote from: rhodges on December 16, 2020, 09:07:26 pm
One interesting VLIW CPU I worked with was the Trimedia/Philips PNX1302. It had 128 32-bit registers, and by convention the top 64 were reserved for interrupts. That saved the cost of saving registers. Also, an interrupt could only be taken on a branch instruction. This may sound crazy, but it probably made things easier for the designers. It also made it very easy to do atomic operations. Everything between branches was atomic.

Interesting trade-off. Switch registers to minimize interrupt latency -- but then add to it by only switching on branches. Also 128 registers uses a ton of die space.

Trimedia is a DSP architecture. They ALWAYS have interesting tradeoffs.

Nominal Animal · « **Reply #48 on:** December 17, 2020, 10:27:26 pm »

I can see why register windows don't work so well in real life.

I'd also like to point out that my own suggestion (of an interrupt controller that can swap the register file) only makes sense, if the register file swap can be made efficiently, in burst mode for example. (In particular, hardware interrupts can load the new register file from dedicated RAM before the task is interrupted; and both software and hardware interrupts can store the old register file back to dedicated RAM concurrently as the task starts executing. The total time dictates the minimum interrupt interval.)

Typically, only a small number of registers need to be saved for function calls regardless of the architecture, so storing the entire register file for those makes absolutely no sense. For privilege separation, like the userspace-kernelspace boundary, it is however extremely useful, as it makes any information passing from kernel to userspace explicit. Similarly for coroutines and userspace task switching, as they essentially do require switching the entire execution context.

Could anyone suggest me an FPGA board (the cheaper the better) that should suffice for experimenting with a RISC-V core? Or two? Or one that could access both external slow DRAM and much smaller static/fast SRAM (for storing the register file when not actively used by a core)? I'm pretty clearly pushing an idea I should try myself first, before suggesting others try anything like it, but am limited by the hardware I can afford. Would Olimex ICE40-HX8K with an iCE40HX8K-CT256 suffice?

DiTBho · « **Reply #49 on:** December 17, 2020, 10:39:31 pm »

Quote from: brucehoult on December 17, 2020, 10:09:01 pm

I note that MIPS hard wires R31 as the subroutine address register. So you could, in a MIPS-like dedicate, say, R30 to exception return. Or shadow R31.

The only alternative (when it's implemented) is ERET(1).

This is the common way ERET works with MIPS32

Code: [Select]

   # Trap handler in the standard MIPS32 kernel text segment
   ...
   move $v0,$k0   # Restore $v0
   move $a0,$k1   # Restore $a0
   mfc0 $k0,$14   # Coprocessor 0 register $14 has address of trapping instruction
   addi $k0,$k0,4 # Add 4 to point to next instruction
   mtc0 $k0,$14   # Store new address back into $14
   eret           # Error return; set PC to value in $14
   nop

In my MIPS SoftCore I haven't implemented ERET, but rather reserved R30 as Exception Return, and R31 as Function Return. So I return from exception with a simple mtc0 R14 R30; J 30; nop

(1) There is also Restore From Exception (RFE); this instruction restores the previous interrupt called and user/kernel state, but this instruction can execute only in kernel state and is unavailable in user mode. I haven't implemented it since I don't have any user-space, my code only runs in kernel-space.

DiTBho · « **Reply #50 on:** December 17, 2020, 10:45:18 pm »

Quote from: Nominal Animal on December 17, 2020, 10:27:26 pm

I am limited by the hardware I can afford

Why not a software simulator? Perhaps, this can be also listed as "an option". I mean, CPUs and peripherals can be implemented as common C-threats, so you can actually have true-concurrency between software cores, shared ram and virtual devices. Personally I am working in this direction at the moment, I have the virtual serials re-addressed to /dev/pts/

Not too bad for me.

Nominal Animal · « **Reply #51 on:** December 17, 2020, 11:21:24 pm »

Quote from: DiTBho on December 17, 2020, 10:45:18 pm

Quote from: Nominal Animal on December 17, 2020, 10:27:26 pm
I am limited by the hardware I can afford
Why not a software simulator?

Having a physical gadget to play with is more fun than complete abstraction...

Any recommendations, considering the only OS I have is Linux? I'd prefer open source if possible (speed is not really a concern for me), but it's not a hard requirement. Verilator? Icarus Verilog? Cascade?

tggzzz · « **Reply #52 on:** December 17, 2020, 11:45:18 pm »

Quote from: Nominal Animal on December 17, 2020, 11:21:24 pm

Quote from: DiTBho on December 17, 2020, 10:45:18 pm
Quote from: Nominal Animal on December 17, 2020, 10:27:26 pm
I am limited by the hardware I can afford
Why not a software simulator?
Having a physical gadget to play with is more fun than complete abstraction...

Any recommendations, considering the only OS I have is Linux? I'd prefer open source if possible (speed is not really a concern for me), but it's not a hard requirement. Verilator? Icarus Verilog? Cascade?

Start with the toolset since that is more important, difficult and expensive than a loaded PCB. Then choose the board/device that has definition files for the chosen toolset.

Xilinx tools work well on Linux; I can't comment on others. Xilinx tools are enormous but free to use for devices that are likely to be used by hobbyists.

DiTBho · « **Reply #53 on:** December 17, 2020, 11:59:29 pm »

Quote from: Nominal Animal on December 17, 2020, 11:21:24 pm

Any recommendations, considering the only OS I have is Linux?

For hobby stuff, I am running Linux. I installed in /opt/ the Linux/x86 version of Xilinx ISE and Vivado, HDL-Scriptum as source editor (it's freeware), and i compiled Icarus and GTKWave.

That's my Verilog-box.

On Icarus I write modules and test-benches to test each module's behavior, so I can also use GTKWave to visually check the timing. I use ISE or Vivado to "compile", and a FTI-cable to upload the final bit-stream into FPGA's ram.

That's all. Nothing special.

As for FPGA board, mine is very limited: I am using an homemade Papilio/Pro. This was it was a cheap solution. The DRAM module is still not fully operating, but I have recently installed 1M Byte of static asynchronous ram (parallel bus, 8bit), so at least I can avoid wasting all the precious BRAM.

brucehoult · « **Reply #54 on:** December 18, 2020, 01:44:43 am »

Quote from: DiTBho on December 17, 2020, 10:39:31 pm

Quote from: brucehoult on December 17, 2020, 10:09:01 pm
I note that MIPS hard wires R31 as the subroutine address register. So you could, in a MIPS-like dedicate, say, R30 to exception return. Or shadow R31.

The only alternative (when it's implemented) is ERET(1).

That's why I said "MIPS-like" not MIPS :-)

Quote

This is the common way ERET works with MIPS32
Code: [Select]
# Trap handler in the standard MIPS32 kernel text segment ... move $v0,$k0 # Restore $v0 move $a0,$k1 # Restore $a0 mfc0 $k0,$14 # Coprocessor 0 register $14 has address of trapping instruction addi $k0,$k0,4 # Add 4 to point to next instruction mtc0 $k0,$14 # Store new address back into $14 eret # Error return; set PC to value in $14 nop

Ok so I just checked MIPS ABI documentation. Registers r26 and r27 (aka $k0, $k1) are "reserved for the Operating System" and normal programs have to assume their contents can be clobbered between any two instructions. So you could legitimately do a non-standard privileged architecture that saved the exception return address in one of those and returned from exception with an indirect jump via it. That would be incompatible with standard MIPS operating systems, but compatible with user programs.

Quote

In my MIPS SoftCore I haven't implemented ERET, but rather reserved R30 as Exception Return, and R31 as Function Return. So I return from exception with a simple mtc0 R14 R30; J 30; nop

As I suggested :-)

But r26 or r27 would be compatible with standard compiler output or other user-mode programs.

brucehoult · « **Reply #55 on:** December 18, 2020, 01:53:07 am »

Quote from: Nominal Animal on December 17, 2020, 10:27:26 pm

Could anyone suggest me an FPGA board (the cheaper the better) that should suffice for experimenting with a RISC-V core? Or two? Or one that could access both external slow DRAM and much smaller static/fast SRAM (for storing the register file when not actively used by a core)? I'm pretty clearly pushing an idea I should try myself first, before suggesting others try anything like it, but am limited by the hardware I can afford. Would Olimex ICE40-HX8K with an iCE40HX8K-CT256 suffice?

7680 LUT4s looks a bit small.

You can certainly fit certain RISC-V cores in there, but I'd be worried about getting a DRAM controller in there, if it has to be implemented in the FPGA fabric as well. Best to check first.

Normally in Lattice I'd suggest an ECP5 board such as the Orange Crab as a good minimum for RISC-V playing.

https://groupgets.com/manufacturers/good-stuff-department/products/orangecrab

Or an Arty 7-35T. But that's not supported by open-source tools (maybe Real Soon Now).

Nominal Animal · « **Reply #56 on:** December 18, 2020, 04:42:05 am »

Thanks.

How do you really compare different implementations? My first instinct is to get the core to execute real-life code, and measure the actual latencies; the good ol' benchmark-a-bunch-of-real-world-workloads. But that seems awfully crude. Are there metrics or resources for high-level comparisons for the implementations? Cycle-count statistics?

SiliconWizard, you haven't explicitly stated your design goals (or even roughly outlined them, really); do you already have an idea you can use as a metric? Features you'd like to see? I guess you already mentioned small latencies.

I mean, implementing something that is already existing can be a fun exercise, but I'd rather discover the benefits and downsides of my own implementation.
That's also why I'm simply not writing my own OS on xCORE, or trying to develop my own programming language; I'm much more of a low-level guy, trying to find ways to do real-world tasks better, to some odd definition of better focusing on efficiency and security. The reasoning behind my more complex ideas tend to be an order of magnitude harder to convey to others than their actual implementation... I've already wasted enough of my life pushing those. I'd rather be seen as helpful and useful than an obscure visionary.

brucehoult · « **Reply #57 on:** December 18, 2020, 05:34:10 am »

Quote from: Nominal Animal on December 18, 2020, 04:42:05 am

How do you really compare different implementations? My first instinct is to get the core to execute real-life code, and measure the actual latencies; the good ol' benchmark-a-bunch-of-real-world-workloads. But that seems awfully crude. Are there metrics or resources for high-level comparisons for the implementations? Cycle-count statistics?

Most people who make cores publish the Dhrystone and/or CoreMark per MHz. Those are pretty poor benchmarks for compilers and ISAs but if you're running the exact same binary on different implementations of the same ISA they do tell you something.

Almost every current design out there, with the exception of SERV and picorv32, does 1 instruction per clock on simple straight line code. The differences come from load/use delays, handling of dependency stalls, quality of branch prediction (if any) and branch mispredict penalty. And the FMAX on a given FPGA. And LUT and FF usage.

Berni · « **Reply #58 on:** December 18, 2020, 06:35:27 am »

The reason there is no single definitive benchmark is that certain architectures are simply better at doing certain tasks. So typically certain computational tasks are used as a benchmark. On PCs software that calculates millions of digits of PI or finds millions of prime numbers used to be used, these days they typically benchmark using CPU only raytrace rendering. For MCUs the Dhrystone benchmark is pretty reasonable, but it doesn't care about float performance. It all depends on how you define a CPU being "faster"

But when it comes to simple small CPUs it is a lot easier to gauge performance since all instructions typically consistently take the same time to execute. So you can easily see how many cycles it will take for that CPU to for example sum a 16k integer array together. So with these simple CPUs more performance is mostly a matter of just clocking the CPU faster to get more instructions per second out of it.

But then there are also things around the CPU that also affect its performance like memory speed. All the modern fast CPUs are faster than the RAM they are connected to. So if the workload is "copy this 100MB array into this array" it doesn't matter how fast the CPU is, it only goes as fast as the RAM can go.

As for a FPGA board, something along these lines could work well:
https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=218&No=1021&PartNo=1
Its a cheap board that has most of the things you would want on a homebrew CPU project. You get 50K LEs (quite sizable for the price) along with 1.6Mbit of internal memory (usefull for making cache) and easy to interface SDRAM. The VGA output can be used for making a simple video card to give your computer a more interactive experience and the big pile of switches and LEDs are really nice for on the fly debugging.

tggzzz · « **Reply #59 on:** December 18, 2020, 09:01:08 am »

Quote from: Nominal Animal on December 18, 2020, 04:42:05 am

I mean, implementing something that is already existing can be a fun exercise, but I'd rather discover the benefits and downsides of my own implementation.
That's also why I'm simply not writing my own OS on xCORE, or trying to develop my own programming language; I'm much more of a low-level guy, trying to find ways to do real-world tasks better, to some odd definition of better focusing on efficiency and security. The reasoning behind my more complex ideas tend to be an order of magnitude harder to convey to others than their actual implementation... I've already wasted enough of my life pushing those. I'd rather be seen as helpful and useful than an obscure visionary.

Extending your understanding by doing something for fun is an excellent reason for doing it

A major part of being seen as helpful and useful is - at the appropriate time - to clearly articulate

how your invention relates to existing tools
if you use it, what unique benefits do you get
what do you have to give up, if you use it

For xCORE devices those might be "highly parallel fast MCU with FPGA-like i/o", "hard realtime guarantees by design not measurement, solid CSP underpinnings, complete toolset ", "embedded systems only, single source".

DiTBho · « **Reply #60 on:** December 18, 2020, 10:56:39 am »

Quote from: tggzzz on December 18, 2020, 09:01:08 am

For xCORE devices those might be "highly parallel fast MCU with FPGA-like i/o"

The big problem here is the inner documentation.

---

Even with MIPS-IV there is the same problem, and for instance I know there is a special instruction improperly namely "AERET", which I can only suppose it stands for "Automatic Exception RETurn", because it's not mentioned anywhere except in a piece of document that is still not officially publicly available.

I am a bit worried about their lawyers, but I think you can "copy" the idea for your own hobby Softcore as long as you declare it "not MIPS, but MIPS-like" (here you can understand why I am not respecting the MIPS-ABI), but I don't think you can use it without some legal permission for anything that is done as "commercial-product". Not sure, anyway, it's simply an enhanced version of the common MIPS-"ERET" that directly copies EPC into PC without the need of any special COP0 instruction (MTC*) and without the need of any temporary register.

Cool, ain't it? I haven't implemented it because it's more complex than my current solution, but I like the idea.

---

With xMOS I think it's about the same, and details about xCORE are missing and what is publicly available is rather smoky. Yes, it's a product, yes, you can buy it, yes it's cheap, yes it finally has a decent tool-chain, but there is still insufficient documentation.

A shame, but that's it, in my opinion

MIS42N · « **Reply #61 on:** December 18, 2020, 11:09:38 am »

An interesting thread. An interrupt is used to service an asynchronous event, when resources are limited. Someone mentioned that an old CDC processor had no interrupts because it had co processors to handle asynchronous events, a case of throwing enough resources at the problem so interrupts were unnecessary. Maybe the XMOS architecture is similar.

But if processor resources are limited, and there are asynchronous events that must be handled by the processor, the processor has to have the facility to break off one task and service another. An interrupt.

The significant phrase is "there are asynchronous events that must be handled by the processor". Much can now be offloaded to 'firmware'. I have been programming a PIC16F1455. I wish to time an external event to within a few tens of nanoseconds. This is simple using a gated timer - start the timer at a predetermined time and allow firmware to stop the timer on the event. UART, USB, DMA are all examples of offloading event handling to firmware/hardware. FPGAs are an extension of the concept to allow users to create custom event handlers.

How are interrupts handled? The minimum is to branch to a predetermined address, save the return address, and have a mechanism in hardware to ensure the event is masked so it can't cause a runaway. The PIC16F628A implemented just this. The interrupt handler software had to explicitly save the state of the interrupted process, and restore it before returning control. Other processors have more elaborate hardware mechanisms, saving register, status, changing privilege levels etc. These are not alternates to interrupts, just variants.

I don't know much about multiprocessor hardware. One could envisage a scheme where each processor has an instantaneous priority and the hardware allocated interrupts to an idle processor or the one with the lowest priority. I don't know if such exist, the only ones I know about have interrupts serviced by one process, that has the ability in software to hand the task off to another processor.

But none of this offers an alternative to interrupts. I don't think there is one. Either there is no interrupt in which case everything can be done by polling, or there are interrupts.

And thank goodness for that. The aforementioned PIC has 3 timer interrupts, a gate interrupt, 2 pin change interrupts, a UART transmit interrupt and a lost external clock interrupt. It handles over 40,000 events a second. It relies of course on hardware assistance. The PWM is double buffered so it is loaded when the timer matches and the program can load a new value asynchronously. The gate event stops the timer so it can be serviced asynchronously. The UART asks for more data while still sending a character. One pin change interrupt is time critical so it is serviced in preference to the others (there is no priority mechanism, all interrupts cause a jump to the same address and it is up to the software to determine the cause).

tggzzz · « **Reply #62 on:** December 18, 2020, 12:16:55 pm »

Quote from: DiTBho on December 18, 2020, 10:56:39 am

Quote from: tggzzz on December 18, 2020, 09:01:08 am
For xCORE devices those might be "highly parallel fast MCU with FPGA-like i/o"

The big problem here is the inner documentation.

---

Even with MIPS-IV there is the same problem, and for instance I know there is a special instruction improperly namely "AERET", which I can only suppose it stands for "Automatic Exception RETurn", because it's not mentioned anywhere except in a piece of document that is still not officially publicly available.

I am a bit worried about their lawyers, but I think you can "copy" the idea for your own hobby Softcore as long as you declare it "not MIPS, but MIPS-like" (here you can understand why I am not respecting the MIPS-ABI), but I don't think you can use it without some legal permission for anything that is done as "commercial-product". Not sure, anyway, it's simply an enhanced version of the common MIPS-"ERET" that directly copies EPC into PC without the need of any special COP0 instruction (MTC*) and without the need of any temporary register.

Cool, ain't it? I haven't implemented it because it's more complex than my current solution, but I like the idea.

---

With xMOS I think it's about the same, and details about xCORE are missing and what is publicly available is rather smoky. Yes, it's a product, yes, you can buy it, yes it's cheap, yes it finally has a decent tool-chain, but there is still insufficient documentation.

A shame, but that's it, in my opinion

Well, when I had used it I found all the information I wanted and needed - and more. I also found the documentation was pleasantly short and complete and accurate. Basically (compared with other MCUs I've used) it just worked as I expected, with less pain than I expected. What more could I want?

So, in what way was the documentation insufficient for your purposes?

tggzzz · « **Reply #63 on:** December 18, 2020, 12:30:47 pm »

Quote from: MIS42N on December 18, 2020, 11:09:38 am

An interesting thread. An interrupt is used to service an asynchronous event, when resources are limited. Someone mentioned that an old CDC processor had no interrupts because it had co processors to handle asynchronous events, a case of throwing enough resources at the problem so interrupts were unnecessary. Maybe the XMOS architecture is similar.

But if processor resources are limited, and there are asynchronous events that must be handled by the processor, the processor has to have the facility to break off one task and service another. An interrupt.

The XMOS processors are like that except all cores are identical.

The number or cores is limited (up to 32/chip, 4000MIPS/chip, can have multiple chips), which means the xCORE devices are not general purpose - but they are good for embedded systems.

There are ways to multiplex multiple independent processes onto a single core, and the toolchain does that for you.

Quote

But none of this offers an alternative to interrupts. I don't think there is one. Either there is no interrupt in which case everything can be done by polling, or there are interrupts.

And thank goodness for that.

Yes, for an embedded system it is a workable alternative to interrupts, and one that enables hard realtime guarantees to be made by design, not by running code and hoping you have found the worst case execution times.

I'm not sure why you are thanking goodness for interrupts. They are merely a hack to get around inadequate processors or lack of application dependent hardware

(FPGAs address the latter!)

DiTBho · « **Reply #64 on:** December 18, 2020, 12:45:14 pm »

Quote from: tggzzz on December 18, 2020, 12:16:55 pm

Well, when I had used it I found all the information I wanted and needed

sure, but as "programmer", not as "architecture designer": for instance, are you able to describe how the hardware scheduler works with enough detail to implement it in fpga?

Personally I am not able, well, I perhaps can try to infer things, but I do find it very very difficult.

MIS42N · « **Reply #65 on:** December 18, 2020, 01:19:19 pm »

Quote from: tggzzz on December 18, 2020, 12:30:47 pm

I'm not sure why you are thanking goodness for interrupts. They are merely a hack to get around inadequate processors or lack of application dependent hardware (FPGAs address the latter!)

That's a point of view rather than a logical argument. I'll give you $2 to purchase your "adequate processor" sans interrupts with whatever hardware you like to add. I'll stick to an off the shelf general purpose processor with interrupts (and costs less than $2). I'll get to market before you, and you may not be able to lower your unit cost to be competitive regardless of the number of units sold.

tggzzz · « **Reply #66 on:** December 18, 2020, 01:36:15 pm »

Quote from: DiTBho on December 18, 2020, 12:45:14 pm

Quote from: tggzzz on December 18, 2020, 12:16:55 pm
Well, when I had used it I found all the information I wanted and needed

sure, but as "programmer", not as "architecture designer": for instance, are you able to describe how the hardware scheduler works with enough detail to implement it in fpga?

Personally I am not able, well, I perhaps can try to infer things, but I do find it very very difficult.

Actually, given my background and thinking about what's in the chip, I could could do a reasonable job for that tiny part of the system. It isn't that difficult[1]. OTOH I wouldn't know where to start with a compiler!

But why would I reinvent the wheel, poorly[2]. There's no way I could create the complete h/w & s/w ecosystem on my own within my remaining lifetime! Much better to attack problems that I can solve.

[1] inputs cause a message to be sent across the comms crossbar (or FPS) switch to a core. Core is stalled by a specific instruction waiting for that message. When message is received, the next instruction is executed. You can see that in the debugger. Those i/o messages and the hardware to process them are are exactly the same as core-to-core "software" messages.

[2] personal fun is a good reason, but life is short

tggzzz · « **Reply #67 on:** December 18, 2020, 01:44:30 pm »

Quote from: MIS42N on December 18, 2020, 01:19:19 pm

Quote from: tggzzz on December 18, 2020, 12:30:47 pm
I'm not sure why you are thanking goodness for interrupts. They are merely a hack to get around inadequate processors or lack of application dependent hardware (FPGAs address the latter!)
That's a point of view rather than a logical argument. I'll give you $2 to purchase your "adequate processor" sans interrupts with whatever hardware you like to add. I'll stick to an off the shelf general purpose processor with interrupts (and costs less than $2). I'll get to market before you, and you may not be able to lower your unit cost to be competitive regardless of the number of units sold.

Well, it isn't a novel point of view, and not merely a point of view! Examples have been given elsewhere in this thread.

Your other points could be argued either way, of course.

But you will find it difficult to guarantee hard realtime performance, unless you either have a faster processor sitting around idle for much of the time, or low jitter/performance requirements, or special purpose hardware, or a lot of development time to test how each code change affects worst-case timings.

SiliconWizard · « **Reply #68 on:** December 18, 2020, 04:22:05 pm »

Quote from: Nominal Animal on December 18, 2020, 04:42:05 am

SiliconWizard, you haven't explicitly stated your design goals (or even roughly outlined them, really); do you already have an idea you can use as a metric? Features you'd like to see? I guess you already mentioned small latencies.

This is meant as a general discussion on the topic, not really as an helper thread for my own goals.

Interrupts - as in interrupting the current flow of execution of a processing unit - are a still ubiquitous way of handling asynchronous events (and multi-threading alike on a single core), but there are certainly other ways of achieving similar results, which I thought was interesting to discuss. Of course, alternative ways of doing things can be interesting in themselves, but they are more interesting if they can achieve "better" according to some metric. Here of course, we can mention latency/overhead and hardware cost, for instance.

The idea of dedicating one or more "processing units" to handle this is one possible approach, as we already talked about above. The fact the topic is about "interrupts" specifically rather than the more general question of preemption with its associated issues is that typical "interrupts" coming from hardware events (so I'm excluding "software" interrupts/exceptions/... here) can often be serviced in simpler ways than code running on a full-fledged CPU core. There are existing examples of this, such as simpler cores (but still able to run "software" code). Another approach is something mentioned by Fixpoint. After looking things up a little to elaborate on that, I saw there are some MCUs out there that embed a so-called "Peripheral Event Controller" (some Atmel SAM, Infineon C166...), or "Peripheral Trigger Generator" (some Microchip MCUs). Those controllers can certainly be handy to limit the use of interrupts while allowing a lot of what you would otherwise do in ISRs, especially on embedded processors. We could probably devise controllers with a bit more flexibility while retaining the same concept.

Now considering interrupts (and preemption in general) again, it's also interesting to discuss ways of improving latency and context switching in particular. The XMOS architecture is obviously interesting there, although it's not quite general-purpose. And regarding windowed register files, I'd still like to know the rationale for avoiding them. According to the Wikipedia article on those, they mention criticism based on studies showing that better use of registers was overall more effective. I haven't read the related papers (if there are any), but I'm not sure I completely agree. That may be true if you consider the same overall number of registers (say using a file of 32 registers instead of 8 registers in 4 windows), but if you consider actually ADDING register sets, how can it ever be less effective? So IMO this rationale would all be related to not increasing the overall number of registers (for area reasons for instance), but if you can afford it, I can't really see what would be against that.

tggzzz · « **Reply #69 on:** December 18, 2020, 04:42:35 pm »

Quote from: SiliconWizard on December 18, 2020, 04:22:05 pm

Of course, alternative ways of doing things can be interesting in themselves, but they are more interesting if they can achieve "better" according to some metric. Here of course, we can mention latency/overhead and hardware cost, for instance.

In hard realtime systems, jitter can be important.

Quote

<snipped many sensible points>
The XMOS architecture is obviously interesting there, although it's not quite general-purpose.

It certainly isn't general purpose. Everything has to be predefined, and there's no sensible way of loading new processes at runtime, nor of hardware protecting one process from another (that's done by xC features).

If you are considering processors that run traditional processes loaded at runtime and with MMU etc, then there is a lot to be gained from the insight that there is very little difference between any reason that protection boundaries are crossed. That specifically includes interrupts, program signals in the Unix sense, and programs calling the kernel. Make one efficient (or otherwise) and that happens to the others.

SiliconWizard · « **Reply #70 on:** December 18, 2020, 05:00:32 pm »

Quote from: tggzzz on December 18, 2020, 04:42:35 pm

In hard realtime systems, jitter can be important.

Certainly, the list wasn't meant to be exhaustive at all.
Whereas pure latency can be an important factor in some cases (as it can impact the time-to-reaction to some event), in many cases jitter can be even more important - it's sometimes better to have low jitter even with a highish latency (because then the timings are more predictable), rather than a low latency with a high relative jitter.

Quote from: tggzzz on December 18, 2020, 04:42:35 pm

If you are considering processors that run traditional processes loaded at runtime and with MMU etc, then there is a lot to be gained from the insight that there is very little difference between any reason that protection boundaries are crossed. That specifically includes interrupts, program signals in the Unix sense, and programs calling the kernel. Make one efficient (or otherwise) and that happens to the others.

Sure, although as I mentioned, there can still be some ways to improve a particular set of features without being too general (such as those Event Controllers I mentioned). But otherwise, certainly. If you can improve context switching, for instance, it will benefit any occurence of context switching, not just interrupt handlers. And a subset of improvements for context switching can also benefit subroutine calls.

As to XMOS, there was a project (xCORE-XA) combining xCORE tiles with an ARM Cortex core on the same device, as a way to make XMOS tech usable in a larger range of applications, but apparently this is now dead. Do you happen to know more about this, and why it didn't succeed?

coppice · « **Reply #71 on:** December 18, 2020, 05:04:37 pm »

Quote from: tggzzz on December 18, 2020, 04:42:35 pm

In hard realtime systems, jitter can be important.

Jitter is a critical factor in most hard real time systems. So important you don't leave it to the processor. You register the inputs and outputs, and let hardware control their timing.

tggzzz · « **Reply #72 on:** December 18, 2020, 05:41:31 pm »

Quote from: SiliconWizard on December 18, 2020, 05:00:32 pm

As to XMOS, there was a project (xCORE-XA) combining xCORE tiles with an ARM Cortex core on the same device, as a way to make XMOS tech usable in a larger range of applications, but apparently this is now dead. Do you happen to know more about this, and why it didn't succeed?

I was vaguely aware of that. After a first glance I didn't understand

how the interesting and unique xCORE/xC/tool processor features could be duplicated in an ARM
what the unique benefit of an ARM would be

SiliconWizard · « **Reply #73 on:** December 18, 2020, 06:04:59 pm »

Quote from: tggzzz on December 18, 2020, 05:41:31 pm

Quote from: SiliconWizard on December 18, 2020, 05:00:32 pm
As to XMOS, there was a project (xCORE-XA) combining xCORE tiles with an ARM Cortex core on the same device, as a way to make XMOS tech usable in a larger range of applications, but apparently this is now dead. Do you happen to know more about this, and why it didn't succeed?

I was vaguely aware of that. After a first glance I didn't understand
how the interesting and unique xCORE/xC/tool processor features could be duplicated in an ARM
what the unique benefit of an ARM would be

From what I've understood, so if I'm not mistaken, this was just some xCore tiles, and then an ARM core, side by side. Either on the same die or as a SOC, that I don't know. But that was basically like an XMOS chip + an ARM MCU, only in a single chip. I don't know how they were connected, probably with one of the internal busses, so probably more efficiently than if they were just separate chips connected externally.

The benefits would be like any other hybrid processor. You can think of this as a SOC like the Zynq, I guess, only the FPGA fabric here would be replaced with xCore tiles.

tggzzz · « **Reply #74 on:** December 18, 2020, 06:23:05 pm »

Quote from: coppice on December 18, 2020, 05:04:37 pm

Quote from: tggzzz on December 18, 2020, 04:42:35 pm
In hard realtime systems, jitter can be important.
Jitter is a critical factor in most hard real time systems. So important you don't leave it to the processor. You register the inputs and outputs, and let hardware control their timing.

Just so. That's what you do on xCORE devices, in one of several ways. There are xC constructs that implement it directly. But, of course, that is only a small part of a system's hard realtime properties.

The xCORE i/o structures and xC are particularly good in that respect, to the extent they are encroaching on FPGA territory. I expect the i/o structures could be (and probably are) found in other processors, but the tight software integration is a little more messy.

The XMOS devices are capable of using software to packetise and serialise/deserialise 100Mb/s ethernet datastreams or USB datastreams, while simultaneously guaranteeing performance of other parts of your design. Whether that is a good use of silicon is open to question, but it is notable that it can be done in software.

tggzzz · « **Reply #75 on:** December 18, 2020, 06:29:21 pm »

Quote from: SiliconWizard on December 18, 2020, 06:04:59 pm

Quote from: tggzzz on December 18, 2020, 05:41:31 pm
Quote from: SiliconWizard on December 18, 2020, 05:00:32 pm
As to XMOS, there was a project (xCORE-XA) combining xCORE tiles with an ARM Cortex core on the same device, as a way to make XMOS tech usable in a larger range of applications, but apparently this is now dead. Do you happen to know more about this, and why it didn't succeed?

I was vaguely aware of that. After a first glance I didn't understand
how the interesting and unique xCORE/xC/tool processor features could be duplicated in an ARM
what the unique benefit of an ARM would be

From what I've understood, so if I'm not mistaken, this was just some xCore tiles, and then an ARM core, side by side. Either on the same die or as a SOC, that I don't know. But that was basically like an XMOS chip + an ARM MCU, only in a single chip. I don't know how they were connected, probably with one of the internal busses, so probably more efficiently than if they were just separate chips connected externally.

The benefits would be like any other hybrid processor. You can think of this as a SOC like the Zynq, I guess, only the FPGA fabric here would be replaced with xCore tiles.

IIRC a tile was 7*xCORE + 1*ARM on the same chip and using the same messaging fabric. It is in the same territory as a Zynq, and in many or most cases a Zynq would be more appropriate. Basically XMOS devices extend pure software technology into traditional low-end FPGA territory.

But design iteration time in the XMOS system was much faster and easier: standard software compilation and downloading vs compile, simulate, place and route, download

SiliconWizard · « **Reply #76 on:** December 18, 2020, 06:52:58 pm »

Well, certainly FPGA+processor core hybrids are more flexible and make more overall sense. Besides, the ARM core that was chosen at the time was a Cortex M3 (IIRC), so nothing really beefy. The end-result was probably not that interesting, at least compared to using a XMOS chip alone, or combined on board with a much beefier CPU if required.

brucehoult · « **Reply #77 on:** December 18, 2020, 10:41:03 pm »

Quote from: SiliconWizard on December 18, 2020, 04:22:05 pm

And regarding windowed register files, I'd still like to know the rationale for avoiding them. According to the Wikipedia article on those, they mention criticism based on studies showing that better use of registers was overall more effective. I haven't read the related papers (if there are any), but I'm not sure I completely agree. That may be true if you consider the same overall number of registers (say using a file of 32 registers instead of 8 registers in 4 windows), but if you consider actually ADDING register sets, how can it ever be less effective? So IMO this rationale would all be related to not increasing the overall number of registers (for area reasons for instance), but if you can afford it, I can't really see what would be against that.

I believe most SPARC CPUs had 8 register windows, and 80 (?) registers overall.

First, yes of course, there is the question of whether such a large register file is the best way to spend those transistors, especially in the early days when the number of transistors you could fit on a chip were quite limited.

But, assuming you can "afford" it, the next question is whether it actually increases performance AT ALL, vs just having 32 registers.

If most of the time your program has a variation in function call depth of less than four [1] then, yes, it's going to be faster to just switch register sets. Assuming you can run the CPU at the same clock speed with a larger register file anyway.

But then Sun invented Java ahahahahahaha. Have you ever seen a Java stack backtrace?

Ok, modern C++ is not much better.

The problem is, most functions have only 1-3 variables that need to be preserved across a function call, and a similar number of arguments, but once you exceed the register window depth and have to spill register windows to RAM you have to save all 8 registers in each window because you don't know which ones are actually used. And then you're saving or restoring a whole lot of registers in one go, which has long latency (at least for restore -- maybe not for save if you have big enough write buffers).

The modern ABIs that are pretty much universal now with registers partitioned by convention into caller save and callee save let you bounce back and forth between two levels of function call with usually zero register save/restore traffic, which gets in practice probably 90% of the benefit of hardware register windows. That's something that has changed since SPARC was invented in the days of VAX, 68000, 8086 which all passed function arguments on the stack and even leaf functions had to save and restore every register they used. Register windows was a big improvement on *that*.

[1] I think I've calculated right. The "eight register windows" have two windows consumed by each level of function call, as one function's "out arguments" window becomes the called function's "in arguments" window, and the called function gets a new "locals" window and a new "out arguments" window. Corrections welcome from anyone who knows SPARC better.

DiTBho · « **Reply #78 on:** December 18, 2020, 11:56:12 pm »

Quote from: brucehoult on December 18, 2020, 10:41:03 pm

there is the question of whether such a large register file is the best way to spend those transistors, especially in the early days when the number of transistors you could fit on a chip were quite limited.

To this question there were several different answers.

Berkeley answered yes, and made the SPARC RISC
Stanford answered no!, and made the MIPS RISC
IBM answered no!, and went ahead researching from their old IBM-801, which led them to the POWER architecture, and then to PowerPC
Acorn answered no!, and made the first ARM RISC
AMD answered I don't know what we are doing, but let's try something weird, and tried to improve the Berkeley' idea by adding 192 registers, so they made the AMD-29000, which was a so bad commercial failure that hey had to sell the core to Honeywell
Honeywell answered dunno what you were doing AMD, perhaps you tried to sell the chip to the wrong kind of customers, so they bought the AMD-29K-core and made the HI-29KII chip, of which they even made some profit in specific niches (like avionics), but not so much money, and the architecture is so ugly that nowadays nobody remembers it.
intel answered dunno, but let's try to make some intel-RISC-chips, maybe it will rule the world like x86, and made the i960, which was an epic failure, and went definitively abandoned when they tried to sell it to Next and Steve Jobs said "pffff"

Another interesting question is how a compiler can benefit from register window. For instance object oriented languages have demonstrated they don't benefit from it.

So, in short (I reserve some chances for DSPs), register-window is a de-facto bad idea

coppice · « **Reply #79 on:** December 19, 2020, 12:36:02 am »

Quote from: DiTBho on December 18, 2020, 11:56:12 pm

AMD answered I don't know what we are doing, but let's try something weird, and tried to improve the Berkeley' idea by adding 192 registers, so they made the AMD-29000, which was a so bad commercial failure that hey had to sell the core to Honeywell

The 29k family might not have been the basis of a whole new direction for AMD, as they might have hoped, but it wasn't a terrible failure. For example, it was the dominant player in high performance printers.

brucehoult · « **Reply #80 on:** December 19, 2020, 12:50:03 am »

Quote from: coppice on December 19, 2020, 12:36:02 am

Quote from: DiTBho on December 18, 2020, 11:56:12 pm
AMD answered I don't know what we are doing, but let's try something weird, and tried to improve the Berkeley' idea by adding 192 registers, so they made the AMD-29000, which was a so bad commercial failure that hey had to sell the core to Honeywell
The 29k family might not have been the basis of a whole new direction for AMD, as they might have hoped, but it wasn't a terrible failure. For example, it was the dominant player in high performance printers.

I seem to recall a lot of PostScript printers were running on i860.

I did some contracts writing telephone exchange add-on software on Stratus i860 machines with hot-swappable CPU modules. (and the later on PA-RISC, and then Sun)

SiliconWizard · « **Reply #81 on:** December 19, 2020, 01:18:33 am »

Yeah, the AMD29k was relatively popular in laser printers at the time, and a few other "embedded" applications. I wouldn't call that a failure. This was a relatively new market that would end up growing a lot. And yes, the architecture ended up at Honeywell and sold relatively well for avionics applications. A niche, but one that you wouldn't call particularly forgiving.

Likewise, the Intel i960 was relatively popular in embedded applications for a while. I would not call either failures.

As to windowed register files, they may not be worth it in the end for general use (such as using them systematically for subroutine calls instead of register save/restore on the stack), but if restricted to a number of contexts equal to the number of windows, then they work as expected (avoid having to save and restore registers). Now the applications in which that could be guaranteed are limited on a general-purpose processor, but if you use them - as discussed in this topic - for interrupt handling, for instance, as long as you have enough windows relative to the levels of interrupt preemption allowed in your system, then yeah. But I'll admit this has a cost and is likely not justified if just for this particular use.

coppice · « **Reply #82 on:** December 19, 2020, 01:34:20 am »

Quote from: brucehoult on December 19, 2020, 12:50:03 am

I seem to recall a lot of PostScript printers were running on i860.

I think you mean the i960, which was the nearest competitor to the Am29k. Both had some success. Neither turned out to be the long term path for high performance embeeded solution the makers hoped for,

Quote from: brucehoult on December 19, 2020, 12:50:03 am

I did some contracts writing telephone exchange add-on software on Stratus i860 machines with hot-swappable CPU modules. (and the later on PA-RISC, and then Sun)

I never understood why Stratus put serious effort into the i860. Intel were lukewarm about it, even at its launch. It had some interesting qualities, but without the whole hearted support of upper management a product like that was always going to crash and burn.

tggzzz · « **Reply #83 on:** December 19, 2020, 08:51:55 am »

Quote from: coppice on December 19, 2020, 01:34:20 am

I never understood why Stratus put serious effort into the i860. Intel were lukewarm about it, even at its launch. It had some interesting qualities, but without the whole hearted support of upper management a product like that was always going to crash and burn.

IIRC a major problem for the i860 was that creating compilers that could make use of the features was, for a while, a popular academic research topic. A topic that was never solved well enough.

Ditto the Itanic. That also had the major problem that performance was increased by computing results that weren't needed. That wasted power just as power consumption was becoming the limiting factor.

DiTBho · « **Reply #84 on:** December 19, 2020, 11:47:00 am »

Quote from: coppice on December 19, 2020, 12:36:02 am

The 29k family might not have been the basis of a whole new direction for AMD, as they might have hoped, but it wasn't a terrible failure. For example, it was the dominant player in high performance printers.

They put money into the research and development of superscalar 29k-processors, like the 29050, but they resulted a very bad business affair (and technically a nightmare for devs like me), so they sold the 29050 to Honeywell and re-targeted their business to make with embedded processors for high performance printers.

Still technically a nightmare for devs, but good business because a lot of people need high performance printers, and they were happy to sell their processors for making them

Nominal Animal · « **Reply #85 on:** December 19, 2020, 12:29:54 pm »

Quote from: DiTBho on December 19, 2020, 11:47:00 am

Quote from: coppice on December 19, 2020, 12:36:02 am
The 29k family might not have been the basis of a whole new direction for AMD, as they might have hoped, but it wasn't a terrible failure. For example, it was the dominant player in high performance printers.
They put money into the research and development of superscalar 29k-processors, like the 29050, but they resulted a very bad business affair (and technically a nightmare for devs like me), so they sold the 29050 to Honeywell and re-targeted their business to make with embedded processors for high performance printers.

Still technically a nightmare for devs, but good business because a lot of people need high performance printers, and they were happy to sell their processors for making them

It is funny in the sense that all that a high performance printer really needs, is a processor that can run PostScript (which is a completely stack-based programming language), a rasterizer (nontrivial for color printers), and lots of RAM (for the raster buffer). I wonder if the emergence of "WinPrinters" (minimal processing, just a small pre-rasterized, pre-separated buffer; all processing done in host drivers) was related?

DiTBho · « **Reply #86 on:** December 19, 2020, 01:15:30 pm »

Quote from: DiTBho on December 18, 2020, 11:56:12 pm

Quote from: brucehoult on December 18, 2020, 10:41:03 pm
there is the question of whether such a large register file is the best way to spend those transistors, especially in the early days when the number of transistors you could fit on a chip were quite limited.

To this question there were several different answers.
...
Acorn answered no!, and made the first ARM RISC
...

Still Acorn, a few months later, answered still no! Register-Windows is a very bad idea, but let's try to see it differently, what about implementing register-banks instead? Umm it sounds better, and they made some prototypes, and the second prototypical ARM chip(1) was designed with registers clubbed together as register banks. Even modern ARM chips still have a bank for the kernel, a bank for the user, and a bank for interrupts, and it's nice because it's fast for interrupt request
Hitachi, in Japan, answered good idea, we had already thought about that, Acorn!, and made the SuperHitachi chip, which was then massively used in Japanese-RISC computers, but those big-irons never left Japan, and we know about SuperHitachi primarily for little-irons like the SEGA Dreamcast console, some PDAs and all the Japanese graphics calculators, which - for instance my new CASIO FX9860GIII - come with a SH4 embedded chip under the shell

Moral of the story, I can infer that "register-windows" is a bad idea, but "register-banks" (namely Register Bank Switching) is a nice idea

Common register file
RegisterFile(register_id)

register file with Register Bank Switching
RegisterFile(register_id, mode), while mode = {exception, kernel, user}

(1) The organization of the ARM integer processor core has changed very little from the first three micron devices developed at Acorn Computers between 1983 and 1985 to the ARM6 and ARM7 developed by ARM Limited between 1990 and 1995. But before 1983: "Register Bank Switching, umm to do it, or not to do umm, that's the problem". Time explained they made a good decision, since the basic principles of operation have remained largely the same.

Nominal Animal · « **Reply #87 on:** December 19, 2020, 02:01:22 pm »

Register banks used in privilege separation definitely make sense. (You could say that the interrupt controller I described is just a register bank switcher with dedicated storage for not-currently-active register banks.)

For normal function calls, the less overhead the better.

It is a pity that automatically tracking caller-used and callee-used registers (within a simple application, when not inlining functions) is too complex to implement in practice (whenever there are multiple call sites). Within a single process, an ABI really just makes things easier for the compiler. At the library and kernel/userspace boundaries, i.e. at API boundaries, an ABI is obviously extremely important – because without it you don't really have an interface! –, but within a single process, it just leads to technically unnecessary register moves. (Something GCC used to emit particularly lots of.)

Register windows, or switching entire banks at function calls, can be examined/analysed as if they were an ABI, with the available register file split into "callee won't touch these" and "caller won't touch these". The analysis as to why that sort of an hardware-enforced ABI does not really perform as well as exposing almost double the registers for function call use, should be pretty straightforward: register pressure causes temporary stores to stack (or scratch space), unless you have more registers than any single function can use; and in that case the code starts using the extra registers as zero-latency scratch space. Essentially, you could say that more registers yields faster code because of that, except for call chains so shallow that existing registers suffice for the entire task without any non-register scratch space being needed.

coppice · « **Reply #88 on:** December 19, 2020, 02:33:45 pm »

Quote from: tggzzz on December 19, 2020, 08:51:55 am

Quote from: coppice on December 19, 2020, 01:34:20 am
I never understood why Stratus put serious effort into the i860. Intel were lukewarm about it, even at its launch. It had some interesting qualities, but without the whole hearted support of upper management a product like that was always going to crash and burn.

IIRC a major problem for the i860 was that creating compilers that could make use of the features was, for a while, a popular academic research topic. A topic that was never solved well enough.

Ditto the Itanic. That also had the major problem that performance was increased by computing results that weren't needed. That wasted power just as power consumption was becoming the limiting factor.

The i860's compiler problem was mostly about getting good general purpose performance from the i860. I didn't do very much with it, but it seemed to be able to do heavy computation very well with the compilers available, much like the Itanic. That was enough to give it a workable initial market. However, most of the people I knew who attended the initial launch events came away thinking this was an Intel tinkering with new ideas project, and was not to be trusted. That was the real killer.

DiTBho · « **Reply #89 on:** December 19, 2020, 03:24:56 pm »

umm, nowadays CPUs start having fast RAM implemented inside the chip (Apple M1, Fujitsu A64FX), FPGAs start having a lot of BRAM, and some companies are already selling fast synchronous static RAM, so I think in the near future this will not need thinking about register banks in the way we have though until now, and will also probably remove the need for any othter devilry like cache , which will remove a lot of troubles related with RISC-design.

Quote from: Nominal Animal on December 19, 2020, 02:01:22 pm

Register windows, or switching entire banks at function calls, can be examined/analysed as if they were an ABI, with the available register file split into "callee won't touch these" and "caller won't touch these".

Personally, in this case, I would divide the code into "critical" and "not critical", and only for the "critical" (function-calls under interrupt? some critical function-calls in kernel space?) I would use a fast-static-ram, or BRAM as a simple "RAM" for storing stuff. This stuff is addressed as a "RAM", and handled like a common stack-RAM (push, pop, sp, ...), so I would store there all the return addresses and all the local variables there for the critical code.

I love and prefer it because it's simple and efficient

Nominal Animal · « **Reply #90 on:** December 19, 2020, 03:39:18 pm »

Quote from: DiTBho on December 19, 2020, 03:24:56 pm

Quote from: Nominal Animal on December 19, 2020, 02:01:22 pm
Register windows, or switching entire banks at function calls, can be examined/analysed as if they were an ABI, with the available register file split into "callee won't touch these" and "caller won't touch these".
Personally, in this case, I would divide the code into "critical" and "not critical", and only for the "critical" (function-calls under interrupt? some critical function-calls in kernel space?) I would use a fast-static-ram, or BRAM as a simple "RAM" for storing stuff.

Oh, I only meant as an analog, to help see why register windows won't help with general-purpose non-privilege-boundary-crossing function calls; because the effect to code generation is the same (as a silly-restrictive ABI has). As an analog, it helps to see the real world effects (on code generated by compilers), and thus hopefully helps see why register windows aren't useful in the general function call case. And when crossing a privilege boundary, you need to basically swap register banks anyhow (or your kernel/privileged code must be really simple and restricted on its register use). If the core has local fast RAM to swap register file to/from, that works. My suggestion expands that to unprivileged code, cases where the context is similarly switched (userspace task switch, coroutines), but does not necessarily involve a privilege boundary crossing. (These would be allocated by request from the kernel, much like memory.)

Being a hobbyist only, I really don't have enough experience on microcontroller and RTOS side to have any kind of an opinion there; I'm definitely only talking about general-purpose computing cores here.

The reason I'd love to see return address stack separated from variable/data "stack" is of course buffer overrun bugs potentially allowing privilege escalation attacks. Heck, it would be interesting to see if making the stack immutable except through call/return (optionally with a "pop" that discards the parent return address) would be viable. (Although return address manipulation can be useful in some cases, I kinda-sorta think there are enough alternatives to make it unnecessary; and killing return address manipulation attacks might be worth it.)

The reason I prefer segmented memory models (including potentially multiple "stacks" with their own address spaces) is that they completely avoid the problem of address space fragmentation, which purely page-based flat memory models are quite sensitive to, especially when you have "only" a 32-bit address space; doubly so if you have any sort of address randomization enabled.

Again, neither of these are useful if one only runs "privileged" or "trusted" code. They are only useful when you have an "userspace" you cannot trust.

SiliconWizard · « **Reply #91 on:** December 19, 2020, 05:15:46 pm »

Quote from: Nominal Animal on December 19, 2020, 03:39:18 pm

The reason I'd love to see return address stack separated from variable/data "stack" is of course buffer overrun bugs potentially allowing privilege escalation attacks. Heck, it would be interesting to see if making the stack immutable except through call/return (optionally with a "pop" that discards the parent return address) would be viable.

This has been done on a number of processors, including the Microchip 8-bit PIC architecture. At least I remember that on the PIC16F, maybe 18F as well. Possibly others.
IIRC, on the 16F, the call ( or return stack) was just 8-entry deep, so that would drastically limit the number of nested calls, but that worked. IIRC, there was no exception if you exceeded the max depth (but I may have forgotten, correct me if there was), so additionally you had no means of knowing that at run-time. You had to use a lot of discipline to never exceed 8 levels.

I like the idea as well. Of course, the problem is the depth of this dedicated stack. It must be deep enough to accomodate general purpose, not cost too much, and ideally be configurable (I think it would be nice to be able to configure a max depth for a given application, and then if overflow occurs, trigger an exception.)

Of course, it can also be implemented in the "main" data memory (as opposed to a dedicate memory block), but as a separate stack, ideally write-protected except for "call" instructions. That probably exists on some processors, although I can't think of a specific one at the moment. Likewise, the architecture could allow the use of several separate return stacks, each accessible to either certain privilege levels, or some other criterion. Again, that's probably doable on a number of exisiting CPUs. But it's a shame, IMO, that it's not more commonly implemented.

SiliconWizard · « **Reply #92 on:** December 19, 2020, 05:31:05 pm »

Note that (since you seem to be interested in experimenting on FPGAs) both multiple register banks and dedicacted return stacks can be "easily" implemented on FPGA (in the sense, it won't cost much in terms of resources as long as you select a reasonable FPGA), so you can definitely experiment and see for yourself.

I for one am interested in trying dedicated return stacks - I'll have to think of a way of implementing them on a RISC-V core, which I'm not sure how to do. Especially for the toolchain support... GCC uses the same stack as data for saving the "ra" register, not sure how to instruct it to do otherwise. That probably requires modifying the back-end, and I'm not too keen on that.

For my pipelined core, the register file is inferred as block RAM - and as you can guess, it only uses a very small quantity of BRAM (31 registers x 32 bits => 124 bytes). Even on a small FPGA, there is definite room for many banks.

brucehoult · « **Reply #93 on:** December 19, 2020, 10:44:10 pm »

Quote from: DiTBho on December 19, 2020, 03:24:56 pm

umm, nowadays CPUs start having fast RAM implemented inside the chip (Apple M1, Fujitsu A64FX), FPGAs start having a lot of BRAM, and some companies are already selling fast synchronous static RAM, so I think in the near future this will not need thinking about register banks in the way we have though until now, and will also probably remove the need for any othter devilry like cache , which will remove a lot of troubles related with RISC-design.

The M1 and A64FX both have the RAM on different chips that just happen to be placed inside the same plastic package. CPU and DRAM processes are very very different.

Registers and memory hierarchies of various kinds will always be with us because there is a fundamental trade-off between capacity, speed of access, and cost per bit.

If you plot available technologies in 3D based on those parameters there is an "efficient frontier" showing which technologies should be used in combination at any given time. Any one technology completely dominating all others in all aspects is a very unlikely possibility. Flipflops or SRAM as cheap as Flash or spinning rust? Not going to happen.

brucehoult · « **Reply #94 on:** December 19, 2020, 10:54:24 pm »

Quote from: SiliconWizard on December 19, 2020, 05:31:05 pm

I for one am interested in trying dedicated return stacks - I'll have to think of a way of implementing them on a RISC-V core, which I'm not sure how to do. Especially for the toolchain support... GCC uses the same stack as data for saving the "ra" register, not sure how to instruct it to do otherwise. That probably requires modifying the back-end, and I'm not too keen on that.

That's definitely an incompatible change -- you wouldn't be able to run standard compiled code any more.

But RISC-V is a great base for doing that experimentation. You don't have to implement *everything* from scratch.

Yes, you'll definitely have to modify the back end but as compiler modifications go it's a fairly simple one. Just remove ra from the stack layout calculation and in the prolog/epilog generation (which is very localized code) substitute some custom "PUSHRA" and "POPRA" instructions instead of the store and load of RA.

For a first cut you don't even have to change the stack layout -- just leave that slot unused.

SiliconWizard · « **Reply #95 on:** December 19, 2020, 11:06:23 pm »

Quote from: brucehoult on December 19, 2020, 10:54:24 pm

Quote from: SiliconWizard on December 19, 2020, 05:31:05 pm
I for one am interested in trying dedicated return stacks - I'll have to think of a way of implementing them on a RISC-V core, which I'm not sure how to do. Especially for the toolchain support... GCC uses the same stack as data for saving the "ra" register, not sure how to instruct it to do otherwise. That probably requires modifying the back-end, and I'm not too keen on that.

That's definitely an incompatible change -- you wouldn't be able to run standard compiled code any more.

But RISC-V is a great base for doing that experimentation. You don't have to implement *everything* from scratch.

Yes, you'll definitely have to modify the back end but as compiler modifications go it's a fairly simple one. Just remove ra from the stack layout calculation and in the prolog/epilog generation (which is very localized code) substitute some custom "PUSHRA" and "POPRA" instructions instead of the store and load of RA.

For a first cut you don't even have to change the stack layout -- just leave that slot unused.

Thanks for the pointers.

There are two sides: the ABI, thus the toolchain - breaking compatibility would not be a big deal, or I could always have a backward compatible mode and make that configurable through some CSR. I suppose the back-end change shouldn't be too difficult, but I admit I've never worked on that.

The other side would be at a lower level. As I said, ideally, additionally to separate return stacks, I would like to add some further protection - making sure, for instance, that only call-related instructions could ever write to said stacks. Currently, the RISC-V ISA (like the MIPS, I think?) uses a return address register, which you are then free to use as you see fit. That's a fine approach, but as saving the return address to the stack - if needed - is just like saving any register, there is no way to strictly restrict this access to specific instructions, which I think would be a plus. For this, I could either add some "protected" call instruction(s), or keep the standard approach, and allow "store" operations in "return stack areas" only if the source register is "ra". Maybe not perfect, but that would be a start.

brucehoult · « **Reply #96 on:** December 19, 2020, 11:16:06 pm »

By the way, protection against ROP etc is more likely to come from address tagging proposals which are already under discussion, which in turn depend on the pointer masking proposal which is getting to a fairly advanced state now and I think might be ratified at the same time as V and B.

https://github.com/riscv/riscv-j-extension/blob/master/pointer-masking-proposal.adoc

SiliconWizard · « **Reply #97 on:** December 19, 2020, 11:20:58 pm »

Quote from: brucehoult on December 19, 2020, 11:16:06 pm

By the way, protection against ROP etc is more likely to come from address tagging proposals which are already under discussion, which in turn depend on the pointer masking proposal which is getting to a fairly advanced state now and I think might be ratified at the same time as V and B.

https://github.com/riscv/riscv-j-extension/blob/master/pointer-masking-proposal.adoc

Ah, thanks for this. I didn't know of this proposal at all!
(BTW, do you happen to know when the B extension is likely to get ratified? And, do you know if it (or how much of the current draft) is supported by the latest GCC, and if so, has it made it to the mainline or only through specific GCC branches? I doubt it's in the mainline yet as it's not even ratified, but I don't know.)

brucehoult · « **Reply #98 on:** December 19, 2020, 11:47:58 pm »

I know the intention is to get V and B at the same time and probably a few other smaller things also. I'm guessing 1.0 proposal versions out very soon (I thought by the Summit at the start of December but that's gone already), and probably ratification in July?

The hope is that Linux distros etc will support both base RV64GC (forever) and a new base with V and B and other small stuff such as pointer masking and WFI in, all at the same time, rather than a ton of variations.

Embedded or others who make their own chip and compile all the software themselves can of course pick and chose what they want.

brucehoult · « **Reply #99 on:** December 19, 2020, 11:49:38 pm »

Oh, and gcc won't put anything in upstream that isn't ratified, so all that stuff is in custom branches in other places e.g. the RISC-V github.

LLVM is happy to take in-progress stuff, so there is more support in upstream there.

SiliconWizard · « **Reply #100 on:** December 19, 2020, 11:58:18 pm »

Quote from: brucehoult on December 19, 2020, 11:49:38 pm

Oh, and gcc won't put anything in upstream that isn't ratified, so all that stuff is in custom branches in other places e.g. the RISC-V github.

That's what I thought. I've started looking at binutils to begin with, just to see how to add support for new instructions to binutils, and that seems straightforward. Haven't looked at GCC yet. But binutils would be enough for trying custom instructions in assembly code.

Quote from: brucehoult on December 19, 2020, 11:49:38 pm

LLVM is happy to take in-progress stuff, so there is more support in upstream there.

I'll have a look. Haven't used LLVM for RISC-V dev so far. Don't know how it fares compared to GCC.

brucehoult · « **Reply #101 on:** December 20, 2020, 01:28:30 am »

Yes, binutils is straightforward. There are some tutorials around showing the different places that need to be modified.

Unless you want gcc to automatically optimize some other code to your new instruction [1] there's usually no need to actually modify gcc. Just put your instruction as inline asm, possibly wrapped in a macro or inline function in some header file.

SiFive added a .insn assembler pseudo-op that lets the assembler (and therefore gcc) automatically insert register numbers into a template if your new instruction follows one of the standard formats e.g. R-type, I-type.

DiTBho · « **Reply #102 on:** December 20, 2020, 03:00:42 am »

Quote from: brucehoult on December 19, 2020, 10:44:10 pm

The M1 and A64FX both have the RAM on different chips that just happen to be placed inside the same plastic package. CPU and DRAM processes are very very different.

Integrated DRAM for sure takes less cycles than external dram thanks to a better signal integrity (shorter paths etc) and adds more benefits; ok, not so many cycles less at the moment, but I think this will radically improve in the near future.

My feelings

My point is simply that all the modern stuff is getting integrated, faster, and the planar process to make DRAM integrated with the CPU is getting easier year after year.

DiTBho · « **Reply #103 on:** December 20, 2020, 11:34:09 am »

edit:
since it's OT here, it has been moved on a dedicated topic here

SiliconWizard · « **Reply #104 on:** December 20, 2020, 06:18:02 pm »

I don't know MIPS in details, but it looks pretty similar to RISC-V in this regard.

As discussed with Bruce, the way calls are handled with the return address in a register is flexible and *can* allow the use of a separate stack, but it's not quite compatible with the standard ABI. So you'd have to make your code NOT compatible with the standard ABI, and write code in pure assembly, or modify compilers to support this. Otherwise current compilers following the ABI use a common stack for data an return addresses.

Now also as discussed above, the ability to use a separate stack to make things more secure is a start, but it's not enough. If said stack is not protected from being written except strictly for storing return addresses, then there still is a security hazard. In this regard, the totally dedicated return stack in earlier PIC MCUs was nice. (And what wasn't that nice was the fact it was pretty depth-limited.)

This kind of protection in this context is IMO not trivial to implement. Merely using usual memory protection mechanisms is not quite adapted to this specific use-case, as I see it. I've looked at the extension Bruce mentioned, and I'm not sure at this point how we could implement sensible protection for return stacks with it, but I'll have to dig a little deeper.)

DiTBho · « **Reply #105 on:** December 20, 2020, 07:49:06 pm »

Quote from: SiliconWizard on December 20, 2020, 06:18:02 pm

the ability to use a separate stack to make things more secure is a start, but it's not enough.

To me it looks enough. How could it be practically exploited if RA is stored in a separate stack? The initial problem arose due to a "Stack Overflow", which *is* a problem *only* if RA is stored on the same stack where data are stored and this because if the data-stack overflows then also the RA is overwritten with something wrong and potentially dangerous.

I don't frankly see any hazard with the trick, and I don't see a reason to over-complicate things.

brucehoult · « **Reply #106 on:** December 20, 2020, 10:13:25 pm »

Quote from: SiliconWizard on December 20, 2020, 06:18:02 pm

I don't know MIPS in details, but it looks pretty similar to RISC-V in this regard.

Pretty similar.

MIPS JAL has 26 bit offset from current PC, shifted left by 2, so +/- 128 MB range. RA is implicit.
MIPS JR exactly jumps to the address in the register, and nothing else.
MIPS JALR jumps to the address in one register and stores the return address in an arbitrary other register (MUST be different reg)

Neither LR nor JALR has an offset. The reason for the register restriction on JALR is in case there is an exception or interrupt in the instruction in the JALR's "branch delay slot", in which case the JALR must be also be restarted. Ugh.

RISC-V JAL has a 20 bit offset from current PC, shifted left by 1, so +/- 1 MB range. Return address can be stored in any register.
RISC-V JALR has a 12 bit offset from the reg contents, stores the return address in any reg (can be the same one)

RISC-V JAL has a shorter range, but a +/- 2 GB range can be achieved with two instructions AUIPC tmp, hi20; JALR ra, lo12(tmp). If MIPS needs to call a function in PIC code more than 512 MB away (a rare situation, especially on a 32 bit machine) then I think it needs a JAL to the next instruction+1 to get the current PC [1] into ra, LUI+ADDI to get the offset, ADDU to add ra to the offset (not the other way around), and finally JALR -- so five instructions.

RISC-V's JALR (with AUIPC) also allows shorter sequences for jump tables for dense switch statements in PIC code.

Quote

Now also as discussed above, the ability to use a separate stack to make things more secure is a start, but it's not enough. If said stack is not protected from being written except strictly for storing return addresses, then there still is a security hazard. In this regard, the totally dedicated return stack in earlier PIC MCUs was nice. (And what wasn't that nice was the fact it was pretty depth-limited.)

It depends on whether you want absolutely mathematically-provable security, or just increase the difficulty by a few orders of magnitude, especially for any attacker who is not in a position to actually generate and run arbitrary code.

Quote

This kind of protection in this context is IMO not trivial to implement. Merely using usual memory protection mechanisms is not quite adapted to this specific use-case, as I see it. I've looked at the extension Bruce mentioned, and I'm not sure at this point how we could implement sensible protection for return stacks with it, but I'll have to dig a little deeper.)

It's just intended to increase the difficulty enormously, at low to zero cost in hardware and runtime and compatible with existing code.

[1] or call a runtime function that just copies ra into the function result register and returns. This is also what is normally done in PIC code on x86 (load return address from stack to rax and return)

coppice · « **Reply #107 on:** December 20, 2020, 10:29:09 pm »

Quote from: DiTBho on December 20, 2020, 07:49:06 pm

I don't frankly see any hazard with the trick, and I don't see a reason to over-complicate things.

I find this remark highly amusing. Like everything else has only proven to have security flaws because people frankly could see a hazard with the tricks they used.

SiliconWizard · « **Reply #108 on:** December 20, 2020, 10:55:18 pm »

Quote from: DiTBho on December 20, 2020, 07:49:06 pm

Quote from: SiliconWizard on December 20, 2020, 06:18:02 pm
the ability to use a separate stack to make things more secure is a start, but it's not enough.

To me it looks enough. How could it be practically exploited if RA is stored in a separate stack? The initial problem arose due to a "Stack Overflow", which *is* a problem *only* if RA is stored on the same stack where data are stored and this because if the data-stack overflows then also the RA is overwritten with something wrong and potentially dangerous.

I don't frankly see any hazard with the trick, and I don't see a reason to over-complicate things.

Whereas the (unfortunately very) common condition, in which incorrect access of local stack data can overwrite return addresses and wreak havoc, is one of the most commonly known and had made a lot of damage, and been the source of a large number of bugs and exploits, this is just one way of messing return addresses.

As long as the stack you store return addresses in is accessible from any other part of the code, it can be overwritten just as easily as in the case of mixed data and return addresses stacks. Maybe the probability is lower in a range of applications - that would certainly warrant benchmarks and a serious study, but it can still happen. And if the separate stack is actually in the same address space, with the same access rights, than the rest of data memory, then it can definitely be overwritten by any part of the code that accesses data memory.

This kind of issues can be mitigated with memory protection, but I was just wondering how to properly implement protection for return stacks. Probably requires a bit of thought.

DiTBho · « **Reply #109 on:** December 20, 2020, 11:03:54 pm »

Quote from: SiliconWizard on December 20, 2020, 10:55:18 pm

As long as the stack you store return addresses in is accessible from any other part of the code, it can be overwritten just as easily as in the case of mixed data and return addresses stacks.

How? The only chance I see is by some "malicious code" that deliberately attempts to write into the dedicated RA-stack area. But if you have malicious code running ... Umm.

SiliconWizard · « **Reply #110 on:** December 21, 2020, 12:14:20 am »

Quote from: DiTBho on December 20, 2020, 11:03:54 pm

Quote from: SiliconWizard on December 20, 2020, 10:55:18 pm
As long as the stack you store return addresses in is accessible from any other part of the code, it can be overwritten just as easily as in the case of mixed data and return addresses stacks.

How? The only chance I see is by some "malicious code" that deliberately attempts to write into the dedicated RA-stack area. But if you have malicious code running ... Umm.

Well, protecting a system against malicious code is not completely stupid from a security POV.

But it doesn't have to be malicious either. It just has to come from a bug. I guess an unwanted bug doesn't qualify as malicious. Bugs do happen. Quite often.

As long as some part of memory is NOT protected in some way and is accessible, any code can accidentally (if not maliciously) write to it. That's the main reason memory protection was invented, to begin with. It can be caused by a bad pointer, by another stack overflowing into the return stack, or a buffer overrun from normal data memory to the return stack, etc. Myriads of possibilities.

Security can't be ensured based on a few assumptions. You have to think about what can possibly go wrong, and take actions against that. Because if something can go wrong, you can be sure it will at some point. Some people call that Murphy's law. Sure it's always a matter of tradeoffs though. Thing is, up to this point, at least in general-purpose computing, a bit too many tradeoffs have been made as far as security is concerned. Things have certainly dramatically improved, but we're not quite there yet...

Nominal Animal · « **Reply #111 on:** December 21, 2020, 05:18:57 am »

Quote from: SiliconWizard on December 19, 2020, 05:15:46 pm

Quote from: Nominal Animal on December 19, 2020, 03:39:18 pm
The reason I'd love to see return address stack separated from variable/data "stack" is of course buffer overrun bugs potentially allowing privilege escalation attacks. Heck, it would be interesting to see if making the stack immutable except through call/return (optionally with a "pop" that discards the parent return address) would be viable.
This has been done on a number of processors, including the Microchip 8-bit PIC architecture. At least I remember that on the PIC16F, maybe 18F as well. Possibly others.

Yes; I just found out Motorola 6809 had a system stack (16-bit pointer register S) used for subroutine return addresses, but also directly accessible (push and pop, and addressing modes using S as a base address); and a separate user stack (16-bit pointer register U, also with its own push and pop and addressing modes). Both stacks are in normal memory.

My subconscious is trying to bubble up something about a segmented memory model, where the segment identifier is part of a base address, and all memory accesses are relative to some base address, but it isn't clear yet. It has to do with how most memory accesses are relative to the beginning of some object, and how segmenting (or having multiple separate address spaces) avoids memory fragmentation issues that are so painful on flat memory models. It is not the 16:16 or 16:32 segment-descriptor:offset model used in 80286 and 80[345]86 architectures, however; the base address itself is a valid pointer also, just restricted to addresses aligned at 64/128 bytes. Something about using the low 3 bits to identify the segment in some addressing modes or instructions, blah blah.

SiliconWizard · « **Reply #112 on:** December 23, 2020, 05:56:19 pm »

Quote from: Nominal Animal on December 21, 2020, 05:18:57 am

Quote from: SiliconWizard on December 19, 2020, 05:15:46 pm
Quote from: Nominal Animal on December 19, 2020, 03:39:18 pm
The reason I'd love to see return address stack separated from variable/data "stack" is of course buffer overrun bugs potentially allowing privilege escalation attacks. Heck, it would be interesting to see if making the stack immutable except through call/return (optionally with a "pop" that discards the parent return address) would be viable.
This has been done on a number of processors, including the Microchip 8-bit PIC architecture. At least I remember that on the PIC16F, maybe 18F as well. Possibly others.
Yes; I just found out Motorola 6809 had a system stack (16-bit pointer register S) used for subroutine return addresses, but also directly accessible (push and pop, and addressing modes using S as a base address); and a separate user stack (16-bit pointer register U, also with its own push and pop and addressing modes). Both stacks are in normal memory.

My subconscious is trying to bubble up something about a segmented memory model, where the segment identifier is part of a base address, and all memory accesses are relative to some base address, but it isn't clear yet. It has to do with how most memory accesses are relative to the beginning of some object, and how segmenting (or having multiple separate address spaces) avoids memory fragmentation issues that are so painful on flat memory models. It is not the 16:16 or 16:32 segment-descriptor:offset model used in 80286 and 80[345]86 architectures, however; the base address itself is a valid pointer also, just restricted to addresses aligned at 64/128 bytes. Something about using the low 3 bits to identify the segment in some addressing modes or instructions, blah blah.

I'm not completely sure I understand what you have in mind, but IMO a conventional memory protection system would work, as long as specific instructions can have certain privileges that can be enforced by memory protection. AFAIK, most memory protection systems do not have this feature - they usually implement access rights such are read, write, execute, privilege levels (but not on a per-instruction basis AFAIK)... So you'd just need to extend that a little bit.

As to segmented memory, once you have some kind of address translator (which you do on any system with a reasonable MMU), I can't really see the benefits.

Nominal Animal · « **Reply #113 on:** December 24, 2020, 03:49:31 pm »

Quote from: SiliconWizard on December 23, 2020, 05:56:19 pm

Quote from: Nominal Animal on December 21, 2020, 05:18:57 am
My subconscious is trying to bubble up something about a segmented memory model
I'm not completely sure I understand what you have in mind,

Me neither. I'm not being coy when I say it's trying to bubble it up; it's very much like a persistent mental itch.

Quote from: SiliconWizard on December 23, 2020, 05:56:19 pm

IMO a conventional memory protection system would work, as long as specific instructions can have certain privileges that can be enforced by memory protection. AFAIK, most memory protection systems do not have this feature - they usually implement access rights such are read, write, execute, privilege levels (but not on a per-instruction basis AFAIK)...

Yes, I do agree on that.

It is useful to see how memory protection is actually to separate types of protections mashed together:
One is the set of protections that describes whether the memory region contents are mutable or not.
The other set is what instructions you can use with the memory region.

Initially, on e.g. x86, the latter set was empty. If you could read memory, you could execute code from it. The executable bit was added later. (IIRC, on x86-64 you cannot have memory be executable but non-readable.)

Here, we are talking about adding another protection mode, call address stack-only, which restricts read and write access to memory to call and return instructions.

If we consider the Read, Write, eXecute, and Stack-only bits as a dependency graph, we get
R─W
│╲│
X S
It is interesting to consider other models (internal dependencies of separating the access privileges), though, for real-world use cases.
(The dependency between Read and Write is that AFAIK write-only memory is not useful in the real world. If you disagree, omit the horizontal line, so you get an N-shaped graph.)

Quote from: SiliconWizard on December 23, 2020, 05:56:19 pm

As to segmented memory, once you have some kind of address translator (which you do on any system with a reasonable MMU), I can't really see the benefits.

Address space fragmentation is a real issue.

Consider, as an example, an ABI where library functions and syscalls do not take just a pointer, but are always offset to some base pointer. The base pointer corresponds roughly to this in C++; the pointer to the base object in object oriented functions. You can already do this, of course, but it is cumbersome. For example, consider the structure and macros/accessor functions you need, if you have a linked list or a graph in shared memory. Instead of pointers, you have e.g. size_t offsets from the beginning of the shared memory region. It's simple. What is hard, is making sure the accessors are used right. Even then, you have a serious situation if you want to reallocate the region, as its base address may have to be moved (due to address space limitations); in a multithreaded process, you can only do so when no thread is using a temporary pointer to the region. I tell you, its frustrating.

How does segmented memory models differ? If your process has N segments, it has N completely separate address spaces. It's like Harvard architecture on steroids, from the software point of view.

I was very interested in this in early 80386 era, since it had four segment registers (CS for code, SS for stack, and DS and ES for data).
Unfortunately, C did not and still does not support named address spaces, so on '386, basically all code used a flat memory model.
It is notable that REP/REPZ/REPNZ MOVSB/MOVSW/MOVSD string instructions explicitly move from DS:[E]SI to ES:[E]DI, so definitely the '386 machine code instruction set was well suited for this.

For AMD64 (x86-64), 64-bit instructions ignore the four segment registers. (I believe this was a design choice based on extant '386 code using flat memory model.)
Relatively recently, there has been additional support (in e.g. the Linux kernel, see FS-GS support) to support userspace use of FS and GS segment registers, which do work in 64-bit mode. FS is usually used for thread-local storage (as putting it at the beginning of the stack area is not optimal). This was basically enabled by GCC and clang extensions for multiple address spaces.

(The Arduino AVR "API" is unfortunate in that it by default puts string literals in a section where they are copied to RAM at bootup. This is purely a design choice, and avr-gcc supports much better choices... I do believe, but cannot confirm, that the address space extensions now used for FS/GS originally grew from the needs of GCC ports to Harvard architecture microcontrollers. Interesting, if one finds the reasons how and why we have what we have now, interesting.)

My subconscious is telling me that there is a pattern that would take the best strengths from these, but be far simpler and more robust than what we now have (for implementing the type of application and code we have), but has not found that pattern yet. It has something to do with dropping segment registers per se, and instead having instructions use a base address pointer, sufficiently aligned, where the least significant bits refer to one of a small number of currently valid segments. I'm treating is as a dedicated base+offset addressing mode, with normal addressing mode using a default segment slot depending on the instruction. It may be that my mind is too stupid to ever find that pattern, though; only clever enough to see that it probably exists. Or I could be completely off the road here.

brucehoult · « **Reply #114 on:** December 26, 2020, 04:24:11 am »

Quote from: Nominal Animal on December 24, 2020, 03:49:31 pm

I was very interested in this in early 80386 era, since it had four segment registers (CS for code, SS for stack, and DS and ES for data).
Unfortunately, C did not and still does not support named address spaces, so on '386, basically all code used a flat memory model.
It is notable that REP/REPZ/REPNZ MOVSB/MOVSW/MOVSD string instructions explicitly move from DS:[E]SI to ES:[E]DI, so definitely the '386 machine code instruction set was well suited for this.

There were three big problems with x86 segments:

- a segment could only be 64k in size, which severely limits its usefulness.

- there are not enough segments available at the same time (CS, SS, DS, ES back then) and only one useful for objects on the heap. It would have been much much more useful to have, say, eight segment registers with five available for heap objects. Or 16 segment registers.

- it's slow to load a new value into a segment register and use it. I don't remember the details, but I think there must be a pipeline flush. That's fine if you only load the segment registers once at program startup and so get 64 KB for each of code, stack, globals, and heap (256 KB ought to be enough for anyone, and is certainly better than having 64 KB total like 8 bit micros). But it used to be very slow to use large heaps if you loaded ES every time you used a pointer.

tggzzz · « **Reply #115 on:** December 26, 2020, 10:09:40 am »

There is also, for C, the issue that multiple segment:offset combinations refer to the same memory location, which can make pointer comparison unpleasant.

Basically the early 80x86 segments were an unpleasant hack, stemming from semiconductor process limitations and a desire for 8080 compatibility.

brucehoult · « **Reply #116 on:** December 26, 2020, 10:40:11 am »

Quote from: tggzzz on December 26, 2020, 10:09:40 am

There is also, for C, the issue that multiple segment:offset combinations refer to the same memory location, which can make pointer comparison unpleasant.

I never actually had the misfortune to use such machines (my path was VAX/DGMV -> Mac -> Solaris -> Linux -> MacOS X) but I think the lowest overhead way to compare 8086/286 style seg:offset far pointers would be to normalize them first using "seg += (offset>>4); offset &= 0xf".

It was ridiculous that segment numbers mapped to physical addresses by multiplying them by 16 rather than looking them up in a table. And also ridiculous that segments registers contained effectively the segment base but not segment size/limit. The whole point of proper segments is that you can relocate segments to other physical addresses transparently, and that you can't access memory past the segment bounds.

DiTBho · « **Reply #117 on:** December 26, 2020, 11:29:13 am »

Quote from: brucehoult on December 26, 2020, 04:24:11 am

There were three big problems with x86 segments:

- a segment could only be 64k in size, which severely limits its usefulness.

I have different feelings about segmented memory:

so frustrated that I never seriously programmed on my DOS machine
happy to exploit it as hack on MPUs with only 16bit address
happy not to have it on MIPS

DiTBho · « **Reply #118 on:** December 26, 2020, 11:39:43 am »

Let's talk about the "ugly hack"

I was happy to apply it as hack on MPUs with only 16bit address because it allows them to go beyond the 64Kbyte limit, and this allows you to have somehow some tasks.

Code: [Select]


    ------
    |    |
    |32KB| page0
    |    |                page latch
    ------                   |   |
    |    |                  A17 A16 A15 ... A00  RAM ADDR
    |32KB| page1                     |       |
    |    |                          A15 ... A00  CPU ADDR
    ------
    |    |
    |32KB| page2
    |    |
    ------
    |    |
    |32KB| page3
    |    |
    ------

In my case, I added 2 bit to the address space of a 128Kbyte RAM, I loaded each 32Kbyte page with a completely segregated applications, which communicate each others through a 16Kbyte ram-slice shared among them, and used to temporary save their task's context on context switch.

Code: [Select]


    ------
 0k |    |
    |32KB| page0..3
32k |    |
    ------
    |16KB| not-paged slice
    ------
64K |    |
    ------

This trick is very ugly, it uses an external "latch" to select the upper 2 bits { A17, A16 } of the physical ram, and it requires the CPU to execute something in the "not-paged slice" before invoking the hardware page-switch because otherwise glitches may happen. Not funny, but it was useful to implement a sort of four-tasks scheduler.

SiliconWizard · « **Reply #119 on:** December 26, 2020, 04:06:15 pm »

I admit I haven't delved into this in details, but I'm not sure what would be the benefit of segmented memory compared to address translation with a more sophisticated scheme (MMU), which is much more flexible and, from a programming POV, is easier to use as there's no segment to fiddle with? (Apart, of course, from simplicity.)

coppice · « **Reply #120 on:** December 26, 2020, 04:27:31 pm »

Quote from: brucehoult on December 26, 2020, 04:24:11 am

Quote from: Nominal Animal on December 24, 2020, 03:49:31 pm
I was very interested in this in early 80386 era, since it had four segment registers (CS for code, SS for stack, and DS and ES for data).
Unfortunately, C did not and still does not support named address spaces, so on '386, basically all code used a flat memory model.
It is notable that REP/REPZ/REPNZ MOVSB/MOVSW/MOVSD string instructions explicitly move from DS:[E]SI to ES:[E]DI, so definitely the '386 machine code instruction set was well suited for this.

There were three big problems with x86 segments:

- a segment could only be 64k in size, which severely limits its usefulness.

- there are not enough segments available at the same time (CS, SS, DS, ES back then) and only one useful for objects on the heap. It would have been much much more useful to have, say, eight segment registers with five available for heap objects. Or 16 segment registers.

- it's slow to load a new value into a segment register and use it. I don't remember the details, but I think there must be a pipeline flush. That's fine if you only load the segment registers once at program startup and so get 64 KB for each of code, stack, globals, and heap (256 KB ought to be enough for anyone, and is certainly better than having 64 KB total like 8 bit micros). But it used to be very slow to use large heaps if you loaded ES every time you used a pointer.

These were the STRENGTHS of the 8086 segmentation scheme. The brief for the 8086 was "Build us a stopgap until the i432 works. Try stretching the 8080, offer some measure of code compatibility, and keep the transistor count under 30k". In 1980 few programs had a problem working with the segment size or quantity.

Nominal Animal · « **Reply #121 on:** December 26, 2020, 09:02:25 pm »

Quote from: brucehoult on December 26, 2020, 04:24:11 am

Quote from: Nominal Animal on December 24, 2020, 03:49:31 pm
I was very interested in this in early 80386 era, since it had four segment registers (CS for code, SS for stack, and DS and ES for data).
Unfortunately, C did not and still does not support named address spaces, so on '386, basically all code used a flat memory model.
It is notable that REP/REPZ/REPNZ MOVSB/MOVSW/MOVSD string instructions explicitly move from DS:[E]SI to ES:[E]DI, so definitely the '386 machine code instruction set was well suited for this.
- a segment could only be 64k in size, which severely limits its usefulness.

No, the segment size was only limited to 64k in VM86 mode; in full 32-bit mode, the size can be anything between 0 and 1048576 (one byte granularity), or any multiple of 4096 between 0 and 2³².

The native 32-bit mode on 80386 and later allowed full 32-bit addressing using segments. Indeed, the Linux kernel "hugemem" more used exactly this to support up to 64 GiB of physical RAM, with at most 4 GiB addressable and usable to any single process. In the kernel mode, DS segment was used for kernel data (up to 4 GiB), and ES segment for the current userspace process. Userspace processes had flat memory models, with CS = SS = DS = ES.

The Wikipedia Segment descriptor article describes the 64-bit segment descriptors on x86 and x86-64.

Quote from: brucehoult on December 26, 2020, 04:24:11 am

- there are not enough segments available at the same time (CS, SS, DS, ES back then) and only one useful for objects on the heap. It would have been much much more useful to have, say, eight segment registers with five available for heap objects. Or 16 segment registers.

The problem with segment registers is that functions accessing objects in different segments need to be written separately for each possible segment register combination. Supplying the segment registers along with each pointer is not viable, because loading a new value took at least a dozen cycles, not counting cache effects.

Quote from: SiliconWizard on December 26, 2020, 04:06:15 pm

I'm not sure what would be the benefit of segmented memory compared to address translation

Extra bits of virtual address space.

Mapping pages is much slower than segments because a given process can only have a small fixed number of segments, but pages may occur wherever in the address space; so a page-to-physical mapping is essentially O(log N) operation where N is the number of pages in use, but segment lookup has constant time complexity, O(1). You can only make page-to-physical mapping at O(1) time complexity if your page table covers the full virtual address space, for each process.

If you use paged segments, you can require page mapping for the entire segment size, because the assumption is that you don't normally have large holes in a segment map. Guard pages and such are obviously acceptable, with their cost being just page descriptors.

brucehoult · « **Reply #122 on:** December 28, 2020, 05:38:16 am »

Quote from: DiTBho on December 26, 2020, 11:39:43 am

Let's talk about the "ugly hack"

I was happy to apply it as hack on MPUs with only 16bit address because it allows them to go beyond the 64Kbyte limit, and this allows you to have somehow some tasks.

Sure, and machines such as the Apple ][ used this to first simply swap the ROM from C000 to FFFF with extra RAM on a card, and then later to implement expanded RAM cards with up to a MB or two. ProDOS knew how to use this, and programs such as AppleWorks used it extensively.

There's no reason you can't do it with pretty much any CPU. Memory map a latch or two, and you can easily map 256 or even 65536 different chunks of RAM or ROM into some one or more portions of the address space.

Add enough latches to cover the whole address space and you've got a paged MMU :-)

Nominal Animal · « **Reply #123 on:** December 28, 2020, 06:47:23 pm »

If anyone is interested, that segmented memory access model that itches my brain is still bubbling like a fetid pool of industrial waste.

It has now split into two, where one is fundamentally incompatible with standard C (with low k bits of pointers used to identify the segment, and all composite objects and arrays aligned to 2^k bytes); in particular, wrt. the address-of operator, &, when applied to members or elements of objects (or basically anything smaller than 2^k bytes, or with smaller alignment requirement, particularly char and its signed/unsigned variants).

The other boils down to having a paged MMU, with 2^k page tables, each starting at virtual addresses multiples of 2^N-k, or equivalently a paged MMU that has the same overhead regardless of the virtual address. Then, the virtual address space can be split into 2^k "segments", where the page table entries cover a continuous virtual address range up to the "segment limit". Stacks need to grow upwards, though, and therefore you really need addressing modes where an offset is subtracted from the stack or stack frame pointer; especially with a small literal constant (most common use, referring to a local variable).

It took me a couple of days of thinking to see how it really boils down to these, even though they both provide some really nice features

Furthermore, the first one can be emulated by any architecture, by using a composite reference, (base object address, offset), instead of scalar pointers. For arrays, a quad (base object address, step delta, first index, count) specifies a slice, just like in e.g Fortran or Python, and is valid if and only if both step delta×first index and step delta×(first index+count-1) are within the base object. (Of course, using first offset=step delta×first index instead of first index and end offset=step delta×(first index+count-1 instead of count is better, but it is important that the human-readable code does not directly provide first offset and end offset because any off-by-one errors would completely confuse it.)
I'm finding it funny how different it is to think in terms of "composite objects" or "structures" – base, offset pairs – with direct pointers to small scalar members forbidden/impossible, compared to how one views the memory address space in C.

Having a hardware comparison instruction that checks if 0 ≤ offset < limit, with offset a register operand, and limit either a register or a memory operand, would help a lot because then the bounds check would be a single instruction. Better yet, a three-operand "within bounds" instruction, where the offset being checked is always a register operand, lower bound being either a register or a (small) literal unsigned integer, and upper bound being either a register or a memory address containing the bound.

Also, it would be useful to be able to swap the greater than and less than comparison status flags based on the highest bit of a named register; a sort of a dedicated XOR operation. Then, a slicing loop would not need to care whether step delta is positive or negative, just compare if current offset is less than end offset, and swap greater/less flags if step delta is negative. Having this in one instruction would be even nicer: compare offset > limit if delta is negative (highest bit set), offset < limit otherwise. The result condition is a single bit: true or false, with the next instruction expected to be a branch based on this comparison.

Similarly, an N×N=N multiplication instruction really needs a flag showing if the result is exact, or just the low N bits. Basically a sticky-carry flag, that is only cleared by a specific clear-sticky-flag instruction and multiplication, and gets set if add, sub, or shift overflowed even temporarily (say, a x<<6 shifted any nonzero bits out). This would make it trivial to detect size calculation overflow. (Making this into a C built-in or an inline assembly function that returns 0 if an overflow occurs works fine; something like ulfma(x, y, z) == x·y+z, or zero if the result overflows.)
A "sticky" overflow bit makes sense in general, because then one can omit the rarely-taken branch-if-overflow branch instructions, and just have one at the end of calculation.

I'm rather dejected that after that much of mental itching, it turns out to be this mundane.

On the other hand, it really clarified to me how much the ABI and the programming language affect the way the hardware gets used, and how relatively small things (like the paged MMU implementation details) can provide a completely different paradigm ("segmented" memory, through pointers with k high bits indicating the "segment", but address space limited to objects at most 2^N-k bytes long).

brucehoult · « **Reply #124 on:** December 28, 2020, 09:23:49 pm »

Quote from: Nominal Animal on December 28, 2020, 06:47:23 pm

Stacks need to grow upwards, though, and therefore you really need addressing modes where an offset is subtracted from the stack or stack frame pointer; especially with a small literal constant (most common use, referring to a local variable).

Local variables on the stack is very old-fashioned. Saving some of your caller's (or caller's caller's) local variables on the stack is the modern way, and each one only gets touched once -- stored at the start of the function and reloaded at the end. Upward or downward stack and positive or negative offsets is not very important as the difference is just a single add or subtract to a temporary register (aka Frame Pointer).

But I don't understand why stacks need to grow upwards?

Quote

Having a hardware comparison instruction that checks if 0 ≤ offset < limit, with offset a register operand, and limit either a register or a memory operand, would help a lot because then the bounds check would be a single instruction.

Such as "set-less-than-unsigned" or "branch-less-than-unsigned"?

Nominal Animal · « **Reply #125 on:** December 29, 2020, 12:56:19 am »

Quote from: brucehoult on December 28, 2020, 09:23:49 pm

Quote from: Nominal Animal on December 28, 2020, 06:47:23 pm
Stacks need to grow upwards, though, and therefore you really need addressing modes where an offset is subtracted from the stack or stack frame pointer; especially with a small literal constant (most common use, referring to a local variable).
Local variables on the stack is very old-fashioned.

Needed only when you have more local variables than you can fit in registers, sure. Caller or callee-saving is obviously part of an ABI, and I agree, having any caller-saved registers is old-fashioned; callee-saving is more efficient.

Quote from: brucehoult on December 28, 2020, 09:23:49 pm

But I don't understand why stacks need to grow upwards?

That is due to the "segmented" paged memory model. When the 2^k possible segments all grow upwards, it suffices to have page descriptors from zero offset up to the limit. The entire address range dedicated for the stack does not need to be fully populated by page descriptors. If the stack is only used for return addresses (or in general, grows at most by a page at a time), then a single-page initial stack suffices, and can be dynamically grown when needed (both at the virtual address level and the page table entry level) by the OS kernel without involving the userspace at all.

If you want the same functionality with downwards-growing stacks, each segment needs a base and a limit, with the address range in between covered by page table entries. However, that adds an extra subtraction operation to MMU: subtracting base from the address being dereferenced.

Alternatively, you could have both "upwards" and "downwards" growing stacks, where the base is always 0, and limit is either the upper or the lower limit of the segment, with page table entries covering the region either starting at the beginning of the segment, or always ending at the end of the segment.
I don't like the complexity, so requiring upwards growing stacks simplifies the scheme significantly.

(Okay, you could also have everything grow downwards, with limit always zero (one past last accessable address), and with the base essentially being just size subtracted from zero, but thinking like this weirds me out.)

Let's say you have a full address A, with k most significant bits containing the "segment", and the rest, N-k bits describing the offset. Accessing memory at A involves the following, assuming 2ⁿ-byte pages:

The k most significant bits (bits N-k through N-1, inclusive) of A contain the segment number, S. It specifies which one of the 2^k segment descriptors is used.
The N-k-n intermediate bits (bits n through N-k-1, inclusive) of A contain the page number, P. This identifies the page descriptor index in the page table for the current segment.
The segment descriptor contains the start address for the page table descriptor array, F.
The page descriptor contains the physical memory base address B for the page. Thus, F+P×(size of page descriptor) is the physical memory address where the page descriptor is in.
The n least significant bits of A specify the offset in the page, L. Thus, B+L is the memory address referred to.
Usually, the n low bits of B are required to be zero, so it is not even an addition, just concatenating the bits.

In parallel to 2., the low N-k bits are compared to the segment limit. If they exceed the segment limit, the access is invalid (segment violation).
In 4, the page descriptor present bit needs to be examined, and a hardware interrupt generated if it is not.
When present, the page descriptor read or write flags should be updated (depending on the instruction), and if initial-interrupt flag is set, a hardware interrupt generated if the flag transitions from false to true. This allows copy-on-write pages.

The memory access mode flags can be segment-specific; I don't even recall having to change page address modes outside examples and self-modifying code, except for a full memory mapped object; so I don't think I need page-level access mode flags. Although, if both segment and page access mode flags are used, then they can be AND'ed (if "allow" type flags) or OR'ed (if "deny" type flags).

Simply put, I can get away with a relatively simple MMU with just 2^k segment registers in CPU dedicated memory. I'm sure you can see how even a small page descriptor cache (using the N-n high bits of the virtual address as the key for the cached page descriptors) can speed it up significantly. Yet, such a simple MMU can still provide basically all the features the application programmer needs.

I do think it ~~would make a lot of~~ might make sense to put the page descriptors in a completely separate memory bus than the actual memory contents. Not only should it speed things up (since it is relatively small, and could then be accessed in parallel to "normal" memory accesses; but would essentially have to be big enough to contain the page tables for all tasks, not just the currently running tasks), but it would be really nice way to protect them against programming errors that allow direct access to memory containing page descriptors. It would be particularly interesting for a multi-core processor. The only downside I can immediately see is the complexity it adds to hibernation (suspend-to-disk; completely unpowered "sleep"). And possibly that memory isn't cheap, so having just one pool that gets divvied up as needed is probably more cost-efficient.

Quote from: brucehoult on December 28, 2020, 09:23:49 pm

Quote
Having a hardware comparison instruction that checks if 0 ≤ offset < limit, with offset a register operand, and limit either a register or a memory operand, would help a lot because then the bounds check would be a single instruction.
Such as "set-less-than-unsigned" or "branch-less-than-unsigned"?

Exactly. I'm an idiot, you see.

DiTBho · « **Reply #126 on:** December 29, 2020, 04:44:41 am »

TLB-Page-Table

?

Nominal Animal · « **Reply #127 on:** December 29, 2020, 03:45:11 pm »

Quote from: DiTBho on December 29, 2020, 04:44:41 am

TLB-Page-Table ?

Yes. I'm not sure how long the instruction pipeline would have to be to do it without any caching, since each memory access involves that extra page descriptor access.

In that simple-MMU case, the TLB would be a set of cached page descriptors, and the N-n-bit key (used to find the cached descriptor) would be the segment number S and the page number P.

One possible implementation uses a hash table with size a power of two, say 2^H, and mixes the key bits into a H-bit index, using a fixed hardware algorithm (usually just a bitwise XOR pattern, with each of the H bits depending on at least three different key bits). Each hash table slot contains both the key and the cached page descriptor. If the cached key in the target does not match, a short stall occurs, as the correct page descriptor is loaded to that slot.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee