One interesting VLIW CPU I worked with was the Trimedia/Philips PNX1302. It had 128 32-bit registers, and by convention the top 64 were reserved for interrupts. That saved the cost of saving registers. Also, an interrupt could only be taken on a branch instruction. This may sound crazy, but it probably made things easier for the designers. It also made it very easy to do atomic operations. Everything between branches was atomic.
The Sittara such as used in the BeagleBone Black has a main ARM processor and two PRU's (Programmable Realtime Units) which are microcontrollers that run their independent program.
I'm assuming you do want some kind of interrupts.
You *can* do everything by polling, but it places big constraints on software design if you want to be responsive to external events. In the limit you end up with your main program being some kind of interpreter, executing some kind of meta instruction set (or something like FORTH words) and polling for events between them.
Perhaps these interrupt processor cores can also be simpler cores that can only execute the basic subset of the main CPUs full instruction set (Like no reason to have a FPU on these)Now that starts to sound like the SPUs on the TI Sitara (Beagle Bone Black processor).
Edit: on second thought, the additional, dedicated core approach alone isn't enough to get rid of interrupts. One additional dedicated core, without interrupts, would be OK if it does everything by polling. Problem is, if all you do is polling, even on a dedicated core, you can't efficiently handle priorities, specifically "preempting" what the core is currently servicing to service something else. So if you need to handle preemption, then you either need interrupts - again - or use more cores, each handling one or several tasks that don't require to be preempted. On a moderately complex system, that would require a lot of cores.Yes, exactly: if you want to run an OS with pre-emptive (as opposed to co-operative) multitasking, you do need some type of interrupts that change the instruction pointer.
pondering the question of implementing alternatives to classic interrupts.
Perhaps these interrupt processor cores can also be simpler cores that can only execute the basic subset of the main CPUs full instruction set (Like no reason to have a FPU on these)Now that starts to sound like the SPUs on the TI Sitara (Beagle Bone Black processor).
(Disclaimer: not sure this is the appropriate section for this, but I had a hard time finding one.)
I'm currently working on CPU design (with a working RISC-V core so far), and at this point, more specifically pondering the question of implementing alternatives to classic interrupts.
I know of the XMOS architecture, for instance, so that could give some starting ideas. I thought sharing ideas/discussing them could be interesting.
It is interesting to think about that for hardened designs, but there is another angle to FPGA based implementation that people often miss.
In FPGA you are free to implement time sensitive logic in the fabric itself, and it can be changed for each project individually. So the whole core can be designed without interrupts at all, making the whole design much simpler.
Again, this is provided that you are implementing the core for practical uses, and not just for entertainment assuming it may be implemented in silicon. There you have to think about this stuff, of course.
The whole point of classic interrupts is that they don't require much additional resources to an existing core, while basically giving access to all of said core's resources in ISRs. Dedicating one or more cores just for this can have significant cost, unless 1/ this additional core(s) is simpler than the main processing core(s) or 2/ for a given range of applications, there is so much to be done in ISRs that dedicating one or more cores just for this is justified from a resource POV.
I'm assuming you do want some kind of interrupts.
You *can* do everything by polling, but it places big constraints on software design if you want to be responsive to external events. In the limit you end up with your main program being some kind of interpreter, executing some kind of meta instruction set (or something like FORTH words) and polling for events between them.
Edit: on second thought, the additional, dedicated core approach alone isn't enough to get rid of interrupts. One additional dedicated core, without interrupts, would be OK if it does everything by polling. Problem is, if all you do is polling, even on a dedicated core, you can't efficiently handle priorities, specifically "preempting" what the core is currently servicing to service something else. So if you need to handle preemption, then you either need interrupts - again - or use more cores, each handling one or several tasks that don't require to be preempted. On a moderately complex system, that would require a lot of cores.Yes, exactly: if you want to run an OS with pre-emptive (as opposed to co-operative) multitasking, you do need some type of interrupts that change the instruction pointer.
Have you considered using an interrupt mechanism that can also be used to implement tasks/threads and coroutines?
<other concepts snipped>
You are getting there, but are approaching it from the wrong angle and not being sufficiently radical
But the sort of efficient interrupt handling is probably more of an interesting topic for embedded real time applications rather than big OSes. Getting device drivers in a large OS to do something other than what they use now is probably a ton of work.
You are getting there, but are approaching it from the wrong angle and not being sufficiently radical
Do you have a manifesto? It is imperative that you have a manifesto.
we need more approaches to system design for future highly parallel systems
we need more approaches to system design for future highly parallel systems
Whenever I advocate proper design techniques instead of ad-hoc hacking in this forum, it doesn't take long until people appear who quote phrases like "keep it simple" or who say "you are only adding complexity, good engineering is simple".
I hope you are luckier than me.
Well, perhaps! But what I described was what I'd like to see from a traditional processor, running machine code generated from traditional programming languages.Edit: on second thought, the additional, dedicated core approach alone isn't enough to get rid of interrupts. One additional dedicated core, without interrupts, would be OK if it does everything by polling. Problem is, if all you do is polling, even on a dedicated core, you can't efficiently handle priorities, specifically "preempting" what the core is currently servicing to service something else. So if you need to handle preemption, then you either need interrupts - again - or use more cores, each handling one or several tasks that don't require to be preempted. On a moderately complex system, that would require a lot of cores.Yes, exactly: if you want to run an OS with pre-emptive (as opposed to co-operative) multitasking, you do need some type of interrupts that change the instruction pointer.
Have you considered using an interrupt mechanism that can also be used to implement tasks/threads and coroutines?
<other concepts snipped>
You are getting there, but are approaching it from the wrong angle and not being sufficiently radical :)
You are getting there, but are approaching it from the wrong angle and not being sufficiently radical :)Well, perhaps! But what I described was what I'd like to see from a traditional processor, running machine code generated from traditional programming languages.
The core of the interrupt controller being a register file swapper basically implements what OS kernels end up doing in software, in a way that can be used to implement userspace task switching too (coroutines, signals).
The memory manager is really separate feature, and I probably should have omitted it; it just completes the picture for me.
The idea of dedicating one or several cores for servicing what is usually done in ISRs is a workable approach. This is how things are done on XMOS devices, for instance, as far as I've understood.
...
And yes, this has been done before. Interestingly, this is basically what was already done a long time ago. Many old systems used a main CPU and then one or several secondary (and cheaper) processors dedicated to servicing peripheral/or otherwise secondary tasks.
It is worse when you hear such statements at work. The best retort I have found is "Everything should be made as simple as possible, but not simpler", which is often attributed to Einstein.
Re: context switch
You can provide a separate set of registers for 3 modes: Kernel, Supervisor and User - like the later PDP11s
Also, it is not clear whether Rutherford, Einstein, and Feynman really said those things. They are basically only attributed.
pondering the question of implementing alternatives to classic interrupts.
Hasn't this been done already with CPU-internal event systems? Those enable me to create complex configurations and get around interrupts a lot of the time, so ... am I missing something?
Care to elaborate a bit? Especially how you can handle external interrupts with this scheme?
Care to elaborate a bit? Especially how you can handle external interrupts with this scheme?
I don't know whether I understood correctly what you are trying to achieve. The event systems that I'm talking about enable me to configure the MCU like this:
"When you see a logic LOW on pin X please start counter Y, and as soon as counter Y has a compare match please trigger the ADC on channel 6, and when the ADC is done please trigger interrupt Z."
So, in this example there is no code execution or interrupt handling involved within the event chain, only at the end there is an interrupt. This means that the CPU has no code to execute while the event chain is running and is free to do anything else. In many cases I can create a configuration where there is no interrupt and no code execution at all, only event handling.
Some series of Microchip MCUs have this kind of event system. But as I said, I don't know whether I understood you correctly and whether this has anything to do with your question.
The interrupts I engineered into my own RISC cores in my FPGA were all too simple to implement. I cannot see an issue to do so in an FPGA. When an interrupt signal is sent, all I had to do is in the read instruction fetch pipeline, I would shift in a JSR/CALL to a fixed address. Yes, just a jump/call. Nothing else.
standard SPARC instruction set
The interrupts I engineered into my own RISC cores in my FPGA were all too simple to implement. I cannot see an issue to do so in an FPGA. When an interrupt signal is sent, all I had to do is in the read instruction fetch pipeline, I would shift in a JSR/CALL to a fixed address. Yes, just a jump/call. Nothing else.
That's like MIPS :D
standard SPARC instruction set
As far as I understand, one of the guys on the RISC-V team is the same person who was behind SPARC years ago. So I learned a simple thing: if he recently changed his mind, then probably the SPARC-ISA was not a good idea :D
And, yes, Dave now believes register windows to be a mistake.
standard SPARC instruction set
As far as I understand, one of the guys on the RISC-V team is the same person who was behind SPARC years ago. So I learned a simple thing: if he recently changed his mind, then probably the SPARC-ISA was not a good idea :D
And, yes, Dave now believes register windows to be a mistake.
Do you know his exact rationale?
@brucehoult
with a common simple MIPS design (R2K-R3K), interrupts and exceptions are handled by COP0. On such events the COP0 forces the CPU to move the current PC into another register, namely "EPC".
To return from a handler, the CPU only needs to move the contents of the EPC register back to a register, in order to use a simple "jump" to load it into the PC.
t I mean is: it's simple, it takes just few circuits, and it's efficient :D
Trimedia is a DSP architecture. They ALWAYS have interesting tradeoffs. :)One interesting VLIW CPU I worked with was the Trimedia/Philips PNX1302. It had 128 32-bit registers, and by convention the top 64 were reserved for interrupts. That saved the cost of saving registers. Also, an interrupt could only be taken on a branch instruction. This may sound crazy, but it probably made things easier for the designers. It also made it very easy to do atomic operations. Everything between branches was atomic.
Interesting trade-off. Switch registers to minimize interrupt latency -- but then add to it by only switching on branches. Also 128 registers uses a ton of die space.
I note that MIPS hard wires R31 as the subroutine address register. So you could, in a MIPS-like dedicate, say, R30 to exception return. Or shadow R31.
# Trap handler in the standard MIPS32 kernel text segment
...
move $v0,$k0 # Restore $v0
move $a0,$k1 # Restore $a0
mfc0 $k0,$14 # Coprocessor 0 register $14 has address of trapping instruction
addi $k0,$k0,4 # Add 4 to point to next instruction
mtc0 $k0,$14 # Store new address back into $14
eret # Error return; set PC to value in $14
nop
I am limited by the hardware I can afford
Having a physical gadget to play with is more fun than complete abstraction...I am limited by the hardware I can affordWhy not a software simulator?
Having a physical gadget to play with is more fun than complete abstraction...I am limited by the hardware I can affordWhy not a software simulator?
Any recommendations, considering the only OS I have is Linux? I'd prefer open source if possible (speed is not really a concern for me), but it's not a hard requirement. Verilator? Icarus Verilog? Cascade?
Any recommendations, considering the only OS I have is Linux?
I note that MIPS hard wires R31 as the subroutine address register. So you could, in a MIPS-like dedicate, say, R30 to exception return. Or shadow R31.
The only alternative (when it's implemented) is ERET(1).
This is the common way ERET works with MIPS32Code: [Select]# Trap handler in the standard MIPS32 kernel text segment
...
move $v0,$k0 # Restore $v0
move $a0,$k1 # Restore $a0
mfc0 $k0,$14 # Coprocessor 0 register $14 has address of trapping instruction
addi $k0,$k0,4 # Add 4 to point to next instruction
mtc0 $k0,$14 # Store new address back into $14
eret # Error return; set PC to value in $14
nop
In my MIPS SoftCore I haven't implemented ERET, but rather reserved R30 as Exception Return, and R31 as Function Return. So I return from exception with a simple mtc0 R14 R30; J 30; nop
Could anyone suggest me an FPGA board (the cheaper the better) that should suffice for experimenting with a RISC-V core? Or two? Or one that could access both external slow DRAM and much smaller static/fast SRAM (for storing the register file when not actively used by a core)? I'm pretty clearly pushing an idea I should try myself first, before suggesting others try anything like it, but am limited by the hardware I can afford. Would Olimex ICE40-HX8K (https://www.olimex.com/Products/FPGA/iCE40/iCE40HX8K-EVB/open-source-hardware) with an iCE40HX8K-CT256 suffice?
How do you really compare different implementations? My first instinct is to get the core to execute real-life code, and measure the actual latencies; the good ol' benchmark-a-bunch-of-real-world-workloads. But that seems awfully crude. Are there metrics or resources for high-level comparisons for the implementations? Cycle-count statistics?
I mean, implementing something that is already existing can be a fun exercise, but I'd rather discover the benefits and downsides of my own implementation.
That's also why I'm simply not writing my own OS on xCORE, or trying to develop my own programming language; I'm much more of a low-level guy, trying to find ways to do real-world tasks better, to some odd definition of better focusing on efficiency and security. The reasoning behind my more complex ideas tend to be an order of magnitude harder to convey to others than their actual implementation... I've already wasted enough of my life pushing those. I'd rather be seen as helpful and useful than an obscure visionary.
For xCORE devices those might be "highly parallel fast MCU with FPGA-like i/o"
For xCORE devices those might be "highly parallel fast MCU with FPGA-like i/o"
The big problem here is the inner documentation.
---
Even with MIPS-IV there is the same problem, and for instance I know there is a special instruction improperly namely "AERET", which I can only suppose it stands for "Automatic Exception RETurn", because it's not mentioned anywhere except in a piece of document that is still not officially publicly available.
I am a bit worried about their lawyers, but I think you can "copy" the idea for your own hobby Softcore as long as you declare it "not MIPS, but MIPS-like" (here you can understand why I am not respecting the MIPS-ABI), but I don't think you can use it without some legal permission for anything that is done as "commercial-product". Not sure, anyway, it's simply an enhanced version of the common MIPS-"ERET" that directly copies EPC into PC without the need of any special COP0 instruction (MTC*) and without the need of any temporary register.
Cool, ain't it? I haven't implemented it because it's more complex than my current solution, but I like the idea.
---
With xMOS I think it's about the same, and details about xCORE are missing and what is publicly available is rather smoky. Yes, it's a product, yes, you can buy it, yes it's cheap, yes it finally has a decent tool-chain, but there is still insufficient documentation.
A shame, but that's it, in my opinion :D
An interesting thread. An interrupt is used to service an asynchronous event, when resources are limited. Someone mentioned that an old CDC processor had no interrupts because it had co processors to handle asynchronous events, a case of throwing enough resources at the problem so interrupts were unnecessary. Maybe the XMOS architecture is similar.
But if processor resources are limited, and there are asynchronous events that must be handled by the processor, the processor has to have the facility to break off one task and service another. An interrupt.
But none of this offers an alternative to interrupts. I don't think there is one. Either there is no interrupt in which case everything can be done by polling, or there are interrupts.
And thank goodness for that.
Well, when I had used it I found all the information I wanted and needed
I'm not sure why you are thanking goodness for interrupts. They are merely a hack to get around inadequate processors or lack of application dependent hardware :) (FPGAs address the latter!)That's a point of view rather than a logical argument. I'll give you $2 to purchase your "adequate processor" sans interrupts with whatever hardware you like to add. I'll stick to an off the shelf general purpose processor with interrupts (and costs less than $2). I'll get to market before you, and you may not be able to lower your unit cost to be competitive regardless of the number of units sold.
Well, when I had used it I found all the information I wanted and needed
sure, but as "programmer", not as "architecture designer": for instance, are you able to describe how the hardware scheduler works with enough detail to implement it in fpga?
Personally I am not able, well, I perhaps can try to infer things, but I do find it very very difficult.
I'm not sure why you are thanking goodness for interrupts. They are merely a hack to get around inadequate processors or lack of application dependent hardware :) (FPGAs address the latter!)That's a point of view rather than a logical argument. I'll give you $2 to purchase your "adequate processor" sans interrupts with whatever hardware you like to add. I'll stick to an off the shelf general purpose processor with interrupts (and costs less than $2). I'll get to market before you, and you may not be able to lower your unit cost to be competitive regardless of the number of units sold.
SiliconWizard, you haven't explicitly stated your design goals (or even roughly outlined them, really); do you already have an idea you can use as a metric? Features you'd like to see? I guess you already mentioned small latencies.
Of course, alternative ways of doing things can be interesting in themselves, but they are more interesting if they can achieve "better" according to some metric. Here of course, we can mention latency/overhead and hardware cost, for instance.
<snipped many sensible points>
The XMOS architecture is obviously interesting there, although it's not quite general-purpose.
In hard realtime systems, jitter can be important.
If you are considering processors that run traditional processes loaded at runtime and with MMU etc, then there is a lot to be gained from the insight that there is very little difference between any reason that protection boundaries are crossed. That specifically includes interrupts, program signals in the Unix sense, and programs calling the kernel. Make one efficient (or otherwise) and that happens to the others.
In hard realtime systems, jitter can be important.Jitter is a critical factor in most hard real time systems. So important you don't leave it to the processor. You register the inputs and outputs, and let hardware control their timing.
As to XMOS, there was a project (xCORE-XA) combining xCORE tiles with an ARM Cortex core on the same device, as a way to make XMOS tech usable in a larger range of applications, but apparently this is now dead. Do you happen to know more about this, and why it didn't succeed?
As to XMOS, there was a project (xCORE-XA) combining xCORE tiles with an ARM Cortex core on the same device, as a way to make XMOS tech usable in a larger range of applications, but apparently this is now dead. Do you happen to know more about this, and why it didn't succeed?
I was vaguely aware of that. After a first glance I didn't understand
- how the interesting and unique xCORE/xC/tool processor features could be duplicated in an ARM
- what the unique benefit of an ARM would be
In hard realtime systems, jitter can be important.Jitter is a critical factor in most hard real time systems. So important you don't leave it to the processor. You register the inputs and outputs, and let hardware control their timing.
As to XMOS, there was a project (xCORE-XA) combining xCORE tiles with an ARM Cortex core on the same device, as a way to make XMOS tech usable in a larger range of applications, but apparently this is now dead. Do you happen to know more about this, and why it didn't succeed?
I was vaguely aware of that. After a first glance I didn't understand
- how the interesting and unique xCORE/xC/tool processor features could be duplicated in an ARM
- what the unique benefit of an ARM would be
From what I've understood, so if I'm not mistaken, this was just some xCore tiles, and then an ARM core, side by side. Either on the same die or as a SOC, that I don't know. But that was basically like an XMOS chip + an ARM MCU, only in a single chip. I don't know how they were connected, probably with one of the internal busses, so probably more efficiently than if they were just separate chips connected externally.
The benefits would be like any other hybrid processor. You can think of this as a SOC like the Zynq, I guess, only the FPGA fabric here would be replaced with xCore tiles.
And regarding windowed register files, I'd still like to know the rationale for avoiding them. According to the Wikipedia article on those, they mention criticism based on studies showing that better use of registers was overall more effective. I haven't read the related papers (if there are any), but I'm not sure I completely agree. That may be true if you consider the same overall number of registers (say using a file of 32 registers instead of 8 registers in 4 windows), but if you consider actually ADDING register sets, how can it ever be less effective? So IMO this rationale would all be related to not increasing the overall number of registers (for area reasons for instance), but if you can afford it, I can't really see what would be against that.
there is the question of whether such a large register file is the best way to spend those transistors, especially in the early days when the number of transistors you could fit on a chip were quite limited.
The 29k family might not have been the basis of a whole new direction for AMD, as they might have hoped, but it wasn't a terrible failure. For example, it was the dominant player in high performance printers.
- AMD answered I don't know what we are doing, but let's try something weird, and tried to improve the Berkeley' idea by adding 192 registers, so they made the AMD-29000, which was a so bad commercial failure that hey had to sell the core to Honeywell
The 29k family might not have been the basis of a whole new direction for AMD, as they might have hoped, but it wasn't a terrible failure. For example, it was the dominant player in high performance printers.
- AMD answered I don't know what we are doing, but let's try something weird, and tried to improve the Berkeley' idea by adding 192 registers, so they made the AMD-29000, which was a so bad commercial failure that hey had to sell the core to Honeywell
I seem to recall a lot of PostScript printers were running on i860.I think you mean the i960, which was the nearest competitor to the Am29k. Both had some success. Neither turned out to be the long term path for high performance embeeded solution the makers hoped for,
I did some contracts writing telephone exchange add-on software on Stratus i860 machines with hot-swappable CPU modules. (and the later on PA-RISC, and then Sun)I never understood why Stratus put serious effort into the i860. Intel were lukewarm about it, even at its launch. It had some interesting qualities, but without the whole hearted support of upper management a product like that was always going to crash and burn.
I never understood why Stratus put serious effort into the i860. Intel were lukewarm about it, even at its launch. It had some interesting qualities, but without the whole hearted support of upper management a product like that was always going to crash and burn.
The 29k family might not have been the basis of a whole new direction for AMD, as they might have hoped, but it wasn't a terrible failure. For example, it was the dominant player in high performance printers.
It is funny in the sense that all that a high performance printer really needs, is a processor that can run PostScript (which is a completely stack-based programming language), a rasterizer (nontrivial for color printers), and lots of RAM (for the raster buffer). I wonder if the emergence of "WinPrinters" (minimal processing, just a small pre-rasterized, pre-separated buffer; all processing done in host drivers) was related?The 29k family might not have been the basis of a whole new direction for AMD, as they might have hoped, but it wasn't a terrible failure. For example, it was the dominant player in high performance printers.They put money into the research and development of superscalar 29k-processors, like the 29050, but they resulted a very bad business affair (and technically a nightmare for devs like me), so they sold the 29050 to Honeywell and re-targeted their business to make with embedded processors for high performance printers.
Still technically a nightmare for devs, but good business because a lot of people need high performance printers, and they were happy to sell their processors for making them :D
there is the question of whether such a large register file is the best way to spend those transistors, especially in the early days when the number of transistors you could fit on a chip were quite limited.
To this question there were several different answers.
- ...
- Acorn answered no!, and made the first ARM RISC
- ...
The i860's compiler problem was mostly about getting good general purpose performance from the i860. I didn't do very much with it, but it seemed to be able to do heavy computation very well with the compilers available, much like the Itanic. That was enough to give it a workable initial market. However, most of the people I knew who attended the initial launch events came away thinking this was an Intel tinkering with new ideas project, and was not to be trusted. That was the real killer.I never understood why Stratus put serious effort into the i860. Intel were lukewarm about it, even at its launch. It had some interesting qualities, but without the whole hearted support of upper management a product like that was always going to crash and burn.
IIRC a major problem for the i860 was that creating compilers that could make use of the features was, for a while, a popular academic research topic. A topic that was never solved well enough.
Ditto the Itanic. That also had the major problem that performance was increased by computing results that weren't needed. That wasted power just as power consumption was becoming the limiting factor.
Register windows, or switching entire banks at function calls, can be examined/analysed as if they were an ABI, with the available register file split into "callee won't touch these" and "caller won't touch these".
Oh, I only meant as an analog, to help see why register windows won't help with general-purpose non-privilege-boundary-crossing function calls; because the effect to code generation is the same (as a silly-restrictive ABI has). As an analog, it helps to see the real world effects (on code generated by compilers), and thus hopefully helps see why register windows aren't useful in the general function call case. And when crossing a privilege boundary, you need to basically swap register banks anyhow (or your kernel/privileged code must be really simple and restricted on its register use). If the core has local fast RAM to swap register file to/from, that works. My suggestion expands that to unprivileged code, cases where the context is similarly switched (userspace task switch, coroutines), but does not necessarily involve a privilege boundary crossing. (These would be allocated by request from the kernel, much like memory.)Register windows, or switching entire banks at function calls, can be examined/analysed as if they were an ABI, with the available register file split into "callee won't touch these" and "caller won't touch these".Personally, in this case, I would divide the code into "critical" and "not critical", and only for the "critical" (function-calls under interrupt? some critical function-calls in kernel space?) I would use a fast-static-ram, or BRAM as a simple "RAM" for storing stuff.
The reason I'd love to see return address stack separated from variable/data "stack" is of course buffer overrun bugs potentially allowing privilege escalation attacks. Heck, it would be interesting to see if making the stack immutable except through call/return (optionally with a "pop" that discards the parent return address) would be viable.
umm, nowadays CPUs start having fast RAM implemented inside the chip (Apple M1, Fujitsu A64FX), FPGAs start having a lot of BRAM, and some companies are already selling fast synchronous static RAM, so I think in the near future this will not need thinking about register banks in the way we have though until now, and will also probably remove the need for any othter devilry like cache , which will remove a lot of troubles related with RISC-design.
I for one am interested in trying dedicated return stacks - I'll have to think of a way of implementing them on a RISC-V core, which I'm not sure how to do. Especially for the toolchain support... GCC uses the same stack as data for saving the "ra" register, not sure how to instruct it to do otherwise. That probably requires modifying the back-end, and I'm not too keen on that.
I for one am interested in trying dedicated return stacks - I'll have to think of a way of implementing them on a RISC-V core, which I'm not sure how to do. Especially for the toolchain support... GCC uses the same stack as data for saving the "ra" register, not sure how to instruct it to do otherwise. That probably requires modifying the back-end, and I'm not too keen on that.
That's definitely an incompatible change -- you wouldn't be able to run standard compiled code any more.
But RISC-V is a great base for doing that experimentation. You don't have to implement *everything* from scratch.
Yes, you'll definitely have to modify the back end but as compiler modifications go it's a fairly simple one. Just remove ra from the stack layout calculation and in the prolog/epilog generation (which is very localized code) substitute some custom "PUSHRA" and "POPRA" instructions instead of the store and load of RA.
For a first cut you don't even have to change the stack layout -- just leave that slot unused.
By the way, protection against ROP etc is more likely to come from address tagging proposals which are already under discussion, which in turn depend on the pointer masking proposal which is getting to a fairly advanced state now and I think might be ratified at the same time as V and B.
https://github.com/riscv/riscv-j-extension/blob/master/pointer-masking-proposal.adoc
Oh, and gcc won't put anything in upstream that isn't ratified, so all that stuff is in custom branches in other places e.g. the RISC-V github.
LLVM is happy to take in-progress stuff, so there is more support in upstream there.
The M1 and A64FX both have the RAM on different chips that just happen to be placed inside the same plastic package. CPU and DRAM processes are very very different.
the ability to use a separate stack to make things more secure is a start, but it's not enough.
I don't know MIPS in details, but it looks pretty similar to RISC-V in this regard.
Now also as discussed above, the ability to use a separate stack to make things more secure is a start, but it's not enough. If said stack is not protected from being written except strictly for storing return addresses, then there still is a security hazard. In this regard, the totally dedicated return stack in earlier PIC MCUs was nice. (And what wasn't that nice was the fact it was pretty depth-limited.)
This kind of protection in this context is IMO not trivial to implement. Merely using usual memory protection mechanisms is not quite adapted to this specific use-case, as I see it. I've looked at the extension Bruce mentioned, and I'm not sure at this point how we could implement sensible protection for return stacks with it, but I'll have to dig a little deeper.)
I don't frankly see any hazard with the trick, and I don't see a reason to over-complicate things.I find this remark highly amusing. Like everything else has only proven to have security flaws because people frankly could see a hazard with the tricks they used. :)
the ability to use a separate stack to make things more secure is a start, but it's not enough.
To me it looks enough. How could it be practically exploited if RA is stored in a separate stack? The initial problem arose due to a "Stack Overflow", which *is* a problem *only* if RA is stored on the same stack where data are stored and this because if the data-stack overflows then also the RA is overwritten with something wrong and potentially dangerous.
I don't frankly see any hazard with the trick, and I don't see a reason to over-complicate things.
As long as the stack you store return addresses in is accessible from any other part of the code, it can be overwritten just as easily as in the case of mixed data and return addresses stacks.
As long as the stack you store return addresses in is accessible from any other part of the code, it can be overwritten just as easily as in the case of mixed data and return addresses stacks.
How? The only chance I see is by some "malicious code" that deliberately attempts to write into the dedicated RA-stack area. But if you have malicious code running ... Umm.
Yes; I just found out Motorola 6809 had a system stack (16-bit pointer register S) used for subroutine return addresses, but also directly accessible (push and pop, and addressing modes using S as a base address); and a separate user stack (16-bit pointer register U, also with its own push and pop and addressing modes). Both stacks are in normal memory.The reason I'd love to see return address stack separated from variable/data "stack" is of course buffer overrun bugs potentially allowing privilege escalation attacks. Heck, it would be interesting to see if making the stack immutable except through call/return (optionally with a "pop" that discards the parent return address) would be viable.This has been done on a number of processors, including the Microchip 8-bit PIC architecture. At least I remember that on the PIC16F, maybe 18F as well. Possibly others.
Yes; I just found out Motorola 6809 had a system stack (16-bit pointer register S) used for subroutine return addresses, but also directly accessible (push and pop, and addressing modes using S as a base address); and a separate user stack (16-bit pointer register U, also with its own push and pop and addressing modes). Both stacks are in normal memory.The reason I'd love to see return address stack separated from variable/data "stack" is of course buffer overrun bugs potentially allowing privilege escalation attacks. Heck, it would be interesting to see if making the stack immutable except through call/return (optionally with a "pop" that discards the parent return address) would be viable.This has been done on a number of processors, including the Microchip 8-bit PIC architecture. At least I remember that on the PIC16F, maybe 18F as well. Possibly others.
My subconscious is trying to bubble up something about a segmented memory model, where the segment identifier is part of a base address, and all memory accesses are relative to some base address, but it isn't clear yet. It has to do with how most memory accesses are relative to the beginning of some object, and how segmenting (or having multiple separate address spaces) avoids memory fragmentation issues that are so painful on flat memory models. It is not the 16:16 or 16:32 segment-descriptor:offset model used in 80286 and 80[345]86 architectures, however; the base address itself is a valid pointer also, just restricted to addresses aligned at 64/128 bytes. Something about using the low 3 bits to identify the segment in some addressing modes or instructions, blah blah.
Me neither. I'm not being coy when I say it's trying to bubble it up; it's very much like a persistent mental itch.My subconscious is trying to bubble up something about a segmented memory modelI'm not completely sure I understand what you have in mind,
IMO a conventional memory protection system would work, as long as specific instructions can have certain privileges that can be enforced by memory protection. AFAIK, most memory protection systems do not have this feature - they usually implement access rights such are read, write, execute, privilege levels (but not on a per-instruction basis AFAIK)...Yes, I do agree on that.
As to segmented memory, once you have some kind of address translator (which you do on any system with a reasonable MMU), I can't really see the benefits.Address space fragmentation is a real issue.
I was very interested in this in early 80386 era, since it had four segment registers (CS for code, SS for stack, and DS and ES for data).
Unfortunately, C did not and still does not support named address spaces, so on '386, basically all code used a flat memory model.
It is notable that REP/REPZ/REPNZ MOVSB/MOVSW/MOVSD string instructions explicitly move from DS:[E]SI to ES:[E]DI, so definitely the '386 machine code instruction set was well suited for this.
There is also, for C, the issue that multiple segment:offset combinations refer to the same memory location, which can make pointer comparison unpleasant.
There were three big problems with x86 segments:
- a segment could only be 64k in size, which severely limits its usefulness.
------
| |
|32KB| page0
| | page latch
------ | |
| | A17 A16 A15 ... A00 RAM ADDR
|32KB| page1 | |
| | A15 ... A00 CPU ADDR
------
| |
|32KB| page2
| |
------
| |
|32KB| page3
| |
------
------
0k | |
|32KB| page0..3
32k | |
------
|16KB| not-paged slice
------
64K | |
------
These were the STRENGTHS of the 8086 segmentation scheme. The brief for the 8086 was "Build us a stopgap until the i432 works. Try stretching the 8080, offer some measure of code compatibility, and keep the transistor count under 30k". In 1980 few programs had a problem working with the segment size or quantity.I was very interested in this in early 80386 era, since it had four segment registers (CS for code, SS for stack, and DS and ES for data).
Unfortunately, C did not and still does not support named address spaces, so on '386, basically all code used a flat memory model.
It is notable that REP/REPZ/REPNZ MOVSB/MOVSW/MOVSD string instructions explicitly move from DS:[E]SI to ES:[E]DI, so definitely the '386 machine code instruction set was well suited for this.
There were three big problems with x86 segments:
- a segment could only be 64k in size, which severely limits its usefulness.
- there are not enough segments available at the same time (CS, SS, DS, ES back then) and only one useful for objects on the heap. It would have been much much more useful to have, say, eight segment registers with five available for heap objects. Or 16 segment registers.
- it's slow to load a new value into a segment register and use it. I don't remember the details, but I think there must be a pipeline flush. That's fine if you only load the segment registers once at program startup and so get 64 KB for each of code, stack, globals, and heap (256 KB ought to be enough for anyone, and is certainly better than having 64 KB total like 8 bit micros). But it used to be very slow to use large heaps if you loaded ES every time you used a pointer.
No, the segment size was only limited to 64k in VM86 mode; in full 32-bit mode, the size can be anything between 0 and 1048576 (one byte granularity), or any multiple of 4096 between 0 and 232.I was very interested in this in early 80386 era, since it had four segment registers (CS for code, SS for stack, and DS and ES for data).- a segment could only be 64k in size, which severely limits its usefulness.
Unfortunately, C did not and still does not support named address spaces, so on '386, basically all code used a flat memory model.
It is notable that REP/REPZ/REPNZ MOVSB/MOVSW/MOVSD string instructions explicitly move from DS:[E]SI to ES:[E]DI, so definitely the '386 machine code instruction set was well suited for this.
- there are not enough segments available at the same time (CS, SS, DS, ES back then) and only one useful for objects on the heap. It would have been much much more useful to have, say, eight segment registers with five available for heap objects. Or 16 segment registers.The problem with segment registers is that functions accessing objects in different segments need to be written separately for each possible segment register combination. Supplying the segment registers along with each pointer is not viable, because loading a new value took at least a dozen cycles, not counting cache effects.
I'm not sure what would be the benefit of segmented memory compared to address translationExtra bits of virtual address space.
Let's talk about the "ugly hack"
I was happy to apply it as hack on MPUs with only 16bit address because it allows them to go beyond the 64Kbyte limit, and this allows you to have somehow some tasks.
Stacks need to grow upwards, though, and therefore you really need addressing modes where an offset is subtracted from the stack or stack frame pointer; especially with a small literal constant (most common use, referring to a local variable).
Having a hardware comparison instruction that checks if 0 ≤ offset < limit, with offset a register operand, and limit either a register or a memory operand, would help a lot because then the bounds check would be a single instruction.
Needed only when you have more local variables than you can fit in registers, sure. Caller or callee-saving is obviously part of an ABI, and I agree, having any caller-saved registers is old-fashioned; callee-saving is more efficient.Stacks need to grow upwards, though, and therefore you really need addressing modes where an offset is subtracted from the stack or stack frame pointer; especially with a small literal constant (most common use, referring to a local variable).Local variables on the stack is very old-fashioned.
But I don't understand why stacks need to grow upwards?That is due to the "segmented" paged memory model. When the 2k possible segments all grow upwards, it suffices to have page descriptors from zero offset up to the limit. The entire address range dedicated for the stack does not need to be fully populated by page descriptors. If the stack is only used for return addresses (or in general, grows at most by a page at a time), then a single-page initial stack suffices, and can be dynamically grown when needed (both at the virtual address level and the page table entry level) by the OS kernel without involving the userspace at all.
Exactly. I'm an idiot, you see. :palm:QuoteHaving a hardware comparison instruction that checks if 0 ≤ offset < limit, with offset a register operand, and limit either a register or a memory operand, would help a lot because then the bounds check would be a single instruction.Such as "set-less-than-unsigned" or "branch-less-than-unsigned"?
TLB-Page-Table (https://en.wikipedia.org/wiki/Page_table) :o ?Yes. I'm not sure how long the instruction pipeline would have to be to do it without any caching, since each memory access involves that extra page descriptor access.