Author Topic: hw-enforced Stack Protection, for once I appreciate Intel (Read 3705 times)

DiTBho · « **on:** November 08, 2023, 06:51:37 am »

Anyone who knows me knows how much I deeply hate Intel for all the times they made me curse when I had to program their damned first, second, third and fourth generation x86 CPUs, so I hate everything, from the 8088, up to the i586.

However, for once I can say: congratulations! This time you-intel added something really useful (for security, which is never too much), and above all not obscenely twisted!

So, what do we have here?

Kernel-mode Hardware-enforced Stack Protection
a security feature that protects systems from stack buffer overflow attacks, where an attacker attempts to trigger arbitrary code execution by overflowing a buffer (temporary memory storage) on the stack (a data structure used to store a program's function calls and local variables).

During these attacks, the attacker attempts to overwrite the return address or control data to redirect the execution of a program to run malicious code of the attacker's choosing.

The technique of overwriting the return address or control data to redirect a program's execution flow is known as a Return-Oriented Programming (ROP) attack.

The Kernel-mode Hardware-enforced Stack Protection feature requires a special hardware-based temporary stack called Shadow Stacks to work that mirrors the standard stack used by the operating system, and the stack cannot be modified by any running applications.

When a program's function is called, the return address is stored in both the normal stack and the Shadow Stack
When the function returns, the Hardware-enforced Stack Protection feature checks if the return address from the primary stack matches the one stored on the Shadow Stack
If the return addresses match, the function returns as expected, and the program execution continues normally
If the return addresses do not match, this could indicate an attack, such as a stack buffer overflow or an ROP attack. When this happens, Windows will terminate the process to prevent the execution of malicious code
...

Using Shadow Stacks, Hardware-enforced Stack Protection feature can mitigate attacks, thus protecting the system from vulnerabilities, including zero-days.

SiliconWizard · « **Reply #1 on:** November 08, 2023, 06:55:16 am »

Yep, I've seen that recently too. Something that has been overdue for decades. Good job!

DiTBho · « **Reply #2 on:** November 08, 2023, 07:09:08 am »

@SiliconWizard
yup, I was thinking of you, or rather of your need for security on RISC-V.

On intel it's advertised as "Intel Control-Flow Enforcement Technology (CET) technology", and it's only available on newer CPUs, but it's good to have!

I was just looking at some documentation, because... "unfortunately" I still have to deal with Intel servers, at this point I wll ask the purchasing-office to only buy intel CPUs that have this functionality!

Which is not even difficult to replicate on any other architectures, even OpenHardware things (softcore, etc)

SiliconWizard · « **Reply #3 on:** November 08, 2023, 08:16:33 am »

I haven't dug deeply into Intel's feature yet. But it appears to be a "shadow stack", in other words, something that's meant to work transparently on existing code. The rationale and the goal are obvious, that's where the market is.
But in my case, I would go for something that doesn't act completely "transparently" on existing code. It would require explicit changes in how the code is compiled.
It's good to have the CPU mitigate a major hole in the way we've been using stacks for almost decades, but it would be better to change the way we use stacks. I think.

Zero999 · « **Reply #4 on:** November 08, 2023, 08:44:32 am »

Quote from: SiliconWizard on November 08, 2023, 08:16:33 am

But in my case, I would go for something that doesn't act completely "transparently" on existing code. It would require explicit changes in how the code is compiled.
It's good to have the CPU mitigate a major hole in the way we've been using stacks for almost decades, but it would be better to change the way we use stacks. I think.

How is it possible to do that, whilst maintainging backwards compatability?

tggzzz · « **Reply #5 on:** November 08, 2023, 09:02:32 am »

Personally I would be be more interested in a more capable (pun intended) system such as CHERI with ARM and RISC-V implementations.

Currently a research project partly supported by DARPA and ARM, but it has been going for a while without running into any fundamental problems.

There's a pleasingly literate overview at https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/

DiTBho · « **Reply #6 on:** November 08, 2023, 09:29:09 am »

Quote from: tggzzz on November 08, 2023, 09:02:32 am

CHERI with ARM and RISC-V implementations.

"CHERI extends conventional hardware Instruction-Set Architectures (ISAs) with new architectural features to enable fine-grained memory protection and highly scalable software compartmentalization"

nice!!!

Berni · « **Reply #7 on:** November 08, 2023, 09:31:31 am »

This sounds like it would be more useful for protecting the OS kernel rather than application code.

For one the OS is task switching application code. So would this whole hardware stack have to be reloaded by the scheduler before making a task switch? Sometimes you task switch onto a difference core or even package, so it would have to go trough RAM, stacks can become pretty huge in rare cases, so is there a limit to the size? Then what about applications that do funky things with the stack on purpose? Be it a bug in some old piece of software that is required for the software to run, or things like JIT compilers or highly optimized emulators that are self modifying code by design, those would false trigger this detection right?

Don't think you can fix such an ancient architecture while remaining backwards compatible.

But i suppose it would be neat if applications could turn on this feature using a special instruction and have its stack protected for when the applications knows nothing funky should be going on with the stack. Then have modern compilers use this feature by default.

DiTBho · « **Reply #8 on:** November 08, 2023, 10:05:16 am »

Quote from: Berni on November 08, 2023, 09:31:31 am

For one the OS is task switching application code. So would this whole hardware stack have to be reloaded by the scheduler before making a task switch? Sometimes you task switch onto a difference core or even package, so it would have to go trough RAM, stacks can become pretty huge in rare cases, so is there a limit to the size? Then what about applications that do funky things with the stack on purpose? Be it a bug in some old piece of software that is required for the software to run, or things like JIT compilers or highly optimized emulators that are self modifying code by design, those would false trigger this detection right?

I have yet to read the details, kernel side, I trust, being an Intel solution rather than a hobbyist's solution, that they have thought of these details

In any case, Microsoft specifies that a compiler flag must be used to enable this functionality in Windows-v11, so it is not completely transparent, it is mostly transparent.

SiliconWizard · « **Reply #9 on:** November 08, 2023, 07:32:20 pm »

Quote from: Zero999 on November 08, 2023, 08:44:32 am

Quote from: SiliconWizard on November 08, 2023, 08:16:33 am
But in my case, I would go for something that doesn't act completely "transparently" on existing code. It would require explicit changes in how the code is compiled.
It's good to have the CPU mitigate a major hole in the way we've been using stacks for almost decades, but it would be better to change the way we use stacks. I think.
How is it possible to do that, whilst maintainging backwards compatability?

Precisely, you don't.

SiliconWizard · « **Reply #10 on:** November 08, 2023, 07:36:22 pm »

Quote from: tggzzz on November 08, 2023, 09:02:32 am

Personally I would be be more interested in a more capable (pun intended) system such as CHERI with ARM and RISC-V implementations.

I'm curious to learn more about what they exactly mean by "fine-grained memory protection". If that's anything like what I have in mind, that would be an idea I've been toying with for a while: the ability to protect small blocks of memory down to possibly any object inside a given program, so that one accidentally overwriting (or even just reading, when it's not appropriate) an object from code that isn't even supposed to touch it becomes absolutely impossible.

Zero999 · « **Reply #11 on:** November 08, 2023, 08:47:42 pm »

Quote from: SiliconWizard on November 08, 2023, 07:32:20 pm

Quote from: Zero999 on November 08, 2023, 08:44:32 am
Quote from: SiliconWizard on November 08, 2023, 08:16:33 am
But in my case, I would go for something that doesn't act completely "transparently" on existing code. It would require explicit changes in how the code is compiled.
It's good to have the CPU mitigate a major hole in the way we've been using stacks for almost decades, but it would be better to change the way we use stacks. I think.
How is it possible to do that, whilst maintainging backwards compatability?

Precisely, you don't.

Then you won't sell any.

Nominal Animal · « **Reply #12 on:** November 09, 2023, 03:21:23 am »

It is very difficult to implement Principle of Least Privilege at the function level with linear address spaces.

We'd really need hardware specifically oriented towards nonlinear address spaces, perhaps via segmented memory where each pointer is a (segment, offset) tuple, and any region/area/array is a (segment, offset, length) tuple, with each segment having their own access protections. Virtual memory paging would be implemented on a segment basis, with even DMA using segment identifiers. This would also help with memory fragmentation, which is an issue with linear address spaces and address randomization schemes intended to hinder exploits. (The corresponding scheme here would be randomized segment identifiers, as opposed to consecutive ones. Efficient hardware segment identifier lookup is a bit of a problem, though; we do need this to be O(1) and fast. The 80386 scheme of segment identifiers being indexes to a lookup table does not really work; they're too predictable.)

One model I've been thinking about (just as a mental exercise) is having each stack frame be a separate segment in a chain of segments. Each subroutine call machine instruction would have a separate bit indicating whether the current stack frame is locked against writes for the duration of the call.

Because of the overheads involved in that, I don't think it is as useful as separate stack for return addresses only and using programming language constructs that allow array bounds checking during compile time. For C, making arrays first-level objects (constructable from expressions specifying the pointer and the length), and in function parameters allow array size variable to be declared after the variably-modified array parameter itself (i.e. (int a[n], size_t n)), would suffice; without forcing it upon all C programmers. This would be backwards-compatible on the C level, too.

Current operating systems using virtual memory already essentially provide many "segments" to userspace applications, just located at separate addresses, with the rest of the address space inaccessible (yielding segmentation violation errors on attempted accesses). While I've worked on huge datasets, I think it might actually be beneficial to limit segments to 32 bits even on 64-bit architectures, because of the compactness and savings in the code and memory references themselves. A pointer would still be 64 bits, with the segment identifier in the upper 32 bits; and Very Large maps or allocations could simply use consecutive segment identifiers. If the instruction set had separate registers for default segment identifiers for loads and stores, then a zero segment identifier could be used as a shorthand for those, allowing 32-bit pointers to 64-bit memory to be used efficiently. Also reducing the segment lookup caching overhead to cases where a different segment is used. In most cases, even complex program code uses very few segments concurrently, so hardware support for just a few concurrently accessible segments, say 8, should work very well.

As usual, the biggest hurdle in such schemes are us humans. Zero999's assertion of anything non-backwards compatible being impossible to sell is a good example. I do not exactly agree with it here because both ARM and RISC-V already have various instruction set extensions and variants; but it is very true that in general, customers do expect and require backwards compatibility. Industry and toolchain vendors/projects have a LOT invested in the current architectures, and any kind of fundamental change in the approach/paradigm will be fought against, because it always incurs additional cost. Perfect is the enemy of good enough, true, but when utter shite is considered good enough for customers, we do get locked into decades of poor solutions.

mikerj · « **Reply #13 on:** November 29, 2023, 02:10:30 pm »

The AMD Ryzen PRO and 3rd Gen Epyc parts also include shadow stack protection. Would be nice to see this in future Ryzen parts.

SiliconWizard · « **Reply #14 on:** November 30, 2023, 01:17:43 am »

Quote from: Nominal Animal on November 09, 2023, 03:21:23 am

As usual, the biggest hurdle in such schemes are us humans. Zero999's assertion of anything non-backwards compatible being impossible to sell is a good example. I do not exactly agree with it here because both ARM and RISC-V already have various instruction set extensions and variants; but it is very true that in general, customers do expect and require backwards compatibility.

That's because something doesn't sell or not sell just out of context in a vacuum. Which is why I cared to say in my post that I understood why Intel would do it this way, in a completely transparent manner. Intel can't sell things the same way other, smaller companies could. This isn't how business works. Ony a newcomer can come up with new approaches, and hope for the best. A very well, old, established company usually can't pull that off. Intel has tried repeatedly and it has always failed. The same for many other large companies.

The purely technical factors are often just secondary. Yes, many good, security-oriented features have an overhead. Sure. Is a reasonable performance overhead a showstopper in general? Of course not. There are many examples of "innovations" that have incured significant overhead, yet have become the norm. The software industry is full of such stories, for instance.

But business is a bitch, and selling stuff that people are not expecting is always a gamble. Sometimes it works wonderfully. Many times it fails. So sticking to what you know people expect, with only incremental changes with near-zero impact on customers is the safe bet when you're a big dinosaur like Intel. It's impossible to blame them. They have tried and it has never really worked out for them. At some point, you wise up. Good thing is that it leaves some space for alternatives, so in the end everyone can be happy. Or almost.

DiTBho · « **Reply #15 on:** November 30, 2023, 10:47:06 am »

So, the new servers have arrived.
I asked a colleague who is an expert in Windows 11 to enable the feature.
He had to fiddle with the Windows console for 40min, I also saw him running VisualStudio.
I don't want to know what he did, I just need to know that those damn Windows apps are running.
so I can finally interface them with my GNU/Linux and BSD side Apps running on the infranet.

Zero999 · « **Reply #16 on:** November 30, 2023, 08:48:07 pm »

Quote from: SiliconWizard on November 30, 2023, 01:17:43 am

Quote from: Nominal Animal on November 09, 2023, 03:21:23 am
As usual, the biggest hurdle in such schemes are us humans. Zero999's assertion of anything non-backwards compatible being impossible to sell is a good example. I do not exactly agree with it here because both ARM and RISC-V already have various instruction set extensions and variants; but it is very true that in general, customers do expect and require backwards compatibility.

That's because something doesn't sell or not sell just out of context in a vacuum. Which is why I cared to say in my post that I understood why Intel would do it this way, in a completely transparent manner. Intel can't sell things the same way other, smaller companies could. This isn't how business works. Ony a newcomer can come up with new approaches, and hope for the best. A very well, old, established company usually can't pull that off. Intel has tried repeatedly and it has always failed. The same for many other large companies.

The purely technical factors are often just secondary. Yes, many good, security-oriented features have an overhead. Sure. Is a reasonable performance overhead a showstopper in general? Of course not. There are many examples of "innovations" that have incured significant overhead, yet have become the norm. The software industry is full of such stories, for instance.

But business is a bitch, and selling stuff that people are not expecting is always a gamble. Sometimes it works wonderfully. Many times it fails. So sticking to what you know people expect, with only incremental changes with near-zero impact on customers is the safe bet when you're a big dinosaur like Intel. It's impossible to blame them. They have tried and it has never really worked out for them. At some point, you wise up. Good thing is that it leaves some space for alternatives, so in the end everyone can be happy. Or almost.

Intel's main customer is desktop computer manufactures and their consumers want backwards compatibility. People want to still be able to use their old software on the latest hardware, but that does make designing a more secure system more difficult.

Nominal Animal · « **Reply #17 on:** November 30, 2023, 09:59:44 pm »

Like I said, the biggest hurdle is us humans.

There is no technical requirement, or even a market requirement (in the game theory sense); it's just how current humans behave, mostly for cultural reasons.
The proverbial five monkeys behaviour.

Even if one wanted to change an established pattern of behaviour, it is at least twice as hard as acquiring one in the first place, because the existing pattern has to be un-learned first, and that is extremely hard for humans to do.

dferyance · « **Reply #18 on:** November 30, 2023, 10:26:17 pm »

Nothing comes for free. And I don't mean performance. CPUs are already crazy complex and the more we have them do the more complex they become. It shifts things from being a software bug risk to a CPU bug risk. Complexity is the enemy of security. We need some CPU features for security, but a shadow-stack? Seems pretty much a hack. I look forward the the microcode updates.

SiliconWizard · « **Reply #19 on:** November 30, 2023, 10:53:40 pm »

The potential issues with shadow stacks, or even any added feature meant to work "transparently", is that they may introduce new security issues of their own. We'll see. Waiting for the first report of a security exploit with their new shadow stacks. Not that I wish them that, but I have like a fuzzy feeling that it will happen.

But as to complexity, actually piling up features one on top of another in existing designs, in a way that it's still fully backwards compatible, is the worst approach possible. Again you just do that for business reasons (it sells), but from an engineering POV, it's rarely pretty.

Point being, designing a CPU "from scratch" with the same, or better security features, would lead to a much less complex design.

tggzzz · « **Reply #20 on:** December 01, 2023, 12:21:01 am »

Quote from: SiliconWizard on November 30, 2023, 10:53:40 pm

Point being, designing a CPU "from scratch" with the same, or better security features, would lead to a much less complex design.

https://millcomputing.com/ is an excellent example of clean-slate architecture with significantly better security and speed, but which can still run C code.

The architecture was well discussed on comp.arch, and Ivan Godard has a remarkably deep and wide knowledge of the history of hardware and software. He, with others, have created a hardware+software ecosystem where the strengths of each complements the weaknesses of the other.

Many patents have been filed.

The other example is the University of Cambridge's CHERI architectural additions to ARM processors. My understanding is they are less comprehensive and revolutionary that the Mill, but arguably easier to implement.

DiTBho · « **Reply #21 on:** December 01, 2023, 10:51:30 am »

Code: [Select]

ppc64 ppc64 ppc64 ppc64 ppc64 ppc64 ppc64 ppc64 ppc64 ppc64 ppc64 ppc64 ppc64 ppc64 ppc64 ppc64
arm64 arm64 arm64 arm64 arm64|x8664 x8664 x8664 x8664 x8664 x8664|arm64 arm64 arm64 arm64 arm64

                              x8664 x8664 x8664 x8664 x8664 x8664 ppc64

                                            firewall
=========================================== internet ==========================================

Will they survive?
About half of the x86 computers close to the firewall run Windows11 with the security patch
Everything else GNU/Linux
We will see ...

PPC64={ IBM970, IBM-Power9 }
ARM64={ ... Ampere128, ... }

SiliconWizard · « **Reply #22 on:** December 05, 2023, 10:19:17 pm »

Quote from: tggzzz on December 01, 2023, 12:21:01 am

https://millcomputing.com/ is an excellent example of clean-slate architecture with significantly better security and speed, but which can still run C code.

I've heard of it a few years ago and it looked "fascinating", but never really got around to fully understanding the architecture. There are some articles and whitepapers on their web site, but otherwise I had a hard time finding a comprehensive document - it's probably linked to the fact they precisely filed a lot of patents and didn't intend to give that away "for free".

Now I'm not sure how practical it is. It seems to have gotten little traction, their web site hasn't been updated since 2017 and I have never heard of it any more than on comp.arch indeed. That of course doesn't mean that it's bad. But I don't even know if there has ever been a single real implementation of it on real hardware?

I'm watching this at the moment:

Quote from: tggzzz on December 01, 2023, 12:21:01 am

The other example is the University of Cambridge's CHERI architectural additions to ARM processors. My understanding is they are less comprehensive and revolutionary that the Mill, but arguably easier to implement.

Yes, I've read the paper in the meantime - it's indeed more or less something I had been having in mind for a while. Now it remains to be seen how effective and applicable it is in practice.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: hw-enforced Stack Protection, for once I appreciate Intel (Read 3705 times)

Share me