User-space shadow stacks (maybe) for 6.4
Shadow stacks are a defense against return-oriented programming (ROP) attacks, as well as others that target a process's call stack. The shadow stack itself is a hardware-maintained copy of the return addresses pushed onto the call stack with each function call. Any attack that corrupts the call stack will be unable to change the shadow stack to match; as a result, the corruption will be detected at function-return time and the process terminated before the attacker can take control. The above-linked 2022 article has more details on how x86 shadow stacks, in particular, work.
The current version of the patch set is the eighth revision posted by Rick Edgecombe (who took it over after some 30 revisions posted by Yu-cheng Yu).
API changes
The user-space API for working with shadow stacks has not changed much in the last year. Most operations are done with arch_prctl() calls, specifically:
- ARCH_SHSTK_ENABLE turns on the shadow stack for the current thread; shadow stacks are not enabled by the kernel when a process starts.
- ARCH_SHSTK_DISABLE disables the use of the shadow stack for the current thread.
- ARCH_SHSTK_LOCK prevents any further changes to a thread's shadow-stack status. Among other things, this operation can keep an attacker from somehow disabling the shadow stack before corrupting the call stack.
- ARCH_SHSTK_UNLOCK undoes the effect of ARCH_SHSTK_LOCK. This option was added to version 4 of the patch set in December; it exists to support functionality like Checkpoint/Restore in User Space that needs to be able to change the shadow-stack status after a process has launched. This option is only available when invoked via ptrace(); a process cannot use it on itself directly.
- ARCH_SHSTK_STATUS returns the current shadow-stack status.
Normally, the kernel handles the allocation and placement of shadow stacks, but there are occasions where an application will need to manage its shadow stacks directly. The map_shadow_stack() system call exists for this purpose; its prototype has changed a bit over the course of the last year:
void *map_shadow_stack(unsigned long address, unsigned long size, unsigned int flags);
At one point, Andrew Morton complained about the "shstk" abbreviation, saying that it "
sounds like me trying to swear in Russian while drunk". As a result, that term was pulled out of much of the generic code, but remains in the x86 portion.
There is one other subtle change to map_shadow_stack() that affects how shadow stacks are handled in general. The shadow-stack feature has incompatibilities with 32‑bit code, especially when signals are involved. The kernel will refuse to enable a shadow stack for a thread that is running in the 32-bit mode and, in version 4 of the patch set, code was added to simply disable any signal handlers if a process switched to 32-bit mode after the shadow stack was enabled.
Beyond seeming like a bit of a hack, this approach did not fully solve the problem. As it turns out, a 64-bit thread can switch to the 32-bit mode without the kernel's knowledge or permission — meaning that the disabling of signal handlers can be circumvented. After some deliberation on how to avoid subtle problems when this happens, the decision was made (for version 5) to just always map the shadow stack at a virtual address above 4GB, making it inaccessible to 32-bit code. As a result, any attempt to switch to the 32-bit mode when a shadow stack is enabled will cause an immediate crash.
This change resulted in a new mmap() flag, MAP_ABOVE4G, which forces the mapping to be created above the 4GB virtual-address boundary. The address passed to map_shadow_stack() (if not zero, indicating no preference) must also be above 4GB or the call will fail. Someday, somebody with sufficient motivation could perhaps find a way to make 32-bit code work with shadow stacks, but given how little interest there is in 32-bit code in general, that seems unlikely to happen.
The glibc problem
While it might be nice to run all programs with shadow stacks enabled, there are applications that would break in that environment. Anything that manipulates its own call stack — just-in-time compilers, for example — will find itself out of sync with the shadow stack and brought to an untimely end. So the enabling of the shadow stack must be limited to code that can handle it.
The scheme that was developed, some time ago, was to place a special note in the .note.gnu.property ELF section of the program's executable image. If that note exists (as the result of compiler options provided when the program was built), that indicates that it is safe to run the program with the shadow stack enabled. That note is not sufficient for the kernel to make the decision, though, so the enabling of the shadow stack is left to user space, and to the C library's program loader in particular.
Enthusiastic developers in the GNU C Library (glibc) community quickly wired up support for turning on the shadow stack when it seemed appropriate; current versions of glibc are poised to turn on the shadow stack as soon as the kernel supports the feature. There is only one little problem: the glibc support was written with an early version of the user-space API in mind. That API no longer exists; trying to use it would result in crashing programs and a failure to boot. That will indeed secure it against ROP attacks, but users can be picky about just how that kind of security was achieved and may complain.
That problem was resolved early on by changing the API enough that glibc simply doesn't find it anymore and thinks that the shadow-stack functionality is not present. The glibc developers have said, though, that they intend to implement the new shadow-stack API once it is merged; thereafter, when an updated glibc shows up on a system, any program that indicates a readiness for a shadow stack will get one.
That leads to a new problem, as noted in the version-3 cover letter: not all applications that are marked as being ready really are.
But many application binaries with the bit marked exist today, and critically, it was applied widely and automatically by some popular distro builds without verification that the packages actually support shadow stack. So when glibc is updated, shadow stack will suddenly turn on very widely with some missing verification.
Applications that will break in this environment evidently include node.js and PyPy, so this seems like a real problem. A quick check on a Fedora 37 system shows that PyPy is indeed built with the shadow stack enabled:
$ readelf -n /usr/bin/pypy Displaying notes found in: .note.gnu.property Owner Data size Description GNU 0x00000040 NT_GNU_PROPERTY_TYPE_0 Properties: x86 feature: IBT, SHSTK [...]
Even if the root cause lies in user space, it can be provoked by upgrading to a new kernel, and thus looks like a kernel regression. Kernel developers generally prefer to avoid breaking systems, even if that breakage can be said to be somebody else's fault.
The ideal solution, according to Edgecombe, would be to simply move to a
new ELF bit to identify real shadow-stack readiness and have glibc
use that. Distributors could then be encouraged to be more careful about
marking applications as being shadow-stack ready. But, he said, "it
doesn’t seem like the glibc developers are interested in working on a
solution
", so something else is needed. In version 3, that
something else was a
patch disabling the shadow-stack API when the ELF bit is detected. The
idea was that distributors would eventually disable that check once they
had confirmed that all of the packages they ship included correctly marked
binaries.
The patch was described as "a bit dirty
" and included for the sake
of discussion — which indeed resulted. H.J. Lu suggested
that the right approach was just to avoid upgrading glibc until the system
was ready for it. Florian Weimer added
that most of the incompatible code is to be found in libraries that are
loaded after a process starts; the kernel test would not detect those, and
it may be too late to disable the shadow stack in any case.
After a while, Edgecombe asked Linus Torvalds what he thought should be done about this problem. Torvalds answered that he did not want to preemptively disable shadow-stack support without a reason:
Once [shadow-stack functionality] is enabled in the kernel, and it turns out that people complain that it breaks existing binaries, at that point I guess it gets disabled again. Possibly at that point using something like your suggested patch. But I'm not doing it until actual problems appear, and until we actually have this code in the kernel.
The patch disabling the shadow-stack API was duly taken out of the series. Weimer described a couple of plans for ensuring that shadow stacks could be safely enabled in distributions, claiming that adopting a new ELF bit would delay that process considerably. Shadow-stack support, he said, is not much different from supporting a new system call; that, too, can break existing applications, mostly as the result of seccomp() filters that do not understand the new call.
On to 6.4
The result of the discussion is that the kernel will take no special steps
to avoid breaking binaries that were incorrectly marked as being ready for
shadow stacks — at least, not before a problem is demonstrated. Most of
the other outstanding issues appear to be resolved, to the point
that Edgecombe prefixed the current version with a remark that "we have
a pretty good initial shadow stack implementation here
". There are a
number of desired enhancements, but those might be done better, he said,
after there has been some real-world use of the code that exists now.
So, after all this work, the 40 shadow-stack patches have been added to the
tip
tree, which feeds them into linux-next. If no show-stopping problems
turn up over the course of the next month or so, user-space shadow-stack
support for x86 systems will, most likely, move upstream during the 6.4
merge window. Finally, after a long development period, the shadow (stack)
will truly know what evil lies in the heart of ROP attackers.
Index entries for this article | |
---|---|
Kernel | Releases/6.6 |
Kernel | Security/Control-flow integrity |
Security | Linux kernel |
Posted Mar 24, 2023 14:57 UTC (Fri)
by syrjala (subscriber, #47399)
[Link] (22 responses)
Posted Mar 24, 2023 15:50 UTC (Fri)
by pm215 (subscriber, #98099)
[Link] (18 responses)
Posted Mar 24, 2023 17:08 UTC (Fri)
by atnot (subscriber, #124910)
[Link] (1 responses)
Posted Apr 1, 2023 21:49 UTC (Sat)
by anton (subscriber, #25547)
[Link]
RISC-V follows in the footsteps of the MIPS heritage (Alpha and DLX). RISC-V does support 64-bit variants, not available in the 1980s, but yes, it does not have many novelties, that's not it's purpose. What advancements of RISC since the 1980s are you thinking about?
Posted Mar 24, 2023 21:18 UTC (Fri)
by ballombe (subscriber, #9523)
[Link] (15 responses)
Posted Mar 24, 2023 22:47 UTC (Fri)
by jrtc27 (subscriber, #107748)
[Link] (14 responses)
Posted Mar 25, 2023 13:39 UTC (Sat)
by ballombe (subscriber, #9523)
[Link] (13 responses)
Posted Mar 26, 2023 15:40 UTC (Sun)
by farnz (subscriber, #17727)
[Link] (12 responses)
So how is it done on SPARC and S390?
The claim is that there's only two ways to do it, neither of which bring huge advantages:
What's the third mechanism?
Posted Mar 26, 2023 21:37 UTC (Sun)
by ballombe (subscriber, #9523)
[Link] (11 responses)
Posted Mar 27, 2023 8:46 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (10 responses)
But you just said that SPARC and S/390 don't do it that way - this is claim version 2, where external data in the MMU determines whether a physical address is interpreted as a kernel address or a user address.
Posted Mar 29, 2023 20:54 UTC (Wed)
by ballombe (subscriber, #9523)
[Link] (9 responses)
This is not the case, the kernel can use the MMU to discriminate between user and kernel addresses.
Posted Mar 30, 2023 8:59 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (8 responses)
OK, so how exactly, using the MMUs, do I determine if 0x1ffff is a kernel or a user address on SPARC? Take as read that I have valid page table mappings for 0x10000 to 0x80000 in both user and kernel ASIDs.
Your claim continues to be that with the SPARC setup, while in kernel mode, I can determine if the address 0x1ffff is meant to be a kernel or a user address, and I just don't see how you can do that without knowing which address space I'm meant to use.
The original claim is that using in-band signalling (top bit of address) for kernel versus user is valuable, since then I can do a trivial check on all addresses coming from userspace to confirm that they are not kernel addresses. This then means that I can fail very quickly if I attempt to use a user address as a kernel address, since I'm using different ASIDs for accesses, and when I use the kernel ASID to access a user address, it'll fault.
Posted Mar 30, 2023 10:15 UTC (Thu)
by paulj (subscriber, #341)
[Link] (7 responses)
Posted Mar 30, 2023 10:22 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (6 responses)
Yes, but the original claim that ballombe made was that with just the VA, no ASID, you can distinguish SPARC kernel and user mode addresses.
He was responding to a comment that said that there were two ways of handling the kernel user split:
The assertion made was that S/390 and SPARC don't work in either of those two ways, and that you neither use bits in the VA space to distinguish the two addresses, nor do you have a possible overlap between the two address spaces. I'm asking how SPARC and S/390 make that work.
Posted Mar 30, 2023 11:04 UTC (Thu)
by paulj (subscriber, #341)
[Link] (3 responses)
Having just skimmed the SPARCv9 Architecture Manual to look up the ASID stuff, the ASID appears to be intrinsically required to translate addresses correctly with the right context. You can't tell from the VA, you need the ASID - that's the point. The ASID is always there as part of the translation, given implicitly or explicitly.
Posted Mar 30, 2023 11:08 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (2 responses)
Yeah, that's what I'm familiar with from SPARCv8, where the CPU automatically uses different ASIDs for instruction and data fetches crossed with user or kernel, for 4 default ASIDs, plus has an override option for data fetches to use any ASID - but I was really hoping to hear about some clever trick in later SPARC definitions that gets me the benefits of both worlds, and the cost of neither.
Otherwise, with large enough VAs, what you end up wanting is to use in-band signalling (top bit like in x86-64, for example) to indicate kernel or user address, with the separate ASIDs ensuring that if I'm fetching with a kernel-mode instruction fetch, I can't fetch from user addresses at all, nor can I fetch kernel data, while if I'm fetching with a user-mode overriden data fetch, I can't fetch from kernel addresses at all.
Posted Mar 30, 2023 11:38 UTC (Thu)
by paulj (subscriber, #341)
[Link] (1 responses)
I like the explicit tag of the ASID in SPARC. CPU can easily apply basic checks. If you're going to reserve bits to identify address space contexts, you might as well make it explicit. SPARC VAs can use the full address space, cause the ASID tag can be set in a separate register and left implicit for a stream of instructions (IIUC).
Posted Mar 30, 2023 12:38 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
That sounds similar to SPARCv8 - 256 ASIDs, 4 of which are predefined and used by default for all instructions that don't override ASID. The MMU is between CPU and L1 cache, and maps ASIDs to either context IDs for paging, or another memory map if there's a predefined mapping (e.g. if you follow SPARC recommendations, some ASIDs are used for 36 bits of direct map, others are used to access MMU register space). Caches track the context ID and virtual address, so that you don't have to flush caches when you change ASID to context ID mapping.
For 32 bit systems, where VA space is at a premium, I get not reserving one bit for kernel/user distinction. But in 64 bit systems, where you have a very large VA space, I don't see that reserving one bit for kernel/user is a huge price to pay for the debuggability and security check simplification it gives you (you know up-front that any top-bit-set address is a kernel address, and top-bit-clear is a user address, even without knowing the context ID that you're going to use with that address). And you can program SPARC hardware with contexts that fault if user accesses are used for kernel addresses or vice-versa.
Posted Mar 30, 2023 15:36 UTC (Thu)
by geert (subscriber, #98403)
[Link] (1 responses)
I assume the SPARC and s390 feature is similar, or an extension to what m68k provides: separate function codes for user and kernel (and for program and data, but that doesn't matter here). M68k also has two sets of page tables: one for the kernel (supervisor), one for userspace.
Posted Mar 30, 2023 15:41 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
The kernel does not deliberately mix kernel and user pointers. However, it's not hard to find past bugs where the kernel has been tricked into reading from a pointer supplied by user space; being able to validate early on that this pointer is supposed to be an __user pointer, but has the VA format of a kernel pointer (or that this pointer has the VA format of a user pointer, but is not tagged as an __user pointer) is useful for actually finding such bugs.
It's a non-issue as long as the kernel is completely free of bugs, which is the case you've described. But that's, unfortunately, not the world I live in.
Posted Mar 28, 2023 11:22 UTC (Tue)
by renox (guest, #23785)
[Link] (1 responses)
1) RISC-V is an evolution of MIPS , so it isn't "really new".
2) RISC-V creators targeted micro-controllers at the beginning, so if you expect any security improvement in RISC-V, you're going to be quite disappointed..
Posted Mar 30, 2023 14:59 UTC (Thu)
by ejr (subscriber, #51652)
[Link]
Posted Apr 5, 2023 12:00 UTC (Wed)
by andy_shev (subscriber, #75870)
[Link]
Posted Mar 24, 2023 16:23 UTC (Fri)
by dezgeg (subscriber, #92243)
[Link] (2 responses)
Posted Mar 25, 2023 20:59 UTC (Sat)
by gerdesj (subscriber, #5446)
[Link]
Page background color: #ffffff
Posted Mar 29, 2023 22:45 UTC (Wed)
by corbet (editor, #1)
[Link]
Posted Mar 24, 2023 16:33 UTC (Fri)
by stop50 (subscriber, #154894)
[Link] (1 responses)
Posted Mar 25, 2023 12:55 UTC (Sat)
by pbonzini (subscriber, #60935)
[Link]
Posted Mar 24, 2023 16:59 UTC (Fri)
by old (subscriber, #154324)
[Link] (1 responses)
Great to see the ARCH_SHSTK_UNLOCK thanks to CRIU. That should be enough to ease the pain.
Posted Apr 6, 2023 8:36 UTC (Thu)
by andrey.turkin (guest, #89915)
[Link]
Posted Mar 24, 2023 19:28 UTC (Fri)
by fredex (subscriber, #11727)
[Link] (1 responses)
Posted Mar 25, 2023 9:27 UTC (Sat)
by dottedmag (subscriber, #18590)
[Link]
Posted Mar 26, 2023 11:03 UTC (Sun)
by geofft (subscriber, #59789)
[Link]
It looks like this is an x86_64-specific extension. From arch/x86/kernel/ptrace.c:
That is, to call arch_prctl(code, addr) on a proces you're tracing, run ptrace(PTRACE_ARCH_PRCTL, pid, addr, code). For the specific operation in this patch, it would be ptrace(PTRACE_ARCH_PRCTL, pid, features, ARCH_SHSTK_UNLOCK), I think.
Posted Apr 20, 2023 13:34 UTC (Thu)
by immibis (guest, #105511)
[Link] (2 responses)
Posted Apr 29, 2023 20:36 UTC (Sat)
by foom (subscriber, #14868)
[Link] (1 responses)
On Linux, code segments with both attributes are available for all processes, so you can flip back and forth with just:
The numbers correspond to __USER_CS and __USER32_CS from https://github.com/torvalds/linux/blob/master/arch/x86/in...
I don't know if there's any real non-exploit code which actually does this, though...
Posted Apr 29, 2023 21:49 UTC (Sat)
by dtlin (subscriber, #36537)
[Link]
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
JALR is a venerable MIPS instruction (already present in the R2000 (the first MIPS CPU)), no novelty points there. And it does not do everything, it's an instruction for indirect calls (used, e.g., for calling a method in an object-oriented language).
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
What a new ABI need to provide is separated kernel and userspace address space as in sparc and s390.
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
<https://lwn.net/Articles/742245/>.
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
The point is that userspace cannot create kernel pointers.
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
Hence userspace accesses are always translated by the user page tables.
Kernel space accesses are translated by the kernel page tables, except when using the special MOVES instruction, which will access based on a preset function code.
This mechanism supports having the full 4 GiB address space available to both kernel and user space ("4G/4G split"), without the need to split the address space in separate parts for kernel and user memory (e.g. "1G/3G" split) to let the kernel access user memory and kernel memory.
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
They even removed the "trap on overflow" integer arithmetic operations that the MIPS had :-(
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
Left column color: #ffcc99
Middle column background: #ffffff
Headline background: #ffcc99
Form/byline background: #eeeeee
Sidebar background: #ffcc99
Text color: black
Link color: DarkBlue
Visited link color: #444
Quoted text (in email) color: #990099
Old (seen) comment background color: #cccccc
Logo color: green
So the color messup was the result of a dumb typo in the definitions of those colors; I have fixed it now. You will probably have to go into the customization area and re-select the dark-mode colors to get the fix, unfortunately; apologies for that.
Dark mode colors
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
On the other hand, it is very handy to have a separate stack filled only with the actual call flow - it could allow for a very quick and reliable way to get a stack trace (without the stack frames but many tools don't actually need it).
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
[PATCH v4 38/39] x86/shstk: Add ARCH_SHSTK_UNLOCK mentions "the ptrace arch_prctl interface," which I hadn't heard of before and appears to be undocumented - the arch_prctl(2) and ptrace(2) manpages don't mention each other.
PTRACE_ARCH_PRCTL
#ifdef CONFIG_X86_64
/* normal 64bit interface to access TLS data.
Works just like arch_prctl, except that the arguments
are reversed. */
case PTRACE_ARCH_PRCTL:
ret = do_arch_prctl_64(child, data, addr);
break;
#endif
User-space shadow stacks (maybe) for 6.4
User-space shadow stacks (maybe) for 6.4
ljmp $0x33, label ; jump to label in 64bit mode
ljmp $0x23, label ; jump to label in 32bit mode
User-space shadow stacks (maybe) for 6.4