Anatomy of a system call, part 1
System calls are the primary mechanism by which user-space programs interact with the Linux kernel. Given their importance, it's not surprising to discover that the kernel includes a wide variety of mechanisms to ensure that system calls can be implemented generically across architectures, and can be made available to user space in an efficient and consistent way.
I've been working on getting FreeBSD's Capsicum security framework onto Linux and, as this involves the addition of several new system calls (including the slightly unusual execveat() system call), I found myself investigating the details of their implementation. As a result, this is the first of a pair of articles that explore the details of the kernel's implementation of system calls (or syscalls). In this article we'll focus on the mainstream case: the mechanics of a normal syscall (read()), together with the machinery that allows x86_64 user programs to invoke it. The second article will move off the mainstream case to cover more unusual syscalls, and other syscall invocation mechanisms.
System calls differ from regular function calls because the code being called is in the kernel. Special instructions are needed to make the processor perform a transition to ring 0 (privileged mode). In addition, the kernel code being invoked is identified by a syscall number, rather than by a function address.
Defining a syscall with SYSCALL_DEFINEn()
The read() system call provides a good initial example to explore the kernel's syscall machinery. It's implemented in fs/read_write.c, as a short function that passes most of the work to vfs_read(). From an invocation standpoint the most interesting aspect of this code is way the function is defined using the SYSCALL_DEFINE3() macro. Indeed, from the code, it's not even immediately clear what the function is called.
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; /* ... */
These SYSCALL_DEFINEn() macros are the standard way for kernel code to define a system call, where the n suffix indicates the argument count. The definition of these macros (in include/linux/syscalls.h) gives two distinct outputs for each system call.
SYSCALL_METADATA(_read, 3, unsigned int, fd, char __user *, buf, size_t, count) __SYSCALL_DEFINEx(3, _read, unsigned int, fd, char __user *, buf, size_t, count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; /* ... */
The first of these, SYSCALL_METADATA(), builds a collection of metadata about the system call for tracing purposes. It's only expanded when CONFIG_FTRACE_SYSCALLS is defined for the kernel build, and its expansion gives boilerplate definitions of data that describes the syscall and its parameters. (A separate page describes these definitions in more detail.)
The __SYSCALL_DEFINEx() part is more interesting, as it holds the system call implementation. Once the various layers of macros and GCC type extensions are expanded, the resulting code includes some interesting features:
asmlinkage long sys_read(unsigned int fd, char __user * buf, size_t count) __attribute__((alias(__stringify(SyS_read)))); static inline long SYSC_read(unsigned int fd, char __user * buf, size_t count); asmlinkage long SyS_read(long int fd, long int buf, long int count); asmlinkage long SyS_read(long int fd, long int buf, long int count) { long ret = SYSC_read((unsigned int) fd, (char __user *) buf, (size_t) count); asmlinkage_protect(3, ret, fd, buf, count); return ret; } static inline long SYSC_read(unsigned int fd, char __user * buf, size_t count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; /* ... */
First, we notice that the system call implementation actually has the name SYSC_read(), but is static and so is inaccessible outside this module. Instead, a wrapper function, called SyS_read() and aliased as sys_read(), is visible externally. Looking closely at those aliases, we notice a difference in their parameter types — sys_read() expects the explicitly declared types (e.g. char __user * for the second argument), whereas SyS_read() just expects a bunch of (long) integers. Digging into the history of this, it turns out that the long version ensures that 32-bit values are correctly sign-extended for some 64-bit kernel platforms, preventing a historical vulnerability.
The last things we notice with the SyS_read() wrapper are the asmlinkage directive and asmlinkage_protect() call. The Kernel Newbies FAQ helpfully explains that asmlinkage means the function should expect its arguments on the stack rather than in registers, and the generic definition of asmlinkage_protect() explains that it's used to prevent the compiler from assuming that it can safely reuse those areas of the stack.
To accompany the definition of sys_read() (the variant with accurate types), there's also a declaration in include/linux/syscalls.h, and this allows other kernel code to call into the system call implementation directly (which happens in half a dozen places). Calling system calls directly from elsewhere in the kernel is generally discouraged and is not often seen.
Syscall table entries
Hunting for callers of sys_read() also points the way toward how user space reaches this function. For "generic" architectures that don't provide an override of their own, the include/uapi/asm-generic/unistd.h file includes an entry referencing sys_read:
#define __NR_read 63 __SYSCALL(__NR_read, sys_read)
This defines the generic syscall number __NR_read (63) for read(), and uses the __SYSCALL() macro to associate that number with sys_read(), in an architecture-specific way. For example, arm64 uses the asm-generic/unistd.h header file to fill out a table that maps syscall numbers to implementation function pointers.
However, we're going to concentrate on the x86_64 architecture, which does not use this generic table. Instead, x86_64 defines its own mappings in arch/x86/syscalls/syscall_64.tbl, which has an entry for sys_read():
0 common read sys_read
This indicates that read() on x86_64 has syscall number 0 (not 63), and has a common implementation for both of the ABIs for x86_64, namely sys_read(). (The different ABIs will be discussed in the second part of this series.) The syscalltbl.sh script generates arch/x86/include/generated/asm/syscalls_64.h from the syscall_64.tbl table, specifically generating an invocation of the __SYSCALL_COMMON() macro for sys_read(). This header file is used, in turn, to populate the syscall table, sys_call_table, which is the key data structure that maps syscall numbers to sys_name() functions.
x86_64 syscall invocation
Now we will look at how user-space programs invoke the system call. This is inherently architecture-specific, so for the rest of this article we'll concentrate on the x86_64 architecture (other x86 architectures will be examined in the second article of the series). The invocation process also involves a few steps, so a clickable diagram, seen at left, may help with the navigation.
In the previous section, we discovered a table of system call function pointers; the table for x86_64 looks something like the following (using a GCC extension for array initialization that ensures any missing entries point to sys_ni_syscall()):
asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = { [0 ... __NR_syscall_max] = &sys_ni_syscall, [0] = sys_read, [1] = sys_write, /*... */ };
For 64-bit code, this table is accessed from arch/x86/kernel/entry_64.S, from the system_call assembly entry point; it uses the RAX register to pick the relevant entry in the array and then calls it. Earlier in the function, the SAVE_ARGS macro pushes various registers onto the stack, to match the asmlinkage directive we saw earlier.
Moving outwards, the system_call entry point is itself referenced in syscall_init(), a function that is called early in the kernel's startup sequence:
void syscall_init(void) { /* * LSTAR and STAR live in a bit strange symbiosis. * They both write to the same internal register. STAR allows to * set CS/DS but only a 32bit target. LSTAR sets the 64bit rip. */ wrmsrl(MSR_STAR, ((u64)__USER32_CS)<<48 | ((u64)__KERNEL_CS)<<32); wrmsrl(MSR_LSTAR, system_call); wrmsrl(MSR_CSTAR, ignore_sysret); /* ... */
The wrmsrl instruction writes a value to a model-specific register; in this case, the address of the general system_call syscall handling function is written to register MSR_LSTAR (0xc0000082), which is the x86_64 model-specific register for handling the SYSCALL instruction.
And this gives us all we need to join the dots from user space to the kernel code. The standard ABI for how x86_64 user programs invoke a system call is to put the system call number (0 for read) into the RAX register, and the other parameters into specific registers (RDI, RSI, RDX for the first 3 parameters), then issue the SYSCALL instruction. This instruction causes the processor to transition to ring 0 and invoke the code referenced by the MSR_LSTAR model-specific register — namely system_call. The system_call code pushes the registers onto the kernel stack, and calls the function pointer at entry RAX in the sys_call_table table — namely sys_read(), which is a thin, asmlinkage wrapper for the real implementation in SYSC_read().
Now that we've seen the standard implementation of system calls
on the most common platform, we're in a better position to
understand what's going on with other architectures, and with less-common cases. That will be the subject of the second article in
the series.
Index entries for this article | |
---|---|
Kernel | System calls |
Kernel | User-space API |
GuestArticles | Drysdale, David |
Posted Jul 10, 2014 12:56 UTC (Thu)
by smitty_one_each (subscriber, #28989)
[Link]
Posted Jul 10, 2014 22:21 UTC (Thu)
by bourbaki (guest, #84259)
[Link] (4 responses)
Thanks for this article, it looks like the start of a very promising series. I have question though, that you or other kernel hackers may answer :
> Calling system calls directly from elsewhere in the kernel is generally discouraged and is not often seen.
I have searched about this a bit but couldn't find a specific reason for why it's "generally discouraged". Could someone enlighten me ?
Thanks
Posted Jul 11, 2014 11:33 UTC (Fri)
by roblucid (guest, #48964)
[Link] (3 responses)
More simply, a system call is designed for user space interaction, in general something in kernel, ought to be using an in-kernel service, rather than re-purposing a user space interface.
[ As noone answered this question yet ]
Posted Jul 14, 2014 9:33 UTC (Mon)
by drysdale (guest, #95971)
[Link] (2 responses)
Posted Jul 23, 2014 14:32 UTC (Wed)
by YogeshC (guest, #49966)
[Link] (1 responses)
Posted Aug 11, 2014 12:45 UTC (Mon)
by rwmj (subscriber, #5474)
[Link]
The main user of sys_read at the moment is the code for unpacking the initramfs. You could imagine this "should" be done by a userspace process, because it's doing a lot of userspace-type stuff, like uncompressing a cpio file and then creating a tree of directories and files from it. It's nearly the equivalent of running:
So it's a layering violation, but for understandable reasons.
Posted Jul 11, 2014 9:01 UTC (Fri)
by geuder (subscriber, #62854)
[Link] (4 responses)
A little detail, but a nuisance for humans, especially if you debug an emebedded target and compare to your desktop known to work correctly: What is the technical reason to have partially architecture specific syscall number? Passing a number in a register does not look like anything inherently architecture specific.
If you quickly need to see the syscall numbers for a given architecture I have seen the hint to look at strace source. I'm just at my phone now, so I don't compare the source now, but I vaguely remember it was indeed easier to navigate than the kernel source proper. Was it so that strace distributes ready lists while for the kernel they need to be built for each architecture?
Posted Jul 11, 2014 9:53 UTC (Fri)
by sasha (guest, #16070)
[Link] (3 responses)
For "compatibility" with other, older OSes used on the architecture. No, I do not understand what kind of "compatibility" you can get in such a way, but it is the main reason.
Posted Jul 11, 2014 16:55 UTC (Fri)
by alonz (subscriber, #815)
[Link] (2 responses)
Originally (in the days of Linux 1.x / 2.0) Linux attempted to be binary compatible to existing Unices on common hardware (the personality(2) system call is also part of this). As time passed, compatibility with other Unices became mostly a non-issue – but now we do need to maintain binary compatibility with older Linux binaries…
Posted Jul 11, 2014 22:12 UTC (Fri)
by geuder (subscriber, #62854)
[Link] (1 responses)
That sounds like a reasonable explanation. Besides that back in those days what Unices were running on x86_64 and ARM? And these 2 differ. Yeah well, maybe the dependencies are not direct, but somehow indirect over other platforms?
Posted Jul 11, 2014 22:22 UTC (Fri)
by sfeam (subscriber, #2841)
[Link]
Posted Jul 13, 2014 0:00 UTC (Sun)
by gb (subscriber, #58328)
[Link] (2 responses)
Posted Jul 14, 2014 9:45 UTC (Mon)
by drysdale (guest, #95971)
[Link] (1 responses)
Having the arguments in registers for the ring transition means that there's no need for fancy footwork to get at the userspace stack memory (compare the innards of copy_from_user()).
Storing the registers on the kernel stack allows the state of the registers to be restored on the return to userspace. But once the parameters are available on the stack, there's no need to preserve them in the registers too – the syscall can get its arguments from the stack (i.e. be asmlinkage) and can immediately use (and clobber) the registers.
Posted Jul 16, 2014 17:03 UTC (Wed)
by nix (subscriber, #2304)
[Link]
But obviously the args have to end up on the stack -- or, rather, have to end up whereever the C ABI for the platform says they should (possibly optimized by asmlinkage, but still, something the compiler supports).
Thanks for this article: I too have wasted entirely too much time tracking this down in pieces now and then: it's nice to have a reference here for next time. Looking forward to the next one.
Posted Oct 2, 2019 2:34 UTC (Wed)
by emceeMC (guest, #134734)
[Link]
Posted Oct 2, 2019 2:35 UTC (Wed)
by emceeMC (guest, #134734)
[Link]
Posted Aug 27, 2020 12:55 UTC (Thu)
by vishwajith_k (guest, #140997)
[Link] (1 responses)
Posted Aug 27, 2020 21:14 UTC (Thu)
by andresfreund (subscriber, #69562)
[Link]
SYSCALL_DEFINE4(clock_nanosleep, const clockid_t, which_clock, int, flags,
Bravo
This is precisely the sort of well-written intermediate material that is needed to help
(a) the n00bs like me, and
(b) the ninjas who haven't reviewed it lately.
More of this!
Cheers,
Chris
Calling syscalls from the kernel
Calling syscalls from the kernel
Calling syscalls from the kernel
Calling syscalls from the kernel
No one answered your question so I'll have a go.
Calling syscalls from the kernel
zcat initramfs | cpio -id
However because the initramfs is needed before userspace is up -- because creating the initramfs is creating the initial userspace -- they have to do this userspace-type stuff in the kernel instead.
Anatomy of a system call, part 1
Anatomy of a system call, part 1
I believe by now the reason is mainly historical.Anatomy of a system call, part 1
Anatomy of a system call, part 1
As I recall, linux binaries for alpha would run under Digital Unix on alpha.
Anatomy of a system call, part 1
Anatomy of a system call, part 1
Anatomy of a system call, part 1
Anatomy of a system call, part 1
Your link to fs/read_write.c (http://lxr.free-electrons.com/source/fs/read_write.c?v=3.14#L511) is broken :(
Broken link
Consider replacing it with this.
Broken link
Anatomy of a system call, part 1
As per this article, name used to implement must be sys_clock_nanosleep, but I found some header files when searched for this identifier (used https://elixir.bootlin.com/ site); Am using arm64 machine and syscall number must be 115.
Can you help me?
Anatomy of a system call, part 1
const struct __kernel_timespec __user *, rqtp,
struct __kernel_timespec __user *, rmtp)
{
....