Anatomy of a system call, part 1

July 9, 2014

This article was contributed by David Drysdale

System calls are the primary mechanism by which user-space programs interact with the Linux kernel. Given their importance, it's not surprising to discover that the kernel includes a wide variety of mechanisms to ensure that system calls can be implemented generically across architectures, and can be made available to user space in an efficient and consistent way.

I've been working on getting FreeBSD's Capsicum security framework onto Linux and, as this involves the addition of several new system calls (including the slightly unusual execveat() system call), I found myself investigating the details of their implementation. As a result, this is the first of a pair of articles that explore the details of the kernel's implementation of system calls (or syscalls). In this article we'll focus on the mainstream case: the mechanics of a normal syscall (read()), together with the machinery that allows x86_64 user programs to invoke it. The second article will move off the mainstream case to cover more unusual syscalls, and other syscall invocation mechanisms.

System calls differ from regular function calls because the code being called is in the kernel. Special instructions are needed to make the processor perform a transition to ring 0 (privileged mode). In addition, the kernel code being invoked is identified by a syscall number, rather than by a function address.

Defining a syscall with `SYSCALL_DEFINEn()`

The read() system call provides a good initial example to explore the kernel's syscall machinery. It's implemented in fs/read_write.c, as a short function that passes most of the work to vfs_read(). From an invocation standpoint the most interesting aspect of this code is way the function is defined using the SYSCALL_DEFINE3() macro. Indeed, from the code, it's not even immediately clear what the function is called.

    SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
    {
    	struct fd f = fdget_pos(fd);
    	ssize_t ret = -EBADF;
    	/* ... */

These SYSCALL_DEFINEn() macros are the standard way for kernel code to define a system call, where the n suffix indicates the argument count. The definition of these macros (in include/linux/syscalls.h) gives two distinct outputs for each system call.

    SYSCALL_METADATA(_read, 3, unsigned int, fd, char __user *, buf, size_t, count)
    __SYSCALL_DEFINEx(3, _read, unsigned int, fd, char __user *, buf, size_t, count)
    {
    	struct fd f = fdget_pos(fd);
    	ssize_t ret = -EBADF;
    	/* ... */

The first of these, SYSCALL_METADATA(), builds a collection of metadata about the system call for tracing purposes. It's only expanded when CONFIG_FTRACE_SYSCALLS is defined for the kernel build, and its expansion gives boilerplate definitions of data that describes the syscall and its parameters. (A separate page describes these definitions in more detail.)

The __SYSCALL_DEFINEx() part is more interesting, as it holds the system call implementation. Once the various layers of macros and GCC type extensions are expanded, the resulting code includes some interesting features:

    asmlinkage long sys_read(unsigned int fd, char __user * buf, size_t count)
    	__attribute__((alias(__stringify(SyS_read))));

    static inline long SYSC_read(unsigned int fd, char __user * buf, size_t count);
    asmlinkage long SyS_read(long int fd, long int buf, long int count);

    asmlinkage long SyS_read(long int fd, long int buf, long int count)
    {
    	long ret = SYSC_read((unsigned int) fd, (char __user *) buf, (size_t) count);
    	asmlinkage_protect(3, ret, fd, buf, count);
    	return ret;
    }

    static inline long SYSC_read(unsigned int fd, char __user * buf, size_t count)
    {
    	struct fd f = fdget_pos(fd);
    	ssize_t ret = -EBADF;
    	/* ... */

First, we notice that the system call implementation actually has the name SYSC_read(), but is static and so is inaccessible outside this module. Instead, a wrapper function, called SyS_read() and aliased as sys_read(), is visible externally. Looking closely at those aliases, we notice a difference in their parameter types — sys_read() expects the explicitly declared types (e.g. char __user * for the second argument), whereas SyS_read() just expects a bunch of (long) integers. Digging into the history of this, it turns out that the long version ensures that 32-bit values are correctly sign-extended for some 64-bit kernel platforms, preventing a historical vulnerability.

The last things we notice with the SyS_read() wrapper are the asmlinkage directive and asmlinkage_protect() call. The Kernel Newbies FAQ helpfully explains that asmlinkage means the function should expect its arguments on the stack rather than in registers, and the generic definition of asmlinkage_protect() explains that it's used to prevent the compiler from assuming that it can safely reuse those areas of the stack.

To accompany the definition of sys_read() (the variant with accurate types), there's also a declaration in include/linux/syscalls.h, and this allows other kernel code to call into the system call implementation directly (which happens in half a dozen places). Calling system calls directly from elsewhere in the kernel is generally discouraged and is not often seen.

Syscall table entries

Hunting for callers of sys_read() also points the way toward how user space reaches this function. For "generic" architectures that don't provide an override of their own, the include/uapi/asm-generic/unistd.h file includes an entry referencing sys_read:

    #define __NR_read 63
    __SYSCALL(__NR_read, sys_read)

This defines the generic syscall number __NR_read (63) for read(), and uses the __SYSCALL() macro to associate that number with sys_read(), in an architecture-specific way. For example, arm64 uses the asm-generic/unistd.h header file to fill out a table that maps syscall numbers to implementation function pointers.

However, we're going to concentrate on the x86_64 architecture, which does not use this generic table. Instead, x86_64 defines its own mappings in arch/x86/syscalls/syscall_64.tbl, which has an entry for sys_read():

    0	common	read			sys_read

This indicates that read() on x86_64 has syscall number 0 (not 63), and has a common implementation for both of the ABIs for x86_64, namely sys_read(). (The different ABIs will be discussed in the second part of this series.) The syscalltbl.sh script generates arch/x86/include/generated/asm/syscalls_64.h from the syscall_64.tbl table, specifically generating an invocation of the __SYSCALL_COMMON() macro for sys_read(). This header file is used, in turn, to populate the syscall table, sys_call_table, which is the key data structure that maps syscall numbers to sys_name() functions.

x86_64 syscall invocation

Now we will look at how user-space programs invoke the system call. This is inherently architecture-specific, so for the rest of this article we'll concentrate on the x86_64 architecture (other x86 architectures will be examined in the second article of the series). The invocation process also involves a few steps, so a clickable diagram, seen at left, may help with the navigation.

In the previous section, we discovered a table of system call function pointers; the table for x86_64 looks something like the following (using a GCC extension for array initialization that ensures any missing entries point to sys_ni_syscall()):

    asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
    	[0 ... __NR_syscall_max] = &sys_ni_syscall,
    	[0] = sys_read,
    	[1] = sys_write,
    	/*... */
    };

For 64-bit code, this table is accessed from arch/x86/kernel/entry_64.S, from the system_call assembly entry point; it uses the RAX register to pick the relevant entry in the array and then calls it. Earlier in the function, the SAVE_ARGS macro pushes various registers onto the stack, to match the asmlinkage directive we saw earlier.

Moving outwards, the system_call entry point is itself referenced in syscall_init(), a function that is called early in the kernel's startup sequence:

    void syscall_init(void)
    {
    	/*
    	 * LSTAR and STAR live in a bit strange symbiosis.
    	 * They both write to the same internal register. STAR allows to
    	 * set CS/DS but only a 32bit target. LSTAR sets the 64bit rip.
    	 */
    	wrmsrl(MSR_STAR,  ((u64)__USER32_CS)<<48  | ((u64)__KERNEL_CS)<<32);
    	wrmsrl(MSR_LSTAR, system_call);
    	wrmsrl(MSR_CSTAR, ignore_sysret);
    	/* ... */

The wrmsrl instruction writes a value to a model-specific register; in this case, the address of the general system_call syscall handling function is written to register MSR_LSTAR (0xc0000082), which is the x86_64 model-specific register for handling the SYSCALL instruction.

And this gives us all we need to join the dots from user space to the kernel code. The standard ABI for how x86_64 user programs invoke a system call is to put the system call number (0 for read) into the RAX register, and the other parameters into specific registers (RDI, RSI, RDX for the first 3 parameters), then issue the SYSCALL instruction. This instruction causes the processor to transition to ring 0 and invoke the code referenced by the MSR_LSTAR model-specific register — namely system_call. The system_call code pushes the registers onto the kernel stack, and calls the function pointer at entry RAX in the sys_call_table table — namely sys_read(), which is a thin, asmlinkage wrapper for the real implementation in SYSC_read().

Now that we've seen the standard implementation of system calls on the most common platform, we're in a better position to understand what's going on with other architectures, and with less-common cases. That will be the subject of the second article in the series.

Index entries for this article
Kernel	System calls
Kernel	User-space API
GuestArticles	Drysdale, David

Bravo

Posted Jul 10, 2014 12:56 UTC (Thu) by smitty_one_each (subscriber, #28989) [Link]

Bravo, thank you, and may Fortune smile on your day, Drysdale.
This is precisely the sort of well-written intermediate material that is needed to help
(a) the n00bs like me, and
(b) the ninjas who haven't reviewed it lately.
More of this!
Cheers,
Chris

Calling syscalls from the kernel

Posted Jul 10, 2014 22:21 UTC (Thu) by bourbaki (guest, #84259) [Link] (4 responses)

Hello David,

Thanks for this article, it looks like the start of a very promising series. I have question though, that you or other kernel hackers may answer :

> Calling system calls directly from elsewhere in the kernel is generally discouraged and is not often seen.

I have searched about this a bit but couldn't find a specific reason for why it's "generally discouraged". Could someone enlighten me ?

Thanks

Calling syscalls from the kernel

Posted Jul 11, 2014 11:33 UTC (Fri) by roblucid (guest, #48964) [Link] (3 responses)

One general principal reason would be, that lower layer software, breaking out and calling higher layer software, turns a "calling tree" into a "network graph", much harder to analyse and reason about it's performance.

More simply, a system call is designed for user space interaction, in general something in kernel, ought to be using an in-kernel service, rather than re-purposing a user space interface.

[ As noone answered this question yet ]

Calling syscalls from the kernel

Posted Jul 14, 2014 9:33 UTC (Mon) by drysdale (guest, #95971) [Link] (2 responses)

That's roughly my understanding too. System calls typically check all their (user-provided) arguments carefully, and also copy any needed chunks of memory from __user * pointers -- all of which is inefficient for internal usage. So it's more common to have an inner function that the rest of the kernel calls, and the syscall wraps.

Calling syscalls from the kernel

Posted Jul 23, 2014 14:32 UTC (Wed) by YogeshC (guest, #49966) [Link] (1 responses)

In that case, why has this particular syscall been used in half a dozen places inside the kernel? Is there any specific (or general) reason(s) for read/write syscalls to be called from within the kernel space on so many occasions?

Calling syscalls from the kernel

Posted Aug 11, 2014 12:45 UTC (Mon) by rwmj (subscriber, #5474) [Link]

No one answered your question so I'll have a go.

The main user of sys_read at the moment is the code for unpacking the initramfs. You could imagine this "should" be done by a userspace process, because it's doing a lot of userspace-type stuff, like uncompressing a cpio file and then creating a tree of directories and files from it. It's nearly the equivalent of running:

  zcat initramfs | cpio -id

However because the initramfs is needed before userspace is up -- because creating the initramfs is creating the initial userspace -- they have to do this userspace-type stuff in the kernel instead.

So it's a layering violation, but for understandable reasons.

Anatomy of a system call, part 1

Posted Jul 11, 2014 9:01 UTC (Fri) by geuder (subscriber, #62854) [Link] (4 responses)

Thank you for the very detailed description. Occasionnally I had to track down (parts of) this from the source for debugging some issue and it can be very tedious.

A little detail, but a nuisance for humans, especially if you debug an emebedded target and compare to your desktop known to work correctly: What is the technical reason to have partially architecture specific syscall number? Passing a number in a register does not look like anything inherently architecture specific.

If you quickly need to see the syscall numbers for a given architecture I have seen the hint to look at strace source. I'm just at my phone now, so I don't compare the source now, but I vaguely remember it was indeed easier to navigate than the kernel source proper. Was it so that strace distributes ready lists while for the kernel they need to be built for each architecture?

Anatomy of a system call, part 1

Posted Jul 11, 2014 9:53 UTC (Fri) by sasha (guest, #16070) [Link] (3 responses)

> What is the technical reason to have partially architecture specific syscall number?

For "compatibility" with other, older OSes used on the architecture. No, I do not understand what kind of "compatibility" you can get in such a way, but it is the main reason.

Anatomy of a system call, part 1

Posted Jul 11, 2014 16:55 UTC (Fri) by alonz (subscriber, #815) [Link] (2 responses)

I believe by now the reason is mainly historical.

Originally (in the days of Linux 1.x / 2.0) Linux attempted to be binary compatible to existing Unices on common hardware (the personality(2) system call is also part of this). As time passed, compatibility with other Unices became mostly a non-issue – but now we do need to maintain binary compatibility with older Linux binaries…

Anatomy of a system call, part 1

Posted Jul 11, 2014 22:12 UTC (Fri) by geuder (subscriber, #62854) [Link] (1 responses)

> Originally (in the days of Linux 1.x / 2.0) Linux attempted to be binary compatible to existing Unices on common hardware

That sounds like a reasonable explanation. Besides that back in those days what Unices were running on x86_64 and ARM? And these 2 differ. Yeah well, maybe the dependencies are not direct, but somehow indirect over other platforms?

Anatomy of a system call, part 1

Posted Jul 11, 2014 22:22 UTC (Fri) by sfeam (subscriber, #2841) [Link]

As I recall, linux binaries for alpha would run under Digital Unix on alpha.

Anatomy of a system call, part 1

Posted Jul 13, 2014 0:00 UTC (Sun) by gb (subscriber, #58328) [Link] (2 responses)

Why such a strange implementation taken - put values into registers, switch to ring 0, than put this values into stack? Isn't it make more sense to keep values in the registers till real syscall function which may, if it wants, push values to the stack?

Anatomy of a system call, part 1

Posted Jul 14, 2014 9:45 UTC (Mon) by drysdale (guest, #95971) [Link] (1 responses)

Having the arguments in registers for the ring transition means that there's no need for fancy footwork to get at the userspace stack memory (compare the innards of copy_from_user()).

Storing the registers on the kernel stack allows the state of the registers to be restored on the return to userspace. But once the parameters are available on the stack, there's no need to preserve them in the registers too – the syscall can get its arguments from the stack (i.e. be asmlinkage) and can immediately use (and clobber) the registers.

Anatomy of a system call, part 1

Posted Jul 16, 2014 17:03 UTC (Wed) by nix (subscriber, #2304) [Link]

Quite. Keeping the args on the stack is a non-starter: userspace stacks are swappable, and you *don't* want to have to go checking to see if the args have been swapped out in the instant of ring transition: it's the sort of terribly narrow race that leads to code that rots and then silently breaks in almost-impossible-to-debug ways, and for almost no gain.

But obviously the args have to end up on the stack -- or, rather, have to end up whereever the C ABI for the platform says they should (possibly optimized by asmlinkage, but still, something the compiler supports).

Thanks for this article: I too have wasted entirely too much time tracking this down in pieces now and then: it's nice to have a reference here for next time. Looking forward to the next one.

Broken link

Posted Oct 2, 2019 2:34 UTC (Wed) by emceeMC (guest, #134734) [Link]

Your link to fs/read_write.c (http://lxr.free-electrons.com/source/fs/read_write.c?v=3.14#L511) is broken :(

Broken link

Posted Oct 2, 2019 2:35 UTC (Wed) by emceeMC (guest, #134734) [Link]

Consider replacing it with this.

Anatomy of a system call, part 1

Posted Aug 27, 2020 12:55 UTC (Thu) by vishwajith_k (guest, #140997) [Link] (1 responses)

I could not find implementation for a syscall named clock_nanosleep;
As per this article, name used to implement must be sys_clock_nanosleep, but I found some header files when searched for this identifier (used https://elixir.bootlin.com/ site); Am using arm64 machine and syscall number must be 115.
Can you help me?

Anatomy of a system call, part 1

Posted Aug 27, 2020 21:14 UTC (Thu) by andresfreund (subscriber, #69562) [Link]

It's in kernel/time/posix-stubs.c :

SYSCALL_DEFINE4(clock_nanosleep, const clockid_t, which_clock, int, flags,
const struct __kernel_timespec __user *, rqtp,
struct __kernel_timespec __user *, rmtp)
{
....

Anatomy of a system call, part 1

Defining a syscall with SYSCALL_DEFINEn()

Syscall table entries

x86_64 syscall invocation

Bravo

Calling syscalls from the kernel

Calling syscalls from the kernel

Calling syscalls from the kernel

Calling syscalls from the kernel

Calling syscalls from the kernel

Anatomy of a system call, part 1

Anatomy of a system call, part 1

Anatomy of a system call, part 1

Anatomy of a system call, part 1

Anatomy of a system call, part 1

Anatomy of a system call, part 1

Anatomy of a system call, part 1

Anatomy of a system call, part 1

Broken link

Broken link

Anatomy of a system call, part 1

Anatomy of a system call, part 1

Defining a syscall with `SYSCALL_DEFINEn()`