Anatomy of a system call, additional content
This page holds additional content associated with the "Anatomy of a system call" articles at LWN, but is sufficiently low-level/detailed that it can be placed outside of the narrative flow of those articles.
Step-by-step expansion of SYSCALL_DEFINEn()
As described in part 1, the SYSCALL_DEFINEn() macros initially give two distinct chunks of code:
SYSCALL_METADATA(_read, 3, unsigned int, fd, char __user *, buf, size_t, count) __SYSCALL_DEFINEx(3, _read, unsigned int, fd, char __user *, buf, size_t, count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; /* ... */
The first part provides metadata about the syscall for tracing purposes, and the second gives the syscall implementation. Let's examine the details of each in turn.
Syscall Metadata
The first thing to notice about the SYSCALL_METADATA() part is that it's only used for a kernel build with CONFIG_FTRACE_SYSCALLS defined. The definition of this configuration option describes it as enabling a "Basic tracer to catch the syscall entry and exit events", which makes sense.
Expanding the SYSCALL_METADATA() macro gives a bunch of boilerplate code:
static const char *types__read[] = { __MAP(3,__SC_STR_TDECL, unsigned int, fd, char __user *, buf, size_t, count) }; static const char *args__read[] = { __MAP(3,__SC_STR_ADECL, unsigned int, fd, char __user *, buf, size_t, count) }; static struct syscall_metadata __syscall_meta__read; static struct ftrace_event_call __used event_enter__read = { .name = "sys_enter""_read", .class = &event_class_syscall_enter, .event.funcs = &enter_syscall_print_funcs, .data = (void *)&__syscall_meta__read, .flags = TRACE_EVENT_FL_CAP_ANY, }; static struct ftrace_event_call __used __attribute__((section("_ftrace_events"))) *__event_enter__read = &event_enter__read; static struct syscall_metadata __syscall_meta__read; static struct ftrace_event_call __used event_exit__read = { .name = "sys_exit""_read", .class = &event_class_syscall_exit, .event.funcs = &exit_syscall_print_funcs, .data = (void *)&__syscall_meta__read, .flags = TRACE_EVENT_FL_CAP_ANY, }; static struct ftrace_event_call __used __attribute__((section("_ftrace_events"))) *__event_exit__read = &event_exit__read; static struct syscall_metadata __used __syscall_meta__read = { .name = "sys""_read", .syscall_nr = -1, .nb_args = 3, .types = 3 ? types__read : NULL, .args = 3 ? args__read : NULL, .enter_event = &event_enter__read, .exit_event = &event_exit__read, .enter_fields = LIST_HEAD_INIT(__syscall_meta__read.enter_fields), }; static struct syscall_metadata __used __attribute__((section("__syscalls_metadata"))) *__p_syscall_meta__read = &__syscall_meta__read;
There's one more macro of interest to expand here, which is the __MAP(n, m, ...) construct. This applies the provided m argument to n pairs of arguments in turn; here it's used with the __SC_STR_ADECL() and __SC_STR_TDECL() arguments to make strings from the argument name and type name. This changes the types__read and args__read variable definitions to:
static const char *types__read[] = { "unsigned int", "char __user *", "size_t" }; static const char *args__read[] = { "fd", "buf", "count" };
We're not going to explore this code further, but it's easy to see that this provides a lot of metadata that would help when tracing syscall invocations.
Syscall Definition
Now let's expand the __SYSCALL_DEFINEx() part, using its definition.
asmlinkage long sys_read(__MAP(3,__SC_DECL, unsigned int, fd, char __user *, buf, size_t, count)) __attribute__((alias(__stringify(SyS_read)))); static inline long SYSC_read(__MAP(3,__SC_DECL, unsigned int, fd, char __user *, buf, size_t, count)); asmlinkage long SyS_read(__MAP(3,__SC_LONG, unsigned int, fd, char __user *, buf, size_t, count)); asmlinkage long SyS_read(__MAP(3,__SC_LONG, unsigned int, fd, char __user *, buf, size_t, count)) { long ret = SYSC_read(__MAP(3,__SC_CAST, unsigned int, fd, char __user *, buf, size_t, count)); __MAP(3,__SC_TEST, unsigned int, fd, char __user *, buf, size_t, count); asmlinkage_protect(3, ret,__MAP(3,__SC_ARGS, unsigned int, fd, char __user *, buf, size_t, count)); return ret; } static inline long SYSC_read(__MAP(3,__SC_DECL, unsigned int, fd, char __user *, buf, size_t, count)) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; /* ... */
We've not yet expanded the __MAP() macros, but we can already see the code structure:
The implementation of the syscall follows this macro expansion, so its name is SYSC_read(), but it's static and so inaccessible outside of this module.
The code also defines a SyS_read() function to wrap SYSC_read() and do a couple of other things. The wrapper is efficient because the SYSC_read() function being wrapped is also defined to be inline.
The name sys_read() is defined as an alias to the SyS_read() wrapper function.
Expanding the various __MAP instances gives:
asmlinkage long sys_read(unsigned int fd, char __user * buf, size_t count) __attribute__((alias(__stringify(SyS_read)))); static inline long SYSC_read(unsigned int fd, char __user * buf, size_t count); asmlinkage long SyS_read(__SC_LONG(unsigned int,fd), __SC_LONG(char __user *,buf), __SC_LONG(size_t,count)); asmlinkage long SyS_read(__SC_LONG(unsigned int,fd), __SC_LONG(char __user *,buf), __SC_LONG(size_t,count)) { long ret = SYSC_read((unsigned int) fd, (char __user *) buf, (size_t) count); (void)BUILD_BUG_ON_ZERO(!__TYPE_IS_LL(unsigned int) && sizeof(unsigned int) > sizeof(long)), (void)BUILD_BUG_ON_ZERO(!__TYPE_IS_LL(char __user *) && sizeof(char __user *) > sizeof(long)), (void)BUILD_BUG_ON_ZERO(!__TYPE_IS_LL(size_t) && sizeof(size_t) > sizeof(long)); asmlinkage_protect(3, ret,fd, buf, count); return ret; } static inline long SYSC_read(unsigned int fd, char __user * buf, size_t count)
I've left a couple of unexpanded macros here. Firstly, something like __SC_LONG(unsigned int, fd) expands to:
__typeof(__builtin_choose_expr((__same_type((unsigned int)0, 0LL) || __same_type((unsigned int)0, 0ULL)), 0LL, 0L)) fd
This uses various gcc extensions to:
Determine if the provided argument type (here unsigned int) is actually the same type as either of the 0LL or 0ULL literals, i.e. whether it is a long long int or unsigned long long int. This is the __TYPE_IS_LL() macro.
If it is, use the type of 0LL, i.e. long long int.
If it isn't, use the type of 0L, i.e. long int.
The other unexpanded macro is a combination of BUILD_BUG_ON_ZERO() with __TYPE_IS_LL(), which generates a compile-time error if any argument is not a long long-equivalent type, but is still larger than a long. Putting these together we have:
asmlinkage long sys_read(unsigned int fd, char __user * buf, size_t count) __attribute__((alias(__stringify(SyS_read)))); static inline long SYSC_read(unsigned int fd, char __user * buf, size_t count); asmlinkage long SyS_read(long int fd, long int buf, long int count); asmlinkage long SyS_read(long int fd, long int buf, long int count) { long ret = SYSC_read((unsigned int) fd, (char __user *) buf, (size_t) count); asmlinkage_protect(3, ret, fd, buf, count); return ret; } static inline long SYSC_read(unsigned int fd, char __user * buf, size_t count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; /* ... */
Expanding asmlinkage
The actual implementation of asmlinkage is architecture-specific. For example, on x86_64 these markers do nothing, but on x86_32 asmlinkage expands to extern "C" __attribute__((regparm(0))) and our asmlinkage_protect(3, ret, fd, buf, count); expands to:
__asm__ __volatile__ ("" : "=r" (ret) : "0" (ret), "m" (fd), "m" (buf), "m" (count));
The gcc docs for the regparm
(number) attribute say "On the Intel 386, the regparm attribute causes the compiler to pass arguments number one to
number if they are of integral type in registers EAX, EDX, and ECX instead of on the stack.
" So having regparm(0) makes the compiler expect arguments on the stack as desired.
The extended assembly for asmlinkage_protect is structured as (template : output operands : input operands). The template is empty, so no actual assembly is inserted, but the presence of the operands prevents the C compiler from doing any unwanted optimizations. In particular, the output operand ret has a constraint that it should be in a register ("=r"), and the other input registers (fd, buf, count) are memory operands ("m").