The kernel's command-line commotion

By Jonathan Corbet
November 27, 2024

For the most part, the 6.13 merge window has gone smoothly, with relatively few problems or disagreements — other than this one, of course. There is one other exception, though, relating to the kernel's presentation of a process's command line to interested user-space observers when a relatively new system call is used. A pull request with a simple change to make that information more user-friendly ran afoul of Linus Torvalds, who has his own view of how it should be managed.

When one looks at a running process, often the first item of interest is which program that process is running. The kernel makes that information available in two different places, both found within a process's /proc directory:

comm holds the "command" that the process is running. When a program is launched with execve(), the base name of the executable file is placed in comm. So if the user runs /usr/bin/rogue, comm will contain "rogue".
cmdline contains the entire command line passed to the running program via the argv array given to execve(); it traditionally holds the program name, which is passed as the first argument in the command line.

The two files contain similar but different information, and have different access characteristics. comm will contain the actual file name that was passed to execve() (but keep on reading); it is also stored in a kernel data structure and can be accessed quickly. Instead, the command name in cmdline is whatever was passed to execve() as argv[0], which may or may not be the actual name of the command. It is stored in the relevant process's address space, meaning that the process itself can change it, and accessing it from another process is a relatively expensive operation. For these reasons, programs like ps or top will use comm instead of cmdline when possible.

As it happens, execve() is not the only way to launch a new program within a process on Linux. There is a library function called fexecve() that takes an open file descriptor for the program to execute rather than its path name; under the hood, it is implemented with execveat(). There is interest in using fexecve() because it allows the target program to be opened and checked (looking for a signature, for example), then executed in a race-free way. Tools like systemd have support for running programs this way, and its developers would prefer to use that mode.

There is just one little problem. While execve() can initialize comm from the name of the file passed to it, fexecve() only has an open file descriptor that no longer has any path-name information associated with it. That file descriptor may be marked "close on exec", meaning that even any information that may have been found in /proc/PID/fd will be lost. The result, in current kernels, is that, when a program is run with fexecve(), the comm is simply set to the file-descriptor number of the program. If rogue is run with fexecve() from file descriptor five, comm will contain "5" rather than "rogue".

Users, being the irascible creatures that they are, have expressed the unreasonable opinion that replacing the command names of processes in their system-management tools with small integers is an unwelcome change. They have been spoiled by being able to see which program each process is running and feel entitled to that ability in the future. Kernel programming would be so much easier without users, but that is not the world we live in. So the search for a better way to set comm when fexecve() is used was begun.

In a mid-November pull request, Kees Cook included a patch from Tycho Andersen that tried to restore some useful information to /proc/PID/comm in the fexecve() case. In the absence of the file name, the kernel would simply use the information from argv[0] instead, causing the information from comm and cmdline to be essentially the same. That patch had been through a few iterations, and seemed like a good solution to everybody involved.

Torvalds, though, disagreed, saying that "this horrible hack is too broken to live". Over the course of an extended and not always courteous discussion, he argued that argv[0] is under user-space control and can contain any sort of information; the kernel uses comm for its own purposes, and letting user space control it could help attackers to hide the actual executable being run. Copying argv[0] into comm will slow program start, he said. The right solution, according to Torvalds, is to use the file name stored in the directory entry ("dentry") associated with the file to be executed. That information is always present and is reliably under the kernel's control.

The problem with the dentry-based approach, as explained by Eric Biederman, Zbigniew Jędrzejewski-Szmek, and Cook, is that it would change the name of the executable as seen by user space. As an example, a user may run rogue, but some helpful distributor may have long since turned /usr/bin/rogue into a symbolic link to /usr/bin/nethack. On current systems, a tool like ps will show this user, busily at work, running rogue, but a comm value derived from the dentry would use the actual file being executed, so it would show as nethack instead. On some systems, like Debian with its "alternatives" mechanism, the visible names of quite a few commands would change. That could break programs that are looking for specific command names. Setting comm from argv[0], instead, would preserve the original name.

Torvalds, though, was unmoved by this argument. The dentry-based name, he said, is "THE TRUTH", and any program that wants to see argv[0] should just be using cmdline instead. Anybody who wants the behavior of execve() should just not use fexecve(), he added. The patch, as written, would not be pulled into the mainline. Cook tried one more time to explain why using argv[0] was desirable. It is the "friendlier" choice, he said, but if Torvalds was adamant that the dentry-based name must be used, that was going to be the result of the discussion. Torvalds responded: "no. THERE IS NO WAY I WILL ACCEPT THE GARBAGE THAT IS ARGV[0]".

This response seemed to indeed be somewhat adamant, so Cook has subsequently resent his pull request without the controversial patch, saying that the dentry-based approach would be implemented for the 6.14 merge window. Jędrzejewski-Szmek said that this approach could be worked with, "but we'll need to make an effort to warn users and do it much more visibly". There will, he said, inevitably be complaints from users whose scripts have broken.

In the end, this disagreement comes down to a small piece of the kernel's user-space interface that has existed almost since the beginning, but which has never been precisely specified. As with any user-visible behavior, programs have developed a reliance on the way things have traditionally worked, making newer approaches (such as execve() from a file descriptor) hard to implement without breaking things. There may be no ideal solution in this case, but it would have been nice if a workable solution could have been found with less shouting.

Index entries for this article
Kernel	/proc
Kernel	System calls/execveat()

Sad outcome

Posted Nov 27, 2024 15:10 UTC (Wed) by mezcalero (subscriber, #45103) [Link] (31 responses)

Frankly this means it's going to be quite hard for systemd to ever switch to fexecve() for things, because if we cannot control comm reasonably then any symlinked binary will show up with the wrong comm[], and there are many of those unfortunately. There are so many multicall binaries after all... tpm2-tss for example is pretty much a single multi-call binary, so are binaries in systemd, util-linux and elsewhere. Sure most of these hopefully check argv[0] rather than comm[], but this is not universally true, and "top" would still be very confusing...

If we ever wanted to support fexecve() for this, we'd have to manually follow any symlinks and then revert back to old execve() support in case we see symlinked binaries, and accept racefulness there again. But uh, that really sucks...

I am kinda looking forward to a future where we pin our files by fds when we are about to use them, and remove all races around TOCTTOU these ways, but it really sucks that this is not a possibility for the most fundamental operation systemd actually does: executing binaries...

Maybe a way out is if the kernel would learn a new fexecve2() syscall or so which takes some additional structure or so with additional parameters for the execution, and some flags, and then one field could be for explicit comm[] control. (And another one could be for explicit selinux execution label control, in place of this super ugly /proc/self/attr/exec interface^Hhack...)

Lennart

Sad outcome

Posted Nov 27, 2024 15:53 UTC (Wed) by Wol (subscriber, #4433) [Link] (12 responses)

Do symbolic links have dentries? Is there any way to store that in eg comm2, so what the user typed and the system looked up can be retrieved?

Cheers,
Wol

Sad outcome

Posted Nov 27, 2024 19:55 UTC (Wed) by adobriyan (subscriber, #30858) [Link] (11 responses)

Yes, but all lookup info gets tossed once final inode has been located.

Sad outcome

Posted Nov 27, 2024 20:26 UTC (Wed) by Wol (subscriber, #4433) [Link]

Then don't toss it?

Cheers,
Wol

Sad outcome

Posted Nov 28, 2024 14:42 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (9 responses)

When exactly does it get tossed? Suppose we construct a file descriptor with open(..., O_NOFOLLOW|O_PATH) (or openat or what have you). Then it probably can't get tossed until after we're in execveat, at which point the kernel should be able to preserve it if it so desires.

Sad outcome

Posted Nov 28, 2024 16:40 UTC (Thu) by adobriyan (subscriber, #30858) [Link] (8 responses)

See terminate_walk() in path_lookupat().

Descriptor pins "struct file" which pins dentry which pins inode.
You can walk upwards to the root and get _some_ name, that's what readlink(/proc/*/fd/*) does.

Now _some_ history must kept for loop detection purposes and too-deep-recursion detection but it surely won't exist once system call exits.

In theory, the name of the first symlink which started last pathname resolution chain could be kept to use as argv[0] but I don't want to be the one sending such patch. :-)

Sad outcome

Posted Nov 28, 2024 16:54 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (7 responses)

But if we pass O_NOFOLLOW, then we should end up with a path fd that refers to the symlink, not an fd that refers to the executable. Given that (as you say) the fd pins the dentry of the requested file, and the requested file is a symlink and not its target, I don't understand how the kernel would avoid pinning the symlink dentry in that case.

Is the problem that execveat fails to dereference the symlink afterwards?

Sad outcome

Posted Nov 28, 2024 18:50 UTC (Thu) by adobriyan (subscriber, #30858) [Link] (6 responses)

Apparently you can't start lookup from a symlink.

readlink("symlink", "/bin/false", 4096) = 10
openat(AT_FDCWD, "symlink", O_RDONLY|O_NOFOLLOW|O_PATH) = 3
execveat(3, "", NULL, NULL, AT_EMPTY_PATH) = -1 ELOOP

Sad outcome

Posted Nov 28, 2024 19:54 UTC (Thu) by dskoll (subscriber, #1630) [Link] (2 responses)

The first argument of execveat needs to be a directory file descriptor. Unless /bin/false is a directory on your system, this shouldn't work even without O_NOFOLLOW.

As to why you get an ELOOP error return, I guess that's just a strange detail of the implementation. I would have thought ENOTDIR would be the appropriate error return.

Sad outcome

Posted Nov 28, 2024 20:15 UTC (Thu) by intelfx (subscriber, #130118) [Link] (1 responses)

> The first argument of execveat needs to be a directory file descriptor. Unless /bin/false is a directory on your system, this shouldn't work even without O_NOFOLLOW.

AT_EMPTY_PATH was used, though.

Sad outcome

Posted Nov 28, 2024 22:24 UTC (Thu) by dskoll (subscriber, #1630) [Link]

Ah, ok, missed that... sorry.

Sad outcome

Posted Nov 29, 2024 15:46 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (2 responses)

I would tend to assume that there is no existing userspace reliance on this behavior (it would be quite insane to use execveat to test whether a path is a symlink, when fstat is right there). Perhaps there are security issues, but OTOH execveat already supports AT_SYMLINK_NOFOLLOW for non-null pathname, and it would (presumably) be trivial to extend that to the AT_EMPTY_PATH case.

Unfortunately, this whole discussion is probably moot for fexecve(3), considering this passage in the man page:

> The idea behind fexecve() is to allow the caller to verify (checksum) the contents of an executable before executing it. Simply opening the file, checksumming the contents, and then doing an execve(2) would not suffice, since, between the two steps, the filename, or a directory prefix of the pathname, could have been exchanged (by, for example, modifying the target of a symbolic link).

If the whole point of the function is to disallow symlink shenanigans, then it is obviously a non-starter to deliberately reintroduce those semantics, so libc would presumably just start passing AT_SYMLINK_NOFOLLOW (if it does not already), and we would be right back where we started.

Sad outcome

Posted Dec 3, 2024 15:14 UTC (Tue) by stevie-oh (subscriber, #130795) [Link] (1 responses)

> If the whole point of the function is to disallow symlink shenanigans

Note that symlinks aren't necessary to this. The operative word here is _shenanigans_: fexecve is designed to prevent this sort of scenario:

1. Guard launcher opens path "/bin/foo"
2. Guard launcher proceeds to read the file contents and verifies the checksum. (It probably also verifies the checksum on all SOs that are /usr/bin/foo are linked to)
3. While #2 is happening, rogue user with sufficient access deletes "/bin/foo" and replaces it with a modified version.
4. Guard launcher finishes verifying checksum on /bin/foo and the checksum passes, so it execve's "/bin/foo". Since that path now refers to a different file/inode, the guard launcher executes the wrong program. Whoops!

By using fexecve instead of execve in step 4, the guard launcher can guarantee that the executable it launches is the _exact same file that it originally opened_.

I see three primary goals here, which currently don't work well together:

1. Some people want/need to be able to prevent certain kinds of shenanigans, which can only be done by using fexecve
2. Other people who aren't as concerned about those shenanigans want the utility of /proc/fd/comm
3. Developers writing launchers such as systemd don't want to have to write separate code paths to satisfy both groups.

Sad outcome

Posted Dec 3, 2024 15:29 UTC (Tue) by intelfx (subscriber, #130118) [Link]

> 1. Some people want/need to be able to prevent certain kinds of shenanigans, which can only be done by using fexecve
> 2. Other people who aren't as concerned about those shenanigans want the utility of /proc/fd/comm
> 3. Developers writing launchers such as systemd don't want to have to write separate code paths to satisfy both groups.

I'd have rather said that everyone wants the utility of /proc/fd/comm. However, while people that specifically have a goal of preventing shenanigans (i.e. those operating secure environments), are probably willing to pay the cost of reduced convenience for security, the people in charge of systemd have a goal of "security by default". And security by default only works if it does not inflict misery elsewhere.

Sad outcome

Posted Nov 27, 2024 16:05 UTC (Wed) by mgedmin (subscriber, #34497) [Link] (4 responses)

What if you use hardlinks instead of symlinks?

Sad outcome

Posted Nov 27, 2024 19:00 UTC (Wed) by nowster (subscriber, #67) [Link]

For the Debian alternatives scenario there are two symlinks: one into /etc/alternatives and another to the actual executable in /bin. There are sometimes good reasons for local configuration in /etc to be on a different filesystem to /bin.

For example:

/bin/editor → /etc/alternatives/editor
/etc/alternatives/editor → /bin/nano

Sad outcome

Posted Nov 27, 2024 19:46 UTC (Wed) by tych0 (subscriber, #105844) [Link] (1 responses)

It's not just systemd here. Things like Alpine's default docker image use a symlinked busybox. They could switch to hard links of course, but asking everyone to switch (ignoring the sibling's point r.e. people who can't switch because they're on different fses) is annoying.

Sad outcome

Posted Nov 27, 2024 21:43 UTC (Wed) by fman (subscriber, #121579) [Link]

Uh oh. That essentially means, when using fexecve() that is, everything running on an embedded system would essentially be "busybox" :-O

Sad outcome

Posted Nov 28, 2024 8:28 UTC (Thu) by mezcalero (subscriber, #45103) [Link]

It's not really up to us to decide what people use. People use what people use. Some people use hardlinks, some people use symlinks, some people look at comm[], others at argv[0] to implement multi-call binaries.

As a systemd maintainer I certainly can give people guidelines, and I can maybe be strict on not supporting completley broken behaviour, but frankly in this case, we don't have that luxury: I don't think using hardlinks or softlinks or comm[] or argv[0] could constitute "clearly broken behaviour", I think all of it is fine in a world where execve() is the law of the land. The whole mess just starts because fexecve() became a thing and it's such a incomplete (to not use the word "broken") interface. And I seriously doubt the right approach is to tell a myriad of projects and distributions to rearrange their stuff to make fexecve() workable, but instead maybe it is to just fix that broken interface.

Sad outcome

Posted Nov 27, 2024 16:14 UTC (Wed) by josh (subscriber, #17465) [Link]

I'm currently working with someone to revive io_uring_spawn; perhaps you'll be able to use that instead.

Sad outcome

Posted Nov 27, 2024 17:54 UTC (Wed) by cschaufler (subscriber, #126555) [Link] (1 responses)

The SELinux (Smack, AppArmor, ...) /proc/self/attr interfaces have been addressed with the lsm_[gs]et_self_attr() system calls.

Sad outcome

Posted Nov 28, 2024 8:21 UTC (Thu) by mezcalero (subscriber, #45103) [Link]

One of the reasons I dislike /proc/self/attr/exec that it maintains process/state-wide state: if somebody uses this they first have to set up the label, and then do the execve() in a 2nd step. If for some reason the abort the whole thing, and they don't end up doing an execve() (or the execve() fails, maybe due to ENOENT) they have to undo the label preparation, but that takes care to do right.

I really hate logic like that that establishes some hidden state, for a secondary operation, that if for some reason fails or is not executed for some reason needs to be manually rolled back. A much better interface would be if this data would be passed to the actual execve() so that it's very clear what this is intended for and that it has no other lifecycle, cannot be applied to the wrong execve(), isn't sticky and so on.

Hence lsm_set_self_attr() might be slightly better as it doesn't require procfs anymore, but the fundamental ugliness doesn't really go away, im my PoV.

Lennart

Sad outcome

Posted Nov 27, 2024 18:05 UTC (Wed) by rcampos (subscriber, #59737) [Link]

The comm can't be set just before exec, for example?

Although if it uses a syscall for setting it, you need to do it before the seccomp policies and all.

It doesn't make sense to worry about multicall binaries.

Posted Nov 27, 2024 18:25 UTC (Wed) by ebiederm (subscriber, #35028) [Link] (2 responses)

For a multicall binary to check anything other than argv (to decide it's behavior) is against unix convention, it is impossible on other unices, and reading task comm is slower than reading argv0.

AKA that would be a stupid bug.

Plus for a multicall binary can reasonably be hardlinked, instead of symlinked. Which would be fewer resources in the filesystem and faster to start up.

The only case worth worrying about are process management things like ps that naturally read task->comm.

It doesn't make sense to worry about multicall binaries.

Posted Dec 4, 2024 9:57 UTC (Wed) by maxfragg (subscriber, #122266) [Link] (1 responses)

all true, but the output of ps and co suddenly becomes a lot less useful, when half of you system shows up as toybox/busybox instead of sh, sleep, cat, ....

It doesn't make sense to worry about multicall binaries.

Posted Dec 13, 2024 12:17 UTC (Fri) by roblucid (guest, #48964) [Link]

Hmmmm, less useful unless you're interested in the truth of it ..
execve(2) behaviour was not changing, in the fexecve(2) case if you're not willing to pay some cost as you are wanting to see a file with a verified signature why are you bothering with the file descriptor? If say you have written a shell with fexecve(2) support as a feature, surely you can set up an environment variable and do more smoke & mirrors processing on ps(1)/top(1) via builtin to protect users from their illusions being shattered.
Scripts have trace features to help debugging, couldn't you just turn off the use of fexecve when developing if necessary?
As somebody said allowing obfuscation of what you are really running seems to be to the benefit the "shenanigans" use case.

Sad outcome

Posted Nov 28, 2024 4:15 UTC (Thu) by neilbrown (subscriber, #359) [Link] (2 responses)

I agree that an fexecve2() that takes a "comm" argument would be a good thing. But it isn't necessary.

Just find some private directory, create a symlink from my_comm to /proc/self/fd/NN, make sure NN is CLOSE-ON-EXEC,
and
execveat(private-dir-fd, "my_comm", argv, envp, 0);

As close-on-exec is processed after the target file is opened, this gives you all you need.

Having to find a private directory isn't ideal, but shouldn't be too hard. /run/fexec/$UID/$PID/ ??
Cleaning up might be awkward.

Sad outcome

Posted Nov 28, 2024 8:32 UTC (Thu) by mezcalero (subscriber, #45103) [Link] (1 responses)

Sure I can also copy the binary and rename it, and then glue a floppy disk to my forehead. It's a bit ridiculous to suggest this was a suitable method of operation for *all* service binary invocations while at the same time saying that initializing comm[] from argv[0] was not an option because doing such a memcpy() would be "too slow".

I mean, come on.

(Yes, I know it wasn't you who who said copying argv[0] → comm[] was too slow, that was Linus.)

Sad outcome

Posted Dec 13, 2024 12:40 UTC (Fri) by roblucid (guest, #48964) [Link]

The difference is the cost is only paid by those debugging, not expose every user unwittingly to potential hostile shenanigans by relying on mutable user space variables. The people changing to use fexecve(2) for the security benefits should perhaps add a logging option saying what they're calling and what the child pid was.

I developed many years, then later ran a lot of server machines distributed over many sites including network centersw and kernel level smoke & mirrors undermines the whole point of switching to the fd based call. Developers have a tendency to pick the easy option and if you're worried about exploits to race conditions, giving them shell access gets your hosts remotely cracked.

Systemd has bug

Posted Nov 29, 2024 20:01 UTC (Fri) by ebiederm (subscriber, #35028) [Link] (1 responses)

I just read through the systemd code to see why people desire to use fexecve in the systemd code

Once the file descriptor for the binary is open systemd makes some sanity checks, that are redundant with the checks execve makes in the kernel.

Other then those sanity there is the only user of executable_fd in systemd, setup_smack.

As I read setup_smack, it reads the xattr that holds the label, smack will apply during exec. If the xattr is present systemd applies the label before smack does. Which is silly, but fine except the systemd code skips the checks the smack kernel code makes before applying the label.

Once smack_setup is fixed to not do unnecessary and buggy work there is no reason for systemd to open the file before exec, and thus no reason to call fexecve.

So all of that work can be removed from systemd and the code can become faster and more reliable. As well as making the entire issue of symlinks to binaries a nonissue, because fexecve is unnecessary.

Systemd has bug

Posted Dec 2, 2024 9:25 UTC (Mon) by mezcalero (subscriber, #45103) [Link]

You are mixing up things: we are not making use of the executable fd so far much, because we don't actually use execveat() unless you set ENABLE_FEXECVE macro, which nobody does. The code to use this was added a while back, in hopeful preparation that some day we could use execveat() properly, but that future never came, so nothing else was moved over to using only the executable fd, because that would be dead code.

It's like arguing: we don't need washing machines, because everyone washes their clothes by hand. Of course they do, if they have no washing machine!

In systemd we are moving the codebase bit by bit over to reference things by fds rather than by paths, i.e. for new stuff we generally only use O_PATH, openat() and friends. For old code we port things over, but we'll never be able to do that properly for execveat(), since it's so unusable right now.

I am not going to comment on the SMACK stuff, it's contributed code by SMACK folks, I have no comprehensive understanding of that.

Lennart

Sad outcome

Posted Nov 30, 2024 21:08 UTC (Sat) by geuder (subscriber, #62854) [Link]

What would systemd have offered us if they could have switched to fexecve()?

The proposed patch was agreed during session at LPC 2024

Posted Nov 27, 2024 15:25 UTC (Wed) by bluca (subscriber, #118303) [Link] (3 responses)

Note that the rejected patch came out of a discussion at LPC 2024, where this was discussed at length: https://www.youtube.com/watch?v=hA2UJ5C_UGw

At this point I do have to wonder why we bother with taking the time and effort to organize these MCs and sessions at the conference, if they are effectively pointless?

The proposed patch was agreed during session at LPC 2024

Posted Nov 27, 2024 16:50 UTC (Wed) by butlerm (subscriber, #13312) [Link] (1 responses)

It seems to me the answer is that the Linux kernel is not governed by a committee it is ultimately governed by Linus Torvalds. In a reversal of the usual democratic scheme it is a committee that proposes and Linus disposes. And if there are any objections the answer of course is to fork the kernel and rename it after yourself instead.

The proposed patch was agreed during session at LPC 2024

Posted Dec 13, 2024 21:18 UTC (Fri) by Wol (subscriber, #4433) [Link]

Like it or not, the BDFL model is very successful.

As is "he who pays the piper calls the tune" or, in FLOSS terms, "he who puts in the work makes the rules".

I'm not saying all of them are, but many calls for "democracy" in FLOSS projects are "we want to control the committee that tells you where to direct your efforts". That usually is a dead flop as far as volunteers are concerned. And as we've seen - with Firefox amongst others - all too often a foundation intended to support a project has great difficulty paying developers, for whatever reason ...

Cheers,
Wol

The proposed patch was agreed during session at LPC 2024

Posted Nov 27, 2024 20:27 UTC (Wed) by ebiederm (subscriber, #35028) [Link]

It has never been the case in the linux kernel that you have to be in the one specific physical room at one specific time to get your opinion heard.

With a community as large as the linux kernel it is unreasonable to expect that.

At best you can conclude is that you didn't have everyone whose opinion mattered in the room.

I honestly find it scary someone would expect that being in a physical room would do more than help get the empathy and attention of the people who care.

only for fexecve?

Posted Nov 27, 2024 15:33 UTC (Wed) by shironeko (subscriber, #159952) [Link] (2 responses)

This change is only for fexecve right? since the current behavior is just a number (not even unique), I'm not sure how going with either of the approaches will break any users, or subvert user expectation?

only for fexecve?

Posted Nov 27, 2024 20:17 UTC (Wed) by ebiederm (subscriber, #35028) [Link]

Yes, only for the case where no filename is supplied to execveat.

There are cases like login shells where argv[0] must take a value that is not really appropriate for use as task->comm.

So no. Userspace won't be broken. There are just cases where slowing down user space using open followed by extra followed by execveat will have a different task->comm.

The silliness of using the ascii string representing the file descriptor number is going away, and that is good.

only for fexecve?

Posted Nov 27, 2024 21:25 UTC (Wed) by jkingweb (subscriber, #113039) [Link]

In short, systemd wants to use fexecve to cleanly avoid data races, but the lack of reliable comm data makes this impractical. The change in the kernel doesn't break anything, but systemd switching to fexecve even with this change would break things.

The security concern

Posted Nov 27, 2024 18:51 UTC (Wed) by carlosrodfern (subscriber, #166486) [Link] (9 responses)

Linus brought up a security concern that it is valid. What do others supporting the `argv[0]` approach think about that?

The security concern

Posted Nov 27, 2024 19:18 UTC (Wed) by roc (subscriber, #30627) [Link] (8 responses)

What security concern?

Linus went on about "comm" being "THE TRUTH" but it's actually not that useful as a source of truth because a process can call prctl(PR_SET_NAME) to set its "comm" to whatever it wants. I hope someone pointed that out in the email discussion; it was tangentially alluded to in a later message, at least.

It's also a bit rich for Linus to rant about other developers being idiots in the same messages where he was completely wrong about the default behaviour of "ps". He ought to have apologised for that.

The security concern

Posted Nov 27, 2024 19:24 UTC (Wed) by carlosrodfern (subscriber, #166486) [Link] (4 responses)

I was referring to this:

> the kernel uses comm for its own purposes, and letting user space control it could help attackers to hide the actual executable being run. Copying argv[0] into comm will slow program start, he said. The right solution, according to Torvalds, is to use the file name stored in the directory entry ("dentry") associated with the file to be executed. That information is always present and is reliably under the kernel's control.

The security concern

Posted Nov 27, 2024 20:22 UTC (Wed) by Wol (subscriber, #4433) [Link]

> That information is always present and is reliably under the kernel's control.

But it's "lying" to the user - it's not the executable they "asked for" ...

That said, I'm a bit surprised that systemd doesn't want to use the secure fexecv or whatever it was - the usual attitude is "do it right and if buggy code breaks, tough". Probably what they should do is implement it as an option, off by default to start with, then on, then only choice. If buggy code isn't fixed, it'll just have to deal with the consequences.

Cheers,
Wol

The security concern

Posted Nov 27, 2024 21:20 UTC (Wed) by roc (subscriber, #30627) [Link] (2 responses)

Yes, I think that's incorrect. Userspace can already control "comm" via PR_SET_NAME. Seems like Linus forgot about that... while ranting about what idiots other developers are.

The security concern

Posted Nov 27, 2024 21:35 UTC (Wed) by carlosrodfern (subscriber, #166486) [Link] (1 responses)

Perhaps he was referring to hiding processes the attacker didn't provide the binary for? For example, using some existing sftp, httpd, etc... program and hiding it with some `comm` looking like something else?

The security concern

Posted Nov 27, 2024 23:20 UTC (Wed) by roc (subscriber, #30627) [Link]

The attacker can create a hard link to get the same effect. Or they can use ptrace to inject a prctl call after exec.

There are some situations where a restricted attacker could manipulate argv[0] but not comm. But they're very narrow. Just ranting that "comm is THE TRUTH" is totally misleading.

The security concern

Posted Nov 27, 2024 21:40 UTC (Wed) by carlosrodfern (subscriber, #166486) [Link]

He may not be referring to programs that change their name programmatically, but attackers reusing existing binaries and hiding their attack with links.

For example,

$ ln -s /usr/bin/sleep ./bash
./bash 60

$ ps -e -o comm,cmd | grep bash
bash ./bash 60

The security concern

Posted Nov 27, 2024 22:44 UTC (Wed) by kees (subscriber, #27264) [Link] (1 responses)

> He ought to have apologised for that

Yes, he should have. Especially after directly insulting all the other participants in the discussion. I thought we were supposed to have a sane CoC?

The security concern

Posted Nov 28, 2024 2:22 UTC (Thu) by npws (subscriber, #168248) [Link]

The entire discussion is a complete train wreck. Linus keeps ranting on and on, making incorrect claims left and right, while shouting and insulting people. Apparently neither CoC nor userspace matter when he doesn't like it.

Irascible

Posted Nov 27, 2024 18:55 UTC (Wed) by jheiss (subscriber, #62556) [Link] (2 responses)

I read that paragraph to my wife and my eyes are watering. Great writeup.

Irascible

Posted Nov 27, 2024 20:15 UTC (Wed) by MrWim (subscriber, #47432) [Link] (1 responses)

Which paragraph made your eyes water?

Irascible

Posted Nov 27, 2024 20:45 UTC (Wed) by carlosrodfern (subscriber, #166486) [Link]

The hilarious one, lol:

> Users, being the irascible creatures that they are, have expressed the unreasonable opinion that replacing the command names of processes in their system-management tools with small integers is an unwelcome change. They have been spoiled by being able to see which program each process is running and feel entitled to that ability in the future. Kernel programming would be so much easier without users, but that is not the world we live in.

Add a execveat flag?

Posted Nov 27, 2024 22:15 UTC (Wed) by aszs (subscriber, #50252) [Link] (1 responses)

Why not just add a flag to execveat so that it uses pathname to set comm (and treats dirfd as the executable's fd)?

Add a execveat flag?

Posted Nov 28, 2024 8:35 UTC (Thu) by mezcalero (subscriber, #45103) [Link]

That would make sense to me.

Simple solution

Posted Nov 28, 2024 8:04 UTC (Thu) by rrolls (subscriber, #151126) [Link] (3 responses)

Add a new version of fexecve/execveat which takes an arbitrary string to be placed on `comm` in addition to the file descriptor.

Programs wishing to use this instead of execve, when the original path is a symlink, can get the basename of the original path themselves, do whatever opening and checking they like of the contents of the file, then pass that basename to be stored in `comm`.

Everyone wins.

Simple solution

Posted Nov 28, 2024 11:27 UTC (Thu) by lkundrak (subscriber, #43452) [Link] (1 responses)

Yes. Or even use the existing call, with a flag not to touch the comm altogether. That way the calling process could just: fork(); prctl(PR_SET_NAME, "lalala"); execveat(..., AT_KEEP_PR_NAME); and be done with it.

Simple solution

Posted Nov 29, 2024 14:04 UTC (Fri) by vbabka (subscriber, #91706) [Link]

Maybe even the pathname argument could be repurposed to become comm with AT_EMPTY_PATH (plus/or another new flag to control this new behavior), because normally it's an empty string with AT_EMPTY_PATH? That would avoid the need for prctl().

Simple solution

Posted Nov 28, 2024 20:22 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link]

That's basically the same idea as copying *argv, just a bit more complicated. /proc/self/comm is one of two things:

1. The (first 16 characters of the) file which was actually executed by the kernel.
2. A string the program which was executed passed as argument to PR_SET_NAME.

This means that it's not under control of the code which executed the exec system call. In contrast to this, *argv is the first string of the argument vector. By convention, this is also the filename of the executed file but that's really just a convention. It can be any string the executing process desired to use as first argument and it may even not exist at all, ie *argv may be NULL.

Copying *argv (or, for that matter, any other string the executing process can either chose freely or omit at all) thus doesn't solve the problem that, for programs executed via file descriptor, the correct comm value is useless for determining information about the actually running program.

There's no correct solution for setting comm to the value it would have had had execve with a filename been used instead of execveat/ fexecve because the name which was used to open the file descriptor may no longer refer to the same file by the time it's executed. Using the name from the dentry is probably the best approximation as that's at least a name referring to the file which is being executed.

execv is weird for not resolving symlinks for comm

Posted Nov 28, 2024 11:34 UTC (Thu) by jengelh (subscriber, #33263) [Link]

Given { symlink("/bin/sleep", "slp"); execl("./slp", "sla", "3600", NULL); } or e.g. `less -S slp`, and ignoring base filesystem symlinks like /bin -> /usr/bin under some distros, tell me who the oddball is without telling me who the oddball is:

* Sun lsof: /bin/sleep
* Sun ps -o args: sla 3600
* Sun ps -o comm: sla
* FreeBSD lsof: /bin/sleep
* FreeBSD ps -o args: sla 3600
* FreeBSD ps -o comm: sleep
* Linux lsof (/proc/N/fd/,/proc/N/maps): /bin/sleep
* Linux ps -o args: sla 3600
* Linux ps -o comm: slp

Neither the dentry nor the argv[0] seems a good solution

Posted Nov 28, 2024 12:26 UTC (Thu) by THALES (subscriber, #134787) [Link]

There is a reason the user cannot access the comm when a process has been executed with fexecve(). That is because the kernel has no guarantees that the file being executed is the same as the one on the disk. A process could open an executable file, tamper with it, then execute it. The real safe behavior is the existing one, neither the dentry nor the argv[0] seems a good solution.

What if there is no dentry?

Posted Nov 28, 2024 14:54 UTC (Thu) by epa (subscriber, #39769) [Link]

What happens if you open an executable file, unlink it from the directory, and then run it with fexecve()?
Or if the file has two links?

That will be very confusing when using extreme multicall binaries like busybox and gzip, combined with debian style alternatives!

Posted Nov 30, 2024 8:24 UTC (Sat) by koenkooi (subscriber, #71861) [Link] (1 responses)

$ ps
busybox
busybox
busybox
gzip
busybox

This seems to be one of the rare cases where you can, in good faith, ask "which truth"? The user types in ls, tar, wget, zgrep but under the hood it's either busybox or gzip.

That will be very confusing when using extreme multicall binaries like busybox and gzip, combined with debian style alternatives!

Posted Dec 1, 2024 14:19 UTC (Sun) by cplaplante (subscriber, #107196) [Link]

It's a shame we can't enhance `ps` to report the symlinks, like `ls`. E.g.:

$ ps
ls -> /bin/busybox
tar -> /bin/busybox

etc.

Can we record comm in fdtable at open()?

Posted Dec 15, 2024 7:35 UTC (Sun) by consend (guest, #132320) [Link] (2 responses)

I understand that the only accurate information fexecve() can use is the file descriptor (fd). So how about selectively storing comm in the fdtable during open? This would allow fexecve() to read it. However, this might bring some issues, such as increasing the burden of open(), requiring more memory, etc., or there might be some obvious problems that I haven't thought of?

Can we record comm in fdtable at open()?

Posted Dec 15, 2024 16:48 UTC (Sun) by viro (subscriber, #7872) [Link] (1 responses)

First of all, fdtable is obviously wrong place for anything of that sort - you'd have to copy that on dup(), fork(), etc.; if anything, it's a property of open file, not of a descriptor. Putting the last component of pathname that had been used to open a file into resulting struct file... Theoretically doable, but that would cost quite a bit, with no clear benefit - you'd have to do that to all files, just for the sake of vanishingly small subset that would be involved in fexecve(), and usefulness of fexecve() itself is not obvious.

Can we record comm in fdtable at open()?

Posted Dec 18, 2024 12:22 UTC (Wed) by consend (guest, #132320) [Link]

Thank you for your reply! I realized that putting it in the fdtable was indeed a wrong idea. Intuitively, the same file instance should share the same name, and placing it in the fdtable is semantically unclear and violates the lightweight design of descriptors. Just like you said, it's a property of open file, not of a descriptor. Additionally, you mentioned saving filenames for all files. I have simply thought about adding a flag to control whether to save through the open function, but indeed, adding a flag for an extremely small number of special cases may not be a good choice.