Rethinking race-free process signaling

By Jonathan Corbet
April 4, 2019

One of the new features in the 5.1 kernel is the pidfd_send_signal() system call. Combined with the (also new) ability to create a file descriptor referring to a process (a "pidfd") by opening its directory in /proc, this system call allows for the sending of signals to processes in a race-free manner. An extension to this feature proposed for 5.2 has, however, sparked a discussion that has brought the whole concept into question. It may yet be that the pidfd feature will be put on hold before the final 5.1 release while the API around it is rethought.

The fundamental problem being addressed by the pidfd concept is process-ID reuse. Most Linux systems have the maximum PID set to 32768; if lots of processes (and threads) are created, it doesn't take a long time to use all of the available PIDs, at which point the kernel will cycle back to the beginning and start reusing the ones that have since become free. That reuse can happen quickly, and processes that work with PIDs might not notice immediately that a PID they hold referred to a process that has exited. In such conditions, a stale PID could be used to send a signal to the wrong process. As Jann Horn pointed out, real vulnerabilities have resulted from this problem.

A pidfd is a file descriptor that is obtained by opening a process's directory in the /proc virtual filesystem; it functions as a reference to the process of interest. If that process exits, its PID might be reused by the kernel, but any pidfds referring to that process will continue to refer to it. Passing a pidfd to pidfd_send_signal() will either signal the correct process (if it still exists), or return an error if the process has exited; it is guaranteed not to signal the wrong process. So it would seem that this problem has been solved.

Not so fast

In late March, Christian Brauner posted a patch set adding another new system call:

    int pidfd_open(pid_t pid, unsigned int flags);

This system call will look up the given pid in the current namespace, then return a pidfd referring to it. This call was proposed to address cases where /proc is not mounted in a given namespace. For cases where /proc is available, though, the patch set also implements a new PIDFD_GET_PROCFD ioctl() call that takes a pidfd, opens the associated process's /proc directory, and return a file descriptor referring to it. That descriptor, which functions as a pidfd as well, could then be used to read other information of interest out of the /proc directory.

Linus Torvalds had no fundamental problem with pidfd_open(), but he was rather less pleased with the ioctl() command. The core of his disagreement had to do with the creation of a second type of pidfd: one created by pidfd_open() would have different semantics than one created by opening the /proc directory or by calling ioctl(). In his view, either creation path should yield the same result on systems where /proc is mounted; there should be no need to convert between two types of pidfd.

Brauner was not immediately accepting of that idea. He worried that the equivalence would force a dependency on having /proc enabled (a concern that Torvalds dismissed), and that it could expose information in /proc that might otherwise be hidden from a pidfd_open() caller. Torvalds suggested tightening the security checks in that latter case. Even then, Andy Lutomirski worried, "/proc has too much baggage" to be made secure in this setting. It might be necessary, he said, to create a separate view of /proc that would be provided with pidfds.

clone()

As the conversation went on, though, it became increasingly clear that pidfd_open() was not the end goal. That call is still racy — a PID could be reused in the time between when a caller learns about it and when the pidfd_open() call actually completes. There are ways of mitigating this problem, but it does still exist. The only truly race-free way of getting a reference to a process, it was agreed, is to create that reference as part of the work of creating the process itself. That means it should be created as part of a clone() call.

That could be made possible by adding a new flag (called something like CLONE_PIDFD) to clone() that would return a pidfd to the parent rather than a PID. There were some worries that clone() has run out of space for new flags, necessitating a new system call, but Torvalds indicated that there is still at least one bit available. As a result of the discussion, it seems likely that a patch implementing the new clone() behavior will be posted in the near future.

That, however, leaves open the question of pidfd_open() and how pidfds should work in general. At one point, Brauner suggested breaking the connection with /proc entirely: a pidfd could be used for sending signals (or, in the future, waiting for a process), but its creation would not be tied to a /proc directory in any way. That would involve disabling the functionality in 5.1, something that can still be done since it is not yet part of an official kernel release. The problem of opening the correct /proc directory (to read information about the process) could be addressed by adding a field to the fdinfo file for the pidfd; the information there could be used to verify that a given /proc directory refers to the same process as the pidfd.

It eventually became clear, though, that Torvalds instead favored retaining the tie between a pidfd and the /proc directory; he called it "the most flexible option". So, one day later, Brauner came back with another plan: the connection with /proc would remain, but the pidfd_open() system call would be dropped since there would no longer be any real need for it. Should this plan be followed, which seems to be the most likely outcome, the existing 5.1 pidfd work could remain, since it is still a part of the final vision.

If things play out this way, the new clone() option will likely appear in 5.2 or 5.3. Process-management systems that are concerned about races will then be able to use pidfds for safe process signaling. If nothing else, this discussion shows the value of having many developers looking at proposed API additions. In a setting where mistakes are hard to correct once they get out into the world, one wants to get things right from the outset if at all possible.

A postscript

A contributing factor to the problem of PID reuse is the fact that the PID space is so small; for compatibility with ancient Unix systems (and the programs that ran on them), it's limited to what can be stored in a signed 16-bit value. That was a hard limit until the 2.6.10 release in 2002, when Ingo Molnar added a flexible limit capped at 4,194,304; the default limit remained (and remains) 32768, but it can be changed with the kernel/pid_max sysctl knob.

At the time, Molnar placed a comment reading "a maximum of 4 million PIDs should be enough for a while" that endures to this day. Over 16 years later, it's clear that he was right. But as part of this discussion, Torvalds said that perhaps the time has come to raise both the default and the limit. Setting the maximum PID to MAXINT would, he said, make a lot of the attacks harder. Whether such a change would break any existing software remains to be seen; it seems unlikely in 2019 but one never knows.

Index entries for this article
Kernel	pidfd
Kernel	System calls/pidfd_send_signal()

Rethinking race-free process signaling

Posted Apr 4, 2019 22:33 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

I've seen software using bitmasks for processes. 4 million processes is just 512kb, bumping this by 1024 times might cause issues.

Rethinking race-free process signaling

Posted Apr 4, 2019 22:52 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (1 responses)

Aren't they already broken on systems with a raised sysctl setting?

Rethinking race-free process signaling

Posted Apr 4, 2019 22:57 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

No, they read the sysctl setting. They probably can be broken if it changes when the program is launched, but nobody really does that.

I'm not saying that such a design is a good idea, it's just that I've seen it used.

Rethinking race-free process signaling

Posted Apr 5, 2019 0:24 UTC (Fri) by Fowl (subscriber, #65667) [Link] (27 responses)

Instead of creating a whole new set of syscalls that deal with PIDFDs, couldn't having an open PIDFD just prevent the PID from being reused? Then all the existing tools would "just work".

Rethinking race-free process signaling

Posted Apr 5, 2019 1:45 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (9 responses)

Might be a DDoS vector. Even with the low 1024 file descriptor limit just 32 processes can eat up the whole default PID namespace.

Rethinking race-free process signaling

Posted Apr 5, 2019 16:40 UTC (Fri) by smurf (subscriber, #17840) [Link] (8 responses)

How is that different from just accumulating 32700 zombie child processes, which you can do right now anyway?

Rethinking race-free process signaling

Posted Apr 7, 2019 1:09 UTC (Sun) by stephen.pollei (subscriber, #125364) [Link] (7 responses)

I wonder why distributions don't set limits that work for at least 99% of people.
70% of people would be ok with max login of 3
95% of people would be ok with max login of 8
99% of people would be ok with max login of 13
70% of people should be ok with a max of 89 processes per login
95% of people should be ok with max of 144 processes per login
99% of people should be ok with max of 377 processes per login

4901 processes ought to be enough for most people.
If you raise pid_max to 99999 on small system and set these limits does it strongly reduce the issues?

Lager systems might need to increase pid_max further, and have bigger rlimits.

In either case it seems like it can mostly be solved with saner configuration.

Rethinking race-free process signaling

Posted Apr 8, 2019 1:21 UTC (Mon) by dvdeug (subscriber, #10998) [Link] (6 responses)

Those are among the most annoying limits. If 99% of the people will never hit them, then 1% will. Because they are distribution-set, they will be hard to figure out what's wrong unless there's the clearest error messages. In the case of a process limit, I bet very few programs handle that properly, and forcing a bunch of users to try and figure out why their programs are crashing, especially because 4901 is a "random" number; if I hit 32,000 PIDs before a crash, I'd realize the problem much faster than 4901.

Arbitrary limits are a pain in the ass, and increasing the number of them and the odds you're going to hit them is not user-friendly.

Rethinking race-free process signaling

Posted Apr 8, 2019 1:59 UTC (Mon) by ebiederm (subscriber, #35028) [Link] (5 responses)

My rule of thumb are limits like that should only be low enough to catch buggy programs,
not properly running programs that consume a few more resources than normal.

There is the other issue with more pids that if they get too large they get ungainly and difficult
to use. Which argues against making 4 million the default. But otherwise something like 4 million
would probably be a fine default for a limit like that.

Rethinking race-free process signaling

Posted Apr 8, 2019 5:54 UTC (Mon) by eru (subscriber, #2753) [Link] (4 responses)

If the pid limit is 4 million, problems due to wraparound are rare, but they may occasionally happen, causing hard to trace bugs. Same with MAXINT. But if pid were a 64-bit number, and the limit the maximum of that, wraparound would never happen, so software could safely assume that pids are always unique.

Rethinking race-free process signaling

Posted Apr 8, 2019 7:27 UTC (Mon) by rbanffy (guest, #103898) [Link] (3 responses)

> But if pid were a 64-bit number, and the limit the maximum of that, wraparound would never happen

Cue to a meeting room with a dozen people dressed like characters from Things to Come trying to figure out why The Google stopped answering their questions.

Fine. It'll be a looooong time.

Rethinking race-free process signaling

Posted Apr 10, 2019 5:15 UTC (Wed) by eru (subscriber, #2753) [Link] (2 responses)

Before posting, I calculated that if the kernel creates one process every microsecond, it takes about 290 000 years for the 64-bit signed maxint to be reached. I don't think any system will have that kind of uptime.

Rethinking race-free process signaling

Posted Apr 10, 2019 18:07 UTC (Wed) by rbanffy (guest, #103898) [Link] (1 responses)

You have to let your imagination fly higher, eru. 290,000 years is a blink of an eye in cosmic terms and it's entirely possible a vast distributed and multiply redundant computer doing a Very Important Job for its users would both live that long (and be that reliable) it'd reach that limit well after everyone who designed it (and who would think fashion went too far this time) are dead or, at least, have moved on to more interesting pursuits. Also, if it's a thousand times faster, it'll get there much faster too.

Rethinking race-free process signaling

Posted Apr 12, 2019 6:19 UTC (Fri) by massimiliano (subscriber, #3048) [Link]

You have to let your imagination fly higher, eru. 290,000 years is a blink of an eye in cosmic terms...

If I let my imagination fly just a bit higher, in such a system this issue will be solved just like the current 2038 problem.

At some point the system will do a live migration to a 128 bit architecture, with conversion of the persistent state to appropriately sized values, and the "actor IDs" in the distributed systems will get a bit of fresh air with a wraparound time of 2^32*290k years, whatever that means...

Rethinking race-free process signaling

Posted Apr 5, 2019 4:24 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (10 responses)

You would still need a race-free way to acquire a pidfd, so the changes to clone() are still required at an absolute minimum. But if you're going to be handing out a pidfd-like-thing from clone(), then you don't have the original PID. So now you need an ioctl() to convert the pidfd back into a PID. Then, this would be just about as powerful, at the expense of consuming an extra PID slot until you close the fd, and at the expense of userspace needing to juggle two opaque integers instead of one. Both of those are mild drawbacks, but not necessarily fatal on their own; Linux already has issues with PID exhaustion and userspace already juggles lots of opaque kernel nonsense anyway.

But wait! If you want to prevent the PID from being reused, and you're the one calling clone() in the first place, all you have to do is *not* call wait(), and then the PID will zombify in an entirely race-free fashion. That doesn't require the use of a pidfd at all. So the only "interesting" functionality we would be adding via a pidfd-that-holds-a-PID-open is the ability to transfer stewardship of a possibly-zombie process to a new "owner"(via SCM_RIGHTS over an AF_UNIX socket, or via fd inheritance), but without changing its PPID. That is such a niche usage that I'm not certain it's actually helpful, especially since you can create a subreaper via prctl(), and thereby actually reparent the process as needed.

(TL;DR: Because then it wouldn't be complicated and/or "interesting" enough to be worth doing.)

Rethinking race-free process signaling

Posted Apr 5, 2019 12:56 UTC (Fri) by jgg (subscriber, #55211) [Link] (2 responses)

I had the same feeling, what is the race here? clone -> zombie -> wait is the usual way UNIX has avoided races with PIDs. Is there some way the sub process can skip being zombie'd?

Rethinking race-free process signaling

Posted Apr 5, 2019 18:32 UTC (Fri) by pbonzini (subscriber, #60935) [Link] (1 responses)

It can double fork and when the parent exits the child is reparented to init (or to the subreaper if it exists). init is very quick at reaping zombies, and that can cause PID reuse.

Rethinking race-free process signaling

Posted Apr 19, 2019 18:12 UTC (Fri) by mixi (guest, #131542) [Link]

It can't make the kernel reuse the original child's PID, as it will hang around as zombie process until the creator executes any of the wait* functions.

It could exit in the grandchild, which would cause the grandchild to be reaped by init, but won't have any effect on the child's PID.

Rethinking race-free process signaling

Posted Apr 5, 2019 16:27 UTC (Fri) by thiago (guest, #85680) [Link] (2 responses)

When Josh and I first proposed the pidfd from clone (then called "clonefd") a few years ago, our objective had nothing to do with PID reuse, though we knew that issue. Our biggest problem was the inherent race conditions related to signal handling and the absolute impossibility of getting it right inside userspace libraries. So getting a file descriptor from clone() is an absolute necessity, as is the ability to select()/poll() on that file descriptor and get the information wait() would have given us. Configuring the child process not to send SIGCHLD is already part of the clone() API -- set the notification signal to 0.

The problem we faced then was exactly the problem of reparenting when the file descriptor is passed onwards via AF_UNIX. Especially when that is coupled with ptrace(), which appears to do a fake-reparenting to the tracer process. No one could explain to us back then what the issue was, so we couldn't fix and the matter was dropped.

Here's to hoping this feature is useful for userspace.

Rethinking race-free process signaling

Posted Apr 7, 2019 10:40 UTC (Sun) by meuh (guest, #22042) [Link] (1 responses)

The problem we faced then was exactly the problem of reparenting when the file descriptor is passed onwards via AF_UNIX.

I think file descriptor sent through SCM_RIGHT should not imply reparenting process to the reader, just like file descriptor inherited through fork() don't imply reparenting process to the newly created one. The semantic of those operations is "copy" ... but "copying" parent relationship doesn't make sense for me: in Unix a process can only have a single parent.

Rethinking race-free process signaling

Posted Apr 7, 2019 18:19 UTC (Sun) by quotemstr (subscriber, #45331) [Link]

Besides, SCM_RIGHTS doesn't "copy" anything. It's pass-by-reference, not pass-by-value.

Rethinking race-free process signaling

Posted Apr 5, 2019 20:24 UTC (Fri) by jkowalski (guest, #131304) [Link] (3 responses)

> So the only "interesting" functionality we would be adding via a pidfd-that-holds-a-PID-open is the ability to transfer stewardship of a possibly-zombie process to a new "owner"(via SCM_RIGHTS over an AF_UNIX socket, or via fd inheritance), but without changing its PPID.

You don't do anything like this really, you can actually set the process cloned using CLONE_PIDFD to deliver no signal to the parent on termination (by setting it to null), which means everyone having a copy of the same file descriptor can poll on it to know when it dies. That also means the process cannot be waited upon, which is what you want when you use this API from inside libraries.

It happens as a side effect of using descriptors, since the parent will get a readable instance (however it is still not clear to me this is something the patch author will support, here's to hoping that), it can pass a copy of its descriptor to others, then perhaps close it, and allow the other process to essentially poll on it, know when the process is gone, and get back the exit status.

The same could be done using references to external processes using pidfd_open, you pass a flag that requests the kernel to give you a readable instance, and if you're a real_parent or parent (as in ptrace terms), you get one. It also has the nice property of the mount namespace not being the one where you acquire pidfds, but limited to the scope of your PID namespace (which I think is a very important point that has been overlooked thus far).

Scoping the opening and adding a system call with extendable flags would allow you to lift checks in pidfd_send_signal to signall across namespaces, exactly because without userspace doing it on its own, a process can only open a pidfd to something it can address inside its PID namespace. It is otherwise a layering violation (literally) that I can use the mount namespace to circumvent this, if iy is opened up in the future. I also object to being able to peek into process state through such descriptors, that capability should be orthogonal, not bound to the pidfd, even if I have the authority to read through. *Therefore, using /proc dir fds comes with a big downside to all of this.*

The nice delegation model allows you to extend pidfd_open with, say, PRIV_KILL that allows you to bind CAP_KILL privs to a pidfd, assuming you have CAP_KILL in the owning userns, which would allow you to pass this pidfd and let the receiver signal across namespace boundaries without restrictions (it has to be opt-in as this is not what you want by default).

You could add a similar flag to bind ptrace privs of the opener, though that is a lot more involved and I have not mentioned it anywhere thus far.

Thus, you can think of the pidfd as a stable reference to the process, and such flags depending on the authority of the opener (if parent, readable, if CAP_KILL, killable, if CAP_SYS_PTRACE, ptraceable, etc) allow you to open up methods to operate on it, and since they are bound to the descriptor, it is limited in scope to the said process only. Such intent cannot be expressed when using /proc. It also does not play well with hidepid=2 (invisible dirs mean you cannot take a reference), and hidepid=1 (dirs you cannot enter mean you cannot reference threads you can see).

The whole reparent on fd-passing however is broken. There can be multiple processes keeping it open at a time.

Rethinking race-free process signaling

Posted Apr 7, 2019 10:42 UTC (Sun) by meuh (guest, #22042) [Link]

The whole reparent on fd-passing however is broken. There can be multiple processes keeping it open at a time.

I agree: through fork(), exec(), and SCM_RIGHTS, file descriptor can be duplicated in many processes.

Rethinking race-free process signaling

Posted Apr 7, 2019 16:12 UTC (Sun) by luto (subscriber, #39314) [Link] (1 responses)

Capabilities to do things to processes are a great model, and they make sense on Windows, L4-like microkernels, and many other systems. They’re rather busted on POSIX, though, since a process can execute a setuid program or an LSM-labeled program can gain privilege.

Rethinking race-free process signaling

Posted Apr 7, 2019 19:31 UTC (Sun) by jkowalski (guest, #131304) [Link]

... which is why if you want to do this with pidfds, you really want CAP_KILL on part of the opener (or cloning entity) in the owning userns.

You could also make it available to things with NNP set, and when cloning children, the PRIV_KILL, then pass it around, send signals. These all checks happen when the flag is used during pidfd_open or clonefd or whatever.

Do you see other cases where it could be a problem?

Rethinking race-free process signaling

Posted Apr 5, 2019 21:49 UTC (Fri) by roc (subscriber, #30627) [Link] (5 responses)

One use-case for pidfds that isn't just "avoid pid reuse" occurs in the rr debugger: the ability to `waitpid` on a selected subset of rr's ptraced processes.

When a traced process is killed by signal or `exit_group`, or does an `execve`, its threads exit in some unknown order. rr wants to clean them all up at once by `wait`ing for those specific threads but NOT receiving notifications for other threads it may be tracing. We currently have no way to do that. AIUI with pidfds we would be able to use, say, `poll` to wait for the exit notifications of a specific set of threads.

You might think we could handle`SIGCHLD` to know when notifications are pending for a traced thread, and use `waitpid` to grab the results for each traced thread after it exits, but that isn't reliable because multiple pending `SIGCHLD`s can be coalesced. Even with `signalfd` :-(.

This is a problem for libraries in general. If you have a library that wants to manage some child processes without interfering with other code (including on other threads) managing other child processes, that's really hard without pidfds.

Rethinking race-free process signaling

Posted Apr 8, 2019 19:59 UTC (Mon) by nix (subscriber, #2304) [Link] (4 responses)

It is also desirable to be able to have a poll()able equivalent to waitpid() for ptracers so that one thread can ptrace things on behalf of other threads and/or processes, polling on fds that represent the processes being waited on and also on fds that represent ordinary pipes or something like that, that are used to transmit requests that the ptracer should fulfill. The waitfd() patches from Casey Dahlin at RH that would have allowed that were shot down many years ago on the grounds that you could always do what waitfd() did by creating another thread and waitpid()ing from that, and doing the poll and waitpid in separate threads.

I revived and unbroke them as part of DTrace for Linux (see e.g. https://oss.oracle.com/git/gitweb.cgi?p=dtrace-linux-kern...) because this argument only applies to process-directed waitpid() results: ptrace() is thread-directed, so if you kick off a new thread to handle the requests from other threads it cannot see any of the waitpid() results, and the thread that can see the waitpid() results isn't able to poll on the requests because it's too busy doing a waitpid() :( you'd have to busywait, or poll with pauses, and thus add huge latencies to signal handling in all your traced children: no thanks.

Rethinking race-free process signaling

Posted Apr 8, 2019 20:41 UTC (Mon) by jkowalski (guest, #131304) [Link] (2 responses)

There were plans of making the non-proc pidfds pollable and readable, but if you go with /proc descriptors you need to implement all of that over the dir fd's file ops.

That's not very problematic (from an API perspective it is very weird, but not internally), but my idea was pidfd_open could take some wait flags as waitfd did (and check if you're one of parent or real_parent, and only return a readable descriptor), among other things, and that intent cannot be described easily when opening a directory in /proc. This means everyone can poll but perhaps only parents (real or tracing) would be able to open a readable instance.

I can only hope this is taken into consideration (I raised this exact problem as I ran into it as well), and /proc descriptor stuff be removed, as it brings more problems for future extensions.

You could ofcourse resurrect a waitfd too, again, but I see no point when you could return a pollable/readable descriptor from clone and pidfd_open (and possibly even disable termination signals and autoreap them - the whole act of waiting is asynchronous). It also has a nice touch of consistency to it (and the fact that resources in a different namespace suddenly aren't addressable and can be taken a reference to from a different namespace - the filesystem), in that sense pidfd_open isn't very different from open, but it works like open only on PIDs you can *see* (unlike /proc, which is leaky across shared mount namespaces but separate PID namespaces).

Rethinking race-free process signaling

Posted Apr 9, 2019 3:54 UTC (Tue) by quotemstr (subscriber, #45331) [Link]

For the moment, Linus seems to be of the opinion that pidfd_open would be redundant. I can't say I disagree.

Rethinking race-free process signaling

Posted Apr 9, 2019 15:12 UTC (Tue) by nix (subscriber, #2304) [Link]

You can use pidfd for reaping, but can you use it for waitpid() in particular? ptrace() results are not only thread-directed but also only defined in terms of waitpid() output: most of the other wait-a-likes will not suffice (e.g. waitid() throws too much information away).

(If pidfds also wake you up when non-termination waitpid() results would be returned iff you did a waitpid(.., WNOHANG), then they do seem like a complete replacement for waitfd, which is great because it means I can drop another annoying invasive patch!)

btw, this does mean that you'd need to be able to get a pidfd not only at clone time but also at PTRACE_SEIZE time, or tracers could never get hold of a useful pidfd at all...

Rethinking race-free process signaling

Posted Apr 9, 2019 3:52 UTC (Tue) by quotemstr (subscriber, #45331) [Link]

That work is underway. I'm *very* interested in pollable processes. :-)

Rethinking race-free process signaling

Posted Apr 5, 2019 11:33 UTC (Fri) by mezcalero (subscriber, #45103) [Link] (6 responses)

The kernel.pid_max sysctl appears to be capped at 2^22, not at 2^31 or 2^32 as the text suggests. At least I couldn't write any value higher than 2^22 into it, without getting EINVAL.

Lennart

Rethinking race-free process signaling

Posted Apr 5, 2019 13:03 UTC (Fri) by corbet (editor, #1) [Link] (4 responses)

That's the four million mentioned in the text; it is indeed explicitly capped at that size.

Rethinking race-free process signaling

Posted Apr 5, 2019 13:44 UTC (Fri) by mezcalero (subscriber, #45103) [Link] (3 responses)

Ah, indeed, I read 4 billion when you correctly wrote 4 million, and in my mind 2^32 formed. Ignore me then. Sorry for the confusion. Still weird though that this is capped at 2^22. (And the MAXINT reference is misleading still, as you couldn't set that without reworking the kernel substantially.)

Rethinking race-free process signaling

Posted Apr 11, 2019 14:35 UTC (Thu) by kmweber (guest, #114635) [Link] (2 responses)

> Still weird though that this is capped at 2^22.

So on an unrelated topic, but similar issue...years ago, the Madden series of American football games had a hard limit of 255 points that could be scored in a game and 1023 rushing yards that could be gained (I assume other stats were similarly capped, but these were the ones I noticed) (these limits may or may not still be there, it's been years since I played). For those of you who aren't familiar with American football, those numbers are ridiculously unachievable in any real game, but when playing the video game on easy settings it's quite plausible to reach both of those limits by halftime (particularly with skillful clock management--in short, American football rules specify that whether or not the game clock keeps running between plays is generally contingent upon the outcome of the just-completed play, with some complications for timeouts, penalties, and injuries, and when the end of the half or game is near--manipulating these rules with play-calling is a crucial part of late-game strategy). The 255-point limit was annoying but straightforward enough to understand, but for the life of me I couldn't understand why accumulated rushing yards was limited at 1023. This was in the mid-to-late 2000s; it wasn't *that* long ago, and memory certainly wasn't scarce enough that it was worth the extra trouble of using bitfields or implementing an explicit limit in code that was less than the max value of the data type used.

Now that I think of it, maybe this is where those extra ten bits from the PID cap went :)

Rethinking race-free process signaling

Posted May 6, 2019 18:25 UTC (Mon) by mgedmin (subscriber, #34497) [Link] (1 responses)

> The 255-point limit was annoying but straightforward enough to understand, but for the life of me I couldn't understand why accumulated rushing yards was limited at 1023. This was in the mid-to-late 2000s; it wasn't *that* long ago, and memory certainly wasn't scarce enough that it was worth the extra trouble of using bitfields or implementing an explicit limit in code that was less than the max value of the data type used.

16-bit fixed point numbers maybe, with 6 bits reserved for the fractional value?

Rethinking race-free process signaling

Posted Sep 5, 2019 21:43 UTC (Thu) by kmweber (guest, #114635) [Link]

There is no fractional member. American football statistics are counted in whole numbers, and the game reflected that.

Essentially, the number of yards you've gained is equal to the number of yard lines you've crossed. So if you start from barely past the one yard line and get to just short of the four yard line, you've only officially gained two yards for statistics purposes even though you've actually gained very nearly three. And on the flip side, if you start from just short of the two yard line and end just past the three yard line, you're credited with a two-yard gain even though you've really covered barely more than one.

Rethinking race-free process signaling

Posted Apr 5, 2019 13:15 UTC (Fri) by Villemoes (subscriber, #91911) [Link]

Yup, the robust futex implementation means one cannot use all 32 bits for a tid.

/*
* A maximum of 4 million PIDs should be enough for a while.
* [NOTE: PID/TIDs are limited to 2^29 ~= 500+ million, see futex.h.]
*/
#define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \
(sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))

though uapi/linux/futex.h seems to imply that the actual limit is 30 bits

/*
* The rest of the robust-futex field is for the TID:
*/
#define FUTEX_TID_MASK 0x3fffffff

Rethinking race-free process signaling

Posted Apr 5, 2019 13:05 UTC (Fri) by mjthayer (guest, #39183) [Link] (23 responses)

What about a system call to send a signal to a (pid, timestamp) pair, where the timestamp is for a time when the process in question was definitely known to be running?

Rethinking race-free process signaling

Posted Apr 8, 2019 1:06 UTC (Mon) by neilbrown (subscriber, #359) [Link] (22 responses)

TImestamp .... would that be CLOCK_REALTIME, CLOCK_MONOTONIC, or CLOCK_BOOTTIME?

I agree that having two-factor authentication would be a good approach, I'm not sure that timestamp is best.
We could add a "pid generation" counter which was incremented whenever the next-pid number restarted from the bottom. Then you need some way to get the pid generation for a given process, and some way to include it in the signal sent.
Two new syscalls would do it.

Rethinking race-free process signaling

Posted Apr 8, 2019 17:58 UTC (Mon) by perennialmind (guest, #45817) [Link]

So add a /proc/$pid/pid-generation symlink or file and some kind of alias to the current /proc/$pid like /proc/pid-gen/$pid-$gen? PID remains a unique candidate key, subject to a version check. Sounds sane to me. I don't have any particular need for 4 million processes running at once.

Rethinking race-free process signaling

Posted Apr 8, 2019 18:17 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (20 responses)

Please. Just don't.

Process management has been the most fucked-up part of Unix since forever. Adding FD-based interface will go a long way towards unfucking it. Adding more ad-hoc kludges won't.

Rethinking race-free process signaling

Posted Apr 8, 2019 19:55 UTC (Mon) by roc (subscriber, #30627) [Link] (11 responses)

I heartily agree.

Though there are some other nasty areas of kernel API that might be worse than process management. Signals themselves, for example. https://ldpreload.com/blog/signalfd-is-useless

Rethinking race-free process signaling

Posted Apr 8, 2019 20:04 UTC (Mon) by nix (subscriber, #2304) [Link]

My thoughts exactly. The terminal API is pretty appalling, too, though it's marginally less appalling now that the old ioctl interface to it is thankfully reduced. Even OS/2's Vio*() was cleaner...

Rethinking race-free process signaling

Posted Apr 8, 2019 22:04 UTC (Mon) by rweikusat2 (subscriber, #117920) [Link] (8 responses)

This long rant can really be condensed down to "I have no friggin' clue and don't care!".

There's only one case where the behaviour of POSIX calls is undefined wrt signals: If a signal handler which interruped an unsafe function calls an unsafe function. The I/O-multiplexing calls are async signal safe, hence, the usually sensible way to marry signal handling and an I/O multiplexing loop is to block all handled signals, install signal handlers for them and then use pselect/ ppoll/ epoll_pwait to unblock handled signals only while the code is waiting for I/O. Signal handlers are then free to call whatever function they want as the sole interrupted function will be async signal safe. And the only function which can terminate with an EINTR condition will be the multiplexing call. Problem solved[*].

[*] Signals mask are inherited, hence, code creating new process has to deal with that. Just like it has to deal with all other inheritable things. BFD.

Rethinking race-free process signaling

Posted Apr 8, 2019 22:08 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> Problem solved
Nope. And you haven't even gotten to the interesting stuff - multithreading and realtime signals.

Signals are basically impossible to use correctly.

Rethinking race-free process signaling

Posted Apr 9, 2019 14:18 UTC (Tue) by rweikusat2 (subscriber, #117920) [Link]

Feel free to prove my wrong instead of making (wrong) blanket assertion based on not knowing how to handle something I didn't explicitly mention. Eg, describe specific problem situation. The model I described maps naturally to multithreaded processes as well, it just needs to be applied to all threads --- they either block signals which are supposed to be handled or block inside a multiplexing call which atomically unblocks said signals. There's nothing special about realtime signals beyond that they can be queued and that it's possible to send some data alongside a signal.

With Linux, one can even build programs which are entirely signal-driven by using realtime signals for I/O readiness notification (see fcntl(2)) and sigwaitinfo instead of one of the file descriptor based I/O multiplexing calls. I've used that once for a moderately complex Perl program (about 25 kloc) because it was the easiest to use in this environment.

Rethinking race-free process signaling

Posted Apr 9, 2019 15:04 UTC (Tue) by nix (subscriber, #2304) [Link] (3 responses)

Let's make it really fun: asynchronous signals and asynchronous cancellation handlers, because having one horrible source of bugs that trigger very rarely when innocent functions like malloc() are called anywhere in the call stack was not enough, so they added another! (Unfortunately asynchronous cancellation handlers *are* used, as Debian code search makes clear, so libc authors have to go through hell making them work right: this is distinct from, say, seekdir(), which appears to be basically unused, with all the codesearch hits being wrappers or parts of the implementation.)

Rethinking race-free process signaling

Posted Apr 9, 2019 17:42 UTC (Tue) by rweikusat2 (subscriber, #117920) [Link] (2 responses)

Signals don't cause "horrible bugs when innocent functions like malloc are being called". Broken code for handling signals might. As I already wrote: A signal handler must not call a function which isn't async signal safe it it interrupted another unsafe function. That's easy enough to guarantee (see other postings),

In the presence of signals, all functions defined by this volume of POSIX.1-2017 shall behave as defined when called from or interrupted by a signal-catching function, with the exception that when a signal interrupts an unsafe function or equivalent (such as the processing equivalent to exit() performed after a return from the initial call to main()) and the signal-catching function calls an unsafe function, the behavior is undefined.

That's part of section 2.4.3 of the following page:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_04

The three usual approaches would be

call only safe functions from signal handlers (eg, by using a pipe to notify an event loop of a signal)
keep signals blocked unless a safe function is being executed (what I described)
keep signals blocked and dequeue them synchronously via sigwait and friends

A UNIX signal is nothing but a level-triggered interrupt emulated in software. A lot of hardware (eg, PCI hardware) has used this model for event notification for an eternity and it's perfectly workable.

IMHO, thread cancellation is a braindead misfeature but that's a completely different discussion.

Rethinking race-free process signaling

Posted Apr 11, 2019 7:04 UTC (Thu) by wahern (subscriber, #37304) [Link]

I think the better analogy is edge-triggered interrupts. The real headache with signals is setting up the environment as you describe, and that's problematic because the timing and disposition semantics are quite inconvenient.

Once all the boilerplate code is in place it's seems like alot of unnecessary complexity compared to having started with something like a signalfd or eventfd. OTOH, for some things, such as catching SIGSEGV to extend mmap'd stack data structures, or to interrupt and switch program flow without tight, language-level integration of components, the old semantics are indispensable.

Threading semantics compound the headaches. But as for cancelations in particular--I don't think anybody defends the idea any longer. It's almost a strawman at this point as few if applications actually use them and implementations effectively disclaim liability. I'm surprised there's no movement to remove cancellations from implementations, POSIX, or both.

Rethinking race-free process signaling

Posted Apr 13, 2019 14:59 UTC (Sat) by nix (subscriber, #2304) [Link]

That's easy enough to guarantee

It's so easy to guarantee that when I looked, every single program that used nontrivial signal handlers (that did more than assigning to a variable or writing to a one-byte pipe) got it wrong, including glibc itself. The latter case is since fixed, but this is clearly not easy to guarantee, given that most uses are faulty and even the implementation gets it wrong. I suspect lack of experience on your part masquerading as arrogance, frankly.

Rethinking race-free process signaling

Posted Apr 8, 2019 22:23 UTC (Mon) by roc (subscriber, #30627) [Link]

I like that approach, but it doesn't solve all the problems with signals:

Sometimes it's a problem to delay all signal handling until you return to the event loop; sometimes you really want signal handlers to be able to run during some long synchronous operation.

It doesn't solve the signal coalescing problem at all, making the data in siginfo impossible to use reliably in many situations.

It doesn't solve the problem that there are a finite and rather small set of signals available and you basically have to allocate them statically with no protocol for avoiding or handling conflicting uses of a signal.

Rethinking race-free process signaling

Posted Apr 8, 2019 22:39 UTC (Mon) by roc (subscriber, #30627) [Link]

Also, even when you can cobble together reliable signal handling out of the available APIs --- the APIs still suck. I just don't think it needs to be this hard.

Rethinking race-free process signaling

Posted Apr 20, 2019 7:16 UTC (Sat) by njs (guest, #40338) [Link]

With signals at least most of the problems are related to inconvenient aspects of reality (i.e., signals can arrive at any moment, and the kernel can't allocate unbounded amounts of memory for a signal queue). For the core signal use cases like SIGSEGV or SIGTERM, I don't think there's a lot of room for improvement. There are also too many APIs that use signals when it's really not appropriate, but that's not the signal API's fault.

For process management, all the terrible problems that make the APIs impossible to use safely are totally self-inflicted. And probably the worst of those is the choice to use signals!

If we're kvetching about kernel API misdesigns, "non-blocking read from stdin" should also be on the list, probably just below SIGCHLD. The problem is: how do you do a non-blocking read from stdin, like you might want to in an async system like node? You might think "well, just use fcntl on fd 0 to set O_NONBLOCK", but since the O_NONBLOCK flag is stored on the file descript*ion*, this also affects any other processes that might have copies of that fd. Obviously O_NONBLOCK should have been a file-descriptor flag, like O_CLOEXEC, but file-descriptor flags didn't exist when O_NONBLOCK was created, so that's not how it works. Therefore, you can't safely set O_NONBLOCK on stdin without possibly breaking other random programs. djb has some cogent commentary: https://cr.yp.to/unix/nonblock.html

There are some obscure hacks for specific cases: https://github.com/python-trio/trio/issues/174#issuecomme...
Or really *really* obscure hacks: https://gist.github.com/njsmith/235d0355f0e3d647beb858765...

But fundamentally this is an obvious, common problem that simply can't be solved on popular Unixes.

(Probably the obvious solution for Linux at this point would be to add a RWF_NONBLOCK flag to preadv2/pwritev2, as per djb's suggestion.)

Rethinking race-free process signaling

Posted Apr 8, 2019 21:51 UTC (Mon) by rweikusat2 (subscriber, #117920) [Link] (7 responses)

I don't know where the "pid files" etc hacks originated from but they aren't inherently associated with UNIX process management, just with some people's stubborn refusal to use that properly.

Rethinking race-free process signaling

Posted Apr 8, 2019 22:03 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

The problem is not in pidfiles. It's the whole process management thingie, that is impossible to use safely in many cases.

For example, a general-purpose library can't even do something as simple as create a process and wait for it to finish. This is just ridiculous.

Rethinking race-free process signaling

Posted Apr 9, 2019 14:21 UTC (Tue) by rweikusat2 (subscriber, #117920) [Link] (5 responses)

Different topic.

"A general purpose library", IOW, some random, binary only code with undefined behaviour (in the sense that no specific behaviour is ever defined when this 'argument' shows up) is a situation which cannot be handled.

Rethinking race-free process signaling

Posted Apr 9, 2019 15:08 UTC (Tue) by nix (subscriber, #2304) [Link] (4 responses)

No, a 'general purpose library' is any library which does not know what its caller is and what its caller expects in regard to process-global or thread-global attributes like the file descriptor table or signal dispositions. This includes most if not all core free software libraries. In particular it includes things like glibc, which cannot do a lot of things in the most obvious way because callers expect to have complete control over signals (except for a tiny set which glibc has reserved *forever* so can keep hold of without breaking existing users) and fds ('loop over lots of fds, close() them all, and hope you got them all' is still a thing a huge number of processes do), and processes, and threads... which makes it more or less impossible for glibc to do anything asynchronous without madness like the shared memory regions used by nscd (which have caused bugs in the past due to programs thinking they can MAP_FIXED in the same place and screaming when they couldn't). I do wonder how many programs the gai_*() APIs, which use threads, have broken -- probably not many since those APIs are Linux-specific and not terribly popular.

Rethinking race-free process signaling

Posted Apr 9, 2019 16:39 UTC (Tue) by quotemstr (subscriber, #45331) [Link] (3 responses)

> fds ('loop over lots of fds, close() them all, and hope you got them all' is still a thing a huge number of processes do),

I have zero qualms about breaking programs that are so obviously broken themselves. You don't free resources you didn't create: that's a rule for any system, not just unix. Just no. Let's no enable programs that break this rule.

Rethinking race-free process signaling

Posted Apr 9, 2019 19:50 UTC (Tue) by roc (subscriber, #30627) [Link]

Applications that launch a securely sandboxed child process need to do this in the child process before they exec. Leaking a file descriptor across the sandbox boundary would be devastating.

Rethinking race-free process signaling

Posted Apr 13, 2019 15:01 UTC (Sat) by nix (subscriber, #2304) [Link] (1 responses)

The problem is that a loop closing the first 1024 fds before exit is incredibly common. Almost everything written before O_CLOEXEC became commonplace does that. Are you willing to break everything written before then? I'm certainly not. People are using those programs.

Rethinking race-free process signaling

Posted Apr 13, 2019 15:58 UTC (Sat) by quotemstr (subscriber, #45331) [Link]

It's less of a problem when the FDs are closed just before exec. Fewer programs randomly close all FDs in steady-state execution, unforked.

Rethinking race-free process signaling

Posted Apr 5, 2019 14:54 UTC (Fri) by amarao (subscriber, #87073) [Link]

I understand the problem they want to solve, but I feel the whole 'reuse' is a core problem here. Let's for a second imagine reading article on 'UUID reuse'. Sounds creepy, isn't it?

The proper way would be to assure userspace that pids (process uuids, process atoms, you name it) are never reused for system lifetime. If you have id, it's guaranteed to be the ID.

Whole 'pid' thing sounds more like hotel room numbers instead of guest identity.

Rethinking race-free process signaling

Posted Apr 5, 2019 14:58 UTC (Fri) by walters (subscriber, #7396) [Link]

Clearly we've been moving to a world where fds are the general way to refer to kernel objects, away from integer approach for PIDs.

However there's an interesting thing here...you still really want a "human readable" and persistent identifier. With regular files, fds are obviously retrieved via...file names.

But this proposal is adding no such equivalent - /proc/<pid> doesn't solve the problem because it uses integers.

Now one of the core innovations of systemd was using cgroups as a way to just *group* processes. That model has further extended into modern container systems - e.g. in the model popularized by Docker, a "container" is a grouping of an image with cgroups, namespaces etc. Put another way...no one using e.g. Kubernetes really ends up thinking in terms of pids - you `kubectl delete pod/X`.

Back to cgroups...I think we can push admins and userspace more to thinking in terms of the cgroups. What if our identifiers for processes were more like (cgroup ID, u64)? And we had a system call that took that pair? It'd greatly ameliorate issues with pid reuse - at least you'd never end up killing a process in another cgroup.

We'd have /usr/bin/kill httpd.service/42 or so.

Rethinking race-free process signaling

Posted Apr 6, 2019 20:05 UTC (Sat) by zlynx (guest, #2285) [Link]

I have seen software broken by large PIDs. Arguably though they're just doing it wrong, especially in 2019.

There's no purpose in using "short" to store PIDs when we have "pid_t" and using "char[6]" as string storage for a PID has always been a pretty bad idea. The print formatting that assumes six columns is ugly but not dangerous, and apparently people just don't care, as "vmstat" has had broken column widths for at least ten years (block IO per second and memory sizes are much larger than vmstat originally designed for).

Rethinking race-free process signaling

Posted Apr 7, 2019 13:24 UTC (Sun) by geuder (subscriber, #62854) [Link]

When building my Yocto image from scratch on my 8 core 16 thread machine PIDs used to wrap around ~100 times during an hour. While I have never explicitly noticed any technical race condition the human race condition was annoying. During debugging I assume that a process with a smaller pid has started first when searching for the root cause of problem. Yocto appends the pid to the names of it log files. With the pids wrapping around so fast, they basically had zero value.

Some what resistingly I changed pid_max to 999999 some months ago. With the 15 bit limit in use for decades I expected something to break. But luckily I have not observed any breakage (although I'm ready to believe those who say they have seen broken SW). Only (command line) user friendliness has suffered a bit. Most of the pids are 6 digits to remember. But hey, that's what I wanted...

It's indeed strange that the 15 bit limit has lived for so long. On the other hand in desktop usage and probably also in many server cases the process creation rate has not increased that drastically as many other performance parameters during the last 20 years.

It can't harm if the kernel hackers fix this issue in any case.When thinking about pid namespaces and mount namespaces involved we can just hope it doesn't get too complicated to be usable in real life...

Rethinking race-free process signaling

Posted Apr 11, 2019 14:30 UTC (Thu) by flussence (guest, #85566) [Link] (5 responses)

A Modest Proposal: give each process a unique IPv6 address, mapped to a 64 bit PID for compatibility.

You might think I'm not serious (and you'd be right), but I can imagine a lot of flamewars wouldn't have happened in a world with this as standard.

Rethinking race-free process signaling

Posted Apr 12, 2019 3:23 UTC (Fri) by faramir (subscriber, #2327) [Link] (1 responses)

I can't figure out how this would work for network based compute clusters that support process migration. I don't think it would be a good idea to pollute the local systems' routing tables with static routes whenever a process was migrated off of its original host.

Rethinking race-free process signaling

Posted Apr 13, 2019 10:43 UTC (Sat) by farnz (subscriber, #17727) [Link]

You could use identifier-locator addressing to handle that - the upper 64 bits can be the SIR, and the lower 64 bits can be the identifier. Then, the existing ILA mechanisms will convert SIR to locator whenever you need to talk to a process.

Rethinking race-free process signaling

Posted Apr 25, 2019 22:36 UTC (Thu) by fest3er (guest, #60379) [Link] (2 responses)

Or use the time of creation of the process, down to the nanosecond, as the PID, using a kernel thread that serially hands out PIDs on request. This could, of course, cause problems when migrating a process from one system to another. At least on a given system, a PID could never be reused. So how many processes can be created in a given instant of time?

Rethinking race-free process signaling

Posted May 2, 2019 21:18 UTC (Thu) by flussence (guest, #85566) [Link]

Purely timestamp-based PIDs will never fly because they'll never be able to perform well enough. An x86 HPET, shared between all cores in the system, is only guaranteed to provide 10MHz/100ns granularity and is fairly expensive to access on top of that. TSCs are basically free to read but nigh impossible to read _consistently_ in a frequency-hopping multicore system.

A scheme that encodes the CPU ID and TSC might be pretty efficient and would work, if we were to abandon the guarantee that IDs have any correlation to time. But, this being Linux, someone out there almost certainly depends on that implementation detail.

Rethinking race-free process signaling

Posted May 9, 2019 18:38 UTC (Thu) by mcortese (guest, #52099) [Link]

How is a time-linked n-bit integer any better than a non-time-linked n-bit integer? In other words, if you can dedicate n bits to a PID with n big enough to measure down to nanoseconds, then why not simply using a n-bit non-wrapping enumerator?