Rethinking race-free process signaling
The fundamental problem being addressed by the pidfd concept is process-ID reuse. Most Linux systems have the maximum PID set to 32768; if lots of processes (and threads) are created, it doesn't take a long time to use all of the available PIDs, at which point the kernel will cycle back to the beginning and start reusing the ones that have since become free. That reuse can happen quickly, and processes that work with PIDs might not notice immediately that a PID they hold referred to a process that has exited. In such conditions, a stale PID could be used to send a signal to the wrong process. As Jann Horn pointed out, real vulnerabilities have resulted from this problem.
A pidfd is a file descriptor that is obtained by opening a process's directory in the /proc virtual filesystem; it functions as a reference to the process of interest. If that process exits, its PID might be reused by the kernel, but any pidfds referring to that process will continue to refer to it. Passing a pidfd to pidfd_send_signal() will either signal the correct process (if it still exists), or return an error if the process has exited; it is guaranteed not to signal the wrong process. So it would seem that this problem has been solved.
Not so fast
In late March, Christian Brauner posted a patch set adding another new system call:
int pidfd_open(pid_t pid, unsigned int flags);
This system call will look up the given pid in the current namespace, then return a pidfd referring to it. This call was proposed to address cases where /proc is not mounted in a given namespace. For cases where /proc is available, though, the patch set also implements a new PIDFD_GET_PROCFD ioctl() call that takes a pidfd, opens the associated process's /proc directory, and return a file descriptor referring to it. That descriptor, which functions as a pidfd as well, could then be used to read other information of interest out of the /proc directory.
Linus Torvalds had no fundamental problem with pidfd_open(), but he was rather less pleased with the ioctl() command. The core of his disagreement had to do with the creation of a second type of pidfd: one created by pidfd_open() would have different semantics than one created by opening the /proc directory or by calling ioctl(). In his view, either creation path should yield the same result on systems where /proc is mounted; there should be no need to convert between two types of pidfd.
Brauner was not immediately accepting of that idea. He worried
that the equivalence would force a dependency on having /proc
enabled (a concern that Torvalds dismissed),
and that it
could expose information in /proc that might otherwise be
hidden from a pidfd_open() caller. Torvalds suggested
tightening the security checks in that latter case. Even then, Andy
Lutomirski worried,
"/proc has too much baggage
" to be made secure in this setting.
It might be necessary, he said, to create a separate view of /proc
that would be provided with pidfds.
clone()
As the conversation went on, though, it became increasingly clear that pidfd_open() was not the end goal. That call is still racy — a PID could be reused in the time between when a caller learns about it and when the pidfd_open() call actually completes. There are ways of mitigating this problem, but it does still exist. The only truly race-free way of getting a reference to a process, it was agreed, is to create that reference as part of the work of creating the process itself. That means it should be created as part of a clone() call.
That could be made possible by adding a new flag (called something like CLONE_PIDFD) to clone() that would return a pidfd to the parent rather than a PID. There were some worries that clone() has run out of space for new flags, necessitating a new system call, but Torvalds indicated that there is still at least one bit available. As a result of the discussion, it seems likely that a patch implementing the new clone() behavior will be posted in the near future.
That, however, leaves open the question of pidfd_open() and how pidfds should work in general. At one point, Brauner suggested breaking the connection with /proc entirely: a pidfd could be used for sending signals (or, in the future, waiting for a process), but its creation would not be tied to a /proc directory in any way. That would involve disabling the functionality in 5.1, something that can still be done since it is not yet part of an official kernel release. The problem of opening the correct /proc directory (to read information about the process) could be addressed by adding a field to the fdinfo file for the pidfd; the information there could be used to verify that a given /proc directory refers to the same process as the pidfd.
It eventually became clear, though, that Torvalds instead favored retaining
the tie between a pidfd and the /proc directory; he called
it "the most flexible option
". So, one day later, Brauner
came back with another
plan: the connection with /proc would remain, but the
pidfd_open() system call would be dropped since there would no
longer be any real need for it. Should this plan be followed,
which seems to be the most likely outcome, the existing 5.1 pidfd work could
remain, since it is still a part of the final vision.
If things play out this way, the new clone() option will likely appear in 5.2 or 5.3. Process-management systems that are concerned about races will then be able to use pidfds for safe process signaling. If nothing else, this discussion shows the value of having many developers looking at proposed API additions. In a setting where mistakes are hard to correct once they get out into the world, one wants to get things right from the outset if at all possible.
A postscript
A contributing factor to the problem of PID reuse is the fact that the PID space is so small; for compatibility with ancient Unix systems (and the programs that ran on them), it's limited to what can be stored in a signed 16-bit value. That was a hard limit until the 2.6.10 release in 2002, when Ingo Molnar added a flexible limit capped at 4,194,304; the default limit remained (and remains) 32768, but it can be changed with the kernel/pid_max sysctl knob.
At the time, Molnar placed a comment reading "a maximum of 4 million
PIDs should be enough for a while
" that endures to this day. Over
16 years later, it's clear that he was right. But as part of this
discussion, Torvalds said
that perhaps the time has come to raise both the default and the limit.
Setting the maximum PID to MAXINT would, he said, make a lot of
the attacks harder. Whether such a change would break any existing
software remains to be seen; it seems unlikely in 2019 but one never knows.
Index entries for this article | |
---|---|
Kernel | pidfd |
Kernel | System calls/pidfd_send_signal() |
Posted Apr 4, 2019 22:33 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Apr 4, 2019 22:52 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
Posted Apr 4, 2019 22:57 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
I'm not saying that such a design is a good idea, it's just that I've seen it used.
Posted Apr 5, 2019 0:24 UTC (Fri)
by Fowl (subscriber, #65667)
[Link] (27 responses)
Posted Apr 5, 2019 1:45 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (9 responses)
Posted Apr 5, 2019 16:40 UTC (Fri)
by smurf (subscriber, #17840)
[Link] (8 responses)
Posted Apr 7, 2019 1:09 UTC (Sun)
by stephen.pollei (subscriber, #125364)
[Link] (7 responses)
4901 processes ought to be enough for most people.
Lager systems might need to increase pid_max further, and have bigger rlimits.
In either case it seems like it can mostly be solved with saner configuration.
Posted Apr 8, 2019 1:21 UTC (Mon)
by dvdeug (subscriber, #10998)
[Link] (6 responses)
Arbitrary limits are a pain in the ass, and increasing the number of them and the odds you're going to hit them is not user-friendly.
Posted Apr 8, 2019 1:59 UTC (Mon)
by ebiederm (subscriber, #35028)
[Link] (5 responses)
There is the other issue with more pids that if they get too large they get ungainly and difficult
Posted Apr 8, 2019 5:54 UTC (Mon)
by eru (subscriber, #2753)
[Link] (4 responses)
Posted Apr 8, 2019 7:27 UTC (Mon)
by rbanffy (guest, #103898)
[Link] (3 responses)
Posted Apr 10, 2019 5:15 UTC (Wed)
by eru (subscriber, #2753)
[Link] (2 responses)
Posted Apr 10, 2019 18:07 UTC (Wed)
by rbanffy (guest, #103898)
[Link] (1 responses)
Posted Apr 12, 2019 6:19 UTC (Fri)
by massimiliano (subscriber, #3048)
[Link]
You have to let your imagination fly higher, eru. 290,000 years is a blink of an eye in cosmic terms...
If I let my imagination fly just a bit higher, in such a system this issue will be solved just like the current 2038 problem.
At some point the system will do a live migration to a 128 bit architecture, with conversion of the persistent state to appropriately sized values, and the "actor IDs" in the distributed systems will get a bit of fresh air with a wraparound time of 2^32*290k years, whatever that means...
Posted Apr 5, 2019 4:24 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (10 responses)
But wait! If you want to prevent the PID from being reused, and you're the one calling clone() in the first place, all you have to do is *not* call wait(), and then the PID will zombify in an entirely race-free fashion. That doesn't require the use of a pidfd at all. So the only "interesting" functionality we would be adding via a pidfd-that-holds-a-PID-open is the ability to transfer stewardship of a possibly-zombie process to a new "owner"(via SCM_RIGHTS over an AF_UNIX socket, or via fd inheritance), but without changing its PPID. That is such a niche usage that I'm not certain it's actually helpful, especially since you can create a subreaper via prctl(), and thereby actually reparent the process as needed.
(TL;DR: Because then it wouldn't be complicated and/or "interesting" enough to be worth doing.)
Posted Apr 5, 2019 12:56 UTC (Fri)
by jgg (subscriber, #55211)
[Link] (2 responses)
Posted Apr 5, 2019 18:32 UTC (Fri)
by pbonzini (subscriber, #60935)
[Link] (1 responses)
Posted Apr 19, 2019 18:12 UTC (Fri)
by mixi (guest, #131542)
[Link]
It could exit in the grandchild, which would cause the grandchild to be reaped by init, but won't have any effect on the child's PID.
Posted Apr 5, 2019 16:27 UTC (Fri)
by thiago (guest, #85680)
[Link] (2 responses)
The problem we faced then was exactly the problem of reparenting when the file descriptor is passed onwards via AF_UNIX. Especially when that is coupled with ptrace(), which appears to do a fake-reparenting to the tracer process. No one could explain to us back then what the issue was, so we couldn't fix and the matter was dropped.
Here's to hoping this feature is useful for userspace.
Posted Apr 7, 2019 10:40 UTC (Sun)
by meuh (guest, #22042)
[Link] (1 responses)
Posted Apr 7, 2019 18:19 UTC (Sun)
by quotemstr (subscriber, #45331)
[Link]
Posted Apr 5, 2019 20:24 UTC (Fri)
by jkowalski (guest, #131304)
[Link] (3 responses)
You don't do anything like this really, you can actually set the process cloned using CLONE_PIDFD to deliver no signal to the parent on termination (by setting it to null), which means everyone having a copy of the same file descriptor can poll on it to know when it dies. That also means the process cannot be waited upon, which is what you want when you use this API from inside libraries.
It happens as a side effect of using descriptors, since the parent will get a readable instance (however it is still not clear to me this is something the patch author will support, here's to hoping that), it can pass a copy of its descriptor to others, then perhaps close it, and allow the other process to essentially poll on it, know when the process is gone, and get back the exit status.
The same could be done using references to external processes using pidfd_open, you pass a flag that requests the kernel to give you a readable instance, and if you're a real_parent or parent (as in ptrace terms), you get one. It also has the nice property of the mount namespace not being the one where you acquire pidfds, but limited to the scope of your PID namespace (which I think is a very important point that has been overlooked thus far).
Scoping the opening and adding a system call with extendable flags would allow you to lift checks in pidfd_send_signal to signall across namespaces, exactly because without userspace doing it on its own, a process can only open a pidfd to something it can address inside its PID namespace. It is otherwise a layering violation (literally) that I can use the mount namespace to circumvent this, if iy is opened up in the future. I also object to being able to peek into process state through such descriptors, that capability should be orthogonal, not bound to the pidfd, even if I have the authority to read through. *Therefore, using /proc dir fds comes with a big downside to all of this.*
The nice delegation model allows you to extend pidfd_open with, say, PRIV_KILL that allows you to bind CAP_KILL privs to a pidfd, assuming you have CAP_KILL in the owning userns, which would allow you to pass this pidfd and let the receiver signal across namespace boundaries without restrictions (it has to be opt-in as this is not what you want by default).
You could add a similar flag to bind ptrace privs of the opener, though that is a lot more involved and I have not mentioned it anywhere thus far.
Thus, you can think of the pidfd as a stable reference to the process, and such flags depending on the authority of the opener (if parent, readable, if CAP_KILL, killable, if CAP_SYS_PTRACE, ptraceable, etc) allow you to open up methods to operate on it, and since they are bound to the descriptor, it is limited in scope to the said process only. Such intent cannot be expressed when using /proc. It also does not play well with hidepid=2 (invisible dirs mean you cannot take a reference), and hidepid=1 (dirs you cannot enter mean you cannot reference threads you can see).
The whole reparent on fd-passing however is broken. There can be multiple processes keeping it open at a time.
Posted Apr 7, 2019 10:42 UTC (Sun)
by meuh (guest, #22042)
[Link]
Posted Apr 7, 2019 16:12 UTC (Sun)
by luto (subscriber, #39314)
[Link] (1 responses)
Posted Apr 7, 2019 19:31 UTC (Sun)
by jkowalski (guest, #131304)
[Link]
You could also make it available to things with NNP set, and when cloning children, the PRIV_KILL, then pass it around, send signals. These all checks happen when the flag is used during pidfd_open or clonefd or whatever.
Do you see other cases where it could be a problem?
Posted Apr 5, 2019 21:49 UTC (Fri)
by roc (subscriber, #30627)
[Link] (5 responses)
When a traced process is killed by signal or `exit_group`, or does an `execve`, its threads exit in some unknown order. rr wants to clean them all up at once by `wait`ing for those specific threads but NOT receiving notifications for other threads it may be tracing. We currently have no way to do that. AIUI with pidfds we would be able to use, say, `poll` to wait for the exit notifications of a specific set of threads.
You might think we could handle`SIGCHLD` to know when notifications are pending for a traced thread, and use `waitpid` to grab the results for each traced thread after it exits, but that isn't reliable because multiple pending `SIGCHLD`s can be coalesced. Even with `signalfd` :-(.
This is a problem for libraries in general. If you have a library that wants to manage some child processes without interfering with other code (including on other threads) managing other child processes, that's really hard without pidfds.
Posted Apr 8, 2019 19:59 UTC (Mon)
by nix (subscriber, #2304)
[Link] (4 responses)
I revived and unbroke them as part of DTrace for Linux (see e.g. https://oss.oracle.com/git/gitweb.cgi?p=dtrace-linux-kern...) because this argument only applies to process-directed waitpid() results: ptrace() is thread-directed, so if you kick off a new thread to handle the requests from other threads it cannot see any of the waitpid() results, and the thread that can see the waitpid() results isn't able to poll on the requests because it's too busy doing a waitpid() :( you'd have to busywait, or poll with pauses, and thus add huge latencies to signal handling in all your traced children: no thanks.
Posted Apr 8, 2019 20:41 UTC (Mon)
by jkowalski (guest, #131304)
[Link] (2 responses)
That's not very problematic (from an API perspective it is very weird, but not internally), but my idea was pidfd_open could take some wait flags as waitfd did (and check if you're one of parent or real_parent, and only return a readable descriptor), among other things, and that intent cannot be described easily when opening a directory in /proc. This means everyone can poll but perhaps only parents (real or tracing) would be able to open a readable instance.
I can only hope this is taken into consideration (I raised this exact problem as I ran into it as well), and /proc descriptor stuff be removed, as it brings more problems for future extensions.
You could ofcourse resurrect a waitfd too, again, but I see no point when you could return a pollable/readable descriptor from clone and pidfd_open (and possibly even disable termination signals and autoreap them - the whole act of waiting is asynchronous). It also has a nice touch of consistency to it (and the fact that resources in a different namespace suddenly aren't addressable and can be taken a reference to from a different namespace - the filesystem), in that sense pidfd_open isn't very different from open, but it works like open only on PIDs you can *see* (unlike /proc, which is leaky across shared mount namespaces but separate PID namespaces).
Posted Apr 9, 2019 3:54 UTC (Tue)
by quotemstr (subscriber, #45331)
[Link]
Posted Apr 9, 2019 15:12 UTC (Tue)
by nix (subscriber, #2304)
[Link]
(If pidfds also wake you up when non-termination waitpid() results would be returned iff you did a waitpid(.., WNOHANG), then they do seem like a complete replacement for waitfd, which is great because it means I can drop another annoying invasive patch!)
btw, this does mean that you'd need to be able to get a pidfd not only at clone time but also at PTRACE_SEIZE time, or tracers could never get hold of a useful pidfd at all...
Posted Apr 9, 2019 3:52 UTC (Tue)
by quotemstr (subscriber, #45331)
[Link]
Posted Apr 5, 2019 11:33 UTC (Fri)
by mezcalero (subscriber, #45103)
[Link] (6 responses)
Lennart
Posted Apr 5, 2019 13:03 UTC (Fri)
by corbet (editor, #1)
[Link] (4 responses)
Posted Apr 5, 2019 13:44 UTC (Fri)
by mezcalero (subscriber, #45103)
[Link] (3 responses)
Posted Apr 11, 2019 14:35 UTC (Thu)
by kmweber (guest, #114635)
[Link] (2 responses)
So on an unrelated topic, but similar issue...years ago, the Madden series of American football games had a hard limit of 255 points that could be scored in a game and 1023 rushing yards that could be gained (I assume other stats were similarly capped, but these were the ones I noticed) (these limits may or may not still be there, it's been years since I played). For those of you who aren't familiar with American football, those numbers are ridiculously unachievable in any real game, but when playing the video game on easy settings it's quite plausible to reach both of those limits by halftime (particularly with skillful clock management--in short, American football rules specify that whether or not the game clock keeps running between plays is generally contingent upon the outcome of the just-completed play, with some complications for timeouts, penalties, and injuries, and when the end of the half or game is near--manipulating these rules with play-calling is a crucial part of late-game strategy). The 255-point limit was annoying but straightforward enough to understand, but for the life of me I couldn't understand why accumulated rushing yards was limited at 1023. This was in the mid-to-late 2000s; it wasn't *that* long ago, and memory certainly wasn't scarce enough that it was worth the extra trouble of using bitfields or implementing an explicit limit in code that was less than the max value of the data type used.
Now that I think of it, maybe this is where those extra ten bits from the PID cap went :)
Posted May 6, 2019 18:25 UTC (Mon)
by mgedmin (subscriber, #34497)
[Link] (1 responses)
16-bit fixed point numbers maybe, with 6 bits reserved for the fractional value?
Posted Sep 5, 2019 21:43 UTC (Thu)
by kmweber (guest, #114635)
[Link]
Essentially, the number of yards you've gained is equal to the number of yard lines you've crossed. So if you start from barely past the one yard line and get to just short of the four yard line, you've only officially gained two yards for statistics purposes even though you've actually gained very nearly three. And on the flip side, if you start from just short of the two yard line and end just past the three yard line, you're credited with a two-yard gain even though you've really covered barely more than one.
Posted Apr 5, 2019 13:15 UTC (Fri)
by Villemoes (subscriber, #91911)
[Link]
/*
though uapi/linux/futex.h seems to imply that the actual limit is 30 bits
/*
Posted Apr 5, 2019 13:05 UTC (Fri)
by mjthayer (guest, #39183)
[Link] (23 responses)
Posted Apr 8, 2019 1:06 UTC (Mon)
by neilbrown (subscriber, #359)
[Link] (22 responses)
I agree that having two-factor authentication would be a good approach, I'm not sure that timestamp is best.
Posted Apr 8, 2019 17:58 UTC (Mon)
by perennialmind (guest, #45817)
[Link]
Posted Apr 8, 2019 18:17 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (20 responses)
Process management has been the most fucked-up part of Unix since forever. Adding FD-based interface will go a long way towards unfucking it. Adding more ad-hoc kludges won't.
Posted Apr 8, 2019 19:55 UTC (Mon)
by roc (subscriber, #30627)
[Link] (11 responses)
Though there are some other nasty areas of kernel API that might be worse than process management. Signals themselves, for example. https://ldpreload.com/blog/signalfd-is-useless
Posted Apr 8, 2019 20:04 UTC (Mon)
by nix (subscriber, #2304)
[Link]
Posted Apr 8, 2019 22:04 UTC (Mon)
by rweikusat2 (subscriber, #117920)
[Link] (8 responses)
There's only one case where the behaviour of POSIX calls is undefined wrt signals: If a signal handler which interruped an unsafe function calls an unsafe function. The I/O-multiplexing calls are async signal safe, hence, the usually sensible way to marry signal handling and an I/O multiplexing loop is to block all handled signals, install signal handlers for them and then use pselect/ ppoll/ epoll_pwait to unblock handled signals only while the code is waiting for I/O. Signal handlers are then free to call whatever function they want as the sole interrupted function will be async signal safe. And the only function which can terminate with an EINTR condition will be the multiplexing call. Problem solved[*].
[*] Signals mask are inherited, hence, code creating new process has to deal with that. Just like it has to deal with all other inheritable things. BFD.
Posted Apr 8, 2019 22:08 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
Signals are basically impossible to use correctly.
Posted Apr 9, 2019 14:18 UTC (Tue)
by rweikusat2 (subscriber, #117920)
[Link]
With Linux, one can even build programs which are entirely signal-driven by using realtime signals for I/O readiness notification (see fcntl(2)) and sigwaitinfo instead of one of the file descriptor based I/O multiplexing calls. I've used that once for a moderately complex Perl program (about 25 kloc) because it was the easiest to use in this environment.
Posted Apr 9, 2019 15:04 UTC (Tue)
by nix (subscriber, #2304)
[Link] (3 responses)
Posted Apr 9, 2019 17:42 UTC (Tue)
by rweikusat2 (subscriber, #117920)
[Link] (2 responses)
http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_04
The three usual approaches would be
IMHO, thread cancellation is a braindead misfeature but that's a completely different discussion.
Posted Apr 11, 2019 7:04 UTC (Thu)
by wahern (subscriber, #37304)
[Link]
Once all the boilerplate code is in place it's seems like alot of unnecessary complexity compared to having started with something like a signalfd or eventfd. OTOH, for some things, such as catching SIGSEGV to extend mmap'd stack data structures, or to interrupt and switch program flow without tight, language-level integration of components, the old semantics are indispensable.
Threading semantics compound the headaches. But as for cancelations in particular--I don't think anybody defends the idea any longer. It's almost a strawman at this point as few if applications actually use them and implementations effectively disclaim liability. I'm surprised there's no movement to remove cancellations from implementations, POSIX, or both.
Posted Apr 13, 2019 14:59 UTC (Sat)
by nix (subscriber, #2304)
[Link]
Posted Apr 8, 2019 22:23 UTC (Mon)
by roc (subscriber, #30627)
[Link]
Sometimes it's a problem to delay all signal handling until you return to the event loop; sometimes you really want signal handlers to be able to run during some long synchronous operation.
It doesn't solve the signal coalescing problem at all, making the data in siginfo impossible to use reliably in many situations.
It doesn't solve the problem that there are a finite and rather small set of signals available and you basically have to allocate them statically with no protocol for avoiding or handling conflicting uses of a signal.
Posted Apr 8, 2019 22:39 UTC (Mon)
by roc (subscriber, #30627)
[Link]
Posted Apr 20, 2019 7:16 UTC (Sat)
by njs (guest, #40338)
[Link]
For process management, all the terrible problems that make the APIs impossible to use safely are totally self-inflicted. And probably the worst of those is the choice to use signals!
If we're kvetching about kernel API misdesigns, "non-blocking read from stdin" should also be on the list, probably just below SIGCHLD. The problem is: how do you do a non-blocking read from stdin, like you might want to in an async system like node? You might think "well, just use fcntl on fd 0 to set O_NONBLOCK", but since the O_NONBLOCK flag is stored on the file descript*ion*, this also affects any other processes that might have copies of that fd. Obviously O_NONBLOCK should have been a file-descriptor flag, like O_CLOEXEC, but file-descriptor flags didn't exist when O_NONBLOCK was created, so that's not how it works. Therefore, you can't safely set O_NONBLOCK on stdin without possibly breaking other random programs. djb has some cogent commentary: https://cr.yp.to/unix/nonblock.html
There are some obscure hacks for specific cases: https://github.com/python-trio/trio/issues/174#issuecomme...
But fundamentally this is an obvious, common problem that simply can't be solved on popular Unixes.
(Probably the obvious solution for Linux at this point would be to add a RWF_NONBLOCK flag to preadv2/pwritev2, as per djb's suggestion.)
Posted Apr 8, 2019 21:51 UTC (Mon)
by rweikusat2 (subscriber, #117920)
[Link] (7 responses)
Posted Apr 8, 2019 22:03 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
For example, a general-purpose library can't even do something as simple as create a process and wait for it to finish. This is just ridiculous.
Posted Apr 9, 2019 14:21 UTC (Tue)
by rweikusat2 (subscriber, #117920)
[Link] (5 responses)
"A general purpose library", IOW, some random, binary only code with undefined behaviour (in the sense that no specific behaviour is ever defined when this 'argument' shows up) is a situation which cannot be handled.
Posted Apr 9, 2019 15:08 UTC (Tue)
by nix (subscriber, #2304)
[Link] (4 responses)
Posted Apr 9, 2019 16:39 UTC (Tue)
by quotemstr (subscriber, #45331)
[Link] (3 responses)
I have zero qualms about breaking programs that are so obviously broken themselves. You don't free resources you didn't create: that's a rule for any system, not just unix. Just no. Let's no enable programs that break this rule.
Posted Apr 9, 2019 19:50 UTC (Tue)
by roc (subscriber, #30627)
[Link]
Posted Apr 13, 2019 15:01 UTC (Sat)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Apr 13, 2019 15:58 UTC (Sat)
by quotemstr (subscriber, #45331)
[Link]
Posted Apr 5, 2019 14:54 UTC (Fri)
by amarao (subscriber, #87073)
[Link]
The proper way would be to assure userspace that pids (process uuids, process atoms, you name it) are never reused for system lifetime. If you have id, it's guaranteed to be the ID.
Whole 'pid' thing sounds more like hotel room numbers instead of guest identity.
Posted Apr 5, 2019 14:58 UTC (Fri)
by walters (subscriber, #7396)
[Link]
However there's an interesting thing here...you still really want a "human readable" and persistent identifier. With regular files, fds are obviously retrieved via...file names.
But this proposal is adding no such equivalent - /proc/<pid> doesn't solve the problem because it uses integers.
Now one of the core innovations of systemd was using cgroups as a way to just *group* processes. That model has further extended into modern container systems - e.g. in the model popularized by Docker, a "container" is a grouping of an image with cgroups, namespaces etc. Put another way...no one using e.g. Kubernetes really ends up thinking in terms of pids - you `kubectl delete pod/X`.
Back to cgroups...I think we can push admins and userspace more to thinking in terms of the cgroups. What if our identifiers for processes were more like (cgroup ID, u64)? And we had a system call that took that pair? It'd greatly ameliorate issues with pid reuse - at least you'd never end up killing a process in another cgroup.
We'd have /usr/bin/kill httpd.service/42 or so.
Posted Apr 6, 2019 20:05 UTC (Sat)
by zlynx (guest, #2285)
[Link]
There's no purpose in using "short" to store PIDs when we have "pid_t" and using "char[6]" as string storage for a PID has always been a pretty bad idea. The print formatting that assumes six columns is ugly but not dangerous, and apparently people just don't care, as "vmstat" has had broken column widths for at least ten years (block IO per second and memory sizes are much larger than vmstat originally designed for).
Posted Apr 7, 2019 13:24 UTC (Sun)
by geuder (subscriber, #62854)
[Link]
Some what resistingly I changed pid_max to 999999 some months ago. With the 15 bit limit in use for decades I expected something to break. But luckily I have not observed any breakage (although I'm ready to believe those who say they have seen broken SW). Only (command line) user friendliness has suffered a bit. Most of the pids are 6 digits to remember. But hey, that's what I wanted...
It's indeed strange that the 15 bit limit has lived for so long. On the other hand in desktop usage and probably also in many server cases the process creation rate has not increased that drastically as many other performance parameters during the last 20 years.
It can't harm if the kernel hackers fix this issue in any case.When thinking about pid namespaces and mount namespaces involved we can just hope it doesn't get too complicated to be usable in real life...
Posted Apr 11, 2019 14:30 UTC (Thu)
by flussence (guest, #85566)
[Link] (5 responses)
You might think I'm not serious (and you'd be right), but I can imagine a lot of flamewars wouldn't have happened in a world with this as standard.
Posted Apr 12, 2019 3:23 UTC (Fri)
by faramir (subscriber, #2327)
[Link] (1 responses)
Posted Apr 13, 2019 10:43 UTC (Sat)
by farnz (subscriber, #17727)
[Link]
You could use identifier-locator addressing to handle that - the upper 64 bits can be the SIR, and the lower 64 bits can be the identifier. Then, the existing ILA mechanisms will convert SIR to locator whenever you need to talk to a process.
Posted Apr 25, 2019 22:36 UTC (Thu)
by fest3er (guest, #60379)
[Link] (2 responses)
Posted May 2, 2019 21:18 UTC (Thu)
by flussence (guest, #85566)
[Link]
A scheme that encodes the CPU ID and TSC might be pretty efficient and would work, if we were to abandon the guarantee that IDs have any correlation to time. But, this being Linux, someone out there almost certainly depends on that implementation detail.
Posted May 9, 2019 18:38 UTC (Thu)
by mcortese (guest, #52099)
[Link]
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
70% of people would be ok with max login of 3
95% of people would be ok with max login of 8
99% of people would be ok with max login of 13
70% of people should be ok with a max of 89 processes per login
95% of people should be ok with max of 144 processes per login
99% of people should be ok with max of 377 processes per login
If you raise pid_max to 99999 on small system and set these limits does it strongly reduce the issues?
Rethinking race-free process signaling
Rethinking race-free process signaling
not properly running programs that consume a few more resources than normal.
to use. Which argues against making 4 million the default. But otherwise something like 4 million
would probably be a fine default for a limit like that.
If the pid limit is 4 million, problems due to wraparound are rare, but they may occasionally happen, causing hard to trace bugs. Same with MAXINT. But if pid were a 64-bit number, and the limit the maximum of that, wraparound would never happen, so software could safely assume that pids are always unique.
Rethinking race-free process signaling
Rethinking race-free process signaling
> But if pid were a 64-bit number, and the limit the maximum of that, wraparound would never happen
Cue to a meeting room with a dozen people dressed like characters from Things to Come trying to figure out why The Google stopped answering their questions.
Fine. It'll be a looooong time.
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
The problem we faced then was exactly the problem of reparenting when the file descriptor is passed onwards via AF_UNIX.
I think file descriptor sent through SCM_RIGHT should not imply reparenting process to the reader, just like file descriptor inherited through fork() don't imply reparenting process to the newly created one. The semantic of those operations is "copy" ... but "copying" parent relationship doesn't make sense for me: in Unix a process can only have a single parent.
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
The whole reparent on fd-passing however is broken. There can be multiple processes keeping it open at a time.
I agree: through fork(), exec(), and SCM_RIGHTS, file descriptor can be duplicated in many processes.
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
That's the four million mentioned in the text; it is indeed explicitly capped at that size.
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
* A maximum of 4 million PIDs should be enough for a while.
* [NOTE: PID/TIDs are limited to 2^29 ~= 500+ million, see futex.h.]
*/
#define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \
(sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))
* The rest of the robust-futex field is for the TID:
*/
#define FUTEX_TID_MASK 0x3fffffff
Rethinking race-free process signaling
Rethinking race-free process signaling
We could add a "pid generation" counter which was incremented whenever the next-pid number restarted from the bottom. Then you need some way to get the pid generation for a given process, and some way to include it in the signal sent.
Two new syscalls would do it.
So add a Rethinking race-free process signaling
/proc/$pid/pid-generation
symlink or file and some kind of alias to the current /proc/$pid
like /proc/pid-gen/$pid-$gen
? PID remains a unique candidate key, subject to a version check. Sounds sane to me. I don't have any particular need for 4 million processes running at once.
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Nope. And you haven't even gotten to the interesting stuff - multithreading and realtime signals.
Rethinking race-free process signaling
Rethinking race-free process signaling
Signals don't cause "horrible bugs when innocent functions like malloc are being called". Broken code for handling signals might. As I already wrote: A signal handler must not call a function which isn't async signal safe it it interrupted another unsafe function. That's easy enough to guarantee (see other postings),
Rethinking race-free process signaling
In the presence of signals, all functions defined by this volume of POSIX.1-2017 shall behave as defined when called from or interrupted by a signal-catching function, with the exception that when a signal interrupts an unsafe function or equivalent (such as the processing equivalent to exit() performed after a return from the initial call to main()) and the signal-catching function calls an unsafe function, the behavior is undefined.
That's part of section 2.4.3 of the following page:
A UNIX signal is nothing but a level-triggered interrupt emulated in software. A lot of hardware (eg, PCI hardware) has used this model for event notification for an eternity and it's perfectly workable.
Rethinking race-free process signaling
Rethinking race-free process signaling
That's easy enough to guarantee
It's so easy to guarantee that when I looked, every single program that used nontrivial signal handlers (that did more than assigning to a variable or writing to a one-byte pipe) got it wrong, including glibc itself. The latter case is since fixed, but this is clearly not easy to guarantee, given that most uses are faulty and even the implementation gets it wrong. I suspect lack of experience on your part masquerading as arrogance, frankly.
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Or really *really* obscure hacks: https://gist.github.com/njsmith/235d0355f0e3d647beb858765...
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling
Rethinking race-free process signaling