PID namespaces in the 2.6.24 kernel
One of the new features in the upcoming 2.6.24 kernel will be the PID namespaces support developed by the OpenVZ team with the help of IBM. The PID namespace allows for creating sets of tasks, with each such set looking like a standalone machine with respect to process IDs. In other words, tasks in different namespaces can have the same IDs.
This feature is the major prerequisite for the migration of containers between hosts; having a namespace, one may move it to another host while keeping the PID values -- and this is a requirement since a task is not expected to change its PID. Without this feature, the migration will very likely fail, as the processes with the same IDs can exist on the destination node, which will cause conflicts when addressing tasks by their IDs.
PID namespaces are hierarchical; once a new PID namespace is created, all the tasks in the current PID namespace will see the tasks (i.e. will be able to address them with their PIDs) in this new namespace. However, tasks from the new namespace will not see the ones from the current. This means that now each task has more than one PID -- one for each namespace.
User-space API
To create a new namespace, one should just call the clone(2)
system call with the CLONE_NEWPID
flag set.
After this, it is useful to change the root directory and mount
a new procfs instance in the /proc
to make the common utilities
like ps
work.
Note that since the parent knows the PID of its child, it may
wait()
in the usual way for it to exit.
The first task in a new namespace will have a PID of 1
. Thus, it
will be this namespace's init and child reaper, so all the orphaned
tasks will be re-parented to it. Unlike the standalone machine, this "init"
can die, and in this case, the whole namespace will be terminated.
Since now we will have isolated sets of tasks, we should make proc
show only the set of PIDs which is visible for a particular task. To achieve
this goal, procfs
should be mounted multiple times -- once
for each namespace. After this the PIDs that are shown in the mounted instance
will be from the namespace which created that mount.
For example, a user may create some new proc_2
directory,
spawn a PID namespace and mount a procfs
to it. After this, the
user will be able to see the PIDs as they appear inside this new namespace.
There will be the PID number 1
, which is the namespace's init,
and all the other PIDs may coincide with some PIDs from the current namespace,
but refer to some other task.
No other changes in the user API are necessary. Tasks still have the ability to get their PIDs, PGIDs, etc. with the known system calls. They can also work with sessions and groups. Tasks may create threads and work with futexes.
Internal API
All the PIDs that a task may have are described in the struct pid
.
This structure contains the ID value, the list of tasks having this ID,
the reference counter and the hashed list node to be stored in the
hash table for a faster search.
A few more words about the lists of tasks. Basically a task has three PIDs: the process ID (PID), the process group ID (PGID), and the session ID (SID). The PGID and the SID may be shared between the tasks, for example, when two or more tasks belong to the same group, so each group ID addresses more than one task.
With the PID namespaces this structure becomes elastic. Now, each PID
may have several values, with each one being valid in one namespace. That is,
a task may have PID of 1024 in one namespace, and 256 in another. So, the
former struct pid
changes.
Here is how the struct pid
looked like before introducing
the PID namespaces:
struct pid { atomic_t count; /* reference counter */ int nr; /* the pid value */ struct hlist_node pid_chain; /* hash chain */ struct hlist_head tasks[PIDTYPE_MAX]; /* lists of tasks */ struct rcu_head rcu; /* RCU helper */ };And this is how it looks now:
struct upid { int nr; /* moved from struct pid */ struct pid_namespace *ns; /* the namespace this value * is visible in */ struct hlist_node pid_chain; /* moved from struct pid */ }; struct pid { atomic_t count; struct hlist_head tasks[PIDTYPE_MAX]; struct rcu_head rcu; int level; /* the number of upids */ struct upid numbers[0]; };
As you can see, the struct upid
now represents the PID
value -- it is stored in the hash and has the PID value.
To convert the struct pid
to the PID or vice versa one may
use a set of helpers like task_pid_nr()
, pid_nr_ns()
,
find_task_by_vpid()
, etc.
All these calls has some information in their names:
..._nr()
- These operate with the so called "global" PIDs.
Global PIDs are the numbers that are unique in the whole system, just
like the old PIDs were. E.g.
pid_nr(pid)
will tell you the global PID of the givenstruct pid
. These are only useful when the PID value is not going to leave the kernel. For example, some code needs to save the PID and then find the task by it. However, in this case saving the direct pointer on thestruct pid
is more preferable as global PIDs are going be used in kernel logs only. ..._vnr()
- These helpers work with the "virtual" PID, i.e.
with the ID as seen by a process. For example,
task_pid_vnr(tsk)
will tell you the PID of a task, as this task sees it (withsys_getpid()
). Note that this value will most likely be useless if you're working in another namespace, so these are always used when working with the current task, since all tasks always see their virtual PIDs. ..._nr_ns()
- These work with the PIDs as seen from the specified
namespace. If you want to get some task's PID (for example, to report it to
the userspace and find this task later), you may call
task_pid_nr_ns(tsk, current->nsproxy->pid_ns)
to get the number, and then find the task usingfind_task_by_pid_ns(pid, current->nsproxy->pid_ns)
. These are used in system calls, when the PID comes from the user space. In this case one task may address another which exists in another namespace.
Conclusion
The interface as described here has been merged for the 2.6.24 kernel
release. It has, however, been marked as "experimental" to prevent its
wide deployment by distributors while some remaining issues are worked
out. Few, if any, changes to this API are expected between now and when
the "experimental" tag is removed in a later kernel release.
Index entries for this article | |
---|---|
GuestArticles | Emelyanov, Pavel |
Posted Nov 6, 2010 20:05 UTC (Sat)
by arekm (subscriber, #4846)
[Link]
How to find out what's the mapping between parent pid ns and child pid ns?
I need to know parent pid that reflects process running in child pid ns and I only know that child pid.
PID namespaces in the 2.6.24 kernel