PID namespaces in the 2.6.24 kernel

November 19, 2007

This article was contributed by Pavel Emelyanov and Kir Kolyshkin

One of the new features in the upcoming 2.6.24 kernel will be the PID namespaces support developed by the OpenVZ team with the help of IBM. The PID namespace allows for creating sets of tasks, with each such set looking like a standalone machine with respect to process IDs. In other words, tasks in different namespaces can have the same IDs.

This feature is the major prerequisite for the migration of containers between hosts; having a namespace, one may move it to another host while keeping the PID values -- and this is a requirement since a task is not expected to change its PID. Without this feature, the migration will very likely fail, as the processes with the same IDs can exist on the destination node, which will cause conflicts when addressing tasks by their IDs.

PID namespaces are hierarchical; once a new PID namespace is created, all the tasks in the current PID namespace will see the tasks (i.e. will be able to address them with their PIDs) in this new namespace. However, tasks from the new namespace will not see the ones from the current. This means that now each task has more than one PID -- one for each namespace.

User-space API

To create a new namespace, one should just call the clone(2) system call with the CLONE_NEWPID flag set. After this, it is useful to change the root directory and mount a new procfs instance in the /proc to make the common utilities like ps work. Note that since the parent knows the PID of its child, it may wait() in the usual way for it to exit.

The first task in a new namespace will have a PID of 1. Thus, it will be this namespace's init and child reaper, so all the orphaned tasks will be re-parented to it. Unlike the standalone machine, this "init" can die, and in this case, the whole namespace will be terminated.

Since now we will have isolated sets of tasks, we should make proc show only the set of PIDs which is visible for a particular task. To achieve this goal, procfs should be mounted multiple times -- once for each namespace. After this the PIDs that are shown in the mounted instance will be from the namespace which created that mount.

For example, a user may create some new proc_2 directory, spawn a PID namespace and mount a procfs to it. After this, the user will be able to see the PIDs as they appear inside this new namespace. There will be the PID number 1, which is the namespace's init, and all the other PIDs may coincide with some PIDs from the current namespace, but refer to some other task.

No other changes in the user API are necessary. Tasks still have the ability to get their PIDs, PGIDs, etc. with the known system calls. They can also work with sessions and groups. Tasks may create threads and work with futexes.

Internal API

All the PIDs that a task may have are described in the struct pid. This structure contains the ID value, the list of tasks having this ID, the reference counter and the hashed list node to be stored in the hash table for a faster search.

A few more words about the lists of tasks. Basically a task has three PIDs: the process ID (PID), the process group ID (PGID), and the session ID (SID). The PGID and the SID may be shared between the tasks, for example, when two or more tasks belong to the same group, so each group ID addresses more than one task.

With the PID namespaces this structure becomes elastic. Now, each PID may have several values, with each one being valid in one namespace. That is, a task may have PID of 1024 in one namespace, and 256 in another. So, the former struct pid changes.

Here is how the struct pid looked like before introducing the PID namespaces:

    struct pid {
	atomic_t count;				/* reference counter */
	int nr;					/* the pid value */
	struct hlist_node pid_chain;		/* hash chain */
	struct hlist_head tasks[PIDTYPE_MAX];	/* lists of tasks */
	struct rcu_head rcu;			/* RCU helper */
    };

And this is how it looks now:

    struct upid {
	int nr;					/* moved from struct pid */
	struct pid_namespace *ns;		/* the namespace this value
						 * is visible in
						 */
	struct hlist_node pid_chain;		/* moved from struct pid */
    };

    struct pid {
	atomic_t count;
	struct hlist_head tasks[PIDTYPE_MAX];
	struct rcu_head rcu;
	int level;				/* the number of upids */
	struct upid numbers[0];
    };

As you can see, the struct upid now represents the PID value -- it is stored in the hash and has the PID value. To convert the struct pid to the PID or vice versa one may use a set of helpers like task_pid_nr(), pid_nr_ns(), find_task_by_vpid(), etc.

All these calls has some information in their names:

..._nr(): These operate with the so called "global" PIDs. Global PIDs are the numbers that are unique in the whole system, just like the old PIDs were. E.g. pid_nr(pid) will tell you the global PID of the given struct pid. These are only useful when the PID value is not going to leave the kernel. For example, some code needs to save the PID and then find the task by it. However, in this case saving the direct pointer on the struct pid is more preferable as global PIDs are going be used in kernel logs only.
..._vnr(): These helpers work with the "virtual" PID, i.e. with the ID as seen by a process. For example, task_pid_vnr(tsk) will tell you the PID of a task, as this task sees it (with sys_getpid()). Note that this value will most likely be useless if you're working in another namespace, so these are always used when working with the current task, since all tasks always see their virtual PIDs.
..._nr_ns(): These work with the PIDs as seen from the specified namespace. If you want to get some task's PID (for example, to report it to the userspace and find this task later), you may call task_pid_nr_ns(tsk, current->nsproxy->pid_ns) to get the number, and then find the task using find_task_by_pid_ns(pid, current->nsproxy->pid_ns). These are used in system calls, when the PID comes from the user space. In this case one task may address another which exists in another namespace.

Conclusion

The interface as described here has been merged for the 2.6.24 kernel release. It has, however, been marked as "experimental" to prevent its wide deployment by distributors while some remaining issues are worked out. Few, if any, changes to this API are expected between now and when the "experimental" tag is removed in a later kernel release.

Index entries for this article
GuestArticles	Emelyanov, Pavel

PID namespaces in the 2.6.24 kernel

Posted Nov 6, 2010 20:05 UTC (Sat) by arekm (subscriber, #4846) [Link]

Suppose there is child pid ns and some processes are running inside.

How to find out what's the mapping between parent pid ns and child pid ns?

I need to know parent pid that reflects process running in child pid ns and I only know that child pid.