Optimizing CPU hotplug locking

By Jonathan Corbet
October 9, 2013

The 3.12 development cycle has seen an increased level of activity around scalability and, in particular, the reduction of locking overhead. Traffic on the linux-kernel mailing list suggests that this work will extend into 3.13, if not beyond. One of several patch sets currently under development relates to CPU hotplugging — the process of adding CPUs to (or removing them from) a running system.

CPU hotplugging adds complications to a number of kernel subsystems; the fact that processors can come and go at arbitrary times must always be taken into account. Needless to say, hotplug operations must be restricted to times when the kernel is prepared for them; to that end, the kernel provides a reference count mechanism to allow any thread to block CPU hotplugging. The reference count is raised with get_online_cpus() to indicate that the set of online CPUs should not be changed; the reference count is decremented with put_online_cpus().

The implementation of get_online_cpus() in current kernels is relatively straightforward:

    mutex_lock(&cpu_hotplug.lock);
    cpu_hotplug.refcount++;
    mutex_unlock(&cpu_hotplug.lock);

Code that is managing an actual hotplug operation will acquire cpu_hotplug.lock (after waiting for the reference count to drop to zero if need be) and hold it for the duration of the operation. This mechanism ensures that no thread will see a change in the set of active CPUs while it holds a reference, but there is a bit of a problem: each reference-count change causes the cache line containing the lock and the count to bounce around the system. Since calls to get_online_cpus() and put_online_cpus() can happen frequently in the core kernel, this bouncing can be hard on performance.

The really sad fact in this case, though, is that CPU hotplug events are exceedingly rare; chances are that, in most running systems, there will never be a hotplug event until the system shuts down. This kind of pattern argues for a different approach to locking, where the common case is as fast as it can be made to be. That is exactly what Peter Zijlstra's CPU hotplug locking patch set sets out to do. To reach that goal, Peter has had to create a custom locking mechanism — a practice which is frowned upon whenever it can be avoided — and incorporate a new RCU-based synchronization mechanism as well. The patch series shows the evolution of this approach; this article will follow in the same path.

The new locking scheme

Peter's patch adds a couple of new variables related to CPU hotplugging:

__cpuhp_refcount is the new form of the reference count controlling hotplug operations. Unlike its predecessor, though, it is a per-CPU variable, so each CPU can tweak its own count without causing cache-line contention.
__cpuhp_state is an enum with three values: readers_fast, readers_slow, and readers_block.

"Readers," in the context of this locking mechanism, are threads that call get_online_cpus(); they need the set of online CPUs to stay stable but make no changes to it. A "writer," instead, is a thread executing an actual CPU hotplug operation.

The state starts out as readers_fast, an indication that no CPU hotplugging activity is going on and that, thus, readers can take the fast path through the locking code. With that in mind, here is a simplified form of the core of the new get_online_cpus():

    if (likely(__cpuhp_state == readers_fast))
	__this_cpu_inc(__cpuhp_refcount);
    else
	__get_online_cpus();

So, when things are in the readers_fast state, get_online_cpus() turns into a simple, per-CPU increment operation, with no cache-line contention. Otherwise the slow-path code (found in __get_online_cpus()) must be run. The put_online_cpus() code looks similar; when no CPUs are coming or going, all that is needed is a per-CPU decrement operation.

When it is time to add or remove a CPU, the hotplug code will make a call to cpu_hotplug_begin(). This function begins with these three lines of code:

    __cpuhp_state = readers_slow;
    synchronize_sched();
    __cpuhp_state = readers_block;

The assignment to __cpuhp_state puts an end to the fast-path reference count operations. A call to synchronize_sched() (a read-copy-update primitive that waits for each CPU to schedule at least once) is necessary to ensure that no thread is still running in the hot-path code in either get_online_cpus() or put_online_cpus(). Once that condition is assured, the state is changed again to readers_block. That will cause new readers to block (as described below), but there may still be old readers running, so the cpu_hotplug_begin() call will block until all of the per-CPU reference counts fall to zero.

At this point, it is worth looking at what happens in the __get_online_cpus() slow path. If that code sees __cpuhp_state as readers_slow, it will simply increment the per-CPU reference count and return in the usual manner; it is still possible to obtain a reference in this state. If, instead, it sees readers_block, it will increment an (atomic) count of waiting threads, then block on a wait queue without raising the reference count. The __put_online_cpus() slow path is simpler: it decrements the reference count as usual, then calls wake_up() to wake any thread that might be waiting in cpu_hotplug_begin().

Returning to that function: cpu_hotplug_begin() will return to its caller once all references have been returned (all of the per-CPU reference counts have dropped to zero). At that point, it is safe to carry out the CPU hotplug event, changing the set of online CPUs; afterward, a call is made to cpu_hotplug_done(). That function reverses what was done in cpu_hotplug_begin() in the following way:

    __cpuhp_state = readers_slow;
    wake_up_all(&cpuhp_readers);
    synchronize_sched();
    __cpuhp_state = readers_fast;

It will then wait until the count of waiting readers drops to zero before returning. This wait (like the entire hotplug operation) is done holding the global hotplug mutex, so, while that wait is happening, no other CPU hotplug operations can begin.

This code raises some interesting questions, starting with: why does cpu_hotplug_done() have to set the state to readers_slow, rather than re-enabling the fast paths immediately? The purpose here is to ensure that any new readers that come along will see all of the changes made by the writer while readers were blocked. The extra memory barriers in the slow path will ensure that all CPUs see the new state of the world correctly. The synchronize_sched() call is needed to ensure that any thread that might try to block will have done so; that means, among other things, that the count of waiting readers will be complete.

Why does cpu_hotplug_begin() explicitly block all readers? This behavior turns the CPU hotplug locking mechanism into one that is biased toward writers; the moment a writer comes along, new readers are blocked almost immediately. Things are done this way because there could be a lot of readers in a large and busy system; if they cannot be blocked, writers could be starved indefinitely. Given that CPU hotplug operations are so rare, there should be no real performance issues resulting from blocking readers and allowing hotplug operations to proceed as soon as possible.

What is the purpose of the count of waiting readers? A single writer can put readers on hold, but those readers should be allowed to proceed before a second hotplug operation can be carried out. By waiting for the count to drop to zero, cpu_hotplug_done() ensures that every reader that was blocked will be able to proceed before the next writer clogs up the works again.

The end result of all this work is that, most of the time, the locking overhead associated with get_online_cpus() will be replaced by a fast, per-CPU increment operation. There is a cost paid in the form of more complex locking code and, perhaps, more expensive hotplug operations, but a CPU hotplug event is not something that needs to be optimized for. So it seems like a net win.

rcu_sync

Interestingly enough, though, Peter's patch still wasn't fast enough for some people. In particular, the synchronize_sched() calls were seen as being too expensive. To address this problem, Oleg Nesterov put together a patch adding a new "rcu_sync" mechanism. In brief, the API looks like:

	struct rcu_sync_struct;

	void rcu_sync_enter(struct rcu_sync_struct *rss);
	void rcu_sync_exit(struct rcu_sync_struct *rss);

	bool rcu_sync_is_idle(struct rcu_sync_struct *rss);

An rcu_sync structure starts out in the "idle" state; it can be moved out of that state with one or more rcu_sync_enter() calls. When an equal number of rcu_sync_exit() calls have been made, the structure will test as idle again. The state changes are made using RCU so that, in particular, rcu_sync_exit() works via an ordinary RCU callback rather than calling synchronize_sched().

To use this infrastructure with CPU hotplugging, Peter defined the "idle" state as meaning that no hotplug operations are underway; then, calls to rcu_sync_is_idle() can replace tests against the readers_fast state described above — and the synchronize_sched() calls as well. That should make things faster — though the extent of the speedup is not entirely clear.

After all this work is done, a simple mutex-protected reference count has been replaced by a few hundred lines of complex, one-off locking code. In the process, the kernel has gotten a little bit harder to understand. This new complexity is unfortunate, but it seems to be an unavoidable by-product of the push for increased scalability. Getting the best performance out of a highly concurrent system can only be made so simple.

Index entries for this article
Kernel	Hotplug
Kernel	Read-copy-update/rcu_sync
Kernel	Scalability

Optimizing CPU hotplug locking

Posted Oct 9, 2013 19:18 UTC (Wed) by mikemol (guest, #83507) [Link] (5 responses)

I would love to see CPU hotplugging as part of being a virtualization guest. If I could take a KVM guest and permit the hot-add of CPU cores during load spikes (call it an ephemeral core policy, if you like), or shift core allotments around as load patterns change around the clock, well, that'd be pretty nice.

Optimizing CPU hotplug locking

Posted Oct 9, 2013 20:09 UTC (Wed) by busterb (subscriber, #560) [Link] (1 responses)

Can't KVM and other virtualization layers emulate CPUs in the first place? If so, would the problem be simplified by simply giving the VM, say, 32 virtual CPUs, backed by one real CPU. Then when load increases, the number of real CPUs is increased. The virtualized OS would not need to know about hotplug at all, since it always thinks it has 32 CPUs.

Optimizing CPU hotplug locking

Posted Oct 9, 2013 20:56 UTC (Wed) by mikemol (guest, #83507) [Link]

That results in increased inefficiency in the form of context switching on the VM host, exactly as if you were to run thirty-two CPU-intensive threads on a fewer-core machine.

Optimizing CPU hotplug locking

Posted Oct 9, 2013 21:30 UTC (Wed) by pbonzini (subscriber, #60935) [Link] (2 responses)

That's available in KVM indeed, starting with QEMU 1.5 (it's all in userspace so any kernel version will do).

Optimizing CPU hotplug locking

Posted Oct 9, 2013 21:35 UTC (Wed) by mikemol (guest, #83507) [Link] (1 responses)

Any kernel version for host, I presume. How does it register with the guest? (I'll admit to not knowing how CPU hotplug events logically propagate through x86 systems...)

Optimizing CPU hotplug locking

Posted Oct 10, 2013 7:49 UTC (Thu) by pbonzini (subscriber, #60935) [Link]

It's all ACPI magic.

Optimizing CPU hotplug locking

Posted Oct 9, 2013 19:58 UTC (Wed) by peterzzz (guest, #70147) [Link]

You forgot to point out that the fast-path test of __cpuhp_state and the inc/dec of __cpuhp_refcount needs to be done with preemption disabled; otherwise the sync_sched()s on the write side could be satisfied by scheduling in between those two statements and things would still go haywire.

Optimizing CPU hotplug locking

Posted Oct 9, 2013 20:20 UTC (Wed) by peterzzz (guest, #70147) [Link] (4 responses)

So the point of the rcu_sync primitive is to optimise the write side of things; in particular it can do away with one or both. In the case where there's a single usage of rcu_sync_begin()...rcu_sync_exit() the user on the exit side will not have to wait for sync_sched() to complete.

What happens is that we use call_rcu_sched() to schedule a callback after the appropriate GP which sets gp_state == GP_IDLE; ie. enables the fast path again.

When a new rcu_sync_begin() happens before that callback fires, we can avoid the sync_sched() because we never enabled the fast path, so we don't have to wait for all those to have happened either; in which case we've avoided waiting on both sync_sched() calls.

Optimizing CPU hotplug locking

Posted Oct 9, 2013 20:22 UTC (Wed) by peterzzz (guest, #70147) [Link] (3 responses)

Obviously I meant to say: one or both .. sync_sched() calls.

rcu_sync

Posted Oct 9, 2013 20:31 UTC (Wed) by corbet (editor, #1) [Link] (2 responses)

The article did say "In particular, the synchronize_sched() calls were seen as being too expensive." Was there something that I said wrong there?

rcu_sync

Posted Oct 9, 2013 20:45 UTC (Wed) by peterzzz (guest, #70147) [Link] (1 responses)

I mostly replied to: "That should make things faster — though the extent of the speedup is not entirely clear." and tried to explain.

rcu_sync

Posted Oct 11, 2013 21:02 UTC (Fri) by giraffedata (guest, #1954) [Link]

Since the premise of the design, according to the article, is that CPU hotplug events are exceedingly rare, how could anyone see the synchronize_sched() calls as being too expensive? They happen twice per CPU hotplug event, right? How is anyone supposed to notice the improvement made by using rcu_sync?

Optimizing CPU hotplug locking

Posted Oct 10, 2013 18:03 UTC (Thu) by pwsan (subscriber, #56604) [Link] (2 responses)

Minor nit. The article states:

> CPU hotplug events are exceedingly rare; chances are that, in most running systems, there will never be a hotplug event until the system shuts down.

Considering that the majority of Linux devices are ARM-based Android systems, and that many of these use CPU hotplug to bring CPU cores up and down while the system is running, the above statement is probably inaccurate.

However it's probably true to write that calls to {get,put}_online_cpus() are frequent with respect to actual hotplug operations, so even for most Linux systems in use, the patch set certainly seems to be an improvement.

Optimizing CPU hotplug locking

Posted Oct 11, 2013 16:05 UTC (Fri) by hamjudo (guest, #363) [Link] (1 responses)

Performance on bringing up a CPU matters a lot, because something is waiting to use that CPU. Likewise, the rest of the system is probably doing important stuff, so the locking interaction should minimize performance hit on other processes.

Conversely, when shutting down a CPU because it isn't needed, the rest of the system is probably relatively idle. All that matters is how many milliwatt hours get used.

It's more complex than that, because once a processor shutdown is started, you have to wait until it's entirely shutdown, before you can bring it back up. If shutdowns take so long that it becomes common to choose to bring processors back up then shutdown time becomes a fraction of start up time. Also, processors may get shutdown for thermal or power management reasons, in which case performance matters.

From a practical point of view, this code will be running on my cellphone years before it gets to my employer's data center.

Optimizing CPU hotplug locking

Posted Oct 11, 2013 16:40 UTC (Fri) by pwsan (subscriber, #56604) [Link]

What you write is true. It's worth noting though that the hotplug lock overhead of CPU hot adds or removes is a tiny fraction of the total duration of the operation. It's dwarfed by the time spent migrating tasks and timers, calling the notifier chains, etc.