Kernel development

Brief items

Kernel release status

The current development kernel is 3.12-rc4, released on October 6. Linus's comments were: "Hmm. rc4 has more new commits than rc3, which doesn't make me feel all warm and fuzzy, but nothing major really stands out. More filesystem updates than normal at this stage, perhaps, but I suspect that is just happenstance. We have cifs, xfs, btrfs, fuse and nilfs2 fixes here."

Stable updates: 3.11.4, 3.10.15, 3.4.65, and 3.0.99 were released on October 5. Greg has included a warning that the long-lived 3.0 series will be coming to a close "within a few weeks", so users of that kernel should be thinking about moving on.

Comments (none posted)

Quotes of the week

What's the term for the feeling you get when looking at code and wondering "who in the hell wrote this crap, it's all messed up, needs to be fixed, how is it even working in the first place!", so you run 'git blame' and realize it was yourself, over a decade ago?

Then, wondering if you can blame someone else for not fixing it up since then, you run 'scripts/get_maintainer.pl' and realize that you are the maintainer for it as well.

Time to just back away slowly from the keyboard and forget I ever even opened those files...

— Greg Kroah-Hartman

So it appears that ATI's proprietary adapters include some tiny I2C protocol that allows identifying them! It seems that Catalyst / fglrx uses some simple I2C talk to check if the adapter is made by ATI and allows or refuses to use HDMI mode. That also explains why radeon driver never had any problems with audio over DVI, no matter what adapter was used.

— Rafał Miłecki

Comments (1 posted)

Kernel development news

Kernel address space layout randomization

By Jake Edge
October 9, 2013

Linux Security Summit

Address-space layout randomization (ASLR) is a well-known technique to make exploits harder by placing various objects at random, rather than fixed, addresses. Linux has long had ASLR for user-space programs, but Kees Cook would like to see it applied to the kernel itself as well. He outlined the reasons why, along with how his patches work, in a Linux Security Summit talk. We looked at Cook's patches back in April, but things have changed since then; the code was based on the original proposal from Dan Rosenberg back in 2011.

Attacks

There is a classic structure to many attacks against the kernel, Cook said. An attacker needs to find a bug either by inspecting kernel code, noticing something in the patch stream, or following CVEs. The attacker can then use that bug to insert malicious code into the kernel address space by various means and redirect the kernel's execution to that that code. One of the easiest ways to get root privileges is to execute two simple functions as follows:

    commit_creds(prepare_creds());

The existence of those function has made things "infinitely easier for an attacker", he said. Once the malicious code has been run, the exploit will then clean up after itself. For an example, he pointed to Rosenberg's RDS protocol local privilege escalation exploit.

These kinds of attacks rely on knowing where symbols of interest live in the kernel's address space. Those locations change between kernel versions and distribution builds, but are known (or can be figured out) for a particular kernel. ASLR disrupts that process and adds another layer of difficulty to an attack.

ASLR in user space randomizes the location of various parts of an executable: stack, mmap region, heap, and the program text itself. Attacks have to rely on information leaks to get around ASLR. By exploiting some other bug (the leak), the attack can find where the code of interest has been loaded.

Randomizing the kernel's location

Cook's kernel ASLR (KASLR) currently only randomizes where the kernel code (i.e. text) is placed at boot time. KASLR "has to start somewhere", he said. In the future, randomizing additional regions is possible as well.

There are a number of benefits to KASLR. One side effect has been moving the interrupt descriptor table (IDT) away from the rest of the kernel to a location in read-only memory. The unprivileged SIDT instruction can be used to get the location of the IDT, which could formerly have been used to figure out where the kernel code was located. Now it can't be used that way because the IDT is elsewhere, but it is also protected from overwrite because it is read-only.

ASLR is a "statistical defense", because brute force methods can generally be used to overcome it. If there are 1000 locations where the item of interest could reside, brute force will find it once and fail 999 times. In user space, that failure will lead to a crash of the program, but that may not raise the kind of red flags that crashing 999 machines would. The latter is the likely outcome from a wrong brute force guess against KASLR.

On the other hand, KASLR is not compatible with hibernation (i.e. suspend to disk). That is a solvable problem, Cook said, but is not interesting to him. The amount of space available for the kernel text to move around in is another problem. The code must be 2M aligned because of page table restrictions, and the space available is 2G. In a "perfect world", that would give 1024 slots for the location. In the real world, it turns out to be a fair amount less.

There are also some steps that need to be taken to protect against information leaks that can be used to determine where the kernel was loaded. The kptr_restrict sysctl should be enabled so that kernel pointers are not leaked to user space. Similarly, dmesg_restrict should be used as dmesg often has addresses or other information that can be used. Also, log files (like /var/log/messages) should have permissions for root-only access.

The last source of leaks he mentioned is conceptually easy to fix, but has run into resistance from the network subsystem maintainers. The INET_DIAG socket API uses the address of a kernel object as a handle. That address is opaque to user space, but it is a real kernel pointer, so it can be used to determine the kernel location. Changing it to some obfuscated value would fix the problem, but the network maintainers are not willing to do so, he said.

In a completely unconfined system, especially one with local untrusted users, KASLR is not going to be very useful, Cook said. But, on systems that use containers or have heavily contained processes, KASLR can help. For example, the renderer process in the Chrome browser is contained using the seccomp-BPF sandbox, which restricts an exploit to the point where it shouldn't be able to get the information needed. It is also useful to protect against attacks via remote services since there are "many fewer leaks" available remotely.

Implementation

KASLR has been added to Chrome OS, Cook said. It is in the Git tree for the distribution's kernel and will be rolled out in the stable release soon. He has a reputation for "bringing disruptive security changes to people who did not necessarily want them", he said with a smile, but KASLR was actually the "least problematic" of his proposed changes. Part of the reason for that is that "several other very smart people" have helped, including Rosenberg, other Google developers, and folks on the kernel mailing list.

Cook's patches change the boot process so that it determines the lowest safe address where the kernel could be placed. It then walks the e820 regions counting kernel-sized slots. From those, it chooses a slot randomly using the best random number source available. Depending on the system, that would be from the RDRAND instruction, the low bits from a RDTSC (time stamp counter), or bits from the timer I/O ports. After that, it decompresses the kernel, handles the relocation, and starts the kernel.

The patches are currently only for 64-bit x86, though Cook plans to look at ARM next. He knows a "lot less" about ARM, though, so he is hoping that he can "trick someone into helping me", he said.

The current layout of the kernel's virtual address space only leaves 512M for the kernel code—and 1.5G for modules. Since there is no need for that much module space, his patches reduce that to 1G, leaving 1G for the kernel, thus 512 possible slots (as it needs to be 2M aligned). The number of slots may increase when the modules' location is added to KASLR.

A demonstration of three virtual machines, with one running a "stock" kernel and two running the KASLR code, was up next. Looking at /proc/kallsyms and /sys/kernel/debug/kernel_page_tables on each showed different addresses. Cook said that he was unable to find a measurable performance impact from KASLR.

The difference in addresses makes panics harder to decode, so the offset of the slot used to locate the kernel has been added to that output. He emphasized that information leaks are going to be more of a problem for KASLR-enabled systems, noting that it is somewhat similar to Secure Boot now making a distinction between root and kernel ring 0. For the most part, developers didn't care about kernel information leaks, but that needs to change.

There are some simple steps developers can take to avoid leaking kernel addresses, he said. Using the "%pK" format for printing addresses will show regular users 0, while root still sees the real address (if kptr_restrict is enabled, otherwise everyone sees the real addresses). The contents of dmesg need to be protected using dmesg_restrict and the kernel should not be using addresses as handles. All of those things will make KASLR an effective technique for thwarting exploits—at least in restricted environments.

[I would like to thank LWN subscribers for travel assistance to New Orleans for LSS.]

Comments (7 posted)

Optimizing CPU hotplug locking

By Jonathan Corbet
October 9, 2013

The 3.12 development cycle has seen an increased level of activity around scalability and, in particular, the reduction of locking overhead. Traffic on the linux-kernel mailing list suggests that this work will extend into 3.13, if not beyond. One of several patch sets currently under development relates to CPU hotplugging — the process of adding CPUs to (or removing them from) a running system.

CPU hotplugging adds complications to a number of kernel subsystems; the fact that processors can come and go at arbitrary times must always be taken into account. Needless to say, hotplug operations must be restricted to times when the kernel is prepared for them; to that end, the kernel provides a reference count mechanism to allow any thread to block CPU hotplugging. The reference count is raised with get_online_cpus() to indicate that the set of online CPUs should not be changed; the reference count is decremented with put_online_cpus().

The implementation of get_online_cpus() in current kernels is relatively straightforward:

    mutex_lock(&cpu_hotplug.lock);
    cpu_hotplug.refcount++;
    mutex_unlock(&cpu_hotplug.lock);

Code that is managing an actual hotplug operation will acquire cpu_hotplug.lock (after waiting for the reference count to drop to zero if need be) and hold it for the duration of the operation. This mechanism ensures that no thread will see a change in the set of active CPUs while it holds a reference, but there is a bit of a problem: each reference-count change causes the cache line containing the lock and the count to bounce around the system. Since calls to get_online_cpus() and put_online_cpus() can happen frequently in the core kernel, this bouncing can be hard on performance.

The really sad fact in this case, though, is that CPU hotplug events are exceedingly rare; chances are that, in most running systems, there will never be a hotplug event until the system shuts down. This kind of pattern argues for a different approach to locking, where the common case is as fast as it can be made to be. That is exactly what Peter Zijlstra's CPU hotplug locking patch set sets out to do. To reach that goal, Peter has had to create a custom locking mechanism — a practice which is frowned upon whenever it can be avoided — and incorporate a new RCU-based synchronization mechanism as well. The patch series shows the evolution of this approach; this article will follow in the same path.

The new locking scheme

Peter's patch adds a couple of new variables related to CPU hotplugging:

__cpuhp_refcount is the new form of the reference count controlling hotplug operations. Unlike its predecessor, though, it is a per-CPU variable, so each CPU can tweak its own count without causing cache-line contention.
__cpuhp_state is an enum with three values: readers_fast, readers_slow, and readers_block.

"Readers," in the context of this locking mechanism, are threads that call get_online_cpus(); they need the set of online CPUs to stay stable but make no changes to it. A "writer," instead, is a thread executing an actual CPU hotplug operation.

The state starts out as readers_fast, an indication that no CPU hotplugging activity is going on and that, thus, readers can take the fast path through the locking code. With that in mind, here is a simplified form of the core of the new get_online_cpus():

    if (likely(__cpuhp_state == readers_fast))
	__this_cpu_inc(__cpuhp_refcount);
    else
	__get_online_cpus();

So, when things are in the readers_fast state, get_online_cpus() turns into a simple, per-CPU increment operation, with no cache-line contention. Otherwise the slow-path code (found in __get_online_cpus()) must be run. The put_online_cpus() code looks similar; when no CPUs are coming or going, all that is needed is a per-CPU decrement operation.

When it is time to add or remove a CPU, the hotplug code will make a call to cpu_hotplug_begin(). This function begins with these three lines of code:

    __cpuhp_state = readers_slow;
    synchronize_sched();
    __cpuhp_state = readers_block;

The assignment to __cpuhp_state puts an end to the fast-path reference count operations. A call to synchronize_sched() (a read-copy-update primitive that waits for each CPU to schedule at least once) is necessary to ensure that no thread is still running in the hot-path code in either get_online_cpus() or put_online_cpus(). Once that condition is assured, the state is changed again to readers_block. That will cause new readers to block (as described below), but there may still be old readers running, so the cpu_hotplug_begin() call will block until all of the per-CPU reference counts fall to zero.

At this point, it is worth looking at what happens in the __get_online_cpus() slow path. If that code sees __cpuhp_state as readers_slow, it will simply increment the per-CPU reference count and return in the usual manner; it is still possible to obtain a reference in this state. If, instead, it sees readers_block, it will increment an (atomic) count of waiting threads, then block on a wait queue without raising the reference count. The __put_online_cpus() slow path is simpler: it decrements the reference count as usual, then calls wake_up() to wake any thread that might be waiting in cpu_hotplug_begin().

Returning to that function: cpu_hotplug_begin() will return to its caller once all references have been returned (all of the per-CPU reference counts have dropped to zero). At that point, it is safe to carry out the CPU hotplug event, changing the set of online CPUs; afterward, a call is made to cpu_hotplug_done(). That function reverses what was done in cpu_hotplug_begin() in the following way:

    __cpuhp_state = readers_slow;
    wake_up_all(&cpuhp_readers);
    synchronize_sched();
    __cpuhp_state = readers_fast;

It will then wait until the count of waiting readers drops to zero before returning. This wait (like the entire hotplug operation) is done holding the global hotplug mutex, so, while that wait is happening, no other CPU hotplug operations can begin.

This code raises some interesting questions, starting with: why does cpu_hotplug_done() have to set the state to readers_slow, rather than re-enabling the fast paths immediately? The purpose here is to ensure that any new readers that come along will see all of the changes made by the writer while readers were blocked. The extra memory barriers in the slow path will ensure that all CPUs see the new state of the world correctly. The synchronize_sched() call is needed to ensure that any thread that might try to block will have done so; that means, among other things, that the count of waiting readers will be complete.

Why does cpu_hotplug_begin() explicitly block all readers? This behavior turns the CPU hotplug locking mechanism into one that is biased toward writers; the moment a writer comes along, new readers are blocked almost immediately. Things are done this way because there could be a lot of readers in a large and busy system; if they cannot be blocked, writers could be starved indefinitely. Given that CPU hotplug operations are so rare, there should be no real performance issues resulting from blocking readers and allowing hotplug operations to proceed as soon as possible.

What is the purpose of the count of waiting readers? A single writer can put readers on hold, but those readers should be allowed to proceed before a second hotplug operation can be carried out. By waiting for the count to drop to zero, cpu_hotplug_done() ensures that every reader that was blocked will be able to proceed before the next writer clogs up the works again.

The end result of all this work is that, most of the time, the locking overhead associated with get_online_cpus() will be replaced by a fast, per-CPU increment operation. There is a cost paid in the form of more complex locking code and, perhaps, more expensive hotplug operations, but a CPU hotplug event is not something that needs to be optimized for. So it seems like a net win.

rcu_sync

Interestingly enough, though, Peter's patch still wasn't fast enough for some people. In particular, the synchronize_sched() calls were seen as being too expensive. To address this problem, Oleg Nesterov put together a patch adding a new "rcu_sync" mechanism. In brief, the API looks like:

	struct rcu_sync_struct;

	void rcu_sync_enter(struct rcu_sync_struct *rss);
	void rcu_sync_exit(struct rcu_sync_struct *rss);

	bool rcu_sync_is_idle(struct rcu_sync_struct *rss);

An rcu_sync structure starts out in the "idle" state; it can be moved out of that state with one or more rcu_sync_enter() calls. When an equal number of rcu_sync_exit() calls have been made, the structure will test as idle again. The state changes are made using RCU so that, in particular, rcu_sync_exit() works via an ordinary RCU callback rather than calling synchronize_sched().

To use this infrastructure with CPU hotplugging, Peter defined the "idle" state as meaning that no hotplug operations are underway; then, calls to rcu_sync_is_idle() can replace tests against the readers_fast state described above — and the synchronize_sched() calls as well. That should make things faster — though the extent of the speedup is not entirely clear.

After all this work is done, a simple mutex-protected reference count has been replaced by a few hundred lines of complex, one-off locking code. In the process, the kernel has gotten a little bit harder to understand. This new complexity is unfortunate, but it seems to be an unavoidable by-product of the push for increased scalability. Getting the best performance out of a highly concurrent system can only be made so simple.

Comments (15 posted)

The Android Graphics microconference

October 9, 2013

This article was contributed by John Stultz

Linux Plumbers Conference

At the 2013 Linux Plumbers Conference, a number of upstream community developers and Google Android developers got together for the Android + Graphics micro-conference to discuss several aspects of the Android graphics stack, how collaboration could be improved, and how this functionality could be merged upstream.

Sync

Erik Gilling of Google's Android team started things off by covering some background on the Android Sync infrastructure, why it was introduced and why it's important to Android.

Sync is important because it allows for better exploitation of parallelism in graphics and media pipelines. Normally you can think of a pipeline as a collection of different devices that are manipulating buffers in series. Each step would have to completely finish before passing the output buffer to the next step. However, in many cases, there is some overhead work that is required in each step that doesn't strictly need the data that is going to be processed. So you can try to do some of those overhead steps, like setting up the buffer to display, in parallel while the buffer is still being filled. However, there needs to be some interlocking so that one step can signal to the next that it is actually finished. This agreement between steps is the "synchronization contract."

Android through version 3.0 (Honeycomb) didn't use any sort of explicit fencing. Instead, the synchronization contract was implicit and not necessarily well defined. This caused problems as driver writers often misunderstood or mis-implemented the synchronization contract, leading to very difficult-to-debug issues. Additionally, by having the contract be implicit and its implementation spread across a number of drivers, with some being proprietary, it made it very difficult to make changes to that contract in order to improve performance.

To address these issues, Android's explicit synchronization mechanism was implemented. Sync is a synchronization framework that allows SurfaceFlinger (Android's display manager) to establish a timeline and set sync points on that timeline. Other threads and drivers can then block on a sync point and will wait until the timeline counter has crossed that point. There can be multiple timelines, managed by various drivers; the Sync interface allows for merging sync points from different timelines. This is how the SufaceFlinger and BufferQueue processes manage the synchronization contract across a number of drivers and processes.

In describing the various sync timelines and the ability to merge sync points on different timelines, Maarten Lankhorst, author of dmabuf fences and wait/wound mutexes, got into a discussion about whether circular deadlocks were possible with Android's sync framework. Erik believed they were not, and made a convincing case, but he admitted that he's not had to use any systems with video RAM (which has complicated locking requirements that led to the need for wait/wound mutexes), so the result of the discussion wasn't exactly clear.

Tom Cooksey from ARM mentioned that, in graphics development, trying to debug issues related to why things didn't happen in the order they were expected is really hard, and that the Android Sync debugging infrastructure makes this much easier. Maarten noted that, for dma-buf fences, the in-kernel lockdep infrastructure can also be used to prove locking correctness. But, it was pointed out, that only works because fences are not held across system calls.

There was also some question of how to handle the unwinding of hardware fences and other error conditions, which Erik thought should be resolved by resetting the GPU. Rob Clark thought that wasn't a very good solution; he worries that resetting the GPU can take a long time and might interfere with concurrent users of the GPU.

In trying to figure what the next steps would be, Rob said he didn't have any objection to adding sync point arguments to the various functions, as long as they were optional. He thought that the explicit sync points could either be built on top of dma-buf fences, or otherwise fall back to dma-buf fences. Erik mentioned that while the Android sync points aren't tied to anything specific in the kernel, they are really only used for graphics buffers, so he thought tying sync points to dma-bufs might be doable. There didn't seem to be any objections to this approach, but it also wasn't clear that all sides were in agreement, so folks promised to continue the discussion on unifying the lower-level primitives on the mailing list.

The atomic display framework

Greg Hackmann from the Android team then discussed the atomic display framework (ADF) which was developed while he was trying to develop a version of HWComposer based on the kernel mode setting (KMS) interface. During that development, Greg ran into a number of limitations and issues with KMS, so he developed ADF as a simple display framework built on top of dma-buf and Android Sync. Thus ADF represents somewhat of an ideal interface for Android, and Greg wanted to see whether folks thought the KMS interface could be extended to provide the same functionality.

One of the limitations discussed was the absence of atomic screen updates. There is the out-of-tree atomic mode setting and nuclear pageflip patch set [PDF], but in that implementation updates to the screen are done by deltas, updating just a portion of the screen. Android, instead, prefers to update the entire screen to reduce the amount of state that needs to be kept as well as to avoid problems with some system-on-chip (SoC) hardware.

There is also the problem of KMS not handling ganged CRTCs (CRT controllers that generate output streams to displays), split-out planes, or custom pixel formats well, and that the modeling primitives KMS uses to manage the hardware don't map very well to some of the new SoCs. Further, there wasn't a good way to allow KMS to exchange sync-point data.

In addition, Greg sees the KMS interface as being fairly complex, requiring drivers to implement quite a bit of boilerplate code and to handle many cases that aren't very common. The concern is that if the API is large, SoC driver writers will only write the parts they immediately need and will likely make mistakes on the edge cases. There was some discussion that maybe KMS could use some helper functions, like the fbdev (Linux framebuffer device) helpers in the DRM layer which automatically provide an fbdev interface for DRM drivers.

As a result, ADF's design is a bit simplified, representing displays as a collection of overlay engines and interfaces which can be interconnected in any configuration the hardware supports. ADF uses a structure that wraps dma-buf handles with Sync data and formatting metadata. ADF then does sanity checks on buffer lengths and pixel formatting, deferring to driver-specific validation if custom formats are in use. ADF also manages any waiting that is required on the sync handle before flipping the page, and provides driver hooks for mode-setting and events like DPMS changes and vsync signals.

Rob noted that issues like split-out planes or custom pixel formats are solvable in the KMS API, and in many cases he has specific plans to do so. For others, like ganged CRTCs he's hesitant and wants to get more info on the hardware before deciding how it would be best to add the requisite functionality.

There was some minor debate about how ADF tends to allow blobs of data to be passed through it from user space to drivers, requiring hardware-specific user-space code. This functionality makes it harder to support other display managers — Wayland, for example — that depend on a hardware-independent interface. Rob noted that, for the most part, Wayland is very similar to SurfaceFlinger, but maybe just a few years behind when it comes to things like ganged CRTCs, and that improvements are needed. But he was also concerned with the desire for KMS to be generic and to have hardware-independent Weston user space, so maybe there need to be some cases where there are hardware-specific plugins, but it will need to fall back to a slower generic implementation.

Folks from the Android team pointed out that it really is hard to anticipate all the constraints and how weird the hardware ends up being. So the issue of where to draw the line on generic interfaces vs hardware-specific seemed unresolved. However, ADF does allow for a simple non-accelerated recovery console, which would be generic.

There was also further discussion on how the atomic mode setting does partial updates or deltas while ADF focuses on full updates. With Wayland being very similar to SurfaceFlinger, the partial updates are really not as useful there, and it's mostly for X that partial updates are useful. There was some question of whether X should really be targeted for atomic mode setting, but Rob said that, while for some things like overlays X isn't a target, X likely will use atomic mode setting. There was also some question as to what a "full frame update" entails, and whether it means updating things like gamma tables as well, as that can be slow on some hardware.

Other KMS extensions

Laurent Pinchart walked through a number of other proposed extensions to KMS. The first was non-memory-backed pipeline sources. Here the issue is that there can be complicated video pipelines where a capture device can write to both memory and directly to the display at the same time. KMS doesn't really model this well, and Laurent would like to see some sort of API to handle this case. There was some back and forth with Rob as to if the DRM framebuffer objects would mostly suffice.

The next issue was memory writeback, where the composited frames are written to memory instead of being sent to a display, and what the right API is for this. On one hand this looks very similar to video capture, so the proposal was to use Video4Linux (V4L) device nodes. Mirroring earlier issues raised, Greg noted that in many situations it's just a lot simpler to write a custom driver that does the bare minimum of what is needed than to try to implement the entire V4L interface. Driver writers are often under pressure, so they're unlikely to do the right thing if it requires writing lots of extra code. Hans Verkuil, the maintainer of V4L, expressed his exasperation with people who go off and do their own thing, needlessly reinventing the wheel, and that he is very open to addressing problems and improving things. Rob again suggested that V4L may need some internal refactoring and helper functions to try to make it easier to write a driver.

There were also discussions on chained compositors, non-linear pipelines and root planes that don't span the whole display, but it didn't seem like there was much resolution to the issues discussed. Hans finally asked that folks mail the linux-media mailing list, as the V4L developers would be interested in working and collaborating to resolve some of these issues.

Deprecating fbdev

The next short topic was deprecating the fbdev interface and discussions to see if this would impact Android as it's a common user of fbdev. For Android, fbdev is particularly useful for very early hardware bringup and recovery consoles. Greg pointed out that Google was able to bring up a Nexus 10 without fbdev using ADF, so this wouldn't be a problem for them, assuming the issues in KMS that ADF worked around were resolved.

ION

The discussion then turned to the ION memory allocator. With much of the background for the topic being covered on LWN, I summarized some of the recent responses from the community and the current thinking from the Android developers and asked what would be reasonable next steps to upstreaming ION. The upstream developers were suggesting that the dma-buf delayed allocation method be used, where user space would attach the dma-buf to all the various devices and allow the allocation to happen at map time.

One problem with this approach that the Android developers saw was that it can have permissions issues. It would require one process that has permissions to all the various devices to do the attach; the Android developers would prefer to avoid having one process with permissions to everything, and instead minimize the permissions needed. The way ION currently works is by having gralloc just know the memory constraints for all the devices, but it doesn't need permissions to all of those devices to allocate buffers. Those buffers can then be passed between various processes that have only the limited device permissions for their specific needs.

With respect to the approach of trying to do the constraints solving in kernel space, Colin Cross, an Android developer, brought up that the SoC constraints are insane, and the Android team sees dynamic constraint solving as an impossible problem. One just cannot generally enumerate the various constraints in a sane way, and trying to then map to specific allocators which satisfy those constraints will always be messy. Additionally, since the desktop use case has a more dynamic environment, there may not exist a solution to the given constraints, and in that case one needs to fall back to a slow path where buffers are copied between device-compatible memory regions. The point was made that for Android, slow paths are not an option, and because of that they expect a level of control made possible by customizing the entire stack to each device.

The solution Android took with ION was to provide known allocators (heaps) and then create a device-specific, custom mapping from device to allocator in user space via the custom gralloc implementations. Colin admitted there would always be some messiness and that the Android developers prefer that messiness to exist in user space.

As far as next steps, it was proposed that, instead of trying to enumerate generic constraints in the dma-buf parameters structure, we could instead have a table of allocators and have the devices link to compatible allocators for the device at attach time. That way we could use the same heap allocators that ION uses and just have two different ways of triggering that allocation. This would allow some shared infrastructure if there couldn't be an agreed-upon top-level interface for the allocations. Rob seemed to be agreeable, but Colin brought up the downside that, by enumerating allocators per device, when new heaps are added, we would have to go through all the drivers and update their heap compatibility. This is a fair point, but it is the cost of doing constraint solving in the kernel, rather than via custom user-space code, and as long as there was still the direct ION-like interface, it wouldn't have any negative consequences for Android.

Another interesting result was that Colin has been busy addressing some of the other issues with ION that were brought up in the LWN summary and in other discussions. It seems likely that ION will be submitted to staging so that the transition to using shared heaps can be more centrally coordinated.

I'd like to thank all of the discussion participants, along with Laurent Pinchart and Jesse Barker for co-organizing the micro-conference, and Zach Pfeffer for his diligent note taking.

Comments (1 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.12-rc4 ?

Greg KH Linux 3.11.4 ?

Greg KH Linux 3.10.15 ?

Sebastian Andrzej Siewior 3.10.14-rt9 ?

Luis Henriques Linux 3.5.7.22 ?

Greg KH Linux 3.4.65 ?

Greg KH Linux 3.0.99 ?

Architecture-specific

Zi Shen Lim ARM: lockless get_user_pages_fast() ?

Leif Lindholm arm: [U]EFI runtime services support ?

Christoffer Dall KVM/ARM Huge pages support ?

Sebastian Hesselbarth ARM: Initial support for Marvell Berlin SoCs ?

Joseph Lo ARM: tegra: basic support for Tegra124 SoC ?

Ben Dooks New big-endian patch series against 3.12-rc4 ?

Stephane Eranian perf,x86: add Intel RAPL PMU support ?

Core kernel code

Mel Gorman Basic scheduler support for automatic NUMA balancing V9 ?

Peter Zijlstra Optimize the cpu hotplug locking ?

Christoph Lameter vmstat: On demand vmstat workers V3 ?

Development tools

Janani Venkataraman [RFC] [PATCH 00/19] Non disruptive application core dump infrastructure using task_work_add() ?

Hemant Kumar Perf support to SDT markers ?

John Stultz Lockdep enablement for seqcount/seqlocks (v2) ?

Device drivers

Stanimir Varbanov Add support for Qualcomm's PRNG ?

Sherman Yin Add Broadcom Capri pinctrl driver ?

Srinivas Pandruvada Power Capping Framework and RAPL Driver ?

Matt Porter USB Device Controller support for BCM281xx ?

Sebastian Reichel OMAP SSI driver ?

Rob Clark Atomic/nuclear modeset/pageflip ?

Russell King - ARM Linux Armada DRM stuff ?

Shaik Ameer Basha Exynos5 Series SCALER Driver ?

Sherman Yin Add Broadcom Capri pinctrl driver ?

Laxman Dewangan clk: palmas: add clock driver for palmas ?

Laxman Dewangan Add AMS AS3722 mfd, pincontrol and RTC driver. ?

Filesystems and block I/O

Al Viro RCU vfsmounts ?

Eric W. Biederman vfs: Detach mounts on unlink. ?

David Sterba fiemap: introduce EXTENT_DATA_COMPRESSED flag ?

Janitorial

Rob Herring Early flattened DT init consolidation ?

Memory management

John Stultz Volatile Ranges v9 ?

Kirill A. Shutemov split page table lock for PMD tables ?

Robert C Jennings vmpslice support for zero-copy gifting of pages ?

Krzysztof Kozlowski mm: migrate zbud pages ?

Networking

Simon Horman MPLS actions and matches ?

Daniel Borkmann netfilter: xtables: lightweight process control group matching ?

Chun-Yeow Yeoh Add Mesh Channel Switch Support ?

Security-related

Hannes Frederic Sowa Introduce support to lazy initialize mostly static keys v2 ?

Miscellaneous

Jacob Pan tools/thermal: Introduce tmon, a tool for thermal subsystem ?

Page editor: Jonathan Corbet
Next page: Distributions>>