Kernel development

Brief items

Kernel release status

The 4.2 merge window is still open and will probably remain open until July 5.

Stable updates: 4.1.1, 4.0.7, 3.14.46, and 3.10.82 were released on June 29. The 3.14.47 and 3.10.83 updates are in the review process as of this writing; they can be expected on or after July 3.

Comments (none posted)

Quote of the week

Alternatively, RCU -is- abuse. Anyone who tries to tell you otherwise simply lacks proper respect for and adoration of traditional synchronization mechanisms.

— Paul McKenney

Comments (none posted)

Kernel development news

Last week's merge window summary noted that "things are just getting started." Since then, Linus has pulled nearly 8,000 non-merge changesets into the mainline repository, making the total for the 4.2 merge window just short of 11,000. It is fair to say that things have truly started now.

Whether 4.2 will, as Linus recently predicted, be the busiest development cycle ever remains to be seen. That record is currently held by 3.15, which saw just over 12,000 changesets pulled during its merge window. Things seem to be winding down on the 4.2 front, so it may not exceed 3.15's totals. There is no doubt, though, that this is a busy development cycle.

Some of the more interesting changes merged since last week's summary include:

The Linux security module stacking patches have been merged, finally giving the kernel the ability to combine security modules in a generic manner.
A new packet classifier called "Flower" has been added. With Flower, "you will be able to classify packets based on a configurable combination of packet keys and masks." This classifier appears to be entirely lacking in documentation, unfortunately.
A driver for GENEVE (Generic Network Virtualization Encapsulation) tunnels has been added to the networking subsystem.
The netfilter subsystem has gained support for ingress-time packet classification.
Unix-domain sockets now support the splice() system call.
Support for the delay-gradient congestion-control algorithm has been merged.
The F2FS filesystem has gained support for a number of features including per-file encryption and the FALLOC_FL_ZERO_RANGE and FALLOC_FL_COLLAPSE_RANGE fallocate() operations.
The ext4 filesystem now supports the FALLOC_FL_INSERT_RANGE operation, which inserts a range of zero-filled space into an existing file.
Support for the Renesas H8/300 architecture was removed a couple of years ago due to lack of maintenance; that code has now been rewritten, fixed up, and merged back into the kernel.
The control group writeback patches have been merged. This work allows for better control of data writeback within control groups, fixing an area that has not worked well for a long time.
The device mapper's dm-cache module has gained support for stochastic multi-queue caching. See the commit message for details.
The thermal control subsystem has a new power-allocator governor, designed to divide power among heat sources while keeping the overall temperature of the system within bounds. See Documentation/thermal/power_allocator.txt for information on how it works.
The XFS filesystem has gained the ability to directly access persistent-memory devices via the DAX interface.
The CIFS filesystem can now handle (in an experimental mode) version 3.1.1 of the SMB protocol.
New hardware support includes:
- Systems and processors: Socionext UniPhier SoCs.
- Audio: Texas Instruments TAS5711/TAS5717/TAS5719 power amplifiers, ZTE zx296702 I2S digital audio interfaces, and Mediatek AFE PCM controllers.
- Block: Broadcom STB AHCI SATA controllers and CEVA AHCI SATA controllers.
- Graphics: Samsung Exynos SoC mobile image compressors. Support has been merged for new AMD GPUs; it currently supports AMD "Tonga," "Iceland," and "Carrizo" systems. The driver is not small, clocking in at well over 400,000 lines of code. There is also a new "virtio-gpu" driver that implements a kernel-mode-setting virtual GPU for use within virtual machines.
- I2C: APM X-Gene SoC SLIMpro I2C controllers, MediaTek I2C adapters, and Broadcom Settop I2C controllers.
- IIO: Sensortek STK3310 ambient-light and proximity sensors, Sensortek STK8312 and STK8BA50 3-axis accelerometers, Bosh BMC150 magnetometers, MEMSIC MMC35240 3-axis magnetic sensors, ROHM BH1750 ambient light sensors, ACPI ambient light sensors, Mitsubishi M62332 digital-to-analog converters, and Marvell Berlin2 analog-to-digital converters.
- Media: DataTranslation DT3155 frame grabbers (staging graduation), Conexant CX24120-based tuners, Cisco Cobalt video cards, and STMicroelectronics BDISP 2D blitters.
- Miscellaneous: APM X-Gene SoC error-detection and correction units, TI DRV2665 haptic controllers, Dialog DA9063 power buttons, Broadcom BCM2835 mailboxes, Mediatek MT6397 realtime clocks, Cortina Gemini SoC realtime clocks, NVIDIA Tegra124 external memory controllers, ARM CCI500 performance-monitoring units, Qualcomm power-management units, Allwinnner SoC SRAM controllers, CoreSight Embedded Trace Macrocell 4.x tracer modules, Unisys visorbus devices, and Himax HX8357D LCD controllers.
- Networking: Cavium ThunderX network controllers, Mellanox Technologies ConnectX-4 interfaces, ATUSB transceivers, Texas Instruments DP83867 Gigabit PHYs, Mediatek MT7601U-based wireless USB dongles, Cavium LiquidIO Intelligent Server adapters, Renesas Electronics Ethernet AVB controllers, EZchip Ethernet interfaces, Atmel WILC1000 wireless interfaces, and Unisys visornic interfaces.
- Pin control: NXP LPC18xx/43xx system control units, Freescale IMX7D pin controllers, IMG Pistachio SoC pin controllers, Mediatek MT6397 pin controllers, CSR SiRFatlas7 pinmux controllers, Allwinner a33 SoC pin controllers, Qualcomm 8660 pin controllers, and Renesas R8A7794 SoC pin controllers.
- Thermal: Intel Quark digital temperature sensors, Qualcomm SPMI PMIC temperature alarms, and Hisilicon thermal sensors.
- TTY: UniPhier on-chip UARTs, NXP LPC18xx/43xx serial ports, and STMicroelectronics STM32 serial ports.
- USB: TI TUSB1210 ULPI PHY modules, Broadcom STB SATA PHYs, Marvell USB 2.0 28nm PHYs, Marvell USB HSIC 28nm PHYs, and IMG Pistachio USB2.0 PHYs.

Changes visible to kernel developers include:

The networking subsystem's "page fragment" allocator has been moved into the memory-management subsystem. Page fragments are arbitrarily sized chunks of memory, allocated for short periods of time. They are allocated and freed with:
```
    void *__alloc_page_frag(struct page_frag_cache *nc, unsigned int fragsz, 
                            gfp_t gfp_mask);
    void __free_page_frag(void *addr);
```
The kernel self-tests subsystem has gained tests for the futex and secure computing ("seccomp") subsystems.
There is a new kernel subsystem called "libnvdimm," which provides various types of access to non-volatile memory (NVM) arrays. At the lowest level, it is able to enumerate arrays found on the system and present them as devices to the rest of the system. There are some add-on layers as well:
- Some new functions have been introduced for the safe writing of data to NVM devices: memremap_pmem(), memcpy_to_pmem(), and wmb_pmem(). Their purpose is to ensure that data has actually been written to the device and is not lurking in a processor or bus cache somewhere — that it's actually persistent, in other words.
- "Regions" are ranges of NVM that are separated out into separate virtual devices; they may span more than one physical device.
- The "BLK" driver provides a mode of access where only a small "window" into an NVM device is mapped into the kernel's address space at any given time. The primary purpose for this mode appears to be to minimize the chances of persistent-memory corruption via writes to random pointers.
- The "BTT" (block translation table) module adds an indirection layer to provide for atomic, sector-sized access to NVM arrays. The idea is to prevent the creation of corrupted filesystem blocks should the system go down in the middle of a block-write operation.
See Documentation/nvdimm/nvdimm.txt for lots of details.

By the usual schedule, the merge window should stay open until July 5. Check back next week for a summary of the final commits merged for this development cycle.

Comments (none posted)

RCU-walk: faster pathname lookup in Linux

July 1, 2015

This article was contributed by Neil Brown

Previously in this series we explored the pathname-lookup procedures in Linux (and the REF-walk mechanism in particular), which are complex because there are numerous details and special cases. This time we are looking at a different part of the process which is complex for a different reason. The read-copy-update-based RCU-walk mechanism avoids some details and cases by simply refusing to handle them; instead, it just falls back to REF-walk when it runs into something it cannot deal with. It remains hard to understand, though, because it is so unfamiliar. In an earlier article discussing this unfamiliarity I suggested that:

In human relationships a friendship can blossom more quickly if a mutual friend acts to introduce two parties and start them out on a sound footing.

A couple of months ago, Al Viro — the maintainer of the Linux VFS layer — provided that introduction when he took the time, during a discussion of some possible changes, to put together a brief overview of the goals and mechanisms of RCU-walk. He also suggested that it "probably needs to be turned into coherent text" and be placed in Documentation/filesystems/. Not being one to turn down the opportunity to translate brief notes and C code into English, I took on the challenge. The first part of this series provided the context against which RCU-walk can make sense. The next part will detail the changes to symlink handling that were the concrete outcome of that discussion. This is the part where we make friends with RCU-walk.

RCU-walk is an algorithm for performing pathname lookup in Linux. It is in many ways similar to REF-walk, which we met last time, and the two share quite a bit of code. The significant difference in RCU-walk is how it allows for the possibility of concurrent access.

Clear demarcation of roles

The easiest way to manage concurrency is to forcibly stop any other thread from changing the data structures that a given thread is looking at. In cases where no other thread would even think of changing the data and lots of different threads want to read at the same time, this can be very costly. Even when using locks that permit multiple concurrent readers, the simple act of updating the count of the number of current readers can impose an unwanted cost. So the goal when reading a shared data structure that no other process is changing is to avoid writing anything to memory at all. Take no locks, increment no counts, leave no footprints.

The REF-walk mechanism already described certainly doesn't follow this principle, but then it is really designed to work when there may well be other threads modifying the data. RCU-walk, in contrast, is designed for the common situation where there are lots of frequent readers and only occasional writers. This may not be common in all parts of the filesystem tree, but in many parts it will be. For the other parts it is important that RCU-walk can quickly fall back to using REF-walk.

Pathname lookup always starts in RCU-walk mode but only remains there as long as what it is looking for is in the cache and is stable. It dances lightly down the cached filesystem image, leaving no footprints and carefully watching where it is, to be sure it doesn't trip. If it notices that something has changed or is changing, or if something isn't in the cache, then it tries to stop gracefully and switch to REF-walk.

This stopping requires getting a counted reference on the current vfsmount and dentry, and ensuring that these are still valid — that a path walk with REF-walk would have found the same entries. This is an invariant that RCU-walk must guarantee. It can only make decisions, such as selecting the next step, that are decisions which REF-walk could also have made if it were walking down the tree at the same time. If the graceful stop succeeds, the rest of the path is processed with the reliable, if slightly sluggish, REF-walk. If RCU-walk finds it cannot stop gracefully, it simply gives up and restarts from the top with REF-walk.

This pattern of "try RCU-walk, if that fails try REF-walk" can be clearly seen in functions like filename_lookup(), filename_parentat(), filename_mountpoint(), do_filp_open(), and do_file_open_root(). These five correspond roughly to the four path_* functions we met last time, each of which calls link_path_walk(). The path_* functions are called using different mode flags until a mode is found which works. They are first called with LOOKUP_RCU set to request "RCU-walk". If that fails with the error ECHILD they are called again with no special flag to request "REF-walk". If either of those report the error ESTALE a final attempt is made with LOOKUP_REVAL set (and no LOOKUP_RCU) to ensure that entries found in the cache are forcibly revalidated — normally entries are only revalidated if the filesystem determines that they are too old to trust.

The LOOKUP_RCU attempt may drop that flag internally and switch to REF-walk, but will never then try to switch back to RCU-walk. Places that trip up RCU-walk are much more likely to be near the leaves and so it is very unlikely that there will be much, if any, benefit from switching back.

RCU and seqlocks: fast and light

RCU is, unsurprisingly, critical to RCU-walk mode. The rcu_read_lock() is held for the entire time that RCU-walk is walking down a path. The particular guarantee it provides is that the key data structures — dentries, inodes, super_blocks, and mounts — will not be freed while the lock is held. They might be unlinked or invalidated in one way or another, but the memory will not be repurposed, so values in various fields will still be meaningful. This is the only guarantee that RCU provides; everything else is done using seqlocks.

As we saw last time, REF-walk holds a counted reference to the current dentry and the current vfsmount, and does not release those references before taking references to the "next" dentry or vfsmount. It also sometimes takes the d_lock spinlock. These references and locks are taken to prevent certain changes from happening. RCU-walk must not take those references or locks and so cannot prevent such changes. Instead, it checks to see if a change has been made, and aborts or retries if it has.

To preserve the invariant mentioned above (that RCU-walk may only make decisions that REF-walk could have made), it must make the checks at or near the same places that REF-walk holds the references. So, when REF-walk increments a reference count or takes a spinlock, RCU-walk samples the status of a seqlock using read_seqcount_begin() or a similar function. When REF-walk decrements the count or drops the lock, RCU-walk checks if the sampled status is still valid using read_seqcount_retry() or similar.

However, there is a little bit more to seqlocks than that. If RCU-walk accesses two different fields in a seqlock-protected structure, or accesses the same field twice, there is no a-priori guarantee of any consistency between those accesses. When consistency is needed — which it usually is — RCU-walk must take a copy and then use read_seqcount_retry() to validate that copy.

read_seqcount_retry() not only checks the sequence number, but also imposes a memory barrier so that no memory-read instruction from before the call can be delayed until after the call, either by the CPU or by the compiler. A simple example of this can be seen in slow_dentry_cmp() which, for filesystems which do not use simple byte-wise name equality, calls into the filesystem to compare a name against a dentry. The length and name pointer are copied into local variables, then read_seqcount_retry() is called to confirm the two are consistent, and only then is ->d_compare() called. When standard filename comparison is used, dentry_cmp() is called instead. Notably it does not use read_seqcount_retry(), but instead has a large comment explaining why the consistency guarantee isn't necessary. A subsequent read_seqcount_retry() will be sufficient to catch any problem that could occur at this point.

With that little refresher on seqlocks out of the way we can look at the bigger picture of how RCU-walk uses seqlocks.

`mount_lock` and `nd->m_seq`

We already met the mount_lock seqlock when REF-walk used it to ensure that crossing a mount point is performed safely. RCU-walk uses it for that too, but for quite a bit more.

Instead of taking a counted reference to each vfsmount as it descends the tree, RCU-walk samples the state of mount_lock at the start of the walk and stores this initial sequence number in the struct nameidata in the m_seq field. This one lock and one sequence number are used to validate all accesses to all vfsmounts and all mount point crossings. As changes to the mount table are relatively rare, it is reasonable to fall back on REF-walk any time that any "mount" or "unmount" happens.

m_seq is checked (using read_seqretry()) at the end of an RCU-walk sequence, whether switching to REF-walk for the rest of the path or when the end of the path is reached. It is also checked when stepping down over a mount point (in __follow_mount_rcu()) or up (in follow_dotdot_rcu()). If it is ever found to have changed, the whole RCU-walk sequence is aborted and the path is processed again by REF-walk.

If RCU-walk finds that mount_lock hasn't changed then it can be sure that, had REF-walk taken counted references on each vfsmount, the results would have been the same. This ensures the invariant holds, at least for vfsmount structures.

`dentry->d_seq` and `nd->seq`

In place of taking a count or lock on d_reflock, RCU-walk samples the per-dentry d_seq seqlock, and stores the sequence number in the seq field of the nameidata structure, so nd->seq should always be the current sequence number of nd->dentry. This number needs to be revalidated after copying, and before using, the name, parent, or inode of the dentry.

The handling of the name we have already looked at, and the parent is only accessed in follow_dotdot_rcu() which fairly trivially follows the required pattern, though it does so for three different cases.

When not at a mount point, d_parent is followed and its d_seq is collected. When we are at a mount point, we instead follow the mnt->mnt_mountpoint link to get a new dentry and collect its d_seq. Then, after finally finding a d_parent to follow, we must check if we have landed on a mount point and, if so, must find that mount point and follow the mnt->mnt_root link. This would imply a somewhat unusual, but certainly possible, circumstance where the starting point of the path lookup was in part of the filesystem that was mounted on, and so not visible from the root.

The inode pointer, stored in ->d_inode, is a little more interesting. The inode will always need to be accessed at least twice, once to determine if it is NULL and once to verify access permissions. Symlink handling requires a validated inode pointer too. Rather than revalidating on each access, a copy is made on the first access and it is stored in the inode field of nameidata from where it can be safely accessed without further validation.

lookup_fast() is the only lookup routine that is used in RCU-mode, lookup_slow() being too slow and requiring locks. It is in lookup_fast() that we find the important "hand over hand" tracking of the current dentry.

The current dentry and current seq number are passed to __d_lookup_rcu() which, on success, returns a new dentry and a new seq number. lookup_fast() then copies the inode pointer and revalidates the new seq number. It then validates the old dentry with the old seq number one last time and only then continues. This process of getting the seq number of the new dentry and then checking the seq number of the old exactly mirrors the process of getting a counted reference to the new dentry before dropping that for the old dentry which we saw in REF-walk.

No `inode->i_mutex` or even `rename_lock`

A mutex is a fairly heavyweight lock that can only be taken when it is permissible to sleep. As rcu_read_lock() forbids sleeping, inode->i_mutex plays no role in RCU-walk. If some other thread does take i_mutex and modifies the directory in a way that RCU-walk needs to notice, the result will be either that RCU-walk fails to find the dentry that it is looking for, or it will find a dentry which read_seqretry() won't validate. In either case it will drop down to REF-walk mode which can take whatever locks are needed.

Though rename_lock could be used by RCU-walk as it doesn't require any sleeping, RCU-walk doesn't bother. REF-walk uses rename_lock to protect against the possibility of hash chains in the dcache changing while they are being searched. This can result in failing to find something that actually is there. When RCU-walk fails to find something in the dentry cache, whether it is really there or not, it already drops down to REF-walk and tries again with appropriate locking. This neatly handles all cases, so adding extra checks on rename_lock would bring no significant value.

`unlazy walk()` and `complete_walk()`

That "dropping down to REF-walk" typically involves a call to unlazy_walk(), so named because "RCU-walk" is also sometimes referred to as "lazy walk". unlazy_walk() is called when following the path down to the current vfsmount/dentry pair seems to have proceeded successfully, but the next step is problematic. This can happen if the next name cannot be found in the dcache, if permission checking or name revalidation couldn't be achieved while the rcu_read_lock() is held (which forbids sleeping), if an automount point is found, or in a couple of cases involving symlinks. It is also called from complete_walk() when the lookup has reached the final component, or the very end of the path, depending on which particular flavor of lookup is used.

Other reasons for dropping out of RCU-walk that do not trigger a call to unlazy_walk() are when some inconsistency is found that cannot be handled immediately, such as mount_lock or one of the d_seq seqlocks reporting a change. In these cases the relevant function will return -ECHILD which will percolate up until it triggers a new attempt from the top using REF-walk.

For those cases where unlazy_walk() is an option, it essentially takes a reference on each of the pointers that it holds (vfsmount, dentry, and possibly some symbolic links) and then verifies that the relevant seqlocks have not been changed. If there have been changes, it, too, aborts with -ECHILD, otherwise the transition to REF-walk has been a success and the lookup process continues.

Taking a reference on those pointers is not quite as simple as just incrementing a counter. That works to take a second reference if you already have one (often indirectly through another object), but it isn't sufficient if you don't actually have a counted reference at all. For dentry->d_lockref, it is safe to increment the reference counter to get a reference unless it has been explicitly marked as "dead" which involves setting the counter to -128. lockref_get_not_dead() achieves this.

For mnt->mnt_count it is safe to take a reference as long as mount_lock is then used to validate the reference. If that validation fails, it may not be safe to just drop that reference in the standard way of calling mnt_put() — an unmount may have progressed too far. So the code in legitimize_mnt(), when it finds that the reference it got might not be safe, checks the MNT_SYNC_UMOUNT flag to determine if a simple mnt_put() is correct, or if it should just decrement the count and pretend none of this ever happened.

Taking care in filesystems

RCU-walk depends almost entirely on cached information and often will not call into the filesystem at all. However there are two places, besides the already-mentioned component-name comparison, where the file system might be included in RCU-walk, and it must know to be careful.

If the filesystem has non-standard permission-checking requirements — such as a networked filesystem which may need to check with the server — the i_op->permission interface might be called during RCU-walk. In this case an extra "MAY_NOT_BLOCK" flag is passed so that it knows not to sleep, but to return -ECHILD if it cannot complete promptly. i_op->permission is given the inode pointer, not the dentry, so it doesn't need to worry about further consistency checks. However if it accesses any other filesystem data structures, it must ensure they are safe to be accessed with only the rcu_read_lock() held. This typically means they must be freed using kfree_rcu() or similar.

If the filesystem may need to revalidate dcache entries, then d_op->d_revalidate may be called in RCU-walk too. This interface is passed the dentry but does not have access to the inode or the seq number from the nameidata, so it needs to be extra careful when accessing fields in the dentry. This extra care typically involves using ACCESS_ONCE() or the newer READ_ONCE() to access fields, and verifying the result is not NULL before using it. This pattern can be see in nfs_lookup_revalidate().

A pair of patterns

In various places in the details of REF-walk and RCU-walk, and also in the big picture, there are a couple of related patterns that are worth being aware of.

The first is "try quickly and check, if that fails, try slowly". We can see that in the high-level approach of first trying RCU-walk and then trying REF-walk, and in places were unlazy_walk() is used to switch to REF-walk for the rest of the path. We also saw it last time in dget_parent() when following a ".." link. It tries a quick way to get a reference, then falls back to taking locks if needed.

The second pattern is "try quickly and check, if that fails, try again — repeatedly". This is seen with the use of rename_lock and mount_lock in REF-walk. RCU-walk doesn't make use of this pattern; if anything goes wrong it is much safer to just abort and try a more sedate approach.

The emphasis here is "try quickly and check". It should probably be "try quickly and carefully, then check". The fact that checking is needed is a reminder that the system is dynamic and only a limited number of things are safe at all. The most likely cause of errors in this whole process is assuming something is safe when in reality it isn't. Careful consideration of what exactly guarantees the safety of each access is sometimes necessary.

Next: symlinks

We have now covered nearly all of pathname lookup with the major missing part being symbolic links. The handling of symbolic links received a major rewrite recently so it deserves a thorough treatment. That treatment will form the bulk of the final article in this series which should appear in the next week or so.

Comments (none posted)

Adding Processor Trace support to Linux

July 1, 2015

This article was contributed by Andi Kleen

Tracing is a technique that is used for both performance analysis and debugging. A tracer generates its data into a log; these are tracing events that can be later analyzed to understand the program's execution. Linux has ftrace that can log function calls and tracepoint data for the kernel. CPUs also support tracing mechanisms for different events. Processor Trace (PT) is a new tracing mechanism for Intel CPUs that traces branches executing on the CPU, which allows the reconstruction of the control flow of all executed code.

Intel CPUs have an older mechanism called BTS (Branch Trace Store) that also allowed branch tracing, but it was slow and not widely used. PT allows tracing with better performance and gathers additional information, such as timing. Intel Processor Trace is supported on Broadwell CPUs. More information can be found in the Processor Trace specification in the Intel Architecture Software Developer Manuals, volume 3, chapter 36.

Hardware tracing mechanisms were traditionally used from hardware debuggers. But it is also possible to integrate them into the operating system, which allows tight integration with its profiling and debugging mechanisms. This makes it possible to use different trace buffers for different processes, and to make the facility available for non-root users. More importantly, it also allows tracing on every system, without needing to attach special debugging hardware.

The 4.1 kernel has an implementation of Processor Trace in the kernel's perf_events subsystem that allows the use of PT through the perf interface. The user-tools support is not quite merged yet, but is expected for 4.2. The code was developed by Alex Shishkin and Adrian Hunter and was tested by Emmanuel Lucet.

Other architectures, such as ARM and MIPS, also have branch tracing mechanisms, but these were not supported by operating system drivers before. ARM is currently merging a separate driver for its CoreSight tracing, but that is using a different, non-perf interface.

Implementation

Tracing branches on a modern CPU that runs multiple cores at multi-GHz speeds can generate huge amounts of data. While the CPU can do the actual tracing with little overhead — hardware can do it in parallel — maintaining the bandwidth to a log buffer can be a challenge. PT uses aggressive compression to make this problem manageable: unconditional direct branches are not logged at all, conditional branches are compressed to a single bit (taken or not-taken), and CALL/RET instructions can be elided by maintaining a call stack in the decoder.

A special PT decoder later reads the compressed trace stream. It decodes the instructions in the executable. Every time it reaches a conditional or indirect branch it consults the trace information to decide on the branch target. Asynchronous events, such as interrupts or transaction aborts, are also reported, in addition to time and processor frequency. The information is sufficient to reconstruct the full control flow of the original program. The performance overhead is shifted from tracing to decoding time. Decoding is several orders of magnitude slower than tracing.

Standard perf PMU (Performance Monitoring Unit) drivers generate "perf events" in the kernel in a standardized format and provide them through a ring buffer to the "perf record" command in user space. Using this approach directly is not possible with high-volume hardware tracing, as a PT decoder running in the kernel would never be able to keep up with the realtime instruction stream.

The perf PT driver instead dumps the tracing data directly to a separate ring buffer and uses a decoder running later in user space to generate perf_events, which are then processed by the standard perf tool chain. The decoder also needs side-band information to identify the executables and associate different processes and threads with the right binaries. Perf already had events for the metadata information needed, as standard perf sampling also relies on user-space post-processing to associate samples with executables. When multiple threads are traced on the same CPU, it also needs context switch information, which is also available. PT decoding uses both the tracing buffers per logical CPU, and the side-band stream generated by perf.

User interface

The merging of the the PT code was a long, drawn-out exercise that took about two years. The main sticking point was how to represent the extra trace buffer in the perf_events API. After multiple iterations, the final version uses an auxiliary data (AUX data) buffer associated with the main perf ring buffer, which is shared between kernel and user space via mmap(). The AUX buffer is allocated by mmap()ing an area above the original perf ring buffer. The perf driver then configures the tracing hardware to dump data into this buffer. When the buffer is filled up the perf_events kernel code sends an AUX data event to the main perf event ring buffer, which signals the amount of data available to "perf record".

In addition, normal perf events are used to collect the side-band data needed for decoding. The AUX data infrastructure is generic and can be used for other kinds of high-volume trace data.

After recording, the compressed data is decoded into standard perf events. This is done at "perf record" (or perf script/inject) time. To synchronize the side-band data with the hardware decoding stream, time stamps are used. The side-band data is used to determine the executables and threads, which are needed by the decoder to reconstruct the branch stream. The trace stream reports events in Time Stamp Counter (TSC) format, while perf events by default use the Linux trace clock. The kernel reports the necessary conversion factors and offsets. The decoder interpolates timing for branch events between time packets, and uses the resulting estimated time stamps to synchronize with tracepoint events for context switch and executable load.

To walk the executables and find the static and conditional branches, the decoder uses Masami Hiramatsu's x86 instruction decoder from the kernel tree.

The PT decoder then generates perf "sample" events from the trace stream. It supports different sampling options, controlled by the new --itrace option:

    --itrace={letter}period
    i       synthesize "instructions" events
    b       synthesize "branches" events
    x       synthesize "transactions" events
    c       synthesize branches events (calls only)
    r       synthesize branches events (returns only)
    e       synthesize tracing error events
    d       create a debug log
    g[size] synthesize a call chain (use with i or x)
    period can be NUMBER unit (e.g. 10us)

By default instructions are sampled approximately every 100µs (with the default --itrace=i100us option).

Here is an example that records the execution of a simple program

    % perf record -e intel_pt//u ls
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.190 MB perf.data ]

After the recording, the trace needs to be decoded. perf report will run the decoder and generate a histogram of the samples.

    % perf report --stdio
    ...
    # Samples: 36  of event 'instructions:u'
    # Event count (approx.): 8613072
    #
    # Overhead  Command  Shared Object     Symbol                          
    # ........  .......  ................  ................................
    #
	27.78%  ls       libc-2.17.so      [.] get_next_seq                
		     |
		     ---get_next_seq
			__strcoll_l
    ...

The tracer recorded 8,613,072 instructions, and the decoder sampled them every 100µs. This is a somewhat faster sampling frequency than normal cycle sampling with perf record, which can give better results.

The tracing supports the usual perf filters, like /u for user code only, and /k for kernel code. There are also more tunables for the PMU that can be optionally specified between the "/" delimiters:

    tsc=0  	 Disable time stamps
    noretcomp	 Disable return compression

In the example above, we used u to only record user space. Recording kernel space is also possible with the right permissions, but requires running the decoder as root.

Alternatively, we can also look at the exact trace, for example for debugging or to find specific patterns for performance analysis. This can be done with perf script. Since the default perf script output is quite verbose, we limit the fields, but also add the source line if available. This traces the startup of the ls command, which begins by executing the dynamic linker (ld-2.17.so).

    % perf script -F time,comm,cpu,sym,dso,ip,srcline
		  ls [001] 454279.326516:                 0 [unknown] ([unknown])
		  ls [001] 454279.326516:      7fdeb58b1630 _start (/lib/x86_64-linux-gnu/ld-2.17.so)
      .:0
		  ls [001] 454279.326527:                 0 [unknown] ([unknown])
		  ls [001] 454279.326527:      7fdeb58b1633 _start (/lib/x86_64-linux-gnu/ld-2.17.so)
      .:0
		  ls [001] 454279.326527:      7fdeb58b4fbf _dl_start (/lib/x86_64-linux-gnu/ld-2.17.so)
      get-dynamic-info.h:44
		  ls [001] 454279.326532:                 0 [unknown] ([unknown])
		  ls [001] 454279.326532:      7fdeb58b4fc6 _dl_start (/lib/x86_64-linux-gnu/ld-2.17.so)
      rtld.c:385 
		  ls [001] 454279.326539:                 0 [unknown] ([unknown])
		  ls [001] 454279.326539:      7fdeb58b4fe1 _dl_start (/lib/x86_64-linux-gnu/ld-2.17.so)
      rtld.c:414
		  ls [001] 454279.326546:                 0 [unknown] ([unknown])
		  ls [001] 454279.326546:      7fdeb58b500e _dl_start (/lib/x86_64-linux-gnu/ld-2.17.so)
      get-dynamic-info.h:52
		  ls [001] 454279.326546:      7fdeb58b502f _dl_start (/lib/x86_64-linux-gnu/ld-2.17.so)
      get-dynamic-info.h:46
		  ls [001] 454279.326546:      7fdeb58b502f _dl_start (/lib/x86_64-linux-gnu/ld-2.17.so)
      get-dynamic-info.h:46
		  ls [001] 454279.326546:      7fdeb58b502f _dl_start (/lib/x86_64-linux-gnu/ld-2.17.so)
      get-dynamic-info.h:46

perf script outputs branches instead of instructions by default.

The format is standard perf format, so existing scripts processing samples will work. For example we could generate a Flame Graph using PT data (as seen at right).

perf script also supports running Perl or Python scripts. We can use that to directly analyze a trace. This is important because instruction traces are often too long to be read by a human. It is also possible to use other tools that can process perf output to view the information.

Perf also has the ability to store trace data into a database. The export-to-postgresql.py script that is included with perf stores branch information into a PostgreSQL database for later analysis with standard database tools. The database pairs up calls and returns, which allows determining the lifetime and the call-stack of every function call.

Limitations for non-root users

Perf supports recording as non-root. There are some limitations. When tracing multiple threads on a single CPU, perf uses the same tracing buffer. To reconstruct the trace later, a context switch side-band event is needed to find the correct executables. The PT decoder uses the sched::sched_switch tracepoint to do this. Tracepoints traditionally need root rights (unless perf security is disabled with kernel.perf_event_paranoid=-1). If the tracepoint is not available, newly created threads cannot be decoded. This implies that only the single-threaded program that is started by perf or already existing threads selected with --per-thread --uid/--pid/--tid can be traced. In the future, a non-root context switch event, such as the one proposed in this patch, would fix this problem.

Another issue is that the kernel code is self-modifying. To handle kernel code, the decoder has to see the patched "live" image through /proc/kcore, which is only available to root. The new perf-with-kcore.sh wrapper script allows saving kcore as part of a perf session, which allows the decoder to see the same state of the kernel code as during tracing.

Snapshot mode

The record example above attempts to store a complete trace of the program to the disk. Storing complete traces typically only works for programs with limited computation. For longer workloads, which do significant computation with many branches, the disk I/O bandwidth is eventually unable to keep up with the CPU, so there can be data loss.

An alternative is to store the trace as a ring buffer to memory only, and only stop and save the trace when an "event of interest" happens. Memory has enough bandwidth to keep up with the traces in most cases.

This is especially useful for debugging: we're looking for the trace before the bug to see how it happened. It is also useful for performance analysis. For example, a common problem both in server and user interface tuning is the need to analyze occasional latency spikes that cause visible delay to the user.

To handle this situation with PT, perf supports the new snapshot mode. The trace is started in ring-buffer mode, recording the trace constantly in a memory ring-buffer, and saving the perf metadata to disk.

When the event of interest happens, we send a SIGUSR2 signal to the perf record process, which then dumps the in-memory trace buffer to the perf.data file.

So we need a program that sends the signal when something interesting happens. For example, we can use a simple REST key-value store that retrieves values from a database. Assume the database has an occasional latency glitch that makes a transaction take too long; we want to understand why. So we instrument the program to send the signal to perf when a database operation takes too long. We can then look at the trace later with perf script to find out what caused the latency spikes.

To avoid having to run the test program as root, we disable perf security and run the perf record as the same user ID as the test. Otherwise the test program could not send signals to perf.

    % sudo sysctl -w kernel.perf_event_paranoid=-1

Start the tracing and run the example server (simple-kv.py) in the background:

    % perf record --snapshot -a -e intel_pt//u sleep 100 &
    % export PERFPID=$!
    % python simple-kv.py &
    % curl http://localhost:4000/foo
    # perf script -F time,comm,cpu,sym,dso,ip,srcline
    ... tracing output ...

Debugging

When we use tracing for debugging, the obvious point of interest is a crash or abort. With tracing information we gain the ability to do limited backward debugging. The original PT patch kit had built-in support for core dumps. A new rlimit allowed enabling tracing in ring buffer mode for a program, and the trace would then be automatically written to the core dump file. The PT code that is merged in 4.1 does not yet have this capability, but it will hopefully eventually be re-added.

This raises the question of what to do with the core dump. The original PT patch kit API was supported by GDB, which could both read PT core dumps and also do live backward debugging for control flow. The debugger allows you to virtually move inside the trace and allows examining the program state (as available) at different points in the past. GDB had support to do this earlier using record and replay and using the earlier Branch Trace Store, but it was extremely slow. PT will be a faster and more functional alternative.

A version of GDB supporting the current auxtrace kernel API is currently in development by Markus Metzger. For decoding PT, GDB uses the libipt PT decoder, which provides a C API for decoding that can be also used for other PT tools.

Conclusion

Hardware-based branch tracing is a powerful technique for debugging and performance analysis. It can generate vast amounts of data that allow the analysis of programs in new ways. In fact, the challenge will not be to collect data on performance or debugging problems any more, but instead more to find the right information in all of the information that can now be collected. As of the 4.2 kernel, perf and other tools will provide a range of powerful capabilities to generate branch traces. It will also provide basic capabilities to display the data. It will be interesting to see what new visualization and analysis tools can be developed on top of that to make better use of this rich source of data.

Comments (6 posted)

Patches and updates

Kernel trees

Greg KH Linux 4.1.1 ?

Greg KH Linux 4.0.7 ?

Luis Henriques Linux 3.16.7-ckt14 ?

Greg KH Linux 3.14.46 ?

Greg KH Linux 3.10.82 ?

Architecture-specific

Paul Osmialowski [New BSP] Add initial support for Freescale Kinetis TWR-K70F120M development kit ?

Heiko Stübner arm64: add initial support for the Rockchip rk3368 ?

Andy Lutomirski x86: Rewrite exit-to-userspace code ?

Core kernel code

Paul Turner [RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections ?

Michael Turquette scheduler-driven cpu frequency selection ?

Tom Zanussi tracing: 'hist' triggers ?

Paul E. McKenney Rework expedited grace periods ?

He Kuang Make eBPF programs output data to perf event ?

Development tools

Dan Carpenter rename_rev.pl: review script for whitespace changes ?

Device drivers

Adam Thomson Add support for DA9150 Fuel-Gauge ?

bjorn@kryo.se Qualcomm Shared Memory & RPM drivers ?

Leilk Liu Add Mediatek SPI bus driver ?

YH Huang Add MediaTek display PWM driver ?

Sanchayan Maity Add support for touchscreen on Colibri VF50 ?

Finn Thain Re-use nvram module ?

Peter Griffin Add c8sectpfe LinuxDVB demux driver ?

Woojung.Huh@microchip.com [PATCH] Add supporting Microchip's LAN78XX USB to Ethernet controller ?

Taku Izumi FUJITSU Extended Socket network device driver ?

David Lin Add new mac80211 driver mwlwifi. ?

Jeremy White RFC - Implement a usbredir kernel module ?

S Twiss da9062: DA9062 driver submission ?

Device driver infrastructure

Dan Williams [PATCH v2 00/17] libnvdimm: ->rw_bytes(), BLK, BTT, PMEM api, and unit tests ?

Tomeu Vizoso Discover and probe dependencies ?

Filesystems and block I/O

Dave Hansen fs: optimize inotify/fsnotify code for unwatched files ?

Shaohua Li MD: a caching layer for raid5/6 ?

Josef Bacik super block scalability patches V4 ?

Memory management

Joonsoo Kim redesign compaction algorithm ?

Gioh Kim enable migration of driver pages ?

Xishi Qiu mm: mirrored memory support for page buddy allocations ?

Sergey Senozhatsky mm/zsmalloc: introduce automatic pool compaction ?

Miscellaneous

Stephen Hemminger iproute2 4.1.0 ?

Wang Nan perf tools: filtering events using eBPF programs ?

Page editor: Jonathan Corbet
Next page: Distributions>>

Kernel development

Brief items

Kernel release status

Quote of the week

Kernel development news

4.2 Merge window part 2

RCU-walk: faster pathname lookup in Linux

Clear demarcation of roles

RCU and seqlocks: fast and light

mount_lock and nd->m_seq

dentry->d_seq and nd->seq

No inode->i_mutex or even rename_lock

unlazy walk() and complete_walk()

Taking care in filesystems

A pair of patterns

Next: symlinks

Adding Processor Trace support to Linux

Implementation

User interface

Limitations for non-root users

Snapshot mode

Debugging

Conclusion

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Filesystems and block I/O

Memory management

Miscellaneous

`mount_lock` and `nd->m_seq`

`dentry->d_seq` and `nd->seq`

No `inode->i_mutex` or even `rename_lock`

`unlazy walk()` and `complete_walk()`