DAX and fsync: the cost of forgoing page structures

February 24, 2016

This article was contributed by Neil Brown

DAX, the support library that can help Linux filesystems provide direct access to persistent memory (PMEM), has seen substantial ongoing development since we covered it nearly 18 months ago. Its main goal is to bypass the page cache, allowing reads and writes to become memory copies directly to and from the PMEM, and to support mapping that PMEM directly into a process's address space with mmap(). Consequently, it was a little surprising to find that one of the challenges in recent months was the correct implementation of fsync() and related functions that are primarily responsible for synchronizing the page cache with permanent storage.

While that primary responsibility of fsync() is obviated by not caching any data in volatile memory, there is a secondary responsibility that is just as important: ensuring that all writes that have been sent to the device have landed safely and are not still in the pipeline. For devices attached using SATA or SCSI, this involves sending (and waiting for) a particular command; the Linux block layer provides the blkdev_issue_flush() API (among a few others) for achieving this. For PMEM we need something a little different.

There are actually two "flush" stages needed to ensure that CPU writes have made it to persistent storage. One stage is a very close parallel to the commands sent by blkdev_issue_flush(). There is a subtle distinction between PMEM "accepting" a write and "committing" a write. If power fails between these events, data could be lost. The necessary "flush" can be performed transparently by a memory controller using Asynchronous DRAM Refresh (ADR) [PDF], or explicitly by the CPU using, for example, the new x86_64 instruction PCOMMIT. This can be seen in the wmb_pmem() calls sprinkled throughout the DAX and PMEM code in Linux; handling this stage is no great burden.

The burden is imposed by the other requirement: the need to flush CPU caches to ensure that the PMEM has "accepted" the writes. This can be avoided by performing "non-temporal writes" to bypass the cache, but that cannot be ensured when the PMEM is mapped directly into applications. Currently, on x86_64 hardware, this requires explicitly flushing each cache line that might be dirty by invoking the CLFLUSH (Cache Line Flush) instruction or possibly a newer variant if available (CLFLUSHOPT, CLWB). An easy approach, referred to in discussions as the "Big Hammer", is to implement the blkdev_issue_flush() API by calling CLFLUSH on every address of the entire persistent memory. While CLFLUSH is not a particularly expensive operation, performing it over potentially terabytes of memory was seen as worrisome.

The alternative is to keep track of which regions of memory might have been written recently and to only flush those. This can be expected to bring the amount of memory being flushed down from terabytes to gigabytes at the very most, and hence to reduce run time by several orders of magnitude. Keeping track of dirty memory is easy when the page cache is in use by using a flag in struct page. Since DAX bypasses the page cache, there are no page structures for most of PMEM, so an alternative is needed. Finding that alternative was the focus of most of the discussions and of the implementation of fsync() support for DAX, culminating in patch sets posted by Ross Zwisler (original and fix-ups) that landed upstream for 4.5-rc1.

Is it worth the effort?

There was a subthread running through the discussion that wondered whether it might be best to avoid the problem rather than fix it. A filesystem does not have to use DAX simply because it is mounted from a PMEM device. It can selectively choose to use DAX or not based on usage patterns or policy settings (and, for example, would never use DAX on directories, as metadata generally needs to be staged out to storage in a controlled fashion). Normal page-cache access could be the default and write-out to PMEM would use non-temporal writes. DAX would only be enabled while a file is memory mapped with a new MMAP_DAX flag. In that case, the application would be explicitly requesting DAX access (probably using the nvml library) and it would take on the responsibility of calling CLFLUSH as appropriate. It is even conceivable that future processors could make cache flushing for a physical address range much more direct, so keeping track of addresses to flush would become pointless.

Dan Williams championed this position putting his case quite succinctly:

DAX in my opinion is not a transparent accelerator of all existing apps, it's a targeted mechanism for applications ready to take advantage of byte addressable persistent memory.

He also expressed a concern that fsync() would end up being painful for large amounts of data.

Dave Chinner didn't agree. He provided a demonstration suggesting that the proposed overheads needed for fsync() would be negligible. He asserted instead:

DAX is a method of allowing POSIX compliant applications get the best of both worlds - portability with existing storage and filesystems, yet with the speed and byte [addressablity] of persistent storage through the use of mmap.

Williams' position resurfaced from time to time as it became clear that there were real and ongoing challenges in making fsync() work, but he didn't seem able to rally much support.

Shape of the solution

In general, the solution chosen is to still use the page cache data structures, but not to store struct page pointers in them. The page cache uses a radix tree that can store a pointer and a few tags (single bits of extra information) at every page-aligned offset in a file. The space reserved for the page pointer can be used for anything else by setting the least significant bit to mark it as an exception. For example, the tmpfs filesystem uses exception entries to keep track of file pages that have been written out to swap.

Keeping track of dirty regions of a file can be done by allocating entries in this radix tree, storing a blank exception entry in place of the page pointer, and setting the PAGECACHE_TAG_DIRTY tag. Finding all entries with a tag set is quite efficient, so flushing all the cache lines in each dirty page to perform fsync() should be quite straightforward.

As this solution was further explored, it was repeatedly found that some of those fields in struct page really are useful, so an alternative needed to be found.

Page size: `PG_head`

To flush "all the cache lines in each dirty page" you need to know how big the page is — it could be a regular page (4K on x86) or it could be a huge page (2M on x86). Huge pages are particularly important for PMEM, which is expected to sometimes be huge. If the filesystem creates files with the required alignment, DAX will automatically use huge pages to map them. There are even patches from Matthew Wilcox that aim to support the direct mapping for extra-huge 1GB pages — referred to as "PUD pages" after the Page Upper Directory level in the four-level page tables from which they are indexed.

With a struct page the PG_head flag can be used to determine the page size. Without that, something else is needed. Storing 512 entries in the radix tree for each huge page would be an option, but not an elegant option. Instead, one bit in the otherwise unused pointer field is used to flag a huge-page entry, which is also known as a "PMD" entry because it is linked from the Page Middle Directory.

Locking: `PG_locked`

The page lock is central to handling concurrency within filesystems and memory management. With no struct page there is no page lock. One place where this has caused a problem is in managing races between one thread trying to sync a page and mark it as clean and another thread dirtying that page. Ideally, clean pages should be removed from the radix tree completely as they are not needed there, but attempts to do that have, so far, failed to avoid the race. Jan Kara suggested that another bit in the pointer field could be used as a bit-spin-lock, effectively duplicating the functionality of PG_locked. That seems a likely approach but it has not yet been attempted.

Physical memory address

Once we have enough information in the radix tree to reliably track which pages are dirty and how big they are, we just need to know where each page is in PMEM so it can be flushed. This information is generally of little interest to common code so handling it is left up to the filesystem. Filesystems will normally attach something to the struct page using the private pointer. In filesystems that use the buffer_head library, the private pointer links to a buffer_head that contains a b_blocknr field identifying the location of the stored data.

Without a struct page, the address needs to be found some other way. There are a number of options, several of which have been explored. The filesystem could be asked to perform the lookup from file offset to physical address using its internal indexing tables. This is an indirect approach and may require the filesystem to reload some indexing data from the PMEM (it wouldn't use direct-access for that). While the first patch set used this approach, it did not survive long.

Alternately, the physical address could be stored in the radix tree when the page is marked as dirty; the physical address will already be available at that time as it is just about to be accessed for write. This leads to another question: exactly how is the physical address represented? We could use the address where the PMEM is mapped into the kernel address space, but that leads to awkward races when a PMEM device is disabled and unmapped. Instead, we could use a sector offset into the block device that represents the PMEM. That is what the current implementation does, but it implicitly assumes there is just one block device, or at least just one per file. For a filesystem that integrates volume management (as Btrfs does), this may not be the case.

Finally, we could use the page frame number (PFN), which is a stable index that is assigned by the BIOS when the memory is discovered. Wilcox has patches to move in this direction, but the work is ~~70%~~ maybe 50% done. Assuming that the PFN can be reliably mapped to the kernel address that is needed for CLFLUSH, this seems like the best solution.

Is this miniature `struct page` enough?

One way to look at this development is that a 64-bit miniature struct page has been created for the DAX use case to avoid the cost of a full struct page. It currently contains a "huge page" flag and a physical sector number. It may yet gain a lock bit and have a PFN in place of the sector number. It seems prudent to ask if there is anything else that might be needed before DAX functionality is complete.

As quoted above, Chinner appears to think that transparent support for full POSIX semantics should be the goal. He went on to opine that:

This is just another example of how yet another new-fangled storage technology maps precisely to a well known, long serving storage architecture that we already have many, many experts out there that know to build reliable, performant storage from... :)

Taking that position to its logical extreme would suggest that anything that can be done in the existing storage architecture should work with PMEM and DAX. One such item of functionality that springs to mind is the pvmove tool. When a filesystem is built on an LVM2 volume, it is possible to use pvmove to move some of the data from one device to another, to balance the load, decommission old hardware, or start using new hardware. Similar functionality could well be useful with PMEM.

There would be a number of challenges to making this work with DAX, but possibly the biggest would be tearing down memory mappings of a section of the old memory before moving data across to the new. The Linux kernel has some infrastructure for memory migration that would be a perfect fit — if only the PMEM had a table of struct page as regular memory does. Without those page structures, moving memory that is currently mapped becomes a much more interesting task, though likely not an insurmountable one.

On the whole, it seems like DAX is showing a lot of promise but is still in its infancy. Currently, it can only be used on ext2, ext4, and XFS, and only where they are directly mounted on a PMEM device (i.e. there is no LVM support). Given the recent rate of change, it is unlikely to stay this way. Bugs will be fixed, performance will be improved, coverage and features will likely be added. When inexpensive persistent memory starts appearing on our motherboards it seems that Linux will be ready to make good use of it.

Index entries for this article
Kernel	DAX
Kernel	Memory management/Nonvolatile memory
GuestArticles	Brown, Neil

DAX and fsync: the cost of forgoing page structures

Posted Feb 25, 2016 10:20 UTC (Thu) by willy (subscriber, #9762) [Link] (3 responses)

I've been considering the idea of a locked bit in the radix tree entry for a while. I think I may even have some awful patches from three years ago languishing in a git tree on my laptop. The problem is that we would need to change a lot of code to use a pointer into the radix tree rather than the value found in the radix tree. That implies holding the radix tree lock (or possibly the RCU lock) for an extended period, which means no sleeping (the page lock is a sleeping lock).

Another issue is that each bit consumed by a feature reduces the amount of physical memory supportable. Right now I have six bits consumed; two for the radix tree, two for PFN_MAP and PFN_DEV and two for the size (PTE, PMD or PUD). That limits us to 256GB on 32-bit systems with a 4k page size. Quite a lot of memory, but a mere laptop drive for storage.

Maybe PFN_DEV is implicit for DAX and that bit can be reused for locking.

DAX and fsync: the cost of forgoing page structures

Posted Feb 25, 2016 11:09 UTC (Thu) by neilbrown (subscriber, #359) [Link] (2 responses)

> That implies holding the radix tree lock

Does it? radix_tree_node.count could be used to count externally held references as well as the internal ones (non-trivial change, but quite practical). That could be used to stabilize the entry while spinning on the lock.

Making this credible on 32bit does seem .... challenging. There is one tag bit that isn't used I think but at best that would get you to 1TB. Maybe 32bit systems don't deserve any more...

Hmmm.. You don't really need two bits for PMD and PUD. Once the PMD bit is set you have 9 bits in the PFN that you expect to be zero. One of those could distinguish between PMD and PUD.

DAX and fsync: the cost of forgoing page structures

Posted Feb 25, 2016 12:44 UTC (Thu) by roblucid (guest, #48964) [Link] (1 responses)

Worrying about the limitations of 32bit systems, seems a bit perverse when gearing up for mmap-ing vast arrays of storage directly into memory.

You'ld face similar program limitations to UNIX Version 6 on a PDP11, where address space was smaller than physical memory & data, you then want things to reside in files processed by record by record. But pmem sounds like it'd make an ideal swap/hibernate disk device. The recent 32bit ARM CPUs launched, for ultra low powered applications, forgo an MMU so are moot.

The whole idea sounds like good material for one of Linus's colourful statements, IIRC he dislikes the 32bit PAE kernel extensions. So why compromise the 64bit & up design, for something that'll never really be useful on 32bit systems?

DAX and fsync: the cost of forgoing page structures

Posted Mar 11, 2016 19:35 UTC (Fri) by dlang (guest, #313) [Link]

could this be addressed by having an offset to a mount? it would mean no one 'drive/partition' could be larger than X, but if this is only an issue on 32 bit systems, is this really that bad an option?

DAX and fsync: the cost of forgoing page structures

Posted Feb 25, 2016 21:07 UTC (Thu) by iabervon (subscriber, #722) [Link] (2 responses)

It seems odd to me that there would be anything new to worry about with CPU caches when implementing fsync() for DAX. Why don't you have the same problems with CPU caches and the page cache? And I'd think that, in the case where PMEM is mapped directly into applications, the application should be expecting to call msync() (or munmap()) in order to be sure that its changes to those pages are more globally visible. I'd understand scalability concerns around the fact that mmap()ing a terabyte of data hadn't been plausible before; is this a case where the removal of a different bottleneck (page cache space) would allow something formerly limited to overwhelm the system? Or are people trying to make PMEM-backed mmaps easier to use than page-cache-backed mmaps?

DAX and fsync: the cost of forgoing page structures

Posted Feb 25, 2016 22:31 UTC (Thu) by neilbrown (subscriber, #359) [Link]

> It seems odd to me that there would be anything new to worry about

It took me a little while to convince myself of why there really is something new.
When you write to a traditional storage device, the device gets the data using DMA.
On x86 at least, the DMA controller sees memory that is consistent with that the CPU sees.
So the data doesn't need to be in "main memory" for the DMA controller to copy it to the target device.

Details might be different on non-x86 hardware. Documentation/DMA-API-HOWTO.txt might be helpful.

> the application should be expecting to call msync()

I probably should have been more explicit, but when I wrote " fsync() and related functions", that includes msync. Msync does the same thing as fsync and has the same difficulties. It just identifies the target file differently.

I am not able to address your other questions.

DAX and fsync: the cost of forgoing page structures

Posted Feb 25, 2016 23:51 UTC (Thu) by dgc (subscriber, #6611) [Link]

> It seems odd to me that there would be anything new to worry about with CPU
> caches when implementing fsync() for DAX. Why don't you have the same
> problems with CPU caches and the page cache?

We do have the same problems - it's just they were solved a long time ago and its assumed that filesystem developers understand the need for these cache flushes and where to locate them. e.g.. go have a look at all the flush_dcache_page() calls in the filesystem and IO code....

-Dave.

DAX and fsync: the cost of forgoing page structures

Is it worth the effort?

Shape of the solution

Page size: PG_head

Locking: PG_locked

Physical memory address

Is this miniature struct page enough?

DAX and fsync: the cost of forgoing page structures

DAX and fsync: the cost of forgoing page structures

DAX and fsync: the cost of forgoing page structures

DAX and fsync: the cost of forgoing page structures

DAX and fsync: the cost of forgoing page structures

DAX and fsync: the cost of forgoing page structures

DAX and fsync: the cost of forgoing page structures

Page size: `PG_head`

Locking: `PG_locked`

Is this miniature `struct page` enough?