Two paths to a better readdir()

By Jonathan Corbet
July 30, 2014

A common filesystem workload follows a simple pattern: work through a list of files in a directory, and use stat() to obtain information about each of those files. The "ls -l" command is a classic example of this type of workload, but there are many others. This workload has always run more slowly on Linux systems than developers would like, but getting a solution into the kernel has happened even more slowly. Recently, a pair of possible solutions was posted by Abhi Das; perhaps this time this issue will be resolved — in a surprising way.

The problem with the "ls -l" workload is simple enough: two system calls are required for each file of interest. A call to getdents() (usually via the readdir() function in the C library) obtains the name of a file in the directory; then stat() is called to get the information about that file. The stat() call, in particular, can be expensive, with each call forcing the underlying filesystem to perform I/O to obtain the desired information. In some cases, that information may be spread across multiple on-disk data structures, requiring even more I/O, even if the calling application does not actually use everything that stat() returns. Doing all this work is inefficient; it would be nice if there were a way to limit the information gathered to what the application actually needs and to get that information in batches.

This issue is not new; it was, in fact, already somewhat old when it was discussed at the 2009 Linux Storage and Filesystem Workshop. A proposed solution, in the form of the xstat() system call, was posted in 2010 but did not get very far. At this point, well into 2014, some filesystems have code to try to optimize for this kind of workload, but there is still no general solution in the kernel. For the last few years, there has appeared to be little interest among developers in working on this problem.

In that setting, Abhi has come forward with two independent solutions demonstrating two separate approaches to the problem. His hope is to get feedback on both and, once one of them emerges as the preferred solution, get it into the mainline kernel.

xgetdents()

The first approach builds on the 2010 xstat() system call by David Howells. It adds two new system calls:

    int xstat(int dirfd, const char *filename, unsigned int flags,
    	      unsigned int mask, struct xstat *info);
    int fxstat(int fd, unsigned int flags, unsigned int mask, struct xstat *info);

The first form looks up a given file by name, while the second returns information for an open file identified by its descriptor. The flags field is there to change the operation of the system call; there is little use of it in this patch set. Of more interest is mask, which tells the kernel which information is being requested by the application. There are quite a few bits that can be set here; examples include XSTAT_MODE (for the file protection bits), XSTAT_UID (file owner), XSTAT_RDEV (underlying storage device), XSTAT_ATIME (last access time), or XSTAT_INO (inode number). XSTAT_ALL_STATS can be used to request all available information. On a successful return, the info structure will be filled in with the requested data.

On top of this work, Abhi has added another system call:

    int xgetdents(unsigned int fd, unsigned int flags, unsigned int mask,
		  void *buf, unsigned int count);

Here, fd is a file descriptor for the directory of interest, while flags and mask are as above (though mask has been extended to allow the application to request various types of extended attribute data). Information is returned in buf, which is a simple byte array, count bytes in length. The xgetdents() call will attempt to retrieve information about multiple files in the given directory until buf fills.

The actual data returned in buf is somewhat complex. The top-level structures defining this information are:

    struct xdirent_blob {
	unsigned int    xb_xattr_count;
	char            xb_blob[1]; /* contains variable length data like
				     * NULL-terminated name, xattrs etc */
    };

    struct linux_xdirent {
	unsigned long        xd_ino;
	char                 xd_type;
	unsigned long        xd_off;
	struct xstat         xd_stat;
	unsigned long        xd_reclen;
	struct xdirent_blob  xd_blob;
    };

The documentation of the return format is somewhat sparse. Actually, it does not exist at all, so one is forced to reverse-engineer it from the code. It appears that information for each file will be returned in one variable-length linux_xdirent structure. The name of the file is the first thing stored in xd_blob, followed by extended attribute information if that has been requested. This structure clearly requires a bit of work to understand and pick apart on the user-space side, but it does have the advantage of allowing all of that information to be collected and returned in a single system call.

dirreadahead()

The alternative approach is rather simpler. It adds a single system call:

	int dirreadahead(unsigned int fd, loff_t *offset, unsigned int count);

This call will attempt to initiate the reading of file information for count files in the directory represented by fd, starting at the given offset within the directory. The offset value will be updated to reflect the number of files whose information was actually read. One can thus use multiple dirreadahead() calls to work through a directory with the kernel maintaining the offset value as things progress.

In this case, it is still necessary to call getdents() and stat() to get the needed information. What changes is that, with luck, the filesystem will have already pulled that information into an internal cache, so the calls should be handled quickly. Reading information for multiple files at once allows batching to be done; even if the information is dispersed on physical media, the necessary I/O operations can be reordered for optimal execution.

The introductory message to the two patch sets included some benchmark results on the GFS2 filesystem. Both approaches performed better than mainline kernels when presented with a workload heavy with getdents() and stat() system calls. Perhaps surprisingly, dirreadahead() consistently performed far better than xgetdents(). That result may be an artifact of the xgetdents() implementation or of the GFS2 filesystem, but it shows that the far simpler readahead-based approach is worthy of consideration.

The readahead idea quickly led to questions of whether the kernel could somehow perform this readahead automatically, as it does with basic file I/O. Trond Myklebust noted that the NFS client tries to detect workloads where this kind of readahead might be of value. In the general case, though, this detection is hard to do; there is no obvious connection within the kernel between the getdents() and stat() calls. So, for now at least, it may be up to user space to communicate that information directly. Either of the two interfaces described here could be used for that communication, but it seems that the relative simplicity of the dirreadahead() approach would argue strongly in its favor, even in the absence of better benchmark results.

Index entries for this article
Kernel	dirreadahead()
Kernel	Filesystems
Kernel	xgetdents()

Two paths to a better readdir()

Posted Jul 31, 2014 10:36 UTC (Thu) by niner (subscriber, #26151) [Link] (1 responses)

Maybe dirreadahead is faster because it allows the kernel to do I/O while the user space program is still processing results, i.e. it's an asynchronous interface.

While dirreadahead certainly has its appeal due to its simplicity and maybe even better performance, but xstat could be useful in other use cases as well. For example we do lots of stat calls to compare mtimes of a source and a preprocessed cache file to detect if we have to re-process the source. In this case we know both file names and just need the mtime.

Two paths to a better readdir()

Posted Aug 1, 2014 2:36 UTC (Fri) by adas (guest, #98126) [Link]

Yes, dirreadahead is faster because it does the readahead asynchronously and using multiple workqueues. On the other hand xgetdents is single-threaded and there is a lot of processing involved in collating all the data into the result __user buffer. And yes, the user space program needs to deconstruct the buffer too.

Two paths to a better readdir()

Posted Jul 31, 2014 19:12 UTC (Thu) by rwmj (subscriber, #5474) [Link]

libguestfs's FUSE driver also performs readahead as soon as a client begins a readdir of a directory. It massively improved performance of simple commands like 'ls' as well as things like the Nautilus file browser.

https://github.com/libguestfs/libguestfs/blob/stable-1.26...

Two paths to a better readdir()

Posted Aug 1, 2014 2:16 UTC (Fri) by dlang (guest, #313) [Link]

dirreadahead also has the advantage that if the kernel does learn to do this automatically it can just be turned into a noop so there is less long term baggage

Two paths to a better readdir()

Posted Aug 1, 2014 9:06 UTC (Fri) by kugel (subscriber, #70540) [Link]

dirreadahead() should have a flags parameter.

Two paths to a better readdir()

Posted Aug 5, 2014 19:42 UTC (Tue) by reubenhwk (guest, #75803) [Link]

I'm wondering why pick just one??? Wouldn't this be all around better...

while (dirent = readdir(...)) {
  xstat(..., dirent->d_name);
  dirreadahead(...)

  /* Do something useful */
}