Swap prefetching

[Posted September 27, 2005 by corbet]

It's a common occurrence: some large application runs briefly and pushes all kinds of useful memory out to swap space. Examples include large ld runs, backups, slocate, and others. Once the program is done, the Linux system is left with a great deal of free memory, and a substantial amount of useful application data stuck in swap space. When the user tries to use a running application, everything stops while it populates that free memory with its pages. Wouldn't it be nice if the system could restore swapped out pages when the memory becomes available and avoid making the user wait later on?

A number of attempts have been made at prefetching swapped data in the past. It has proved hard, however, to repopulate memory from swap in a way which does not adversely affect the performance of the system as a whole. A well-intended interactivity optimization can easily turn into a performance hit in real use. Con Kolivas has been making another try at it, however, with a series of prefetch patches based on code originally written by Thomas Schlichter. Version 11 of the swap prefetch patch was posted on September 23.

This patch creates two new data structures to track pages which have been evicted to swap. Each swapped page is represented by a swapped_entry_t structure; this structure is added to a linked list and a radix tree. The list enables the prefetch code to find the most recently swapped pages, with the idea that those pages are more likely to be useful in the near future than others which have been languishing in swap for longer. The radix tree, instead, allows the quick removal of entries without having to search the entire (possibly very long) list to find them.

Whenever a page is pushed out to swap, it is also added to the list and radix tree. There is a limit on how many pages will be remembered; it is currently set to a relatively high value which keeps the swapped page entries from occupying more than 5% of RAM. If that limit is exceeded, an older entry will be recycled. The add_to_swapped_list() code also refuses to wait for any locks; if there is a conflict with another processor, it will simply forget a page rather than spin on the lock. The consequence of forgetting a page (it will never be prefetched) is relatively small, so holding up the swap process for contention is not worth it in this case.

The code which actually performs prefetching is even more timid; every effort has been made to make the process of swap prefetching as close to free as possible. The prefetch code only runs once every five seconds - and that gets pushed back any time there is VM activity. The number of available free pages must be substantially above the minimum desired number, or prefetching will not happen. The code also checks that no writeback is happening, that the number of dirty pages in the system is relatively small, that the number of mapped pages is not too high, that the swap cache is not too large, and that the available pages are outside of the DMA zone. When all of those conditions are met, a few pages will be read from swap into the swap cache; they remain on the swap device so that they can be immediately reclaimed should a sudden shortage of memory develop.

Con claims that the end result is worthwhile:

In testing on modern pc hardware this results in wall-clock time activation of the firefox browser to speed up 5 fold after a worst case complete swap-out of the browser on an static web page.

That seems like a benefit worth having, if the cost of the prefetch code is truly low. Discussion on the list has been limited, suggesting that developers are unconcerned about the impacts of prefetching - or simply uninterested at this point.

Index entries for this article
Kernel	Memory management/Swapping

replaced by new patch from ..

Posted Sep 29, 2005 5:56 UTC (Thu) by hisdad (subscriber, #5375) [Link] (9 responses)

I saw it the other day, riel or molnar IIRC.
If a process has been dirtying pages it is required to write those pages out to disk.

Apparenty huge improvements to interactivity while doing huge file copies.
A small patch apprently, with no downside, apparently.

That might make this irrelevant because if never get swapped out in the first place.

--John

replaced by new patch from ..

Posted Sep 29, 2005 7:08 UTC (Thu) by ncm (guest, #165) [Link] (2 responses)

Would somebody please explain the above comment, comprehensibly?

For my part, I would prefer a much more aggressive prefetcher. Any page that's unused is wasted -- providing it can be reclaimed quickly because there's also a copy somewhere on disk. Similarly, any page of swap that doesn't mirror an otherwise-unbacked page in RAM is wasted, and slows down reclaiming that page for some other use.

Throughput's nice for benchmarks and kernel compiles, but most of us suffer far more from abysmal latency than from marginally-reduced throughput.

Andrea Arcangeli's per-task-predictive-write-throttling

Posted Sep 29, 2005 10:34 UTC (Thu) by farnz (subscriber, #17727) [Link] (1 responses)

I believe he's referring to Andrea Arcangeli's per-task-predictive-write-throttling.

If I've understood it properly, the patch measures the rate at which each task is writing to disk; if maintaining that rate would cause the kernel to start flushing buffers via pdflush in the near future, the task's timeslice is used to flush instead.

The result is that tasks doing the odd write here and there aren't affected, since they don't cause enough dirty pages. Tasks like cp, which dirty lots of pages, get paused to write out these pages (cleaning them), and making these pages eligible for eviction. This reduces the memory pressure cp-type tasks can induce without killing their performance; cp would eventually pause for the writeout, once it had dirtied as much RAM as it could. The patch just brings this pause forward, so that it doesn't dirty too much RAM before it writes out.

Andrea Arcangeli's per-task-predictive-write-throttling

Posted Sep 29, 2005 10:41 UTC (Thu) by nix (subscriber, #2304) [Link]

It's a good idea, that patch, definitely. For a pathological example, my backup process involves multi-hundred-Mb copies to a packet-written CD-RW. This generally floods almost all of memory with dirtied pages and then flushes them to this rather slow device, making the machine slow and swappy for a quarter-hour or so until the flush is done: with predictive write throttling, I'd expect to see a steady trickle instead, at something close to it.

(Even the current behaviour is much better than Linux-2.4, where the fact that block devices didn't have their own queues led to the entirety of X and everything I was doing freezing solid within a second or two of the flush beginning; the system was more swappy than normal because memory was filled with all those dirty pages, and a tiny and otherwise-unnoticeable bit of swapping or paging was stuck behind the vast heap of stuff destined for the CD-RW.)

replaced by new patch from ..

Posted Sep 29, 2005 10:36 UTC (Thu) by nix (subscriber, #2304) [Link] (5 responses)

The patch you're discussing is intended to stop processes that would dirty many pages from filling up memory with dirty pages far faster than the device can emit them, and only then blocking; instead it forces them to block sooner, before so much of memory is filled up by its dirtied pages.

This is orthogonal to the patch under discussion, which is arranging that when something *has* filled memory and then freed it all again, that useful stuff gets put back in there sooner rather than later: after all, memory can be filled by many things other than dirtied pages awaiting flushing (e.g., large ld(1) runs ;) )

replaced by new patch from ..

Posted Sep 29, 2005 13:50 UTC (Thu) by liljencrantz (guest, #28458) [Link] (3 responses)

These algorithms may be orthogonal in what they do, but the problems they solve have a strong overlap.

The writeout on dirty pages patch fixes filter-type programs that read/write out huge amounts of data, but never use the same data for more than a short period of time. This is somewhat related to the new instructions in the next-generation consoles that support writing out data without touching the caches, only that the kernel autodetects such uses, so the program doesn't have to explicitly tell the OS what data not to put in the cache. This patch should in theory work very well for backup programs, indexers, media players and many other types of cache killers.

The swap prefetching patch would also solve the issue of filter-type programs, though significantly less efficiently, since a filter program would first force the entire system to swap out and then slowly 'swap in' during half a minute or so. But swap prefetching also fixes a slightly different type of problem, namely when an application uses a huge amount of memory and then exits (or at least free()s the memory). This includes applications that do some rather complicated things during initialization, like OpenOffice, as well as memory-hungry 'one-shot' programs, like yum.

replaced by new patch from ..

Posted Sep 29, 2005 21:08 UTC (Thu) by nix (subscriber, #2304) [Link] (2 responses)

It doesn't fix reading-cache-killers: they'll still push other data out of the cache.

But it handles the other half of the job. :)

replaced by new patch from ..

Posted Sep 29, 2005 21:35 UTC (Thu) by giraffedata (guest, #1954) [Link] (1 responses)

The early writing of dirty pages patch doesn't actually address, except in an incidental way, the problem of single-use pages, either read or write. If you read or write a gigabyte of virtual memory on a .5 gigabyte system, you will sweep physical memory with the patch just like without. To solve that problem, you need to change the cache replacement policy, not the prewriting policy.

Actually, as VM changes so frequently, I don't know just what the present cache replacement policy is; maybe it's already sweep resistant; my point is that early writing isn't about that.

Early writing makes it so all those page frames wasted with pages that will never be used again are at least clean, so when they do get reclaimed, the reclaimer doesn't have to wait. The wait for page laundry is shifted to the guy diryting pages away from the innocent bystander competing for real memory.

replaced by new patch from ..

Posted Sep 29, 2005 23:24 UTC (Thu) by farnz (subscriber, #17727) [Link]

AIUI, the big gain of Andrea's early write-out patch is that the VM has a strong preference for evicting clean pages (nearly free) over dirty (expensive I/O needed). Because Andrea's code stops streaming writes like cp from dirtying lots of one-use pages, it makes the pages that cp is dirtying much more attractive to the VM when it's hunting for freeable pages.

replaced by new patch from ..

Posted Sep 29, 2005 20:15 UTC (Thu) by hisdad (subscriber, #5375) [Link]

Ah yes, angeli. so i didn't recall correctly afterall!

I had mostly thought of swap in the case of large copies and not considered
these other cases.

It will be interesting to see what they are like in practice.

--John

a better solution (for some cases)

Posted Sep 30, 2005 9:33 UTC (Fri) by zooko (guest, #2589) [Link] (2 responses)

A better solution (for me at least): turn off swap.

A few months ago I was frequently running into situations where swap thrash would drag my system to a standstill. I have 1 GB of physical RAM, so this was happening when a single process was attempting a very large computation. After minutes or hours of waiting, the process would grow beyond swap and the out-of-memory killer would start killing random (*) things. If the OOM chose well then the system would become usable again, else, not.

So one day as an experiment I turned off swap. Now when a process grows beyond my 1 GB physical RAM, it quickly dies. (Err, waitasecond, shouldn't the OOM killer do its horrible random slaying? Yet as I recall, it seemed to work better in this situation.)

Swap -- an idea whose time has come and gone.

Regards,

Zooko

(*) I know it's not random. Whatever. It isn't predictable to me, the user.

a better solution (for some cases)

Posted Sep 30, 2005 22:53 UTC (Fri) by job (guest, #670) [Link]

I second that. If I ever need more than a gigabyte om RAM, I'll simply buy another.
With 64-bit machines becoming mainstream, who cares?

a better solution (for some cases)

Posted Oct 6, 2005 13:19 UTC (Thu) by efexis (guest, #26355) [Link]

Agreed on the swap front. If a runaway process has a huge amount of swap space to eat through before dying, it can bring the system to a halt for an extended length of time. Trying to ssh in (if I don't already have an open connection), and then battling for enough memory and disk IO just to run the kill/killall command, is often impossible. So I tend to set a tiny swap, just for stuff that really doesn't need to be in memory, like 128meg or something. Now a runaway process won't cause a runaway system.

Not that it's a regular occurance, but with todays amounts of memory, this "1.5 x RAM" swap rule that people won't let go of, causes more harm than good.