Swap prefetching
A number of attempts have been made at prefetching swapped data in the past. It has proved hard, however, to repopulate memory from swap in a way which does not adversely affect the performance of the system as a whole. A well-intended interactivity optimization can easily turn into a performance hit in real use. Con Kolivas has been making another try at it, however, with a series of prefetch patches based on code originally written by Thomas Schlichter. Version 11 of the swap prefetch patch was posted on September 23.
This patch creates two new data structures to track pages which have been evicted to swap. Each swapped page is represented by a swapped_entry_t structure; this structure is added to a linked list and a radix tree. The list enables the prefetch code to find the most recently swapped pages, with the idea that those pages are more likely to be useful in the near future than others which have been languishing in swap for longer. The radix tree, instead, allows the quick removal of entries without having to search the entire (possibly very long) list to find them.
Whenever a page is pushed out to swap, it is also added to the list and radix tree. There is a limit on how many pages will be remembered; it is currently set to a relatively high value which keeps the swapped page entries from occupying more than 5% of RAM. If that limit is exceeded, an older entry will be recycled. The add_to_swapped_list() code also refuses to wait for any locks; if there is a conflict with another processor, it will simply forget a page rather than spin on the lock. The consequence of forgetting a page (it will never be prefetched) is relatively small, so holding up the swap process for contention is not worth it in this case.
The code which actually performs prefetching is even more timid; every effort has been made to make the process of swap prefetching as close to free as possible. The prefetch code only runs once every five seconds - and that gets pushed back any time there is VM activity. The number of available free pages must be substantially above the minimum desired number, or prefetching will not happen. The code also checks that no writeback is happening, that the number of dirty pages in the system is relatively small, that the number of mapped pages is not too high, that the swap cache is not too large, and that the available pages are outside of the DMA zone. When all of those conditions are met, a few pages will be read from swap into the swap cache; they remain on the swap device so that they can be immediately reclaimed should a sudden shortage of memory develop.
Con claims that the end result is worthwhile:
That seems like a benefit worth having, if the cost of the prefetch code is
truly low. Discussion on the list has been limited, suggesting that
developers are unconcerned about the impacts of prefetching - or simply
uninterested at this point.
Index entries for this article | |
---|---|
Kernel | Memory management/Swapping |
Posted Sep 29, 2005 5:56 UTC (Thu)
by hisdad (subscriber, #5375)
[Link] (9 responses)
Apparenty huge improvements to interactivity while doing huge file copies.
That might make this irrelevant because if never get swapped out in the first place.
--John
Posted Sep 29, 2005 7:08 UTC (Thu)
by ncm (guest, #165)
[Link] (2 responses)
For my part, I would prefer a much more aggressive prefetcher. Any page that's unused is wasted -- providing it can be reclaimed quickly because there's also a copy somewhere on disk. Similarly, any page of swap that doesn't mirror an otherwise-unbacked page in RAM is wasted, and slows down reclaiming that page for some other use.
Throughput's nice for benchmarks and kernel compiles, but most of us suffer far more from abysmal latency than from marginally-reduced throughput.
Posted Sep 29, 2005 10:34 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (1 responses)
If I've understood it properly, the patch measures the rate at which each task is writing to disk; if maintaining that rate would cause the kernel to start flushing buffers via pdflush in the near future, the task's timeslice is used to flush instead.
The result is that tasks doing the odd write here and there aren't affected, since they don't cause enough dirty pages. Tasks like cp, which dirty lots of pages, get paused to write out these pages (cleaning them), and making these pages eligible for eviction. This reduces the memory pressure cp-type tasks can induce without killing their performance; cp would eventually pause for the writeout, once it had dirtied as much RAM as it could. The patch just brings this pause forward, so that it doesn't dirty too much RAM before it writes out.
Posted Sep 29, 2005 10:41 UTC (Thu)
by nix (subscriber, #2304)
[Link]
(Even the current behaviour is much better than Linux-2.4, where the fact that block devices didn't have their own queues led to the entirety of X and everything I was doing freezing solid within a second or two of the flush beginning; the system was more swappy than normal because memory was filled with all those dirty pages, and a tiny and otherwise-unnoticeable bit of swapping or paging was stuck behind the vast heap of stuff destined for the CD-RW.)
Posted Sep 29, 2005 10:36 UTC (Thu)
by nix (subscriber, #2304)
[Link] (5 responses)
This is orthogonal to the patch under discussion, which is arranging that when something *has* filled memory and then freed it all again, that useful stuff gets put back in there sooner rather than later: after all, memory can be filled by many things other than dirtied pages awaiting flushing (e.g., large ld(1) runs ;) )
Posted Sep 29, 2005 13:50 UTC (Thu)
by liljencrantz (guest, #28458)
[Link] (3 responses)
The writeout on dirty pages patch fixes filter-type programs that read/write out huge amounts of data, but never use the same data for more than a short period of time. This is somewhat related to the new instructions in the next-generation consoles that support writing out data without touching the caches, only that the kernel autodetects such uses, so the program doesn't have to explicitly tell the OS what data not to put in the cache. This patch should in theory work very well for backup programs, indexers, media players and many other types of cache killers.
The swap prefetching patch would also solve the issue of filter-type programs, though significantly less efficiently, since a filter program would first force the entire system to swap out and then slowly 'swap in' during half a minute or so. But swap prefetching also fixes a slightly different type of problem, namely when an application uses a huge amount of memory and then exits (or at least free()s the memory). This includes applications that do some rather complicated things during initialization, like OpenOffice, as well as memory-hungry 'one-shot' programs, like yum.
Posted Sep 29, 2005 21:08 UTC (Thu)
by nix (subscriber, #2304)
[Link] (2 responses)
But it handles the other half of the job. :)
Posted Sep 29, 2005 21:35 UTC (Thu)
by giraffedata (guest, #1954)
[Link] (1 responses)
Actually, as VM changes so frequently, I don't know just what the present cache replacement policy is; maybe it's already sweep resistant; my point is that early writing isn't about that.
Early writing makes it so all those page frames wasted with pages that will never be used again are at least clean, so when they do get reclaimed, the reclaimer doesn't have to wait. The wait for page laundry is shifted to the guy diryting pages away from the innocent bystander competing for real memory.
Posted Sep 29, 2005 23:24 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
Posted Sep 29, 2005 20:15 UTC (Thu)
by hisdad (subscriber, #5375)
[Link]
I had mostly thought of swap in the case of large copies and not considered
It will be interesting to see what they are like in practice.
--John
Posted Sep 30, 2005 9:33 UTC (Fri)
by zooko (guest, #2589)
[Link] (2 responses)
A few months ago I was frequently running into situations where swap thrash would drag my system to a standstill. I have 1 GB of physical RAM, so this was happening when a single process was attempting a very large computation. After minutes or hours of waiting, the process would grow beyond swap and the out-of-memory killer would start killing random (*) things. If the OOM chose well then the system would become usable again, else, not.
So one day as an experiment I turned off swap. Now when a process grows beyond my 1 GB physical RAM, it quickly dies. (Err, waitasecond, shouldn't the OOM killer do its horrible random slaying? Yet as I recall, it seemed to work better in this situation.)
Swap -- an idea whose time has come and gone.
Regards,
Zooko
(*) I know it's not random. Whatever. It isn't predictable to me, the user.
Posted Sep 30, 2005 22:53 UTC (Fri)
by job (guest, #670)
[Link]
Posted Oct 6, 2005 13:19 UTC (Thu)
by efexis (guest, #26355)
[Link]
Not that it's a regular occurance, but with todays amounts of memory, this "1.5 x RAM" swap rule that people won't let go of, causes more harm than good.
I saw it the other day, riel or molnar IIRC.replaced by new patch from ..
If a process has been dirtying pages it is required to write those pages out to disk.
A small patch apprently, with no downside, apparently.
Would somebody please explain the above comment, comprehensibly?replaced by new patch from ..
I believe he's referring to Andrea Arcangeli's per-task-predictive-write-throttling.
Andrea Arcangeli's per-task-predictive-write-throttling
It's a good idea, that patch, definitely. For a pathological example, my backup process involves multi-hundred-Mb copies to a packet-written CD-RW. This generally floods almost all of memory with dirtied pages and then flushes them to this rather slow device, making the machine slow and swappy for a quarter-hour or so until the flush is done: with predictive write throttling, I'd expect to see a steady trickle instead, at something close to it.Andrea Arcangeli's per-task-predictive-write-throttling
The patch you're discussing is intended to stop processes that would dirty many pages from filling up memory with dirty pages far faster than the device can emit them, and only then blocking; instead it forces them to block sooner, before so much of memory is filled up by its dirtied pages.replaced by new patch from ..
These algorithms may be orthogonal in what they do, but the problems they solve have a strong overlap.replaced by new patch from ..
It doesn't fix reading-cache-killers: they'll still push other data out of the cache.replaced by new patch from ..
The early writing of dirty pages patch doesn't actually address, except in an incidental way, the problem of single-use pages, either read or write. If you read or write a gigabyte of virtual memory on a .5 gigabyte system, you will sweep physical memory with the patch just like without. To solve that problem, you need to change the cache replacement policy, not the prewriting policy.
replaced by new patch from ..
AIUI, the big gain of Andrea's early write-out patch is that the VM has a
strong preference for evicting clean pages (nearly free) over dirty
(expensive I/O needed). Because Andrea's code stops streaming writes like
cp from dirtying lots of one-use pages, it makes the pages that cp is
dirtying much more attractive to the VM when it's hunting for freeable
pages.
replaced by new patch from ..
Ah yes, angeli. so i didn't recall correctly afterall!replaced by new patch from ..
these other cases.
A better solution (for me at least): turn off swap.a better solution (for some cases)
I second that. If I ever need more than a gigabyte om RAM, I'll simply buy another.a better solution (for some cases)
With 64-bit machines becoming mainstream, who cares?
Agreed on the swap front. If a runaway process has a huge amount of swap space to eat through before dying, it can bring the system to a halt for an extended length of time. Trying to ssh in (if I don't already have an open connection), and then battling for enough memory and disk IO just to run the kill/killall command, is often impossible. So I tend to set a tiny swap, just for stuff that really doesn't need to be in memory, like 128meg or something. Now a runaway process won't cause a runaway system.a better solution (for some cases)