Token-based thrashing control

[Posted August 4, 2004 by corbet]

A system which is in the throes of VM thrashing is no fun to work with. The kernel is forever throwing out pages which it will need in the near future in favor of pages needed right now, and little work actually gets done. It seems like there has to be a better way.

Rik van Riel has put together a patch based on the work of Song Jiang which might help. The basic idea is that a process which is currently bringing in pages should, for a short period, not have its other pages booted out to swap. With luck, that process will actually make some progress during that grace period before the VM grim reaper swoops down and consigns it, once again, to the swap ghetto.

Clearly, not all processes which are bringing in pages can be sheltered from page reclamation at the same time; if they could, the system would not be thrashing in the first place. This problem is addressed through the creation of a "swap token." A process holding the swap token will be allowed to bring in pages without having its current working set molested for a period of time. After a while, the token is passed on to the next needy process.

In Rik's patch, the (single, system-wide) token is implemented through swap_token_mm, a pointer to the mm structure of the process holding the token. If the kernel, on behalf of a process incurring a page fault, decides that the token is available, swap_token_mm will be set and the faulting process will get its breathing space for a while. The token is deemed to be available if (1) it has been held for longer than the maximum period, which is set to a surprisingly long 300 seconds, or (2) the process holding the token has not incurred any page faults recently. Once the token becomes available, the first process which comes looking for it will grab it - unless it has held the token in the recent past.

Rik's tests show some performance improvements with this patch applied. There are potential improvements which could be made, such as trying to add some intelligence to the decision of which process gets the token. A huge process may hold the token for some time, grow to fill much of memory, and still not have enough to get any real work done. Meanwhile, small processes which could have benefited from a few extra pages continue to thrash. Some tweaks could be made to the patch to address this issue, but there are limits to how much code and complexity should be added to the kernel to deal with a (hopefully) rare situation.

Index entries for this article
Kernel	Memory management/Token-based thrashing control
Kernel	Thrashing

Token-based thrashing control

Posted Aug 5, 2004 19:46 UTC (Thu) by eskild (guest, #1556) [Link]

"... before the VM grim reaper swoops down and consigns it, once again, to the swap ghetto."

I don't think I've ever seen complex kernel operations described quite this colourful... Very amusing!

long term scheduling

Posted Aug 6, 2004 3:05 UTC (Fri) by giraffedata (guest, #1954) [Link] (1 responses)

Much older operating systems do a similar thing, but simpler, called long term scheduling. Above the regular CPU dispatch queue, you have a list of processes waiting to get into the dispatch queue. While a time slice on the CPU might be .1 second, a time slice in the dispatch queue might be 5 seconds. You don't let the dispatch queue ever contain so many processes that their working sets won't all fit in memory at the same time.

So a process gets into the inner circle, quickly faults in his working set, gets a lot of work done against that memory across many CPU time slices, then steps out and his page frames get quickly stolen by the next guy.

The fact that a process can spend a long time waiting in the outer circle is just recognition of the fact that you can't get blood from a stone. If you want an interactive system, you have to have enough memory for everyone.

long term scheduling

Posted Aug 6, 2004 17:46 UTC (Fri) by riel (subscriber, #3142) [Link]

The thing is, long term sheduling isn't simpler. It works fine when you have to deal with one kind of workload on one type of system, but the variety of hardware and workloads that Linux has to deal with is way bigger than the variety any of the commercial Unixen has had to deal with.

I've tried implementing good long term scheduling (aka load control) strategies for Linux in the past, but those efforts have always ran into trouble when it comes to the magic knobs needed for such a system. The 64MB RAM system with a few apps behaves very differently from the 1GB RAM system with many apps, or the 1GB RAM system with a few scientific calculations, or ...

The token based thrashing prevention, OTOH, seems to just work. No magic involved. It's turned out to be a lot simpler than any long term scheduling algorithm I've ever seen.