Memory power management
One technology which is finding its way into some systems is called "partial array self refresh" or PASR. On a PASR-enabled system, memory is divided into banks, each of which can be powered down independently. If (say) half of memory is not needed, that memory (and its self-refresh mechanism) can be turned off; the result is a reduction in power use, but also the loss of any data stored in the affected banks. The amount of power actually saved is a bit unclear; estimates seem to run in the range of 5-15% of the total power used by the memory subsystem.
The key to powering down a bank of memory, naturally, is to be sure that there is no important data stored therein first. That means that the system must either evacuate a bank to be powered down, or it must take care not to allocate memory there in the first place. So the memory management subsystem will have to become aware of the power topology of main memory and take that information into account when satisfying allocation requests. It will also have to understand the desired power management policy and make decisions to power banks up or down depending on the current level of memory pressure. This is going to be fun: memory management is already a complicated set of heuristics which attempt to provide reasonable results for any workload; adding power management into the mix can only complicate things further.
A recent patch set from Ankita Garg does not attempt to solve the whole problem; instead, it creates an initial infrastructure which can be used for future power management decisions. Before looking at that patch, though, a bit of background will be helpful.
The memory management subsystem already splits available memory at two different levels. On non-uniform memory access (NUMA) systems, memory which is local to a specific processor will be faster to access than memory on a different processor. The kernel's memory management code takes NUMA nodes into account to implement specific allocation policies. In many cases, the system will try to keep a process and all of its memory on the same NUMA node in the hope of maximizing the number of local accesses; other times, it is better to spread allocations evenly across the system. The point is that the NUMA node must be taken into account for all allocation and reclaim decisions.
The other important concept is that of a "zone"; zones are present on all systems. The primary use of zones is to categorize memory by accessibility; 32-bit systems, for example, will have "low memory" and "high memory" zones to contain memory which can and cannot (respectively) be directly accessed by the kernel. Systems may have a zone for memory accessible with a 32-bit address; many devices can only perform DMA to such addresses. Zones are also used to separate memory which can readily be relocated (user-space pages accessed through page tables, for example) from memory which is hard to move (kernel memory for which there may be an arbitrary number of pointers). Every NUMA node has a full set of zones.
PASR has been on the horizon for a little while, so a few people have been thinking about how to support it; one of the early works would appear to be this paper by Henrik Kjellberg, though that work didn't result in code submitted upstream. Henrik pointed out that the kernel already has a couple of mechanisms which could be used to support PASR. One of those is memory hotplug, wherein memory can be physically removed from the system. Turning off a bank of memory can be thought of as being something close to removing that memory, so it makes sense to consider hotplug. Hotplug is a heavyweight operation, though; it is not well suited to power management, where decisions to power banks of memory up or down may be made fairly often.
Another approach would be to use zones; the system could set up a separate zone for each memory bank which could be powered down independently. Powering down a bank would then be a matter of moving needed data out of the associated zone and marking that zone so that no further allocations would be made from it. The problem with this approach is a number of important memory management operations happen at the zone level; in particular, each zone has a set of limits on how many free pages must exist. Adding more zones would increase memory management overhead and create balancing problems which don't need to exist.
That is the approach that Ankita has taken, though; the patch adds another level of description called "regions" interposed between nodes and zones, essentially creating not just one new zone for each bank of memory, but a complete set of zones for each. The page allocator will always try to obtain pages from the lowest-numbered region it can in the hope that the higher regions will remain vacant. Over time, of course, this simple approach will not work and it will become necessary to migrate pages out of regions before they can be powered down. The initial patch does not address that issue, though - or any of the associated policy issues that come up.
Your editor is not a memory management hacker, but ignorance has never kept him from having an opinion on things. To a naive point of view, it would almost seem like this design has been done backward - that regions should really be contained within zones. That would avoid multiplying the number of zones in the system and the associated balancing costs. Also, importantly, it would allow regions to be controlled by the policy of a single enclosing zone. In particular, regions inside a zone used for movable allocations would be vacated with relative ease, allowing them to be powered down when memory pressure is light. Placing multiple zones within each region, instead, would make clearing a region harder.
The patch set has not gotten a lot of review attention; the people who know
what they are talking about in this area have mostly kept silent.
There are numerous
memory management patches circulating at the moment, so time for review is
probably scarce. Andrew Morton did ask
about the overhead of this work on machines which lack the PASR capability
and about how much power might actually be saved; answers to those
questions don't seem to be available at the moment. So one might conclude
that this patch set, while demonstrating an approach to memory power
management, will not be ready for mainline inclusion in the near future.
But, then, adding power management to such a tricky subsystem was never
going to be done in a hurry.
Index entries for this article | |
---|---|
Kernel | Memory management/Power management |
Kernel | Partial array self refresh (PASR) |
Kernel | Power management |
Posted Jun 9, 2011 13:33 UTC (Thu)
by ejr (subscriber, #51652)
[Link] (4 responses)
Posted Jun 10, 2011 10:19 UTC (Fri)
by Ankita (guest, #39147)
[Link] (3 responses)
Posted Jun 10, 2011 13:40 UTC (Fri)
by ejr (subscriber, #51652)
[Link] (2 responses)
Unfortunately, there is always memory pressure in my areas (massive graph analysis, numerics, etc.). I'm working on a memory concurrency v. performance model for some graph tasks with a thought towards powering down unneeded cores (think SCC) when they cannot contribute. Now I'm wondering if there's some way to consider the memory side, again assuming everything fits (which it doesn't).
There also is much, *much* work going into dropping NAND flash into DRAM slots (phase change, plus DRAM cache). That will change the power usage characteristics drastically. If you can turn off the DRAM cache without having to flush out all the data...
Posted Jun 11, 2011 0:29 UTC (Sat)
by giraffedata (guest, #1954)
[Link]
There's also internal fragmentation in memory allocation pools -- memory that's used just to anticipate future allocations and save time.
So I don't see any policy that just says power down totally free memory as being terribly useful. We need a policy that weighs for each page the value of having that data in memory vs the cost of powering that page.
Posted Jun 11, 2011 2:21 UTC (Sat)
by Ankita (guest, #39147)
[Link]
Posted Jun 9, 2011 22:07 UTC (Thu)
by djm1021 (guest, #31130)
[Link]
Posted Jun 10, 2011 8:13 UTC (Fri)
by Ankita (guest, #39147)
[Link] (8 responses)
You are right that creating regions inside zones results in a bloat in the number of zones. But the difficulty we faced is that zones already encapsulate some boundary information, that might be at a level lower than regions. A region could span multiple zones, in which case we would need another mechanism to group these sub-regions into one region that maps to an independently managed power unit. For instance, a single numa node with 8GB RAM, will come up with two zones- ZONE_DMA and ZONE_NORMAL. But if this numa node has support for power management at a different level, say 2GB, then we would create 4 regions, thus spanning the two zones. This would make targeted allocation and reclaim now depend on another piece of information to unite regions that form one single unit. Further, the zone policies like movable allocations, could still be leveraged when zones are under regions. However, as Dave pointed out, it is important to understand the performance impact of this change.
Also, besides PASR, there are other mechanisms by which memory power could be conserved. The Samsung Exynos 4210 for instance, has support for automatic power down of memory, i.e, if there are no references to certain areas of memory for a certain threshold of time, the hardware would automatically put that area of memory into a lower power state, without losing the content. A basic infrastructure to make the VM aware of the hardware topology would aid the hardware in placing memory into lower power states.
-Ankita
Posted Jun 10, 2011 14:24 UTC (Fri)
by ccurtis (guest, #49713)
[Link] (7 responses)
Posted Jun 10, 2011 14:30 UTC (Fri)
by mjg59 (subscriber, #23239)
[Link] (4 responses)
Posted Jun 10, 2011 15:05 UTC (Fri)
by ccurtis (guest, #49713)
[Link] (3 responses)
Perhaps I was a bit too terse. Before creating this extensive infrastructure, perhaps it would be better to get an idea of what kind of power savings this actually provides. Code would still need to be written to power down the memory bank, and code would also likely need to be written to isolate the excluded RAM from the boot parameter, but this seems like a relatively easy way to answer the question before embarking on the endeavor. Of course, this may be a done deal and it's just a matter of time before the code gets written, but it would still be interesting to see how much power is actually going to be saved. A patch like this would allow individuals to measure the power savings of their own systems as well, in case they wanted to control the trade of any overhead the new memory management changes might impose.
Posted Jun 10, 2011 16:20 UTC (Fri)
by etienne (guest, #25256)
[Link] (2 responses)
Posted Jun 11, 2011 3:19 UTC (Sat)
by willy (subscriber, #9762)
[Link] (1 responses)
As I understand PASR, one would not power down an entire DIMM, but rather sections of each DIMM, thus preserving the performance benefits of interleaving.
Posted Jun 16, 2011 4:54 UTC (Thu)
by Ankita (guest, #39147)
[Link]
Posted Jun 13, 2011 14:11 UTC (Mon)
by nye (subscriber, #51576)
[Link] (1 responses)
Sadly, the existing power supply proved inadequate and the system would crash shortly after booting, if it even booted at all. The memory was a single DIMM so there wasn't anything I could remove to reduce power consumption, however it turned out that telling the kernel to use only about 800M allowed the system to remain completely stable for the few days it took to get a new power supply.
Thus I conclude that the most likely explanation is that RAM which the OS doesn't believe even exists actually does use less power than RAM which is simply not in use at the time.
Posted Jun 16, 2011 18:26 UTC (Thu)
by Pc5Y9sbv (guest, #41328)
[Link]
It could even have to do with certain combinations of address and data bits that required more power to configure the addressing logic and route the data signals, and it didn't stabilize within the configured access timings.
The end result is that certain memory addresses in a given module will tend to show corruption before others as the power supply sags or the timings get too tight. This is why people advocate long runs with a dedicated memory test program to try to validate parts in situ. Just running an OS may not exercise combinations of address and data bits with sufficient testing coverage, at least not for many hours (or weeks!) of operation.
Posted Jun 12, 2011 17:45 UTC (Sun)
by pipipen (guest, #56099)
[Link]
>> After years of effort to improve the kernel's power behavior, add instrumentation to track wakeups, and fix misbehaving applications,
Posted Jul 7, 2011 6:56 UTC (Thu)
by henrik.kjellberg (guest, #61640)
[Link]
Best regards
Memory power management
Memory power management
Memory power management
I think it's usually the case that essentially all the memory is "used," so I don't get how the proposed policy can be effective. Memory is "used" to varying degrees, i.e. how important the contents are. Some is actually indispensable because there's no practical way to recreate its contents. But other memory is a cache of file contents, cache of dentries, memory that could be moved to swap space, and the like.
Memory power management
Memory power management
Memory power management
Memory power management
Memory power management power savings test
Memory power management power savings test
Memory power management power savings test
The amount of power actually saved is a bit unclear; estimates seem to run in the range of 5-15% of the total power used by the memory subsystem.
[...]
A recent patch set from Ankita Garg does not attempt to solve the whole problem; instead, it creates an initial infrastructure which can be used for future power management decisions.
Memory power management power savings test
Memory power management power savings test
Memory power management power savings test
Memory power management power savings test
Memory power management power savings test
misbehaving applications
Memory power management
The whole thesis can be read here: http://sam.cs.lth.se/ExjobGetFile?id=239 and contains more background information about the system structure. It can be good to read for those that are curious about the technology but lack the knowledge of the memory structure.
Henrik Kjellberg