More Web Proxy on the site http://driver.im/

research-article

Efficient virtual memory for big memory servers

Authors:

Arkaprava Basu,

Jayneel Gandhi,

Michael M. SwiftAuthors Info & Claims

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

Pages 237 - 248

https://doi.org/10.1145/2485922.2485943

Published: 23 June 2013 Publication History

Abstract

Our analysis shows that many "big-memory" server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume as much as 10% of execution cycles on TLB misses, even using large pages. On the other hand, we find that these workloads use read-write permission on most pages, are provisioned not to swap, and rarely benefit from the full flexibility of page-based virtual memory.

To remove the TLB miss overhead for big-memory workloads, we propose mapping part of a process's linear virtual address space with a direct segment, while page mapping the rest of the virtual address space. Direct segments use minimal hardware---base, limit and offset registers per core---to map contiguous virtual memory regions directly to contiguous physical memory. They eliminate the possibility of TLB misses for key data structures such as database buffer pools and in-memory key-value stores. Memory mapped by a direct segment may be converted back to paging when needed.

We prototype direct-segment software support for x86-64 in Linux and emulate direct-segment hardware. For our workloads, direct segments eliminate almost all TLB misses and reduce the execution time wasted on TLB misses to less than 0.5%.

References

[1]

Adams, K. and Agesen, O. 2006. A comparison of software and hardware techniques for x86 virtualization. Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (Oct. 2006), 2--13.

Digital Library

[2]

Ahn, J. et al. 2012. Revisiting Hardware-Assisted Page Walks for Virtualized Systems. Proceedings of the 39th Annual International Symposium on Computer Architecture (Jun. 2012).

Digital Library

[3]

Barr, T. W. et al. 2011. SpecTLB: a mechanism for speculative address translation. Proceedings of the 38th Annual International Symposium on Computer Architecture (Jun. 2011).

Digital Library

[4]

Barr, T. W. et al. 2010. Translation caching: skip, don't walk (the page table). Proceedings of the 37th Annual International Symposium on Computer Architecture (Jun. 2010).

Digital Library

[5]

Basu, A. et al. 2012. Reducing Memory Reference Energy With Opportunistic Virtual Caching. Proceedings of the 39th annual international symposium on Computer architecture (Jun. 2012), 297--308.

Digital Library

[6]

Bhargava, R. et al. 2008. Accelerating two-dimensional page walks for virtualized systems. Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (Mar. 2008).

Digital Library

[7]

Bhattacharjee, A. et al. 2011. Shared last-level TLBs for chip multiprocessors. Proc. of the 17th IEEE Symp. on High-Performance Computer Architecture (Feb. 2011).

Digital Library

[8]

Bhattacharjee, A. and Martonosi, M. 2009. Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors. Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques (Sep. 2009).

Digital Library

[9]

Bhattacharjee, A. and Martonosi, M. 2010. Inter-core cooperative TLB for chip multiprocessors. Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (Mar. 2010).

Digital Library

[10]

Binkert, N. et al. 2011. The gem5 simulator. Computer Architecture News (CAN). (2011).

Digital Library

[11]

Chen, J. B. et al. 1992. A Simulation Based Study of TLB Performance. Proceedings of the 19th Annual International Symposium on Computer Architecture (May. 1992).

Digital Library

[12]

Christos Kozyrakis, A. K. and Vaid, K. 2010. Server Engineering Insights for Large-Scale Online Services. IEEE Micro (Jul. 2010).

Digital Library

[13]

Couleur, J. F. and Glaser, E. L. 1968. Shared-access Data Processing System. Nov. 1968.

[14]

Daley, R. C. and Dennis, J. B. 1968. Virtual memory, processes, and sharing in MULTICS. Communications of the ACM. 11, 5 (May. 1968), 306--312.

Digital Library

[15]

Denning, P. J. 1970. Virtual Memory. ACM Computing Surveys. 2, 3 (Sep. 1970), 153--189.

Digital Library

[16]

Emer, J. S. and Clark, D. W. 1984. A Characterization of Processor Performance in the vax-11/780. Proceedings of the 11th Annual International Symposium on Computer Architecture (Jun. 1984), 301--310.

Digital Library

[17]

Ferdman, M. et al. 2012. Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware. Proceedings of the 17th Conference on Architectural Support for Programming Languages and Operating Systems (Mar. 2012).

Digital Library

[18]

Ganapathy, N. and Schimmel, C. 1998. General purpose operating system support for multiple page sizes. Proceedings of the annual conference on USENIX Annual Technical Conference (1998).

Digital Library

[19]

graph500 -- The Graph500 List: http://www.graph500.org/.

[20]

Huge Pages/libhugetlbfs: 2010. http://lwn.net/Articles/374424/.

[21]

Intel 8086: http://en.wikipedia.org/wiki/Intel_8086.

[22]

Jacob, B. and Mudge, T. 2001. Uniprocessor Virtual Memory without TLBs. IEEE Transaction on Computer. 50, 5 (May. 2001).

Digital Library

[23]

Jacob, B. and Mudge, T. 1998. Virtual Memory in Contemporary Microprocessors. IEEE Micro. 18, 4 (1998).

Digital Library

[24]

Kandiraju, G. B. and Sivasubramaniam, A. 2002. Going the distance for TLB prefetching: an application-driven study. Proceedings of the 29th Annual International Symposium on Computer Architecture (May. 2002).

Digital Library

[25]

Large Page Performance: ESX Server 3.5 and ESX Server 3i v3.5: http://www.vmware.com/files/pdf/large_pg_performance.pdf.

[26]

Linux pmap utility: http://linux.die.net/man/1/pmap.

[27]

Lustig, D. et al. 2013. TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs. ACM Transactions on Architecture and Code Optimization. (Jan. 2013).

Digital Library

[28]

Marissa Mayer at Web 2.0: http://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html.

[29]

Mars, J. et al. 2011. Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations. Proceedings of the 44th Annual IEEE/ACM International Symp. on Microarchitecture (Dec. 2011).

Digital Library

[30]

McCurdy, C. et al. 2008. Investigating the TLB Behavior of High-end Scientific Applications on Commodity Microprocessors. Proceedings of IEEE International Symposium on Performance Analysis of Systems and software (2008).

Digital Library

[31]

Memory Hotplug: http://www.kernel.org/doc/Documentation/memory-hotplug.txt.

[32]

Microsystems, S. 2007. UltraSPARC T2#8482; Supplement to the UltraSPARC Architecture 2007. (Sep. 2007).

[33]

Navarro, J. et al. 2002. Practical Transparent Operating System Support for Superpages. Proceedings of the 5th Symposium on Operating Systems Design and Implementation (Dec. 2002).

Digital Library

[34]

Oprofile: http://oprofile.sourceforge.net/.

[35]

Ousterhout, J. and al, et 2011. The case for RAMCloud. Communications of the ACM. 54, 7 (Jul. 2011), 121--130.

Digital Library

[36]

Pham, B. et al. 2012. CoLT: Coalesced Large Reach TLBs. Proceedings of 45th Annual IEEE/ACM International Symposium on Microarchitecture (Dec. 2012).

Digital Library

[37]

Ranganathan, P. 2011. From Microprocessors to Nanostores: Rethinking Data-Centric Systems. Computer. 44, 1 (2011).

Digital Library

[38]

Reiss, C. et al. 2012. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. Proceedings of the 3rd ACM Symposium on Cloud Computing (Oct. 2012).

Digital Library

[39]

Rosenblum, N. E. et al. 2008. Virtual machine-provided context sensitive page mappings. Proceedings of the 4th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments (Mar. 2008).

Digital Library

[40]

Saulsbury, A. et al. 2000. Recency-based TLB preloading. Proceedings of the 27th Annual International Symposium on Computer Architecture (Jun. 2000).

Digital Library

[41]

Sodani, A. 2011. Race to Exascale: Opportunities and Challenges. MICRO 2011 Keynote address.

[42]

Srikantaiah, S. and Kandemir, M. 2010. Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors. Proceedings of 43rd Annual IEEE/ACM International Symposium on Microarchitecture (Dec. 2010).

Digital Library

[43]

Talluri, M. et al. 1992. Tradeoffs in Supporting Two Page Sizes. Proceedings of the 19th Annual International Symposium on Computer Architecture (May. 1992).

Digital Library

[44]

Talluri, M. and Hill, M. D. 1994. Surpassing the TLB performance of superpages with less operating system support. Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (Oct. 1994).

Digital Library

[45]

TCMalloc: Thread-Caching Malloc: http://goog-perftools.sourceforge.net/doc/tcmalloc.html.

[46]

Transparent huge pages: 2011. www.lwn.net/Articles/423584/.

[47]

Volos, H. et al. 2011. Mnemosyne: Lightweight Persistent Memory. Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (Mar. 2011).

Digital Library

[48]

Waldspurger, C. A. 2002. Memory Resource Management in VMware ESX Server. Proceedings of the 2002 Symposium on Operating Systems Design and Implementation (Dec. 2002).

Digital Library

[49]

Wood, D. A. et al. 1986. An in-cache address translation mechanism. Proceedings of 13th annual international symposium on Computer architecture (Jun. 1986).

Digital Library

[50]

Zhang, L. et al. 2010. Enigma: architectural and operating system support for reducing the impact of address translation. Proceedings of the 24th ACM International Conference on Supercomputing (Jun. 2010).

Digital Library

Cited By

Qu HYu ZTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)WASP: Workload-Aware Self-Replicating Page-Tables for NUMA ServersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640369(1233-1249)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640369
Zhang JJia WChai SLiu PKim JXu TTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Direct Memory Translation for Virtualized CloudsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640358(287-304)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640358
Maas MAndersen DIsard MJavanmard MMcKinley KRaffel C(2024)Combining Machine Learning and Lifetime-Based Resource Management for Memory Allocation and BeyondCommunications of the ACM10.1145/361101867:4(87-96)Online publication date: 25-Mar-2024
https://dl.acm.org/doi/10.1145/3611018
Show More Cited By

Index Terms

Efficient virtual memory for big memory servers
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Efficient virtual memory for big memory servers
ICSA '13

Our analysis shows that many "big-memory" server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume as much as 10% of execution cycles on TLB misses, even using large pages. ...
Redesign the Memory Allocator for Non-Volatile Main Memory
Special Issue on Hardware and Algorithms for Learning On-a-chip and Special Issue on Alternative Computing Systems

The non-volatile memory (NVM) has the merits of byte-addressability, fast speed, persistency and low power consumption, which make it attractive to be used as main memory. Commonly, user process dynamically acquires memory through memory allocators. ...
Cooperating Write Buffer Cache and Virtual Memory Management for Flash Memory Based Systems
RTAS '11: Proceedings of the 2011 17th IEEE Real-Time and Embedded Technology and Applications Symposium

Flash memory is becoming the storage media of choice for mobile devices and embedded systems. The performance of flash memory is impacted by the asymmetric speed of read and write operations, limited number of erase times and the absence of in-place ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

June 2013

686 pages

ISBN:9781450320795

DOI:10.1145/2485922

General Chair:
Avi Mendelson
Technion

ACM SIGARCH Computer Architecture News Volume 41, Issue 3
ICSA '13
June 2013
666 pages
ISSN:0163-5964
DOI:10.1145/2508148
Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE CS

In-Cooperation

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

ISCA'13

Sponsor:

ISCA'13: The 40th Annual International Symposium on Computer Architecture

June 23 - 27, 2013

Tel-Aviv, Israel

Acceptance Rates

ISCA '13 Paper Acceptance Rate 56 of 288 submissions, 19%;

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

264
Total Citations
View Citations
3,188
Total Downloads

Downloads (Last 12 months)326
Downloads (Last 6 weeks)49

Reflects downloads up to 18 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Qu HYu ZTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)WASP: Workload-Aware Self-Replicating Page-Tables for NUMA ServersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640369(1233-1249)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640369
Zhang JJia WChai SLiu PKim JXu TTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Direct Memory Translation for Virtualized CloudsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640358(287-304)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640358
Maas MAndersen DIsard MJavanmard MMcKinley KRaffel C(2024)Combining Machine Learning and Lifetime-Based Resource Management for Memory Allocation and BeyondCommunications of the ACM10.1145/361101867:4(87-96)Online publication date: 25-Mar-2024
https://dl.acm.org/doi/10.1145/3611018
Han JGosakan KKuszmaul WMubarek IMukherjee NSriram KTagliavini GWest EBender MBhattacharjee AConway AFarach-Colton MGandhi JJohnson RKannan SPorter D(2024)Mosaic Pages: Big TLB Reach With Small PagesIEEE Micro10.1109/MM.2024.340918144:4(52-59)Online publication date: 6-Jun-2024
https://dl.acm.org/doi/10.1109/MM.2024.3409181
Zhao KXue KWang ZSchatzberg DYang LManousis AWeiner JRiel RSharma BTang CSkarlatos D(2024)Contiguitas: The Pursuit of Physical Memory Contiguity in Data CentersIEEE Micro10.1109/MM.2024.340693344:4(44-51)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1109/MM.2024.3406933
Li BWang YWang TEeckhout LYang JJaleel ATang X(2024)STAR: Sub-Entry Sharing-Aware TLB for Multi-Instance GPU2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00031(309-323)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00031
Kwon OLee YPark JJang STak BHong S(2024)Distributed Page Table: Harnessing Physical Memory as an Unbounded Hashed Page Table2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00013(36-49)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00013
Psomadakis SAlverti CKarakostas VKatsakioris CSiakavaras DNikas KGoumas GKoziris N(2024)Elastic Translations: Fast Virtual Memory with Multiple Translation Sizes2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00012(17-35)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00012
Yao YWang XZhou DLi LWu JZhu LWang ZLuo Y(2024)EKRM: Efficient Key-Value Retrieval Method to Reduce Data Lookup Overhead for RedisEuro-Par 2024: Parallel Processing10.1007/978-3-031-69577-3_12(166-179)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-69577-3_12
Manocha AYan ZTureci EAragón JNellans DMartonosi M(2023)Architectural Support for Optimizing Huge Page Selection Within the OSProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614296(1213-1226)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614296
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents