research-article

An Empirical Evaluation of PTE Coalescing

Authors:

Eliot H. Solomon,

Yufeng Zhou,

Alan L. CoxAuthors Info & Claims

MEMSYS '23: Proceedings of the International Symposium on Memory Systems

Article No.: 20, Pages 1 - 16

https://doi.org/10.1145/3631882.3631902

Published: 08 April 2024 Publication History

Get Access

Abstract

Superpages (also known as huge pages) are an effective technique for reducing the latency of virtual-to-physical address translation on modern processors. However, the large size of the 2 MB and 1 GB superpages supported by x86-64 processors continues to present a challenge to the operating system’s ability to form superpages, given the mandatory contiguity, alignment, and attribute requirements of a superpage. Recent work proposes medium-sized superpages as a potential solution, by allowing the creation of smaller superpages where 2 MB and larger superpages have not formed or will not be possible to form. Notably, AMD processors starting with the Zen microarchitecture have offered a “PTE Coalescing” feature where the hardware opportunistically and transparently creates, from underlying consecutive and aligned 4 KB mappings in the page table, 16 KB or 32 KB mappings to be cached in the TLB. On the surface, this feature requires no modifications to the operating system or the compiler toolchain, exploiting only coincidental contiguity and alignment. Nonetheless, there are ways that either the operating system or the toolchain can be made coalescing-aware and hence make better use of PTE Coalescing. This paper first investigates undocumented aspects of PTE Coalescing, and then evaluates some operating system and toolchain optimizations which explicitly take advantage of it. We find that an operating system that is coalescing-friendly reduces L1 ITLB misses by 50%-80% compared to an operating system that is coalescing-unaware. For a Clang compilation workload, a coalescing-friendly operating system coupled with PTE Coalescing all but eliminates L2 ITLB misses. Last but not least, we evaluate the impact of granularity (16 KB vs 32 KB) on the effectiveness of PTE Coalescing. We find that reducing the coalescing granularity from 32 KB to 16 KB leads to a 1.3x-20.5x reduction in 4 KB L2 DTLB misses in a wide variety of workloads.

References

[1]

2016. React server-side rendering benchmark. https://www.npmjs.com/package/react-ssr-benchmarks.

Abstract

References

Cited By

Index Terms

Recommendations

Efficient Address Translation for Architectures with Multiple Page Sizes

Efficient Address Translation for Architectures with Multiple Page Sizes

Efficient Address Translation for Architectures with Multiple Page Sizes

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations