More Web Proxy on the site http://driver.im/

research-article

TLB Shootdown Mitigation for Low-Power Many-Core Servers with L1 Virtual Caches

Authors:

Abhishek Bhattacharjee,

Trey CainAuthors Info & Claims

IEEE Computer Architecture Letters, Volume 17, Issue 1

Pages 17 - 20

https://doi.org/10.1109/LCA.2017.2712140

Published: 01 January 2018 Publication History

Abstract

Power efficiency has become one of the most important design constraints for high-performance systems. In this paper, we revisit the design of low-power virtually-addressed caches. While virtually-addressed caches enable significant power savings by obviating the need for Translation Lookaside Buffer (TLB) lookups, they suffer from several challenging design issues that curtail their widespread commercial adoption. We focus on one of these challenges–cache flushes due to virtual page remappings. We use detailed studies on an ARM many-core server to show that this problem degrades performance by up to 25 percent for a mix of multi-programmed and multi-threaded workloads. Interestingly, we observe that many of these flushes are spurious, and caused by an indiscriminate invalidation broadcast on ARM architecture. In response, we propose a low-overhead and readily implementable hardware mechanism using bloom filters to reduce spurious invalidations and mitigate their ill effects.

References

[1]

N. Agarwal, D. Nellans, M. O'Connor, S. Keckler, and T. Wenisch, “Unlocking bandwidth for GPUs in CC-NUMA systems,” in Proc. Int. Symp. High Performance Comput. Archit., 2015, pp. 354–365.

[2]

A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, “Efficient virtual memory for big memory servers,” in Proc. 40th Annu. Int. Symp. Comput. Archit., 2013, pp. 237–248.

Digital Library

[3]

A. Basu, M. D. Hill, and M. M. Swift, “Reducing memory reference energy with opportunistic virtual caching,” in Proc. 39th Annu. Int. Symp. Comput. Archit., 2012, pp. 297–308.

Digital Library

[4]

B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Commun. ACM, vol. Volume 13, no. Issue 7, pp. 422–426, 1970.

Digital Library

[5]

J. L. Carter and M. N. Wegman, “Universal classes of hash functions (extended abstract),” in Proc. 9th Annu. ACM Symp. Theory Comput., 1977, pp. 106–112.

Digital Library

[6]

, ThunderX Family of Workload Optimized Processors . 2015.

[7]

M. Cekleov and M. Dubois, “Virtual-address caches, part 2: Multiprocessor issues” IEEE Micro, vol. Volume 17, no. Issue 6, pp. 69–74, 1997.

Digital Library

[8]

J. L. Henning, “SPEC CPU2006 benchmark descriptions,” SIGARCH Comput. Archit. News, 2006.

Digital Library

[9]

S. Kaxiras and A. Ros, “A new perspective for efficient virtual-cache coherence,” in Proc. 40th Annu. Int. Symp. Comput. Archit., 2013, pp. 353–546.

Digital Library

[10]

R. C. Murphy, K. B. Wheele, B. W. Barrett, and J. A. Ang, “Introducing the graph 500,” 2010.

[11]

M. Oskin and G. H. Loh, “A software managed approach to die-stacked DRAM,” in Proc. Int. Conf. Parallel Archit. Compilation, 2015, pp. 188–200.

Digital Library

[12]

C. H. Park, T. Heo, and J. Huh, “Efficient synonym filtering and scalable delayed translation for hybrid virtual caching,” in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit., 2016, pp. 90–102.

Digital Library

[13]

B. Pham, A. Bhattacharjee, Y. Eckert, and G. Loh, “Increasing TLB reach by exploiting clustering in page translations,” in Proc. IEEE 20th Int. Symp. High Performance Comput. Archit., 2014, pp. 558–567.

[14]

B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee, “CoLT: Coalesced large-reach TLBs,” in Proc. 45th Annu. IEEE/ACM Int. Symp. Microarchitecture, 2012, pp. 258–269.

Digital Library

[15]

B. Pham, J. Vesely, G. H. Loh, and A. Bhattacharjee, “Large pages and lightweight memory management in virtualized environments: Can you have it both ways?” in Proc. 48th Annu. IEEE/ACM Int. Symp. Microarchitecture, 2015, pp. 1–12.

Digital Library

[16]

C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis, “Evaluating MapReduce for multi-core and multiprocessor systems,” in Proc. IEEE 13th Int. Symp. High Performance Comput. Archit., 2007, pp. 13–24.

Digital Library

[17]

B. Romanescu, A. Lebeck, D. Sorin, and A. Bracy, “Unified instruction/translation/data (UNITD) coherence: One protocol to rule them all,” in Proc. 16th Int. Symp. High-Performance Comput. Archit., 2010, pp. 1–12.

[18]

D. Sanchez, L. Yen, M. D. Hill, and K. Sankaralingam, “Implementing signatures for transactional memory,” in Proc. 40th Annu. IEEE/ACM Int. Symp. Microarchitecture, 2007, pp. 123–133.

Digital Library

[19]

C. Villavieja et al., “Didi: Mitigating the performance impact of TLB shootdowns using a shared TLB directory,” in Proc. Int. Conf. Parallel Archit. Compilation Techn., 2011, pp. 340–349.

Digital Library

[20]

Z. Yan, J. Vesely, G. Cox, and A. Bhattacharjee, “Hardware translation coherence for virtualized systems,” in Proc. 40th Annu. Int. Symp. Comput. Archit, 2017.

Digital Library

[21]

H. Yoon and G. S. Sohi, “Revisiting virtual l1 caches: A practical design using dynamic synonym remapping” in Proc. IEEE Int. Symp. High Performance Comput. Archit., 2016, pp. 212–224.

Cited By

Gugale HGulur NMarathe YJohn LSarkar VKim H(2020)ATTC (@C)Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414653(481-492)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414653
Tang XZhang ZXu WKandemir MMelhem RYang JSarkar VKim H(2020)Enhancing Address Translations in Throughput Processors via CompressionProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414633(191-204)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414633
Amit NTai AWei MBilas AMagoutis KMarkatos EKostic DSeltzer M(2020)Don't shoot down TLB shootdowns!Proceedings of the Fifteenth European Conference on Computer Systems10.1145/3342195.3387518(1-14)Online publication date: 15-Apr-2020
https://dl.acm.org/doi/10.1145/3342195.3387518
Show More Cited By

Recommendations

Inter-core cooperative TLB for chip multiprocessors
ASPLOS '10

Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for ...
Reducing L1 caches power by exploiting software semantics
ISLPED '12: Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design

To access a set-associative L1 cache in a high-performance processor, all ways of the selected set are searched and fetched in parallel using physical address bits. Such a cache is oblivious of memory references' software semantics such as stack-heap ...
Inter-core cooperative TLB for chip multiprocessors
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems

Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Computer Architecture Letters

IEEE Computer Architecture Letters Volume 17, Issue 1

January 2018

99 pages

ISSN:1556-6056

Issue’s Table of Contents

Copyright © 2018.

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 January 2018

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gugale HGulur NMarathe YJohn LSarkar VKim H(2020)ATTC (@C)Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414653(481-492)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414653
Tang XZhang ZXu WKandemir MMelhem RYang JSarkar VKim H(2020)Enhancing Address Translations in Throughput Processors via CompressionProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414633(191-204)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414633
Amit NTai AWei MBilas AMagoutis KMarkatos EKostic DSeltzer M(2020)Don't shoot down TLB shootdowns!Proceedings of the Fifteenth European Conference on Computer Systems10.1145/3342195.3387518(1-14)Online publication date: 15-Apr-2020
https://dl.acm.org/doi/10.1145/3342195.3387518
Bharadwaj SCox GKrishna TBhattacharjee AOskin MInoue K(2018)Scalable distributed last-level TLBs using low-latency interconnectsProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00030(271-284)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00030
Parasar MBhattacharjee AKrishna T(2018)SEESAWProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00026(193-206)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00026

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents