More Web Proxy on the site http://driver.im/

Article

Enabling scalability and performance in a large scale CMP environment

Authors:

Ali-Reza Adl-Tabatabai,

Mohan Rajagopalan,

Richard L. Hudson,

Tatiana Shpeisman,

Anwar Rohillah,

Jesse FangAuthors Info & Claims

EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007

Pages 73 - 86

https://doi.org/10.1145/1272996.1273006

Published: 21 March 2007 Publication History

Abstract

Hardware trends suggest that large-scale CMP architectures, with tens to hundreds of processing cores on a single piece of silicon, are iminent within the next decade. While existing CMP machines have traditionally been handled in the same way as SMPs, this magnitude of parallelism introduces several fundamental challenges at the architectural level and this, in turn, translates to novel challenges in the design of the software stack for these platforms. This paper presents the "Many Core Run Time" (McRT), a software prototype of an integrated language runtime that was designed to explore configurations of the software stack for enabling performance and scalability on large scale CMP platforms. This paper presents the architecture of McRT and discusses our experiences with the system, including experimental evaluation that lead to several interesting, non-intuitive findings, providing key insights about the structure of the system stack at this scale. A key contribution of this paper is to demonstrate how McRT enables near linear improvements in performance and scalability for desktop workloads such as the popular XviD encoder and a set of RMS (recognition, mining, and synthesis) applications. Another key contribution of this work is its use of McRT to explore non-traditional system configurations such as a light-weight executive in which McRT runs on "bare metal" and replaces the traditional OS. Such configurations are becoming an increasingly attractive alternative to leverage heterogeneous computing uints as seen in today's CPU-GPU configurations.

References

[1]

B. D. Marsh, M. L. Scott, T. J. LeBlanc, and E. P. Markatos. Firstclass user-level threads. Proc. SOSP-13, October 1991.

Digital Library

[2]

B. Lewis and D. J. Berg, "Multithreaded Programming with Pthreads," Prentice Hall, 1998.

Digital Library

[3]

Next Generation POSIX Threading. http://www-124.ibm.com/pthreads/

[4]

U. Drepper, and I. Molnar. The native POSIX thread library for Linux, Jan 2003. http://people.redhat.com/drepper/nptl-design.pdf.

[5]

D. Vianney, Hyper-Threading speeds Linux, Jan 2003. http://www-128.ibm.com/developerworks/linux/library/l-htl/

[6]

Microsoft Corp, Windows Support for Hyper-Threading technology, 2002. download.microsoft.com/download/5/7/7/577a5684-8a83-43ae-9272-ff260a9c20e2/Hyper-thread_Windows.doc

[7]

E. Bugnion S. Devine, and M. Rosenblum. Disco: running commodity operating systems on scalable multiprocessors. Proc. SOSP-16, 1997.

Digital Library

[8]

R. von Behren, J. Condit, F. Zhou, G. C. Necula, and E. Brewer, "Capriccio: Scalable threads for internet services," Proc. SOSP-19, 2003.

Digital Library

[9]

N. Nagarajaya, Improving Application Efficiency Through Chip Multi-Threading, Sun Developer Network, Mar 2005. developers.sun.com/solaris/articles/chip_multi_thread.html

[10]

V. S. Pai, P. Druschel, and W. Zwaenepoel. Flash: An efficient and portable Web server. In Proceedings of the USENIX Technical Conference, Monterey, CA, June 1999.

Digital Library

[11]

M. Welsh, D. Culler, and E. Brewer. "SEDA: An Architecture for Well-Conditioned, Scalable Internet Services." Proc. SOSP-18, 2001.

Digital Library

[12]

S. Gribble, M. Welsh, R. von Behren, E. Brewer, D. Culler, N. Borisov, S. Czerwinski, R. Gummadi, J. Hill, A. Josheph, R. Katz, Z. Mao, S. Ross, and B. Zhao. The Ninja Architecture for Robust Internet-Scale Systems and Services. Sp. lss.: Computer Networks on Pervasive Computing 2000.

[13]

Mohan Rajagopalan, Saumya Debray, Matti Hiltunen, and Richard Schlichting. Profile-directed optimization of event-based programs. Proc. PLDI, 2002

Digital Library

[14]

J. Liedtke, On micro-Kernel Construction, SOSP-15, 1995.

Digital Library

[15]

D. R. Engler, M. F. Kaashoek, and J. O'Toole Jr. Exokernel: an operating system architecture for application-specific resource management. SOSP-15, 1995.

Digital Library

[16]

B. Ford, G. Back, G. Benson, J. Lepreau, A. Lin, and O. Shivers. The Flux OSKit: A Substrate for Kernel and Language Research. SOSP-16, 1997.

Digital Library

[17]

J. Ousterhout, A. Cherenson, F. Douglis, M. Nelson, and B. Welch. The Sprite network operating system. IEEE Computer, 21(2):23--36, February 1988.

Digital Library

[18]

Anderson, T. E., Bershad, B. N., Lazowska, E. D., and Levy, H. M. Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism. ACM ToCS, Feb. 1992

Digital Library

[19]

T. Anderson, E. Lazowska, and H. Levy. The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors. IEEE Trans, on Comp., Dec. 1989.

Digital Library

[20]

Michael B. Jones, Richard F. Rashid: Mach and Matchmaker: Kernel and Language Support for Object-Oriented Distributed Systems. OOPSLA 1986

Digital Library

[21]

E. Bugnion, S. Devine, and M. Rosenblum. Disco: running commodity operating systems on scalable multiprocessors. SOSP-16, 1997.

Digital Library

[22]

The K42 project, IBM Research. http://www.research.ibm.com/k42/

[23]

The K42/Tornado Operating System. http://www.eecg.toronto.edu/~tornado/

[24]

T. G. Mattson and G. Henry. An overview of the Intel TFLOPS supercomputer. Intel Technology Journal, 1, 1998.

[25]

Sharad Garg, Robert Godley, Richard Griffiths, Andrew Pfiffer, Terry Prickett, David Robboy, Stan Smith, T. Mack Stallcup, and Stephen Zeisset. Achieving large scale parallelism through operating system resource management on the Intel TFLOPS supercomputer. Intel Technology Journal, 1st quarter 1998.

[26]

Ron Brightwell, Rolf Riesen, Keith D. Underwood, Trammell Hudson, Patrick G. Bridges, Arthur B. Maccabe: A Performance Comparison of Linux and a Lightweight Kernel. CLUSTER 2003: 251--258

[27]

IBM Research Hypervisor. http://www.research.ibm.com/hypervisor/.

[28]

Boris Dragovic, Keir Fraser, Steve Hand, Tim Harris, Alex Ho, Ian Pratt, Andrew Warfield, Paul Barham, and Rolf Neugebauer. Xen and the Art of Virtualization. SOSP, 2003.

Digital Library

[29]

Brian N. Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gün Sirer, Marc E. Fiuczynski, David Becker, Craig Chambers, Susan J. Eggers: Extensibility, Safety and Performance in the SPIN Operating System. SOSP 1995.

Digital Library

[30]

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime System. PPoPP, 1995.

Digital Library

[31]

H. Lu, Y. C. Hu, and W. Zwaenepoel. OpenMP on networks of workstations. In Supercomputing '98, November 1998

Digital Library

[32]

J. Rattner. Platform 2015. Intel Dev. Forum, Spring 2005.

[33]

J. Rattner. Tera-Scale Research Program. Intel Dev. Forum, Spring 2006.

[34]

J. Dean, S. Ghemawat. MapReduce: Simplified data processing on large clusters. OSDI, 2004.

Digital Library

[35]

M. Herlihy, and J. E. B. Moss. Transactional memory: architectural support for lock-free data structures. ISCA 1993

Digital Library

[36]

A. Adl-Tabatabai, B. T. Lewis, V. S. Menon, B. M. Murphy, B. Saha, T. Shpeisman. Compiler and runtime support for efficient software transactional memory. PLDI 2006.

Digital Library

[37]

B. Saha, A. Adl-Tabatabai, R. Hudson, C. Minh, B. Hertzberg. McRT-STM: A High Performance Software Transactional Memory System For A Multi-Core Runtime. PPoPP, 2006.

Digital Library

[38]

R. Hudson, B. Saha, A. Adl-Tabatabai, B. Hertzberg. McRT-Malloc: A Scalable Transaction Aware Memory Allocator. ISMM, 2006.

Digital Library

[39]

E. Grochowski, R. Ronnen, J. Shen, H. Wang. Best of both latency and throughput. ICCD, 2004.

Digital Library

[40]

Cierniak, M., Eng, M., Glew, N., Lewis, B., and Stichnoth, J. 2005. The Open Runtime Platform: a flexible high-performance managed runtime environment: Research Articles. Concurr. Comput.: Pract. Exper, Apr. 2005.

Digital Library

[41]

P. Dubey. Recognition, Mining, and Synthesis moves computers to the era of tera. Technology@Intel, Feb 2005.

[42]

Craig, T. S. Building FIFO and priority-queueing spin locks from atomic swap. Technical Report TR 93-02-02, Dept of Computer Science, University of Washington, Feb. 1993.

[43]

Magnussen, P., A. Landin, and E. Hagersten. Queue locks on cache coherent multiprocessors. 8th Intl. Parallel Processing Symposium, Cancun, Mexico, Apr. 1994.

Digital Library

[44]

M. L. Scott and W N. Scherer III. Scalable Queue-Based Spin Locks with Timeout. PPoPP 2001.

Digital Library

[45]

Doug Lea. The java.util.concurrent Synchronizer Framework. Science of Computer Programming, Dec 2005.

Digital Library

[46]

A. W. Appel. Compiling with continuations. Cambridge University Press, New York, 1992.

Digital Library

[47]

B. So, A. M. Ghuloum, Y. Wu. Optimizing data parallel operations on many-core platforms. STMCS 2006.

[48]

P. Kongetira, K. Aingaran, K. Olukutun. Niagara: A 32-way Multithreaded Sparc Processor. IEEE Micro, Mar 2005.

Digital Library

[49]

J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21--65, 1991

Digital Library

[50]

P. Larson and M. Krishnan. Memory Allocation for Long-Running Server Applications. In Proceedings of the First International Symposium on Memory Management, pages 176--185, Vancouver, BC, October 1998

Digital Library

[51]

IA-32 Intel Architecture Software Developer's Manual. Intel Corporation.

[52]

Bradski, G.; Kaehler, A.; Pisarevsky, V. "Learning-Based Computer Vision with Intel's Open Source Computer Vision Library." Intel Technology Journal. http://developer.intel.com/technology/itj/2005/volume09issue02/art03_learning_vision/p01_abstract.htm. May 2005.

[53]

Muir, S. and Smith, J. 1998. Functional divisions in the Piglet multiprocessor operating system. In Proceedings of the 8th ACM SIGOPS European Workshop on Support For Composing Distributed Applications (Sintra, Portugal).

Digital Library

Cited By

Gao XQiu M(2022)Efficient Process Scheduling for Multi-core Systems2022 IEEE 8th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS)10.1109/BigDataSecurityHPSCIDS54978.2022.00030(113-118)Online publication date: May-2022
https://doi.org/10.1109/BigDataSecurityHPSCIDS54978.2022.00030
Saravanan VAlagan AWoungang I(2018)Big Data in Massive Parallel ProcessingHandbook of Research on Big Data Storage and Visualization Techniques10.4018/978-1-5225-3142-5.ch011(276-302)Online publication date: 2018
https://doi.org/10.4018/978-1-5225-3142-5.ch011
(2016)Virtualized I/OAttaining High Performance Communications10.1201/b10249-17(261-282)Online publication date: 19-Apr-2016
https://doi.org/10.1201/b10249-17
Show More Cited By

Index Terms

Enabling scalability and performance in a large scale CMP environment
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages
      1. Language types
        Concurrent programming languages
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems

Recommendations

Enabling scalability and performance in a large scale CMP environment
EuroSys'07 Conference Proceedings

Hardware trends suggest that large-scale CMP architectures, with tens to hundreds of processing cores on a single piece of silicon, are iminent within the next decade. While existing CMP machines have traditionally been handled in the same way as SMPs, ...
An effective hybrid transactional memory system with strong isolation guarantees
ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture

We propose signature-accelerated transactional memory (SigTM), ahybrid TM system that reduces the overhead of software transactions. SigTM uses hardware signatures to track the read-set and write-set forpending transactions and perform conflict ...
Efficient Eager Management of Conflicts for Scalable Hardware Transactional Memory

The efficient management of conflicts among concurrent transactions constitutes a key aspect that hardware transactional memory (HTM) systems must achieve. Scalable HTM proposals so far inherit the cache-based style of conflict detection typically found ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007

March 2007

431 pages

ISBN:9781595936363

DOI:10.1145/1272996

ACM SIGOPS Operating Systems Review Volume 41, Issue 3
EuroSys'07 Conference Proceedings
June 2007
386 pages
ISSN:0163-5980
DOI:10.1145/1272998
Issue’s Table of Contents

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 March 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

EuroSys07

Sponsor:

SIGOPS

EuroSys07: Eurosys 2007 Conference

March 21 - 23, 2007

Lisbon, Portugal

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

49
Total Citations
View Citations
1,478
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)1

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gao XQiu M(2022)Efficient Process Scheduling for Multi-core Systems2022 IEEE 8th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS)10.1109/BigDataSecurityHPSCIDS54978.2022.00030(113-118)Online publication date: May-2022
https://doi.org/10.1109/BigDataSecurityHPSCIDS54978.2022.00030
Saravanan VAlagan AWoungang I(2018)Big Data in Massive Parallel ProcessingHandbook of Research on Big Data Storage and Visualization Techniques10.4018/978-1-5225-3142-5.ch011(276-302)Online publication date: 2018
https://doi.org/10.4018/978-1-5225-3142-5.ch011
(2016)Virtualized I/OAttaining High Performance Communications10.1201/b10249-17(261-282)Online publication date: 19-Apr-2016
https://doi.org/10.1201/b10249-17
Hassanein W(2016)Understanding and improving JVM GC work stealing at the data center scaleACM SIGPLAN Notices10.1145/3241624.292670651:11(46-54)Online publication date: 14-Jun-2016
https://dl.acm.org/doi/10.1145/3241624.2926706
Baldassin ABorin EAraujo G(2015)Performance implications of dynamic memory allocators on transactional memory systemsACM SIGPLAN Notices10.1145/2858788.268850450:8(87-96)Online publication date: 24-Jan-2015
https://dl.acm.org/doi/10.1145/2858788.2688504
Baldassin ABorin EAraujo GCohen AGrove D(2015)Performance implications of dynamic memory allocators on transactional memory systemsProceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/2688500.2688504(87-96)Online publication date: 24-Jan-2015
https://dl.acm.org/doi/10.1145/2688500.2688504
Colmenares JEads GHofmeyr SBird SMoretó MChou DGluzman BRoman EBartolini DMor NAsanović KKubiatowicz J(2013)TessellationProceedings of the 50th Annual Design Automation Conference10.1145/2463209.2488827(1-10)Online publication date: 29-May-2013
https://dl.acm.org/doi/10.1145/2463209.2488827
Anderson Z(2012)Efficiently combining parallel software using fine-grained, language-level, hierarchical resource management policiesACM SIGPLAN Notices10.1145/2398857.238466947:10(717-736)Online publication date: 19-Oct-2012
https://dl.acm.org/doi/10.1145/2398857.2384669
Anderson ZLeavens GDwyer M(2012)Efficiently combining parallel software using fine-grained, language-level, hierarchical resource management policiesProceedings of the ACM international conference on Object oriented programming systems languages and applications10.1145/2384616.2384669(717-736)Online publication date: 19-Oct-2012
https://dl.acm.org/doi/10.1145/2384616.2384669
Ding XWang KGibbons PZhang XFelber PBellosa FBos H(2012)BWSProceedings of the 7th ACM european conference on Computer Systems10.1145/2168836.2168873(365-378)Online publication date: 10-Apr-2012
https://dl.acm.org/doi/10.1145/2168836.2168873
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents