More Web Proxy on the site http://driver.im/

research-article

Shuffling: a framework for lock contention aware thread scheduling for multicore multiprocessor systems

Authors:

Kishore Kumar Pusukuri,

Laxmi N. BhuyanAuthors Info & Claims

PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

Pages 289 - 300

https://doi.org/10.1145/2628071.2628074

Published: 24 August 2014 Publication History

Abstract

On a cache-coherent multicore multiprocessor system, the performance of a multithreaded application with high lock contention is very sensitive to the distribution of application threads across multiple processors (or Sockets). This is because the distribution of threads impacts the frequency of lock transfers between Sockets, which in turn impacts the frequency of last-level cache (LLC) misses that lie on the critical path of execution. Since the latency of a LLC miss is high, an increase of LLC misses on the critical path increases both lock acquisition latency and critical section processing time. However, thread schedulers for operating systems, such as Solaris and Linux, are oblivious of the lock contention among multiple threads belonging to an application and therefore fail to deliver high performance for multithreaded applications.

To alleviate the above problem, in this paper, we propose a scheduling framework called Shuffling, which migrates threads of a multithreaded program across Sockets so that threads seeking locks are more likely to find the locks on the same Socket. Shuffling reduces the time threads spend on acquiring locks and speeds up the execution of shared data accesses in the critical section, ultimately reducing the execution time of the application. We have implemented Shuffling on a 64-core Supermicro server running Oracle Solaris 11™ and evaluated it using a wide variety of 20 multithreaded programs with high lock contention. Our experiments show that Shuffling achieves up to 54% reduction in execution time and an average reduction of 13%. Moreover it does not require any changes to the application source code or the OS kernel.

References

[1]

M. Bhadauria and S. A. McKee. An Approach to Resource-Aware Co-Scheduling for CMPs. In ICS, 2010.

Digital Library

[2]

C. Bienia, S. Kumar, J.P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In PACT, 2008.

Digital Library

[3]

S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova. A Case for NUMA-aware Contention Management on Multicore Systems. In USENIX ATC, 2011.

Digital Library

[4]

S. Boyd-Wickizer, R. Morris, and M. F. Kaashoek. Reinventing Scheduling for Multicore Systems. In HotOS, 2009.

Digital Library

[5]

S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R .Morris, A. Pesterev, L. Stein, M. Wu, Y. D. Y. Zhang, and Z. Zhang. Corey: An operating system for many cores. In OSDI, 2008.

Digital Library

[6]

S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, F. Kaashoek, R.Morris, and N. Zeldovich. An Analysis of Linux Scalability to Many Cores. In OSDI, 2010.

Digital Library

[7]

T. Brecht. On the Importance of Parallel Application Placement in NUMA Multiprocessors. In SEDMS, 1993.

Digital Library

[8]

B. Cantrill, M. Shapiro, and A. Leventhal. Dynamic instrumentation of production systems. In USENIX ATC, 2004.

Digital Library

[9]

R. Chandra, S. Devine, B. Verghese, A. Gupta, and M. Rosenblum. Scheduling and page migration for multiprocessor compute servers. In ASPLOS 1994.

Digital Library

[10]

J. Chen, W. Watson, and W. Mao. Multi-Threading Performance on Commodity Multi-core Processors. In HPC-Asia, 2007.

[11]

J. Corbalan, X. Martorell, and J. Labarta. Evaluation of the Memory Page Migration Influence in the System Performance: the Case of the SGI O2000. In SC, 2003.

Digital Library

[12]

J. Corbalan, X. Martorell, and J. Labarta. Performance-driven processor allocation. In OSDI, 2000.

Digital Library

[13]

Y. Cui, Y. Wang, Y. Chen, and Y.Shi. Lock-contention-aware scheduler: A scalable and energy-efficient method for addressing scalability collapse on multicore systems. In ACM TACO, 4, Article 44, Jan. 2013.

Digital Library

[14]

D. Dice, V. Marathe, and N. Shavit. Flat Combining NUMA Locks. In SPAA, 2011.

Digital Library

[15]

D. Dice, V. Marathe, and N. Shavit. Lock Cohorting: A General Technique for Designing NUMA Locks. In PPoPP, 2012.

Digital Library

[16]

X. Ding, K. Wang, P. B. Gibbons, and X. Zhang. BWS: balanced work stealing for time-sharing multicores. In Eurosys, 2012.

Digital Library

[17]

E. Frachtenberg, D. G. Feitelson, F. Petrini, and J. Fernandez. Adaptive parallel job scheduling with flexible coscheduling. In IEEE TPDS, (2005), 16(11).

Digital Library

[18]

A. Gupta, A. Tucker, and S. Urushibara. The Impact Of Operating System Scheduling Policies And Synchronization Methods On Performance Of Parallel Applications. In SIGMETRICS, 1991.

Digital Library

[19]

L. Jean-Pierre, D. Florian, T. Gaël, L. Julia and M. Gilles. Remote core locking: migrating critical-section execution to improve the performance of multithreaded applications. In USENIX ATC, 2012.

Digital Library

[20]

J. A. Joao, M. A. Suleman, O. Mutlu, and Y. N. Patt. Bottleneck Identification and Scheduling in Multithreaded Applications. In ASPLOS, 2012.

Digital Library

[21]

R. Johnson, R. Stoica, A. Ailamaki, and T. C. Mowry. Decoupling contention management from scheduling. In ASPLOS, 2010.

Digital Library

[22]

R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS Observations to Improve Performance in Multicore Systems. In IEEE Micro, 2008

Digital Library

[23]

R.P. Larowe, C. S. Ellis, and M. A. Holliday. Evaluation of NUMA Memory Management Through Modeling and Measurements. In IEEE TPDS, (1991), 688 -- 701.

Digital Library

[24]

Z. Majo and T. R. Gross. Memory management in NUMA multicore systems: Trapped between cache contention and interconnect overhead. In ISMM, 2011.

Digital Library

[25]

J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In MICRO, 2011.

Digital Library

[26]

R. McDougall and J. Mauro. Solaris Internals. Prentice Hall Publications, Second Edition, 2006.

Digital Library

[27]

R. McGregor, C. Antonopoulos, and D. Nikolopoulos. Scheduling algorithms for effective thread pairing on hybrid multiprocessors. In IPDPS, 2005.

Digital Library

[28]

A. Mendelson and F. Gabbay. 2001. The effect of seance communication on multiprocessing systems. In ACM Trans. Comput. Syst. 19, 2 (May 2001), 252--281.

Digital Library

[29]

A. Merkel, J. Stoess, and F. Bellosa, Resource-conscious scheduling for energy efficiency on multicore processors. In Eurosys, 2010.

Digital Library

[30]

PBZIP2. http://compression.ca/pbzip2/

[31]

K. K. Pusukuri, D. Vengerov, A. Fedorova, and V .Kalogeraki. FACT: a framework for adaptive contention-aware thread migrations. In CF, 2011.

Digital Library

[32]

K. K. Pusukuri, R. Gupta, L. N. Bhuyan. Thread Reinforcer: Dynamically Determining Number of Threads via OS Level Monitoring. In IISWC, 2011.

Digital Library

[33]

K. K. Pusukuri, R. Gupta, L. N. Bhuyan. No More Backstabbing... A Faithful Scheduling Policy for Multithreaded Programs. In PACT, 2011.

Digital Library

[34]

K.K. Pusukuri and D. Johnson. Has one-thread-per-core binding model become obsolete for multithreaded programs running on multicore systems. In USENIX HotPar, 2013.

[35]

K. K. Pusukuri, R. Gupta, L. N. Bhuyan. An Effective OS Load Balancing Technique for Multicore Multiprocessor Systems. Technical Report, Sept. 2012. University of California, Riverside.

[36]

Z. Radovic and E. Hagersten. RH Lock: A Scalable Hierarchical Spin Lock. In WMPI, 2012.

[37]

H. Sasaki, T. Tanimoto, K. Inoue, and H. Nakamura. Scalability-based manycore partitioning. In PACT, 2012.

Digital Library

[38]

C. Severance and R. Enbody. Comparing gang scheduling with dynamic space sharing on symmetric multiprocessors using automatic self-allocating threads. In IPPS, 1997.

Digital Library

[39]

A. Snavely, D.M. Tullsen, G. Voelker. Symbiotic Job scheduling For A Simultaneous Multithreading Processor. In ASPLOS, 2000.

Digital Library

[40]

S. Sridharan, B. Keck, R. Murphy, S. Chandra, and P. Kogge. Thread migration to improve synchronization performance. In Workshop on Operating System Interference in High Performance Applications, 2006.

[41]

SPEC and the benchmark names SPEC OMP2001, SPEC jbb2005 are registered trademarks of the Standard Performance Evaluation Corporation. For more information, see www.spec.org.

[42]

P. Sweazey and A. J. Smith. A Class of Compatible Cache Consistency Protocols and Their Support by the IEEE Futurebus. In ISCA, 1986.

Digital Library

[43]

D. Tam, R. Azimi, and M. Stumm. Thread Clustering: Sharing-Aware Scheduling on SMP-CMP-SMT Multiprocessors. In Eurosys, 2007.

Digital Library

[44]

L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa. The Impact of Memory Subsystem Resource Sharing on Datacenter Applications. In ISCA, 2011.

Digital Library

[45]

L. Tang, J. Mars, and M. L. Soffa. Compiling For Niceness: Mitigating Contention for QOS in Warehouse Scale Computers. In CGO, 2012.

Digital Library

[46]

L. Tang, J. Mars, X. Zhang, R. Hagmann, R. Hundt, and E. Tune. Optimizing Google's Warehouse Scale Computers: The NUMA Experience. In HPCA, 2013.

Digital Library

[47]

R. Thekkath and S. J. Eggers. Impact of Sharing-Based Thread Placement on Multithreaded Architectures. In ISCA, 1994.

Digital Library

[48]

VMware ESX Server 2 NUMA Support. White paper. http://www.vmware.com/pdf/esx2_NUMA.pdf.

[49]

S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: characterization and methodological considerations. In ISCA, 1995.

Digital Library

[50]

F. Xian, W. Srisa-an, and H. Jiang. Contention-aware scheduler: unlocking execution parallelism in multithreaded java programs. In OOPSLA, 2008.

Digital Library

[51]

X. Xiang, B. Bao, C. Ding, K. Shen: Cache Conscious Task Regrouping on Multicore Processors. In CCGRID, 2012.

Digital Library

[52]

S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing Shared Resource Contention in Multicore Processors via Scheduling. In ASPLOS, 2010.

Digital Library

Cited By

Maqbool FMalik ARaza Naqvi SAhmed ND’Angelo GMahmood I(2019)SEECSSim: A toolkit for parallel and distributed simulations for mobile devicesJournal of Simulation10.1080/17477778.2019.170195815:3(235-260)Online publication date: 20-Dec-2019
https://doi.org/10.1080/17477778.2019.1701958
Jeong BKhan APark S(2019)Async-LCAMCluster Computing10.1007/s10586-018-2832-522:2(373-384)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1007/s10586-018-2832-5
Zhao PShen Z(2018)TSP: A Threads Scheduling Policy for Hierarchical Locks in Multiple Applications Scenario2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS)10.1109/ICSESS.2018.8663839(861-864)Online publication date: Nov-2018
https://doi.org/10.1109/ICSESS.2018.8663839
Show More Cited By

Index Terms

Shuffling: a framework for lock contention aware thread scheduling for multicore multiprocessor systems
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Multithreading
        Process synchronization
        Scheduling

Recommendations

Lock contention aware thread migrations
PPoPP '14

On a cache-coherent multicore multiprocessor system, the performance of a multithreaded application with high lock contention is very sensitive to the distribution of application threads across multiple processors. This is because the distribution of ...
Lock contention aware thread migrations
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

On a cache-coherent multicore multiprocessor system, the performance of a multithreaded application with high lock contention is very sensitive to the distribution of application threads across multiple processors. This is because the distribution of ...
Multithreading in Java: Performance and Scalability on Multicore Systems

The performance and scalability issues of multithreaded Java programs on multicore systems are studied in this paper. First, we examine the performance scaling of benchmarks with various numbers of processor cores and application threads. Second, by ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

August 2014

514 pages

ISBN:9781450328098

DOI:10.1145/2628071

General Chair:
J. Nelson Amaral
University of Alberta, Canada
,
Program Chair:
Josep Torrellas
University of Illinois, USA

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IFIP WG 10.3: IFIP WG 10.3
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing
IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

PACT '14

Sponsor:

IFIP WG 10.3
SIGARCH
IEEE CS TCPP
IEEE CS TCAA

PACT '14: International Conference on Parallel Architectures and Compilation

August 24 - 27, 2014

AB, Edmonton, Canada

Acceptance Rates

PACT '14 Paper Acceptance Rate 54 of 144 submissions, 38%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
275
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)3

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Maqbool FMalik ARaza Naqvi SAhmed ND’Angelo GMahmood I(2019)SEECSSim: A toolkit for parallel and distributed simulations for mobile devicesJournal of Simulation10.1080/17477778.2019.170195815:3(235-260)Online publication date: 20-Dec-2019
https://doi.org/10.1080/17477778.2019.1701958
Jeong BKhan APark S(2019)Async-LCAMCluster Computing10.1007/s10586-018-2832-522:2(373-384)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1007/s10586-018-2832-5
Zhao PShen Z(2018)TSP: A Threads Scheduling Policy for Hierarchical Locks in Multiple Applications Scenario2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS)10.1109/ICSESS.2018.8663839(861-864)Online publication date: Nov-2018
https://doi.org/10.1109/ICSESS.2018.8663839
Maqbool FNaqvi SMalik A(2017)Why to redesign PDES framework for smart devicesProceedings of the Summer Simulation Multi-Conference10.5555/3140065.3140085(1-11)Online publication date: 9-Jul-2017
https://dl.acm.org/doi/10.5555/3140065.3140085
Cai MLiu SHuang H(2017)tScale: A Contention-Aware Multithreaded Framework for Multicore Multiprocessor Systems2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS.2017.00052(334-343)Online publication date: Dec-2017
https://doi.org/10.1109/ICPADS.2017.00052
Chabbi MMellor-Crummey J(2016)Contention-conscious, locality-preserving locksACM SIGPLAN Notices10.1145/3016078.285116651:8(1-14)Online publication date: 27-Feb-2016
https://dl.acm.org/doi/10.1145/3016078.2851166
Chabbi MMellor-Crummey JAsenjo RHarris T(2016)Contention-conscious, locality-preserving locksProceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/2851141.2851166(1-14)Online publication date: 27-Feb-2016
https://dl.acm.org/doi/10.1145/2851141.2851166
Massari GFornaciari WLibutti S(2016)Co-scheduling tasks on multi-core heterogeneous systems: An energy-aware perspectiveIET Computers & Digital Techniques10.1049/iet-cdt.2015.005310:2(77-84)Online publication date: 1-Mar-2016
https://doi.org/10.1049/iet-cdt.2015.0053
Pusukuri KGupta RBhuyan L(2015)TumblerACM Transactions on Architecture and Code Optimization10.1145/282769812:4(1-24)Online publication date: 16-Nov-2015
https://dl.acm.org/doi/10.1145/2827698

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents