[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2628071.2628074acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Shuffling: a framework for lock contention aware thread scheduling for multicore multiprocessor systems

Published: 24 August 2014 Publication History

Abstract

On a cache-coherent multicore multiprocessor system, the performance of a multithreaded application with high lock contention is very sensitive to the distribution of application threads across multiple processors (or Sockets). This is because the distribution of threads impacts the frequency of lock transfers between Sockets, which in turn impacts the frequency of last-level cache (LLC) misses that lie on the critical path of execution. Since the latency of a LLC miss is high, an increase of LLC misses on the critical path increases both lock acquisition latency and critical section processing time. However, thread schedulers for operating systems, such as Solaris and Linux, are oblivious of the lock contention among multiple threads belonging to an application and therefore fail to deliver high performance for multithreaded applications.
To alleviate the above problem, in this paper, we propose a scheduling framework called Shuffling, which migrates threads of a multithreaded program across Sockets so that threads seeking locks are more likely to find the locks on the same Socket. Shuffling reduces the time threads spend on acquiring locks and speeds up the execution of shared data accesses in the critical section, ultimately reducing the execution time of the application. We have implemented Shuffling on a 64-core Supermicro server running Oracle Solaris 11™ and evaluated it using a wide variety of 20 multithreaded programs with high lock contention. Our experiments show that Shuffling achieves up to 54% reduction in execution time and an average reduction of 13%. Moreover it does not require any changes to the application source code or the OS kernel.

References

[1]
M. Bhadauria and S. A. McKee. An Approach to Resource-Aware Co-Scheduling for CMPs. In ICS, 2010.
[2]
C. Bienia, S. Kumar, J.P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In PACT, 2008.
[3]
S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova. A Case for NUMA-aware Contention Management on Multicore Systems. In USENIX ATC, 2011.
[4]
S. Boyd-Wickizer, R. Morris, and M. F. Kaashoek. Reinventing Scheduling for Multicore Systems. In HotOS, 2009.
[5]
S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R .Morris, A. Pesterev, L. Stein, M. Wu, Y. D. Y. Zhang, and Z. Zhang. Corey: An operating system for many cores. In OSDI, 2008.
[6]
S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, F. Kaashoek, R.Morris, and N. Zeldovich. An Analysis of Linux Scalability to Many Cores. In OSDI, 2010.
[7]
T. Brecht. On the Importance of Parallel Application Placement in NUMA Multiprocessors. In SEDMS, 1993.
[8]
B. Cantrill, M. Shapiro, and A. Leventhal. Dynamic instrumentation of production systems. In USENIX ATC, 2004.
[9]
R. Chandra, S. Devine, B. Verghese, A. Gupta, and M. Rosenblum. Scheduling and page migration for multiprocessor compute servers. In ASPLOS 1994.
[10]
J. Chen, W. Watson, and W. Mao. Multi-Threading Performance on Commodity Multi-core Processors. In HPC-Asia, 2007.
[11]
J. Corbalan, X. Martorell, and J. Labarta. Evaluation of the Memory Page Migration Influence in the System Performance: the Case of the SGI O2000. In SC, 2003.
[12]
J. Corbalan, X. Martorell, and J. Labarta. Performance-driven processor allocation. In OSDI, 2000.
[13]
Y. Cui, Y. Wang, Y. Chen, and Y.Shi. Lock-contention-aware scheduler: A scalable and energy-efficient method for addressing scalability collapse on multicore systems. In ACM TACO, 4, Article 44, Jan. 2013.
[14]
D. Dice, V. Marathe, and N. Shavit. Flat Combining NUMA Locks. In SPAA, 2011.
[15]
D. Dice, V. Marathe, and N. Shavit. Lock Cohorting: A General Technique for Designing NUMA Locks. In PPoPP, 2012.
[16]
X. Ding, K. Wang, P. B. Gibbons, and X. Zhang. BWS: balanced work stealing for time-sharing multicores. In Eurosys, 2012.
[17]
E. Frachtenberg, D. G. Feitelson, F. Petrini, and J. Fernandez. Adaptive parallel job scheduling with flexible coscheduling. In IEEE TPDS, (2005), 16(11).
[18]
A. Gupta, A. Tucker, and S. Urushibara. The Impact Of Operating System Scheduling Policies And Synchronization Methods On Performance Of Parallel Applications. In SIGMETRICS, 1991.
[19]
L. Jean-Pierre, D. Florian, T. Gaël, L. Julia and M. Gilles. Remote core locking: migrating critical-section execution to improve the performance of multithreaded applications. In USENIX ATC, 2012.
[20]
J. A. Joao, M. A. Suleman, O. Mutlu, and Y. N. Patt. Bottleneck Identification and Scheduling in Multithreaded Applications. In ASPLOS, 2012.
[21]
R. Johnson, R. Stoica, A. Ailamaki, and T. C. Mowry. Decoupling contention management from scheduling. In ASPLOS, 2010.
[22]
R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS Observations to Improve Performance in Multicore Systems. In IEEE Micro, 2008
[23]
R.P. Larowe, C. S. Ellis, and M. A. Holliday. Evaluation of NUMA Memory Management Through Modeling and Measurements. In IEEE TPDS, (1991), 688 -- 701.
[24]
Z. Majo and T. R. Gross. Memory management in NUMA multicore systems: Trapped between cache contention and interconnect overhead. In ISMM, 2011.
[25]
J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In MICRO, 2011.
[26]
R. McDougall and J. Mauro. Solaris Internals. Prentice Hall Publications, Second Edition, 2006.
[27]
R. McGregor, C. Antonopoulos, and D. Nikolopoulos. Scheduling algorithms for effective thread pairing on hybrid multiprocessors. In IPDPS, 2005.
[28]
A. Mendelson and F. Gabbay. 2001. The effect of seance communication on multiprocessing systems. In ACM Trans. Comput. Syst. 19, 2 (May 2001), 252--281.
[29]
A. Merkel, J. Stoess, and F. Bellosa, Resource-conscious scheduling for energy efficiency on multicore processors. In Eurosys, 2010.
[30]
PBZIP2. http://compression.ca/pbzip2/
[31]
K. K. Pusukuri, D. Vengerov, A. Fedorova, and V .Kalogeraki. FACT: a framework for adaptive contention-aware thread migrations. In CF, 2011.
[32]
K. K. Pusukuri, R. Gupta, L. N. Bhuyan. Thread Reinforcer: Dynamically Determining Number of Threads via OS Level Monitoring. In IISWC, 2011.
[33]
K. K. Pusukuri, R. Gupta, L. N. Bhuyan. No More Backstabbing... A Faithful Scheduling Policy for Multithreaded Programs. In PACT, 2011.
[34]
K.K. Pusukuri and D. Johnson. Has one-thread-per-core binding model become obsolete for multithreaded programs running on multicore systems. In USENIX HotPar, 2013.
[35]
K. K. Pusukuri, R. Gupta, L. N. Bhuyan. An Effective OS Load Balancing Technique for Multicore Multiprocessor Systems. Technical Report, Sept. 2012. University of California, Riverside.
[36]
Z. Radovic and E. Hagersten. RH Lock: A Scalable Hierarchical Spin Lock. In WMPI, 2012.
[37]
H. Sasaki, T. Tanimoto, K. Inoue, and H. Nakamura. Scalability-based manycore partitioning. In PACT, 2012.
[38]
C. Severance and R. Enbody. Comparing gang scheduling with dynamic space sharing on symmetric multiprocessors using automatic self-allocating threads. In IPPS, 1997.
[39]
A. Snavely, D.M. Tullsen, G. Voelker. Symbiotic Job scheduling For A Simultaneous Multithreading Processor. In ASPLOS, 2000.
[40]
S. Sridharan, B. Keck, R. Murphy, S. Chandra, and P. Kogge. Thread migration to improve synchronization performance. In Workshop on Operating System Interference in High Performance Applications, 2006.
[41]
SPEC and the benchmark names SPEC OMP2001, SPEC jbb2005 are registered trademarks of the Standard Performance Evaluation Corporation. For more information, see www.spec.org.
[42]
P. Sweazey and A. J. Smith. A Class of Compatible Cache Consistency Protocols and Their Support by the IEEE Futurebus. In ISCA, 1986.
[43]
D. Tam, R. Azimi, and M. Stumm. Thread Clustering: Sharing-Aware Scheduling on SMP-CMP-SMT Multiprocessors. In Eurosys, 2007.
[44]
L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa. The Impact of Memory Subsystem Resource Sharing on Datacenter Applications. In ISCA, 2011.
[45]
L. Tang, J. Mars, and M. L. Soffa. Compiling For Niceness: Mitigating Contention for QOS in Warehouse Scale Computers. In CGO, 2012.
[46]
L. Tang, J. Mars, X. Zhang, R. Hagmann, R. Hundt, and E. Tune. Optimizing Google's Warehouse Scale Computers: The NUMA Experience. In HPCA, 2013.
[47]
R. Thekkath and S. J. Eggers. Impact of Sharing-Based Thread Placement on Multithreaded Architectures. In ISCA, 1994.
[48]
VMware ESX Server 2 NUMA Support. White paper. http://www.vmware.com/pdf/esx2_NUMA.pdf.
[49]
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: characterization and methodological considerations. In ISCA, 1995.
[50]
F. Xian, W. Srisa-an, and H. Jiang. Contention-aware scheduler: unlocking execution parallelism in multithreaded java programs. In OOPSLA, 2008.
[51]
X. Xiang, B. Bao, C. Ding, K. Shen: Cache Conscious Task Regrouping on Multicore Processors. In CCGRID, 2012.
[52]
S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing Shared Resource Contention in Multicore Processors via Scheduling. In ASPLOS, 2010.

Cited By

View all
  • (2019)SEECSSim: A toolkit for parallel and distributed simulations for mobile devicesJournal of Simulation10.1080/17477778.2019.170195815:3(235-260)Online publication date: 20-Dec-2019
  • (2019)Async-LCAMCluster Computing10.1007/s10586-018-2832-522:2(373-384)Online publication date: 1-Jun-2019
  • (2018)TSP: A Threads Scheduling Policy for Hierarchical Locks in Multiple Applications Scenario2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS)10.1109/ICSESS.2018.8663839(861-864)Online publication date: Nov-2018
  • Show More Cited By

Index Terms

  1. Shuffling: a framework for lock contention aware thread scheduling for multicore multiprocessor systems

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation
August 2014
514 pages
ISBN:9781450328098
DOI:10.1145/2628071
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. last-level cache misses
  2. lock contention
  3. multicore
  4. scheduling
  5. thread migration

Qualifiers

  • Research-article

Funding Sources

Conference

PACT '14
Sponsor:
  • IFIP WG 10.3
  • SIGARCH
  • IEEE CS TCPP
  • IEEE CS TCAA

Acceptance Rates

PACT '14 Paper Acceptance Rate 54 of 144 submissions, 38%;
Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)3
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2019)SEECSSim: A toolkit for parallel and distributed simulations for mobile devicesJournal of Simulation10.1080/17477778.2019.170195815:3(235-260)Online publication date: 20-Dec-2019
  • (2019)Async-LCAMCluster Computing10.1007/s10586-018-2832-522:2(373-384)Online publication date: 1-Jun-2019
  • (2018)TSP: A Threads Scheduling Policy for Hierarchical Locks in Multiple Applications Scenario2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS)10.1109/ICSESS.2018.8663839(861-864)Online publication date: Nov-2018
  • (2017)Why to redesign PDES framework for smart devicesProceedings of the Summer Simulation Multi-Conference10.5555/3140065.3140085(1-11)Online publication date: 9-Jul-2017
  • (2017)tScale: A Contention-Aware Multithreaded Framework for Multicore Multiprocessor Systems2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS.2017.00052(334-343)Online publication date: Dec-2017
  • (2016)Contention-conscious, locality-preserving locksACM SIGPLAN Notices10.1145/3016078.285116651:8(1-14)Online publication date: 27-Feb-2016
  • (2016)Contention-conscious, locality-preserving locksProceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/2851141.2851166(1-14)Online publication date: 27-Feb-2016
  • (2016)Co-scheduling tasks on multi-core heterogeneous systems: An energy-aware perspectiveIET Computers & Digital Techniques10.1049/iet-cdt.2015.005310:2(77-84)Online publication date: 1-Mar-2016
  • (2015)TumblerACM Transactions on Architecture and Code Optimization10.1145/282769812:4(1-24)Online publication date: 16-Nov-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media