More Web Proxy on the site http://driver.im/

research-article

Friendly barriers: efficient work-stealing with return barriers

Authors:

Stephen M. Blackburn,

David GroveAuthors Info & Claims

VEE '14: Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments

Pages 165 - 176

https://doi.org/10.1145/2576195.2576207

Published: 01 March 2014 Publication History

Abstract

This paper addresses the problem of efficiently supporting parallelism within a managed runtime. A popular approach for exploiting software parallelism on parallel hardware is task parallelism, where the programmer explicitly identifies potential parallelism and the runtime then schedules the work. Work-stealing is a promising scheduling strategy that a runtime may use to keep otherwise idle hardware busy while relieving overloaded hardware of its burden. However, work-stealing comes with substantial overheads. Recent work identified sequential overheads of work-stealing, those that occur even when no stealing takes place, as a significant source of overhead. That work was able to reduce sequential overheads to just 15%.

In this work, we turn to dynamic overheads, those that occur each time a steal takes place. We show that the dynamic overhead is dominated by introspection of the victim's stack when a steal takes place. We exploit the idea of a low overhead return barrier to reduce the dynamic overhead by approximately half, resulting in total performance improvements of as much as 20%. Because, unlike prior work, we attack the overheads directly due to stealing and therefore attack the overheads that grow as parallelism grows, we improve the scalability of work-stealing applications. This result is complementary to recent work addressing the sequential overheads of work-stealing. This work therefore substantially relieves work-stealing of the increasing pressure due to increasing intra-node hardware parallelism.

References

[1]

U. A. Acar, A. Chargueraud, and M. Rainey. Scheduling parallel programs by work stealing with private deques. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 219--228, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1922-5. 10.1145/2442516.2442538.

Digital Library

[2]

B. Alpern, C. Attanasio, J. Barton, M. Burke, P. Cheng, J. Choi, A. Cocchi, S. Fink, D. Grove, M. Hind, et al. The Jalape no virtual machine. IBM Systems Journal, 39 (1): 211--238, 2010. ISSN 0018-8670. 10.1147/sj.391.0211.

Digital Library

[3]

M. Arnold, S. Fink, D. Grove, M. Hind, and P. F. Sweeney. Adaptive optimization in the Jalape no JVM. In Proceedings of the 15th ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '00, pages 47--65, New York, NY, USA, 2000. ACM. ISBN 1-58113-200-X. 10.1145/353171.353175.

Digital Library

[4]

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '98, pages 119--129, New York, NY, USA, 1998. ACM. ISBN 0-89791-989-0. 10.1145/277651.277678.

Digital Library

[5]

P. Berenbrink, T. Friedetzky, and L. A. Goldberg. The natural work-stealing algorithm is stable. SIAM J. Comput., 32 (5): 1260--1279, May 2003. ISSN 0097-5397. 10.1137/S0097539701399551.

Digital Library

[6]

R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46 (5): 720--748, Sept. 1999. ISSN 0004-5411. 10.1145/324133.324234.

Digital Library

[7]

Cavé, J. Zhao, J. Shirako, and V. Sarkar. Habanero-Java: the new adventures of old X10. In Proceedings of the 9th International Conference on Principles and Practice of Programming in Java, PPPJ '11, pages 51--61, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0935-6. 10.1145/2093157.2093165.

Digital Library

[8]

P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '05, pages 519--538, New York, NY, USA, 2005. ACM. ISBN 1-59593-031-0. 10.1145/1094811.1094852.

Digital Library

[9]

G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. Solving large, irregular graph problems using adaptive work-stealing. In Proceedings of the 2008 37th International Conference on Parallel Processing, ICPP '08, pages 536--545, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978-0-7695-3374-2. 10.1109/ICPP.2008.88.

Digital Library

[10]

J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 53:1--53:11, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-744-8. 10.1145/1654059.1654113.

Digital Library

[11]

S. J. Fink and F. Qian. Design, implementation and evaluation of adaptive recompilation with on-stack replacement. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO '03, pages 241--252, Washington, DC, USA, 2003. IEEE Computer Society. ISBN 0-7695-1913-X.

Digital Library

[12]

M. Frigo, H. Prokop, M. Frigo, C. Leiserson, H. Prokop, S. Ramachandran, D. Dailey, C. Leiserson, I. Lyubashevskiy, N. Kushman, et al. The Cilk project. Algorithms, 1998.

[13]

Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for async-finish task parallelism. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, IPDPS '09, pages 1--12, Washington, DC, USA, 2009. IEEE Computer Society. ISBN 978-1-4244-3751-1. 10.1109/IPDPS.2009.5161079.

Digital Library

[14]

Y. Guo, J. Zhao, V. Cave, and V. Sarkar. SLAW: A scalable locality-aware adaptive work-stealing scheduler for multi-core systems. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, pages 341--342, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-877-3. 10.1145/1693453.1693504.

Digital Library

[15]

D. Hendler and N. Shavit. Non-blocking steal-half work queues. In Proceedings of the Twenty-first Annual Symposium on Principles of Distributed Computing, PODC '02, pages 280--289, New York, NY, USA, 2002. ACM. ISBN 1-58113-485-1. 10.1145/571825.571876.

Digital Library

[16]

R. Hoffmann and T. Rauber. Adaptive task pools: efficiently balancing large number of tasks on shared-address spaces. International Journal of Parallel Programming, 39 (5): 553--581, 2011.

[17]

U. Hölzle, C. Chambers, and D. Ungar. Debugging optimized code with dynamic deoptimization. In Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation, PLDI '92, pages 32--43, New York, NY, USA, 1992. ACM. ISBN 0-89791-475-9. 10.1145/143095.143114.

Digital Library

[18]

Intel Corporation. Using the RDTSC instruction for performance monitoring, 1997. URL http://www.intel.com.au/content/dam/www/public/us/en/documents/white-pa%pers/ia-32-ia-64-benchmark-code-execution-paper.pdf.

[19]

G. Kliot, E. Petrank, and B. Steensgaard. A lock-free, concurrent, and incremental stack scanning for garbage collectors. In Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE '09, pages 11--20, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-375-4. 10.1145/1508293.1508296.

Digital Library

[20]

M. Kulkarni, M. Burtscher, C. Cascaval, and K. Pingali. Lonestar: A suite of parallel irregular programs. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 65--76, 2009. 10.1109/ISPASS.2009.4919639.

[21]

V. Kumar, D. Frampton, S. M. Blackburn, D. Grove, and O. Tardieu. Work-stealing without the baggage. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '12, pages 297--314, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1561-6. 10.1145/2384616.2384639.

Digital Library

[22]

D. Lea. A Java Fork/Join framework. In Proceedings of the ACM 2000 Conference on Java Grande, JAVA '00, pages 36--43, New York, NY, USA, 2000. ACM. ISBN 1-58113-288-3. 10.1145/337449.337465.

Digital Library

[23]

R. Lublinerman, J. Zhao, Z. Budimlić, S. Chaudhuri, and V. Sarkar. Delegated isolation. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '11, pages 885--902, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0940-0. 10.1145/2048066.2048133.

Digital Library

[24]

R. Lüling and B. Monien. A dynamic distributed load balancing algorithm with provable good performance. In Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '93, pages 164--172, New York, NY, USA, 1993. ACM. ISBN 0-89791-599-2. 10.1145/165231.165252.

Digital Library

[25]

M. Mitzenmacher. Analyses of load stealing models based on differential equations. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '98, pages 212--221, New York, NY, USA, 1998. ACM. ISBN 0-89791-989-0. 10.1145/277651.277687.

Digital Library

[26]

E. Mohr, D. A. Kranz, and R. H. Halstead, Jr. Lazy task creation: A technique for increasing the granularity of parallel programs. In Proceedings of the 1990 ACM Conference on LISP and Functional Programming, LFP '90, pages 185--197, New York, NY, USA, 1990. ACM. ISBN 0-89791-368-X. 10.1145/91556.91631.

Digital Library

[27]

S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: an unbalanced tree search benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing, LCPC'06, pages 235--250, Berlin, Heidelberg, 2007. Springer-Verlag. ISBN 978-3-540-72520-6. URL http://dl.acm.org/citation.cfm?id=1757112.1757137.

Digital Library

[28]

J. Reinders. Intel threading building blocks: outfitting C++ for multi-core processor parallelism. O'Reilly Media, Inc., 2010.

Digital Library

[29]

L. Rudolph, M. Slivkin-Allalouf, and E. Upfal. A simple load balancing scheme for task allocation in parallel machines. In Proceedings of the Third Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '91, pages 237--245, New York, NY, USA, 1991. ACM. ISBN 0-89791-438-4. 10.1145/113379.113401.

Digital Library

[30]

H. Saiki, Y. Konaka, T. Komiya, M. Yasugi, and T. Yuasa. Real-time GC in JeRTy#8482; VM using the return-barrier method. In Proceedings of the Eighth IEEE International Symposium on Object-Oriented Real-Time Distributed Computing, ISORC '05, pages 140--148, Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2356-0. 10.1109/ISORC.2005.45.

Digital Library

[31]

D. Sanchez, R. M. Yoo, and C. Kozyrakis. Flexible architectural support for fine-grain scheduling. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pages 311--322, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-839-1. 10.1145/1736020.1736055.

Digital Library

[32]

V. Sundaresan, D. Maier, P. Ramarao, and M. Stoodley. Experiences with multi-threading and dynamic class loading in a Java Just-In-Time compiler. In Proceedings of the International Symposium on Code Generation and Optimization, CGO '06, pages 87--97, Washington, DC, USA, 2006. IEEE Computer Society. ISBN 0-7695-2499-0. 10.1109/CGO.2006.16.

Digital Library

[33]

S. Umatani, M. Yasugi, T. Komiya, and T. Yuasa. Pursuing laziness for efficient implementation of modern multithreaded languages. In A. Veidenbaum, K. Joe, H. Amano, and H. Aiso, editors, High Performance Computing, volume 2858 of Lecture Notes in Computer Science, pages 174--188. Springer Berlin Heidelberg, 2003. ISBN 978-3-540-20359-9. 10.1007/978-3-540-39707-6_13.

[34]

E. Westbrook, J. Zhao, Z. Budimlić, and V. Sarkar. Practical permissions for race-free parallelism. In Proceedings of the 26th European Conference on Object-Oriented Programming, ECOOP'12, pages 614--639, Berlin, Heidelberg, 2012. Springer-Verlag. ISBN 978-3-642-31056-0. 10.1007/978-3-642-31057-7_27.

Digital Library

[35]

T. Yuasa. Real-time garbage collection on general-purpose machines. J. Syst. Softw., 11 (3): 181--198, Mar. 1990. ISSN 0164-1212. 10.1016/0164-1212(90)90084-Y.

Digital Library

[36]

T. Yuasa, Y. Nakagawa, T. Komiyay, and M. Yasugiy. Return barrier. In Proceedings of the International Lisp Conference, 2002.

[37]

J. Zhao, R. Lublinerman, Z. Budimlić, S. Chaudhuri, and V. Sarkar. Isolation for nested task parallelism. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '13, pages 571--588, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2374-1. 10.1145/2509136.2509534.

Digital Library

Cited By

Peters AThywissen JRossbach CHosking AFinocchi I(2019)PorcE: a deparallelizing compilerProceedings of the 16th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes10.1145/3357390.3361023(117-130)Online publication date: 21-Oct-2019
https://dl.acm.org/doi/10.1145/3357390.3361023
Kumar V(2019)Featherlight Speculative Task ParallelismEuro-Par 2019: Parallel Processing10.1007/978-3-030-29400-7_28(391-404)Online publication date: 26-Aug-2019
https://dl.acm.org/doi/10.1007/978-3-030-29400-7_28
Kumar VDolby JBlackburn SZheng YBinder WTůma P(2016)Integrating Asynchronous Task Parallelism and Data-centric AtomicityProceedings of the 13th International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools10.1145/2972206.2972214(1-10)Online publication date: 29-Aug-2016
https://dl.acm.org/doi/10.1145/2972206.2972214
Show More Cited By

Index Terms

Friendly barriers: efficient work-stealing with return barriers
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Work-stealing without the baggage
OOPSLA '12: Proceedings of the ACM international conference on Object oriented programming systems languages and applications

Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle ...
Friendly barriers: efficient work-stealing with return barriers
VEE '14

This paper addresses the problem of efficiently supporting parallelism within a managed runtime. A popular approach for exploiting software parallelism on parallel hardware is task parallelism, where the programmer explicitly identifies potential ...
Work-stealing without the baggage
OOPSLA '12

Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

VEE '14: Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments

March 2014

236 pages

ISBN:9781450327640

DOI:10.1145/2576195

General Chair:
Martin Hirzel
IBM Research
,
Program Chairs:
Erez Petrank
Technion
,
Dan Tsafrir
Technion

ACM SIGPLAN Notices Volume 49, Issue 7
VEE '14
July 2014
222 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2674025
Editors:
Mark W. Bailey
Hamilton College, Clinton, NY
,
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Sarita Adve
University of Illinois at Urbana-Champ
Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

VEE '14

Sponsor:

VEE '14: 10th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments

March 1 - 2, 2014

Utah, Salt Lake City, USA

Acceptance Rates

VEE '14 Paper Acceptance Rate 18 of 56 submissions, 32%;

Overall Acceptance Rate 80 of 235 submissions, 34%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
260
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Peters AThywissen JRossbach CHosking AFinocchi I(2019)PorcE: a deparallelizing compilerProceedings of the 16th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes10.1145/3357390.3361023(117-130)Online publication date: 21-Oct-2019
https://dl.acm.org/doi/10.1145/3357390.3361023
Kumar V(2019)Featherlight Speculative Task ParallelismEuro-Par 2019: Parallel Processing10.1007/978-3-030-29400-7_28(391-404)Online publication date: 26-Aug-2019
https://dl.acm.org/doi/10.1007/978-3-030-29400-7_28
Kumar VDolby JBlackburn SZheng YBinder WTůma P(2016)Integrating Asynchronous Task Parallelism and Data-centric AtomicityProceedings of the 13th International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools10.1145/2972206.2972214(1-10)Online publication date: 29-Aug-2016
https://dl.acm.org/doi/10.1145/2972206.2972214
Ribic HLiu Y(2016)AEQUITASProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926260(1-12)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1145/2925426.2926260

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents