[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2576195.2576207acmconferencesArticle/Chapter ViewAbstractPublication PagesveeConference Proceedingsconference-collections
research-article

Friendly barriers: efficient work-stealing with return barriers

Published: 01 March 2014 Publication History

Abstract

This paper addresses the problem of efficiently supporting parallelism within a managed runtime. A popular approach for exploiting software parallelism on parallel hardware is task parallelism, where the programmer explicitly identifies potential parallelism and the runtime then schedules the work. Work-stealing is a promising scheduling strategy that a runtime may use to keep otherwise idle hardware busy while relieving overloaded hardware of its burden. However, work-stealing comes with substantial overheads. Recent work identified sequential overheads of work-stealing, those that occur even when no stealing takes place, as a significant source of overhead. That work was able to reduce sequential overheads to just 15%.
In this work, we turn to dynamic overheads, those that occur each time a steal takes place. We show that the dynamic overhead is dominated by introspection of the victim's stack when a steal takes place. We exploit the idea of a low overhead return barrier to reduce the dynamic overhead by approximately half, resulting in total performance improvements of as much as 20%. Because, unlike prior work, we attack the overheads directly due to stealing and therefore attack the overheads that grow as parallelism grows, we improve the scalability of work-stealing applications. This result is complementary to recent work addressing the sequential overheads of work-stealing. This work therefore substantially relieves work-stealing of the increasing pressure due to increasing intra-node hardware parallelism.

References

[1]
U. A. Acar, A. Chargueraud, and M. Rainey. Scheduling parallel programs by work stealing with private deques. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 219--228, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1922-5. 10.1145/2442516.2442538.
[2]
B. Alpern, C. Attanasio, J. Barton, M. Burke, P. Cheng, J. Choi, A. Cocchi, S. Fink, D. Grove, M. Hind, et al. The Jalape no virtual machine. IBM Systems Journal, 39 (1): 211--238, 2010. ISSN 0018-8670. 10.1147/sj.391.0211.
[3]
M. Arnold, S. Fink, D. Grove, M. Hind, and P. F. Sweeney. Adaptive optimization in the Jalape no JVM. In Proceedings of the 15th ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '00, pages 47--65, New York, NY, USA, 2000. ACM. ISBN 1-58113-200-X. 10.1145/353171.353175.
[4]
N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '98, pages 119--129, New York, NY, USA, 1998. ACM. ISBN 0-89791-989-0. 10.1145/277651.277678.
[5]
P. Berenbrink, T. Friedetzky, and L. A. Goldberg. The natural work-stealing algorithm is stable. SIAM J. Comput., 32 (5): 1260--1279, May 2003. ISSN 0097-5397. 10.1137/S0097539701399551.
[6]
R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46 (5): 720--748, Sept. 1999. ISSN 0004-5411. 10.1145/324133.324234.
[7]
Cavé, J. Zhao, J. Shirako, and V. Sarkar. Habanero-Java: the new adventures of old X10. In Proceedings of the 9th International Conference on Principles and Practice of Programming in Java, PPPJ '11, pages 51--61, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0935-6. 10.1145/2093157.2093165.
[8]
P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '05, pages 519--538, New York, NY, USA, 2005. ACM. ISBN 1-59593-031-0. 10.1145/1094811.1094852.
[9]
G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. Solving large, irregular graph problems using adaptive work-stealing. In Proceedings of the 2008 37th International Conference on Parallel Processing, ICPP '08, pages 536--545, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978-0-7695-3374-2. 10.1109/ICPP.2008.88.
[10]
J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 53:1--53:11, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-744-8. 10.1145/1654059.1654113.
[11]
S. J. Fink and F. Qian. Design, implementation and evaluation of adaptive recompilation with on-stack replacement. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO '03, pages 241--252, Washington, DC, USA, 2003. IEEE Computer Society. ISBN 0-7695-1913-X.
[12]
M. Frigo, H. Prokop, M. Frigo, C. Leiserson, H. Prokop, S. Ramachandran, D. Dailey, C. Leiserson, I. Lyubashevskiy, N. Kushman, et al. The Cilk project. Algorithms, 1998.
[13]
Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for async-finish task parallelism. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, IPDPS '09, pages 1--12, Washington, DC, USA, 2009. IEEE Computer Society. ISBN 978-1-4244-3751-1. 10.1109/IPDPS.2009.5161079.
[14]
Y. Guo, J. Zhao, V. Cave, and V. Sarkar. SLAW: A scalable locality-aware adaptive work-stealing scheduler for multi-core systems. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, pages 341--342, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-877-3. 10.1145/1693453.1693504.
[15]
D. Hendler and N. Shavit. Non-blocking steal-half work queues. In Proceedings of the Twenty-first Annual Symposium on Principles of Distributed Computing, PODC '02, pages 280--289, New York, NY, USA, 2002. ACM. ISBN 1-58113-485-1. 10.1145/571825.571876.
[16]
R. Hoffmann and T. Rauber. Adaptive task pools: efficiently balancing large number of tasks on shared-address spaces. International Journal of Parallel Programming, 39 (5): 553--581, 2011.
[17]
U. Hölzle, C. Chambers, and D. Ungar. Debugging optimized code with dynamic deoptimization. In Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation, PLDI '92, pages 32--43, New York, NY, USA, 1992. ACM. ISBN 0-89791-475-9. 10.1145/143095.143114.
[18]
Intel Corporation. Using the RDTSC instruction for performance monitoring, 1997. URL http://www.intel.com.au/content/dam/www/public/us/en/documents/white-pa%pers/ia-32-ia-64-benchmark-code-execution-paper.pdf.
[19]
G. Kliot, E. Petrank, and B. Steensgaard. A lock-free, concurrent, and incremental stack scanning for garbage collectors. In Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE '09, pages 11--20, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-375-4. 10.1145/1508293.1508296.
[20]
M. Kulkarni, M. Burtscher, C. Cascaval, and K. Pingali. Lonestar: A suite of parallel irregular programs. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 65--76, 2009. 10.1109/ISPASS.2009.4919639.
[21]
V. Kumar, D. Frampton, S. M. Blackburn, D. Grove, and O. Tardieu. Work-stealing without the baggage. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '12, pages 297--314, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1561-6. 10.1145/2384616.2384639.
[22]
D. Lea. A Java Fork/Join framework. In Proceedings of the ACM 2000 Conference on Java Grande, JAVA '00, pages 36--43, New York, NY, USA, 2000. ACM. ISBN 1-58113-288-3. 10.1145/337449.337465.
[23]
R. Lublinerman, J. Zhao, Z. Budimlić, S. Chaudhuri, and V. Sarkar. Delegated isolation. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '11, pages 885--902, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0940-0. 10.1145/2048066.2048133.
[24]
R. Lüling and B. Monien. A dynamic distributed load balancing algorithm with provable good performance. In Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '93, pages 164--172, New York, NY, USA, 1993. ACM. ISBN 0-89791-599-2. 10.1145/165231.165252.
[25]
M. Mitzenmacher. Analyses of load stealing models based on differential equations. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '98, pages 212--221, New York, NY, USA, 1998. ACM. ISBN 0-89791-989-0. 10.1145/277651.277687.
[26]
E. Mohr, D. A. Kranz, and R. H. Halstead, Jr. Lazy task creation: A technique for increasing the granularity of parallel programs. In Proceedings of the 1990 ACM Conference on LISP and Functional Programming, LFP '90, pages 185--197, New York, NY, USA, 1990. ACM. ISBN 0-89791-368-X. 10.1145/91556.91631.
[27]
S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: an unbalanced tree search benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing, LCPC'06, pages 235--250, Berlin, Heidelberg, 2007. Springer-Verlag. ISBN 978-3-540-72520-6. URL http://dl.acm.org/citation.cfm?id=1757112.1757137.
[28]
J. Reinders. Intel threading building blocks: outfitting C++ for multi-core processor parallelism. O'Reilly Media, Inc., 2010.
[29]
L. Rudolph, M. Slivkin-Allalouf, and E. Upfal. A simple load balancing scheme for task allocation in parallel machines. In Proceedings of the Third Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '91, pages 237--245, New York, NY, USA, 1991. ACM. ISBN 0-89791-438-4. 10.1145/113379.113401.
[30]
H. Saiki, Y. Konaka, T. Komiya, M. Yasugi, and T. Yuasa. Real-time GC in JeRTy#8482; VM using the return-barrier method. In Proceedings of the Eighth IEEE International Symposium on Object-Oriented Real-Time Distributed Computing, ISORC '05, pages 140--148, Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2356-0. 10.1109/ISORC.2005.45.
[31]
D. Sanchez, R. M. Yoo, and C. Kozyrakis. Flexible architectural support for fine-grain scheduling. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pages 311--322, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-839-1. 10.1145/1736020.1736055.
[32]
V. Sundaresan, D. Maier, P. Ramarao, and M. Stoodley. Experiences with multi-threading and dynamic class loading in a Java Just-In-Time compiler. In Proceedings of the International Symposium on Code Generation and Optimization, CGO '06, pages 87--97, Washington, DC, USA, 2006. IEEE Computer Society. ISBN 0-7695-2499-0. 10.1109/CGO.2006.16.
[33]
S. Umatani, M. Yasugi, T. Komiya, and T. Yuasa. Pursuing laziness for efficient implementation of modern multithreaded languages. In A. Veidenbaum, K. Joe, H. Amano, and H. Aiso, editors, High Performance Computing, volume 2858 of Lecture Notes in Computer Science, pages 174--188. Springer Berlin Heidelberg, 2003. ISBN 978-3-540-20359-9. 10.1007/978-3-540-39707-6_13.
[34]
E. Westbrook, J. Zhao, Z. Budimlić, and V. Sarkar. Practical permissions for race-free parallelism. In Proceedings of the 26th European Conference on Object-Oriented Programming, ECOOP'12, pages 614--639, Berlin, Heidelberg, 2012. Springer-Verlag. ISBN 978-3-642-31056-0. 10.1007/978-3-642-31057-7_27.
[35]
T. Yuasa. Real-time garbage collection on general-purpose machines. J. Syst. Softw., 11 (3): 181--198, Mar. 1990. ISSN 0164-1212. 10.1016/0164-1212(90)90084-Y.
[36]
T. Yuasa, Y. Nakagawa, T. Komiyay, and M. Yasugiy. Return barrier. In Proceedings of the International Lisp Conference, 2002.
[37]
J. Zhao, R. Lublinerman, Z. Budimlić, S. Chaudhuri, and V. Sarkar. Isolation for nested task parallelism. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '13, pages 571--588, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2374-1. 10.1145/2509136.2509534.

Cited By

View all
  • (2019)PorcE: a deparallelizing compilerProceedings of the 16th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes10.1145/3357390.3361023(117-130)Online publication date: 21-Oct-2019
  • (2019)Featherlight Speculative Task ParallelismEuro-Par 2019: Parallel Processing10.1007/978-3-030-29400-7_28(391-404)Online publication date: 26-Aug-2019
  • (2016)Integrating Asynchronous Task Parallelism and Data-centric AtomicityProceedings of the 13th International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools10.1145/2972206.2972214(1-10)Online publication date: 29-Aug-2016
  • Show More Cited By

Index Terms

  1. Friendly barriers: efficient work-stealing with return barriers

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    VEE '14: Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
    March 2014
    236 pages
    ISBN:9781450327640
    DOI:10.1145/2576195
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 March 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. managed languages
    2. scheduling
    3. task parallelism
    4. work-stealing
    5. x10

    Qualifiers

    • Research-article

    Conference

    VEE '14

    Acceptance Rates

    VEE '14 Paper Acceptance Rate 18 of 56 submissions, 32%;
    Overall Acceptance Rate 80 of 235 submissions, 34%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)PorcE: a deparallelizing compilerProceedings of the 16th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes10.1145/3357390.3361023(117-130)Online publication date: 21-Oct-2019
    • (2019)Featherlight Speculative Task ParallelismEuro-Par 2019: Parallel Processing10.1007/978-3-030-29400-7_28(391-404)Online publication date: 26-Aug-2019
    • (2016)Integrating Asynchronous Task Parallelism and Data-centric AtomicityProceedings of the 13th International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools10.1145/2972206.2972214(1-10)Online publication date: 29-Aug-2016
    • (2016)AEQUITASProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926260(1-12)Online publication date: 1-Jun-2016

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media