[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2676870.2676879acmotherconferencesArticle/Chapter ViewAbstractPublication PagespgasConference Proceedingsconference-collections
research-article

HabaneroUPC++: a Compiler-free PGAS Library

Published: 06 October 2014 Publication History

Abstract

The Partitioned Global Address Space (PGAS) programming models combine shared and distributed memory features, providing the basis for high performance and high productivity parallel programming environments. UPC++ [39] is a very recent PGAS implementation that takes a library-based approach and avoids the complexities associated with compiler transformations. However, this implementation does not support dynamic task parallelism and only relies on other threading models (e.g., OpenMP or pthreads) for exploiting parallelism within a PGAS place.
In this paper, we introduce a compiler-free PGAS library called HabaneroUPC++, which supports a tighter integration of intra-place and inter-place parallelism than standard hybrid programming approaches. The library makes heavy use of C++11 lambda functions in its APIs. C++11 lambdas avoid the need for compiler support while still retaining the syntactic convenience of language-based approaches. The HabaneroUPC++ library implementation is based on a tight integration of the UPC++ library and the Habanero-C++ library, with new extensions to support the integration. The UPC++ library is used to provide PGAS communication and function shipping support using GASNet, and the Habanero-C++ library is used to provide support for intra-place work-stealing integrated with function shipping. We demonstrate the programmability and performance of our implementation using two benchmarks, scaled up to 6K cores. The insights developed in this paper promise to further enhance the usability and popularity of PGAS programming models.

References

[1]
K. Bergman et al. Exascale computing study: Technology challenges in achieving exascale systems. DARPA IPTO, Tech. Rep, 15, 2008.
[2]
B. Carlson, T. El-Ghazawi, B. Numrich, and K. Yelick. Programming in the partitioned global address space model. Tutorial at Supercomputing, 2003.
[3]
V. Cavè. HClib: a library implementation of the Habanero-C language. http://habanero-rice.github.io/hclib/, 2013.
[4]
V. Cavé, J. Zhao, J. Shirako, and V. Sarkar. Habanero-Java: the new adventures of old X10. In PPPJ, pages 51--61, 2011.
[5]
B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the Chapel language. Int. J. High Perform. Comput. Appl., 21(3):291--312, Aug. 2007.
[6]
P. Charles et al. X10: An object-oriented approach to non-uniform cluster computing. In OOPSLA, pages 519--538, 2005.
[7]
S. Chatterjee et al. Integrating asynchronous task parallelism with MPI. In IPDPS, pages 712--725, 2013.
[8]
G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. Solving large, irregular graph problems using adaptive work-stealing. In ICPP, pages 536--545, 2008.
[9]
J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In SC, pages 53:1--53:11, 2009.
[10]
T. El-Ghazawi and L. Smith. UPC: unified parallel C. In SC, 2006.
[11]
B. B. Fraguela, J. Guo, G. Bikshandi, M. J. Garzarán, G. Almási, J. Moreira, and D. Padua. The hierarchically tiled arrays programming approach. In LCR, pages 1--12, 2004.
[12]
M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pages 212--223, 1998.
[13]
M. Garland, M. Kudlur, and Y. Zheng. Designing a unified programming model for heterogeneous machines. In SC, 2012.
[14]
M. Grossman, A. S. Sbîrlea, Z. Budimlić, and V. Sarkar. CnC-CUDA: Declarative programming for GPUs. In LCPC, pages 230--245, 2011.
[15]
S. Imam and V. Sarkar. Habanero-java library: A Java 8 framework for multicore programming. In PPPJ, pages 75--86, 2014.
[16]
J. Järvi and J. Freeman. C++ lambda expressions and closures. Science of Computer Programming, 75(9):762--772, 2010.
[17]
V. Kumar, D. Frampton, S. M. Blackburn, D. Grove, and O. Tardieu. Work-stealing without the baggage. In OOPSLA, pages 297--314, 2012.
[18]
D. Lea. A Java Fork/Join framework. In JAVA, pages 36--43, 2000.
[19]
D. Leijen, W. Schulte, and S. Burckhardt. The design of a task parallel library. In OOPSLA, pages 227--242, 2009.
[20]
B. Meister, N. Vasilache, D. Wohlford, M. Baskaran, A. Leung, and R. Lethin. R-Stream Compiler. Springer US, 2011.
[21]
S. Min, C. Iancu, and K. Yelick. Hierarchical work stealing on manycore clusters. In PGAS, 2011.
[22]
E. Mohr, D. A. Kranz, and R. H. Halstead, Jr. Lazy task creation: A technique for increasing the granularity of parallel programs. In LFP, pages 185--197, 1990.
[23]
R. W. Numrich and J. Reid. Co-array Fortran for parallel programming. SIGPLAN Fortran Forum, 17(2):1--31, Aug. 1998.
[24]
Compute unified device architecture programming guide. NVIDIA, 2007.
[25]
Open Community Runtime. https://01.org/open-community-runtime/, Intel Open Source Technology Center, 2014.
[26]
I. Patel and J. Gilbert. An empirical study of the performance and productivity of two parallel programming models. In IPDPS, pages 1--7, April 2008.
[27]
J. Reinders. Intel Threading Building Blocks. O'Reilly & Associates, Inc., first edition, 2007.
[28]
Habanero-C Overview. https://wiki.rice.edu/confluence/display/HABANERO/Habanero-C, Rice University, 2013.
[29]
J. Shirako, D. M. Peixotto, V. Sarkar, and W. N. Scherer. Phasers: A unified deadlock-free construct for collective and point-to-point synchronization. In ICS, pages 277--288, 2008.
[30]
J. Shirako, D. M. Peixotto, V. Sarkar, and W. N. Scherer. Phaser accumulators: A new reduction construct for dynamic parallelism. In IPDPS, pages 1--12, 2009.
[31]
O. Tardieu et al. X10 and APGAS at petascale. In PPoPP, pages 53--66, 2014.
[32]
O. Tardieu, H. Wang, and H. Lin. A work-stealing scheduler for X10's task parallelism with suspension. In PPoPP, pages 267--276, 2012.
[33]
S. Tasirlar and V. Sarkar. Data-driven tasks and their implementation. In ICPP, pages 652--661, 2011.
[34]
K. Taura, K. Tabata, and A. Yonezawa. StackThreads/MP: Integrating futures into calling standards. In PPoPP, pages 60--71, 1999.
[35]
Y. Yan, J. Zhao, Y. Guo, and V. Sarkar. Hierarchical place trees: A portable abstraction for task parallelism and data movement. In LCPC, pages 172--187, 2010.
[36]
C. Yang, K. Murthy, and J. Mellor-Crummey. Managing asynchronous operations in Coarray Fortran 2.0. In IPDPS, pages 1321--1332, May 2013.
[37]
K. Yelick et al. Titanium: A high-performance Java dialect. In ACM, pages 10--11, 1998.
[38]
W. Zhang, O. Tardieu, D. Grove, B. Herta, T. Kamada, V. Saraswat, and M. Takeuchi. GLB: Lifeline-based global load balancing library in X10. In PPAA, pages 31--40, 2014.
[39]
Y. Zheng, A. Kamil, M. B. Driscoll, H. Shan, and K. Yelick. UPC++: a PGAS extension for C++. In IPDPS, 2014.

Cited By

View all
  • (2024)Energy efficient permanence‐based community detection algorithmConcurrency and Computation: Practice and Experience10.1002/cpe.829736:28Online publication date: 13-Oct-2024
  • (2023)Memory Transfer Decomposition: Exploring Smart Data Movement Through Architecture-Aware StrategiesProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624609(1958-1967)Online publication date: 12-Nov-2023
  • (2023)Itoyori: Reconciling Global Address Space and Global Fork-Join Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607049(1-15)Online publication date: 12-Nov-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
PGAS '14: Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models
October 2014
199 pages
ISBN:9781450332477
DOI:10.1145/2676870
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 October 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Habanero
  2. PGAS
  3. UPC++
  4. scheduling
  5. work-stealing

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

PGAS '14

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Energy efficient permanence‐based community detection algorithmConcurrency and Computation: Practice and Experience10.1002/cpe.829736:28Online publication date: 13-Oct-2024
  • (2023)Memory Transfer Decomposition: Exploring Smart Data Movement Through Architecture-Aware StrategiesProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624609(1958-1967)Online publication date: 12-Nov-2023
  • (2023)Itoyori: Reconciling Global Address Space and Global Fork-Join Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607049(1-15)Online publication date: 12-Nov-2023
  • (2021)ClamorProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3486996(654-669)Online publication date: 1-Nov-2021
  • (2021)CuttlefishProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476163(1-14)Online publication date: 14-Nov-2021
  • (2021)Operators for Data Redistribution: Applications to the STL Library and RayTracing AlgorithmIEEE Access10.1109/ACCESS.2021.30636289(38557-38570)Online publication date: 2021
  • (2021)Enhancing Load-Balancing of MPI Applications with WorkshareEuro-Par 2021: Parallel Processing10.1007/978-3-030-85665-6_29(466-481)Online publication date: 25-Aug-2021
  • (2021)A software cache autotuning strategy for dataflow computing with UPC++ DepSpawnComputational and Mathematical Methods10.1002/cmm4.1148Online publication date: 22-Feb-2021
  • (2020)DASH: Distributed Data Structures and Parallel Algorithms in a Global Address SpaceSoftware for Exascale Computing - SPPEXA 2016-201910.1007/978-3-030-47956-5_6(103-142)Online publication date: 31-Jul-2020
  • (2019)Failure Recovery in Resilient X10ACM Transactions on Programming Languages and Systems10.1145/333237241:3(1-30)Online publication date: 2-Jul-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media