[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1851476.1851529acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Pwrake: a parallel and distributed flexible workflow management tool for wide-area data intensive computing

Published: 21 June 2010 Publication History

Abstract

This paper proposes Pwrake, a parallel and distributed flexible workflow management tool based on Rake, a domain specific language for building applications in the Ruby programming language. Rake is a similar tool to make and ant. It uses a Rakefile that is equivalent to a Makefile in make, but written in Ruby. Due to a flexible and extensible language feature, Rake would be a powerful workflow management language. The Pwrake extends Rake to manage distributed and parallel workflow executions that include remote job submission and management of parallel executions. This paper discusses the design and implementation of the Pwrake, and demonstrates its power of language and extensibility of the system using a practical e-Science data-intensive workflow in astronomical data analysis on the Gfarm file system as a case study. Extending a scheduling algorithm to be aware of file locations, 20% of speed up is observed using 8 nodes (32 cores) in a PC cluster. Using two PC clusters located in different institutions, the file location aware scheduling shows scalable speedup. The extensible Pwrake is a promising workflow management tool even for wide-area data analysis.

References

[1]
}}DAGMan (Directed Acyclic Graph Manager). http://www.cs.wisc.edu/condor/dagman/.
[2]
}}E. Deelman, G. Singh, M.-H. Su, J. Blythe, et al. Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems. Scientific Programming Journal, 13(3):219--237, 2005.
[3]
}}EGEE. http://www.eu-egee.org/.
[4]
}}Gfarm. http://datafarm.apgrid.org/.
[5]
}}Kepler. http://kepler-project.org/.
[6]
}}L. Meyer, J. Annis, M. Wilde, M. Mattoso, and I. Foster. Planning spatial workflows to optimize grid performance. In SAC '06: Proceedings of the 2006 ACM symposium on Applied computing, pages 786--790, New York, NY, USA, 2006. ACM.
[7]
}}Montage. http://montage.ipac.caltech.edu/.
[8]
}}Rake. http://rake.rubyforge.org/.
[9]
}}Ruby. http://www.ruby-lang.org/.
[10]
}}K. Taura. Grid Explorer: A Tool for Discovering, Selecting, and Using Distributed Resources Efficiently. IPSJ SIG Technical Report 2004-HPC-99, pages 235--240, 2004.
[11]
}}Taverna. http://www.taverna.org.uk/.
[12]
}}TeraGrid. http://www.teragrid.org/.
[13]
}}Triana. http://www.trianacode.org/.
[14]
}}Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, V. Nefedova, I. Raicu, T. Stef-Praun, and M. Wilde. Swift: Fast, reliable, loosely coupled parallel computation. 1st IEEE International Workshop on Scientific Workflows, pages 199--206, 2007.

Cited By

View all
  • (2021)Sustainable data analysis with SnakemakeF1000Research10.12688/f1000research.29032.210(33)Online publication date: 19-Apr-2021
  • (2021)Sustainable data analysis with SnakemakeF1000Research10.12688/f1000research.29032.110(33)Online publication date: 18-Jan-2021
  • (2019)Data Jockey: Automatic Data Management for HPC Multi-tiered Storage Systems2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00061(511-522)Online publication date: May-2019
  • Show More Cited By

Index Terms

  1. Pwrake: a parallel and distributed flexible workflow management tool for wide-area data intensive computing

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
      June 2010
      911 pages
      ISBN:9781605589428
      DOI:10.1145/1851476
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 June 2010

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. file system
      2. performance evaluation
      3. workflow

      Qualifiers

      • Research-article

      Conference

      HPDC '10
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 166 of 966 submissions, 17%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)6
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 24 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)Sustainable data analysis with SnakemakeF1000Research10.12688/f1000research.29032.210(33)Online publication date: 19-Apr-2021
      • (2021)Sustainable data analysis with SnakemakeF1000Research10.12688/f1000research.29032.110(33)Online publication date: 18-Jan-2021
      • (2019)Data Jockey: Automatic Data Management for HPC Multi-tiered Storage Systems2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00061(511-522)Online publication date: May-2019
      • (2019)GHOSTZ PW/GF: Distributed Parallel Homology Search System for Large-scale Metagenomic Analysis2019 IEEE International Conference on Big Data (Big Data)10.1109/BigData47090.2019.9006499(3692-3700)Online publication date: Dec-2019
      • (2018)Applying Pwrake Workflow System and Gfarm File System to Telescope Data Processing2018 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2018.00024(124-133)Online publication date: Sep-2018
      • (2017)MaDaTSProceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3078597.3078611(41-52)Online publication date: 26-Jun-2017
      • (2017)A web-based real-time and full-resolution data visualization for Himawari-8 satellite sensed imagesEarth Science Informatics10.1007/s12145-017-0316-411:2(217-237)Online publication date: 21-Sep-2017
      • (2016)Design of fault tolerant pwrake workflow system supported by gfarm file systemProceedings of the 9th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers10.5555/3019078.3019080(7-12)Online publication date: 13-Nov-2016
      • (2016)Implementation of a deduplication cache mechanism using content-defined chunkingInternational Journal of High Performance Computing and Networking10.1504/ijhpcn.2016.0762519:3(190-205)Online publication date: 1-Apr-2016
      • (2016)Implementation of a deduplication cache mechanism using content-defined chunkingInternational Journal of High Performance Computing and Networking10.1504/IJHPCN.2016.0762519:3(190-205)Online publication date: 1-Apr-2016
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media