More Web Proxy on the site http://driver.im/

research-article

Dynamic Provisioning of Data Intensive Computing Middleware Frameworks: A Case Study

Authors:

Michael E. Payne,

Flavio Villanustre,

Richard Taylor,

Amy W. AponAuthors Info & Claims

SCREAM '15: Proceedings of the 1st Workshop on The Science of Cyberinfrastructure: Research, Experience, Applications and Models

Pages 3 - 10

https://doi.org/10.1145/2753524.2753528

Published: 16 June 2015 Publication History

Abstract

Big data has become an important asset for industry, and academic disciplines now utilize large-scale data in their research. This fourth paradigm of scientific research has led to the inclusion of data management, processing, and analytic tools into the traditional high performance computing software libraries. This integration is facilitated through a collection of supporting software components that comprise a data intensive computing middleware framework. From a shared campus cyberinfrastructure perspective, this represents a new challenge to the system administrators in balancing between the traditional high performance computing software stacks and the new data-intensive middleware on the same physical computing resource. In turn, this limits researchers from having access to the new middleware tools while administrators determine how to overcome the challenge. In this paper, we present our experience in configuring dynamic provisioning of two different data-intensive middleware frameworks from a user perspective. We describe the configuration process from setting up dependencies to deploying the middleware, and how this experience can be applied by other researchers and administrators.

References

[1]

A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz. Hadoop GIS: a high performance spatial data warehousing system over MapReduce. Proceedings of the VLDB Endowment, 6(11):1009--1020, 2013.

Digital Library

[2]

J. W. Anderson, K. Kennedy, L. B. Ngo, A. Luckow, and A. W. Apon. Synthetic data generation for the internet of things. In Big Data (Big Data), 2014 IEEE International Conference on, pages 171--176. IEEE, 2014.

[3]

Apache Foundation. Apache Hadoop.\http://hadoop.apache.org, 2015.

[4]

A. Apon, S. Ahalt, V. Dantuluri, C. Gurdgiev, M. Limayem, L. B. Ngo, and M. Stealey. High performance computing instrumentation and research productivity in U.S. Universities. Journal of Information Technology Impact, 10:87--98, 2010.

[5]

A. W. Apon, L. B. Ngo, M. E. Payne, and P. W. Wilson. Assessing the effect of high performance computing capabilities on academic research output. Empirical Economics, 2014. Available at http://link.springer.com/article/10.1007/s00181-014-0833--7/fulltext.html.

[6]

F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):4, 2008.

Digital Library

[7]

Cloudera. Cloudera Big Data. http://www.cloudera.com/, 2014.

[8]

J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107--113, Jan. 2008.

Digital Library

[9]

T. Estrada, B. Zhang, M. Taufer, P. Cicotti, and R. Armen. Reengineering high-throughput molecular datasets for scalable clustering using MapReduce. In 2012 IEEE 14th International Conference on High Performance Computing and Communication, pages 351--359. IEEE, 2012.

Digital Library

[10]

M. J. Franklin. Making sense of big data with the Berkeley data analytics stack. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management, page 1. ACM, 2013.

Digital Library

[11]

J. L. Furlani and P. W. Osel. Abstract yourself with modules. In LISA, volume 96, pages 193--204, 1996.

Digital Library

[12]

L. George. HBase: the definitive guide. " O'Reilly Media, Inc.", 2011.

[13]

S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. ACM SIGOPS operating systems review, 37(5):29--43, 2003.

Digital Library

[14]

K. Goodhope, J. Koshy, J. Kreps, N. Narkhede, R. Park, J. Rao, and V. Y. Ye. Building LinkedIn's real-time activity data pipeline. IEEE Data Eng. Bull., 35(2):33--45, 2012.

[15]

Hortonworks. Hortonworks Data Platform. http://hortonworks.com/, 2014.

[16]

S. Jha, J. Qiu, A. Luckow, P. Mantha, and G. C. Fox. A tale of two data-intensive paradigms: Applications, abstractions, and architectures. In Big Data, 2014 IEEE International Congress on, pages 645--652. IEEE, 2014.

Digital Library

[17]

D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of MapReduce: an in-depth study. Proceedings of the VLDB Endowment, 3(1--2):472--483, 2010.

Digital Library

[18]

J. Kreps, N. Narkhede, J. Rao, et al. Kafka: A distributed messaging system for log processing. In Proceedings of 6th International Workshop on Networking Meets Databases (NetDB), Athens, Greece, 2011.

[19]

S. Krishnan, M. Tatineni, and C. Baru. myHadoop - Hadoop-on-Demand on Traditional HPC Resources. http://www.sdsc.edu/pub/techreports/SDSC-TR-2011-2-Hadoop.pdf, 2011. San Diego Supercomputing Center Tech Report.

[20]

A. R. Larzelere. Delivering insight: The history of the Accelerated Strategic Computing Initiative (ASCI), 2009.

[21]

K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, and B. Moon. Parallel data processing with MapReduce: a survey. AcM sIGMoD Record, 40(4):11--20, 2012.

Digital Library

[22]

T. Ludwig. The costs of HPC-based science in the exascale era. In High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, pages 2120--2188. IEEE, 2012.

Digital Library

[23]

A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, et al. The genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data. Genome research, 20(9):1297--1303, 2010.

[24]

A. M. Middleton, D. A. Bayliss, and G. Halliday. ECL/HPCC: A Unified Approach to Big Data. In Handbook of Data Intensive Computing, pages 59--107. Springer, 2011.

[25]

W. C. Moody, L. B. Ngo, E. Duffy, and A. Apon. Jummp: Job uninterrupted maneuverable mapreduce platform. In Cluster Computing (CLUSTER), 2013 IEEE International Conference on, pages 1--8. IEEE, 2013.

[26]

NAS. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. The National Academies Press, Washington, DC, 2008. National Research Council of the National Academies.

[27]

National Science Foundation. From Desktop to Teraflop: Exploiting the U.S. Lead in High Performance Computing, 1993. NSF Blue Ribbon Panel on High Performance Computing.

[28]

National Science Foundation. Cyberinfrastructure Framework for 21st Century Science and Engineering, 2012.

[29]

M. Odersky, P. Altherr, V. Cremet, B. Emir, S. Micheloud, N. Mihaylov, M. Schinz, E. Stenman, and M. Zenger. The Scala language specification, 2004.

[30]

A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proceedings of the ACM SIGMOD, 2009.

Digital Library

[31]

M. E. Payne, L. B. Ngo, F. Villanustre, and A. W. Apon. Managing the academic data lifecycle: A case study of HPCC. In Big Data (Big Data), 2014 IEEE International Conference on, pages 22--30. IEEE, 2014.

[32]

PCAST. Report to the President and Congress--Designing a Digital Future: Federally Funded Research and Development in Networking and Information Technology, 2013. Executive Office of the President, President's Council of Advisors on Science and Technology.

[33]

PCAST. Report to the President--Big Data and Privacy: A Technological Perspective, 2014. Executive Office of the President, President’s Council of Advisors on Science and Technology.

[34]

T. Schlick. The 2013 Nobel Prize in chemistry celebrates computations in chemistry and biology. SIAM News, 46(10), 2013.

[35]

O. Sefraoui, M. Aissaoui, and M. Eleuldj. OpenStack: toward an open-source solution for cloud computing. International Journal of Computer Applications, 55(3):38--42, 2012.

[36]

R. C. Taylor. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC bioinformatics, 11(Suppl 12):S1, 2010.

[37]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a MapReduce framework. Proceedings of the VLDB Endowment, 2(2):1626--1629, 2009.

Digital Library

[38]

P. Xuan, Y. Zheng, S. Sarupria, and A. Apon. Sciflow: A dataflow-driven model architecture for scientific computing using hadoop. In Big Data, 2013 IEEE International Conference on, pages 36--44. IEEE, 2013.

[39]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10--10, 2010.

Digital Library

Cited By

C JU ADe Hilster DWatanuki HG SShetty J(2024)Scalable Analysis of English Dictionary Files on HPCC Systems Big Data Platform2024 9th International Conference on Big Data Analytics (ICBDA)10.1109/ICBDA61153.2024.10607199(328-333)Online publication date: 16-Mar-2024
https://doi.org/10.1109/ICBDA61153.2024.10607199
Khan SNgo LMorris EDey KZhou Y(2017)Social Media Data in TransportationData Analytics for Intelligent Transportation Systems10.1016/B978-0-12-809715-1.00011-0(263-281)Online publication date: 2017
https://doi.org/10.1016/B978-0-12-809715-1.00011-0

Index Terms

Dynamic Provisioning of Data Intensive Computing Middleware Frameworks: A Case Study
1. Social and professional topics
  1. Professional topics
    1. Management of computing and information systems

Recommendations

On the role of message broker middleware for many-task computing on a big-data platform
Abstract
We have designed and implemented a new data processing framework called “Many-task computing On HAdoop” (MOHA) which aims to effectively support fine-grained many-task applications that can show another type of data-intensive workloads in the YARN-...
MGC middleware for grid computing: the Globus Toolkit
ACAI '11: Proceedings of the International Conference on Advances in Computing and Artificial Intelligence

Grid computing has made substantial advances during the last decade. A major concern in Grid environments is dealing with the high degree of heterogeneity of resources that can range from laptops and PCs to supercomputers. The unified virtual view of ...
Resource provisioning for data-intensive applications with deadline constraints on hybrid clouds using Aneka

Cloud computing has emerged as a mainstream paradigm for hosting various types of applications by supporting easy-to-use computing services. Among the many different forms of cloud computing, hybrid clouds, which mix on-premises private cloud and third-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SCREAM '15: Proceedings of the 1st Workshop on The Science of Cyberinfrastructure: Research, Experience, Applications and Models

June 2015

82 pages

ISBN:9781450335669

DOI:10.1145/2753524

General Chairs:
Shantenu Jha
Rutgers University, USA
,
Daniel S. Katz
University of Chicago & Argonne National Laboratory, USA
,
Jon Weissman
University of Minnesota, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

HPDC'15

Sponsor:

University of Arizona
SIGARCH

HPDC'15: The 24th International Symposium on High-Performance Parallel and Distributed Computing

June 16, 2015

Oregon, Portland, USA

Acceptance Rates

SCREAM '15 Paper Acceptance Rate 8 of 12 submissions, 67%;

Overall Acceptance Rate 8 of 12 submissions, 67%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
138
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

C JU ADe Hilster DWatanuki HG SShetty J(2024)Scalable Analysis of English Dictionary Files on HPCC Systems Big Data Platform2024 9th International Conference on Big Data Analytics (ICBDA)10.1109/ICBDA61153.2024.10607199(328-333)Online publication date: 16-Mar-2024
https://doi.org/10.1109/ICBDA61153.2024.10607199
Khan SNgo LMorris EDey KZhou Y(2017)Social Media Data in TransportationData Analytics for Intelligent Transportation Systems10.1016/B978-0-12-809715-1.00011-0(263-281)Online publication date: 2017
https://doi.org/10.1016/B978-0-12-809715-1.00011-0

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten