[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2753524.2753528acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Dynamic Provisioning of Data Intensive Computing Middleware Frameworks: A Case Study

Published: 16 June 2015 Publication History

Abstract

Big data has become an important asset for industry, and academic disciplines now utilize large-scale data in their research. This fourth paradigm of scientific research has led to the inclusion of data management, processing, and analytic tools into the traditional high performance computing software libraries. This integration is facilitated through a collection of supporting software components that comprise a data intensive computing middleware framework. From a shared campus cyberinfrastructure perspective, this represents a new challenge to the system administrators in balancing between the traditional high performance computing software stacks and the new data-intensive middleware on the same physical computing resource. In turn, this limits researchers from having access to the new middleware tools while administrators determine how to overcome the challenge. In this paper, we present our experience in configuring dynamic provisioning of two different data-intensive middleware frameworks from a user perspective. We describe the configuration process from setting up dependencies to deploying the middleware, and how this experience can be applied by other researchers and administrators.

References

[1]
A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz. Hadoop GIS: a high performance spatial data warehousing system over MapReduce. Proceedings of the VLDB Endowment, 6(11):1009--1020, 2013.
[2]
J. W. Anderson, K. Kennedy, L. B. Ngo, A. Luckow, and A. W. Apon. Synthetic data generation for the internet of things. In Big Data (Big Data), 2014 IEEE International Conference on, pages 171--176. IEEE, 2014.
[3]
Apache Foundation. Apache Hadoop.\http://hadoop.apache.org, 2015.
[4]
A. Apon, S. Ahalt, V. Dantuluri, C. Gurdgiev, M. Limayem, L. B. Ngo, and M. Stealey. High performance computing instrumentation and research productivity in U.S. Universities. Journal of Information Technology Impact, 10:87--98, 2010.
[5]
A. W. Apon, L. B. Ngo, M. E. Payne, and P. W. Wilson. Assessing the effect of high performance computing capabilities on academic research output. Empirical Economics, 2014. Available at http://link.springer.com/article/10.1007/s00181-014-0833--7/fulltext.html.
[6]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):4, 2008.
[7]
Cloudera. Cloudera Big Data. http://www.cloudera.com/, 2014.
[8]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107--113, Jan. 2008.
[9]
T. Estrada, B. Zhang, M. Taufer, P. Cicotti, and R. Armen. Reengineering high-throughput molecular datasets for scalable clustering using MapReduce. In 2012 IEEE 14th International Conference on High Performance Computing and Communication, pages 351--359. IEEE, 2012.
[10]
M. J. Franklin. Making sense of big data with the Berkeley data analytics stack. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management, page 1. ACM, 2013.
[11]
J. L. Furlani and P. W. Osel. Abstract yourself with modules. In LISA, volume 96, pages 193--204, 1996.
[12]
L. George. HBase: the definitive guide. " O'Reilly Media, Inc.", 2011.
[13]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. ACM SIGOPS operating systems review, 37(5):29--43, 2003.
[14]
K. Goodhope, J. Koshy, J. Kreps, N. Narkhede, R. Park, J. Rao, and V. Y. Ye. Building LinkedIn's real-time activity data pipeline. IEEE Data Eng. Bull., 35(2):33--45, 2012.
[15]
Hortonworks. Hortonworks Data Platform. http://hortonworks.com/, 2014.
[16]
S. Jha, J. Qiu, A. Luckow, P. Mantha, and G. C. Fox. A tale of two data-intensive paradigms: Applications, abstractions, and architectures. In Big Data, 2014 IEEE International Congress on, pages 645--652. IEEE, 2014.
[17]
D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of MapReduce: an in-depth study. Proceedings of the VLDB Endowment, 3(1--2):472--483, 2010.
[18]
J. Kreps, N. Narkhede, J. Rao, et al. Kafka: A distributed messaging system for log processing. In Proceedings of 6th International Workshop on Networking Meets Databases (NetDB), Athens, Greece, 2011.
[19]
S. Krishnan, M. Tatineni, and C. Baru. myHadoop - Hadoop-on-Demand on Traditional HPC Resources. http://www.sdsc.edu/pub/techreports/SDSC-TR-2011-2-Hadoop.pdf, 2011. San Diego Supercomputing Center Tech Report.
[20]
A. R. Larzelere. Delivering insight: The history of the Accelerated Strategic Computing Initiative (ASCI), 2009.
[21]
K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, and B. Moon. Parallel data processing with MapReduce: a survey. AcM sIGMoD Record, 40(4):11--20, 2012.
[22]
T. Ludwig. The costs of HPC-based science in the exascale era. In High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, pages 2120--2188. IEEE, 2012.
[23]
A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, et al. The genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data. Genome research, 20(9):1297--1303, 2010.
[24]
A. M. Middleton, D. A. Bayliss, and G. Halliday. ECL/HPCC: A Unified Approach to Big Data. In Handbook of Data Intensive Computing, pages 59--107. Springer, 2011.
[25]
W. C. Moody, L. B. Ngo, E. Duffy, and A. Apon. Jummp: Job uninterrupted maneuverable mapreduce platform. In Cluster Computing (CLUSTER), 2013 IEEE International Conference on, pages 1--8. IEEE, 2013.
[26]
NAS. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. The National Academies Press, Washington, DC, 2008. National Research Council of the National Academies.
[27]
National Science Foundation. From Desktop to Teraflop: Exploiting the U.S. Lead in High Performance Computing, 1993. NSF Blue Ribbon Panel on High Performance Computing.
[28]
National Science Foundation. Cyberinfrastructure Framework for 21st Century Science and Engineering, 2012.
[29]
M. Odersky, P. Altherr, V. Cremet, B. Emir, S. Micheloud, N. Mihaylov, M. Schinz, E. Stenman, and M. Zenger. The Scala language specification, 2004.
[30]
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proceedings of the ACM SIGMOD, 2009.
[31]
M. E. Payne, L. B. Ngo, F. Villanustre, and A. W. Apon. Managing the academic data lifecycle: A case study of HPCC. In Big Data (Big Data), 2014 IEEE International Conference on, pages 22--30. IEEE, 2014.
[32]
PCAST. Report to the President and Congress--Designing a Digital Future: Federally Funded Research and Development in Networking and Information Technology, 2013. Executive Office of the President, President's Council of Advisors on Science and Technology.
[33]
PCAST. Report to the President--Big Data and Privacy: A Technological Perspective, 2014. Executive Office of the President, President’s Council of Advisors on Science and Technology.
[34]
T. Schlick. The 2013 Nobel Prize in chemistry celebrates computations in chemistry and biology. SIAM News, 46(10), 2013.
[35]
O. Sefraoui, M. Aissaoui, and M. Eleuldj. OpenStack: toward an open-source solution for cloud computing. International Journal of Computer Applications, 55(3):38--42, 2012.
[36]
R. C. Taylor. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC bioinformatics, 11(Suppl 12):S1, 2010.
[37]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a MapReduce framework. Proceedings of the VLDB Endowment, 2(2):1626--1629, 2009.
[38]
P. Xuan, Y. Zheng, S. Sarupria, and A. Apon. Sciflow: A dataflow-driven model architecture for scientific computing using hadoop. In Big Data, 2013 IEEE International Conference on, pages 36--44. IEEE, 2013.
[39]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10--10, 2010.

Cited By

View all
  • (2024)Scalable Analysis of English Dictionary Files on HPCC Systems Big Data Platform2024 9th International Conference on Big Data Analytics (ICBDA)10.1109/ICBDA61153.2024.10607199(328-333)Online publication date: 16-Mar-2024
  • (2017)Social Media Data in TransportationData Analytics for Intelligent Transportation Systems10.1016/B978-0-12-809715-1.00011-0(263-281)Online publication date: 2017

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SCREAM '15: Proceedings of the 1st Workshop on The Science of Cyberinfrastructure: Research, Experience, Applications and Models
June 2015
82 pages
ISBN:9781450335669
DOI:10.1145/2753524
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 June 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dynamic provisioning
  2. hadoop
  3. hpcc systems
  4. shared computing resources

Qualifiers

  • Research-article

Funding Sources

Conference

HPDC'15
Sponsor:

Acceptance Rates

SCREAM '15 Paper Acceptance Rate 8 of 12 submissions, 67%;
Overall Acceptance Rate 8 of 12 submissions, 67%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Scalable Analysis of English Dictionary Files on HPCC Systems Big Data Platform2024 9th International Conference on Big Data Analytics (ICBDA)10.1109/ICBDA61153.2024.10607199(328-333)Online publication date: 16-Mar-2024
  • (2017)Social Media Data in TransportationData Analytics for Intelligent Transportation Systems10.1016/B978-0-12-809715-1.00011-0(263-281)Online publication date: 2017

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media