[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/CCGrid.2015.161acmotherconferencesArticle/Chapter ViewAbstractPublication PagesccgridConference Proceedingsconference-collections
research-article

Triple-H: a <u>h</u>ybrid approach to accelerate <u>H</u>DFS on HPC clusters with <u>h</u>eterogeneous storage architecture

Published: 04 May 2015 Publication History

Abstract

HDFS (Hadoop Distributed File System) is the primary storage of Hadoop. Even though data locality offered by HDFS is important for Big Data applications, HDFS suffers from huge I/O bottlenecks due to the tri-replicated data blocks and cannot efficiently utilize the available storage devices in an HPC (High Performance Computing) cluster. Moreover, due to the limitation of local storage space, it is challenging to deploy HDFS in HPC environments. In this paper, we present a hybrid design (Triple-H) that can minimize the I/O bottlenecks in HDFS and ensure efficient utilization of the heterogeneous storage devices (e.g. RAM, SSD, and HDD) available on HPC clusters. We also propose effective data placement policies to speed up Triple-H. Our design integrated with parallel file system (e.g. Lustre) can lead to significant storage space savings and guarantee fault-tolerance. Performance evaluations show that Triple-H can improve the write and read throughputs of HDFS by up to 7x and 2x, respectively. The execution times of data generation benchmarks are reduced by up to 3x. Our design also improves the execution time of the Sort benchmark by up to 40% over default HDFS and 54% over Lustre. The alignment phase of the CloudBurst [20] application is accelerated by 19%. Triple-H also benefits the performance of SequenceCount and Grep in PUMA [15] over both default HDFS and Lustre.

References

[1]
G. Ananthanarayanan, A. Ghodsi, A. Wang, D. Borthakur, S. Kandula, S. Shenker, and I. Stoica, "PACMan: Coordinated Memory Caching for Parallel Jobs," in 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2012.
[2]
Apache Software Foundation, "Centralized Cache Management in HDFS," http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html.
[3]
CSC, "Big Data Universe Beginning to Explode," http://www.csc.com/insights/flxwd/78931-big_data_universe_beginning_to_explode.
[4]
C. Engelmann, H. Ong, and S. L. Scott, "Middleware in Modern High Performance Computing System Architectures," in 7th International Conference on Computational Science (ICCS), 2007.
[5]
Gordon at San Diego Supercomputer Center, http://www.sdsc.edu/us/resources/gordon/.
[6]
Hadoop 2.6 Storage Policies, https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html.
[7]
T. Harter, D. Borthakur, S. Dong, A. Aiyer, L. Tang, A. Arpaci-Dusseau, and R. Arpaci-Dusseau, "Analysis of HDFS Under HBase: A Facebook Messages Case Study," in 12th USENIX Conference on File and Storage Technologies (FAST), 2014.
[8]
IDC, "IDC: Hadoop Commonly Used with Other Big Data Analytics Systems," http://www.computerworlduk.com/news/applications/3476789/idc-hadoop-commonly-used-with-other-big-data-analytics-systems/.
[9]
N. S. Islam, X. Lu, M. W. Rahman, R. Rajachandrasekar, and D. K. Panda, "In-Memory I/O and Replication for HDFS with Memcached: Early Experiences," in 2014 IEEE International Conference on Big Data (IEEE BigData), 2014.
[10]
N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda, "SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS," in 23rd International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2014.
[11]
N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and D. K. Panda, "High Performance RDMA-based Design of HDFS over InfiniBand," in International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012.
[12]
N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda, "Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?" in IEEE 21st Annual Symposium on High-Performance Interconnects (HOTI), 2013.
[13]
H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica, "Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks," in ACM Symposium on Cloud Computing, New York, 2014.
[14]
OSU NBC Lab, "High-Performance Big Data (HiBD)," http://hibd.cse.ohio-state.edu.
[15]
Purdue MapReduce Benchmarks Suite (PUMA), https://sites.google.com/site/farazahmad/pumabenchmarks.
[16]
K. R, A. Anwar, and A. Butt, "hatS: A Heterogeneity-Aware Tiered Storage for Hadoop," in 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), 2014.
[17]
K. R, S. Iqbal, and A. Butt, "VENU: Orchestrating SSDs in Hadoop Storage," in 2014 IEEE International Conference on Big Data (IEEE BigData), 2014.
[18]
K. R, A. Khasymski, A. Butt, S. Tiwari, and M. Bhandarkar, "Apt-Store: Dynamic Storage Management for Hadoop," in International Conference on Cluster Computing (CLUSTER), 2013.
[19]
M. W. Rahman, X. Lu, N. S. Islam, R. Rajachadrasekar, and D. K. Panda, "High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA," in 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2015.
[20]
M. C. Schatz, "CloudBurst: Highly Sensitive Read Mapping with MapReduce," Bioinformatics, 2009.
[21]
K. Shvachko, "HDFS Scalability: The Limits to Growth," vol. 35, no. 2, 2010, pp. 6--16. {Online}. Available: http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf
[22]
Stampede at TACC, http://www.tacc.utexas.edu/resources/hpc/stampede.
[23]
T. Sterling, E. Lusk, and W. Gropp, Beowulf Cluster Computing with Linux. MIT Press, 2003.
[24]
T. L. Sterling, J. Salmon, D. J. Becker, and D. F. Savarese, How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters. MIT Press, 1999.
[25]
W. Tantisiriroj, S. Patil, G. Gibson, S. Son, S. Lang, and R. Ross, "On the Duality of Data-intensive File System Design:Reconciling HDFS and PVFS," in International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2011.
[26]
The Apache Software Foundation, "The Apache Hadoop Project," http://hadoop.apache.org/.
[27]
M. Welsh, D. Culler, and E. Brewer, "SEDA: An Architecture for Well-Conditioned, Scalable Internet Services," in 18th ACM Symposium on Operating Systems Principles (SOSP), 2001.
[28]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, "Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing," in 9th USENIX Conference on Networked Systems Design and Implementation, 2012.
[29]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, "Spark: Cluster Computing with Working Sets," in Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud), 2010.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
CCGRID '15: Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing
May 2015
1277 pages
ISBN:9781479980062

Publisher

IEEE Press

Publication History

Published: 04 May 2015

Check for updates

Qualifiers

  • Research-article

Conference

CCGrid '15

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2021)HatRPCProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476191(1-14)Online publication date: 14-Nov-2021
  • (2020)MosaicProceedings of the VLDB Endowment10.14778/3407790.340785213:12(2662-2675)Online publication date: 1-Jul-2020
  • (2017)An UHD video handling system using a scalable server over an IP networkInternational Journal of Advanced Media and Communication10.1504/IJAMC.2017.100048607:1(1-19)Online publication date: 1-Jan-2017
  • (2017)A Comprehensive Study of MapReduce Over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC ClustersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.259194728:3(633-646)Online publication date: 1-Mar-2017
  • (2016)Can non-volatile memory benefit mapreduce applications on HPC clusters?Proceedings of the 1st Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems10.5555/3019046.3019050(19-24)Online publication date: 13-Nov-2016
  • (2016)Performance characterization of hadoop workloads on SR-IOV-enabled virtualized InfiniBand clustersProceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies10.1145/3006299.3006313(36-45)Online publication date: 6-Dec-2016
  • (2016)Nswap2LProceedings of the Second International Symposium on Memory Systems10.1145/2989081.2989107(50-61)Online publication date: 3-Oct-2016
  • (2016)Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC CometProceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale10.1145/2949550.2949561(1-5)Online publication date: 17-Jul-2016
  • (2016)High Performance Design for HDFS with Byte-Addressability of NVM and RDMAProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926290(1-14)Online publication date: 1-Jun-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media