V for vicissitude: the challenge of scaling complex big data workflows
Pages 927 - 932
Abstract
In this paper we present the scaling of BTWorld, our MapReduce-based approach to observing and analyzing the global BitTorrent network which we have been monitoring for the past 4 years. BTWorld currently provides a comprehensive and complex set of queries implemented in Pig Latin, with data dependencies between them, which translate to several MapReduce jobs that have a heavy-tailed distribution with respect to both execution time and input size characteristics. Processing BitTorrent data in excess of 1 TB with our BTWorld workflow required an in-depth analysis of the entire software stack and the design of a complete optimization cycle. We analyze our system from both theoretical and experimental perspectives and we show how we attained a 15 times larger scale of data processing than our previous results.
References
[1]
M. Wojciechowski, M. Capotă, J. Pouwelse, and A. Iosup, "BTWorld: Towards Observing the Global BitTorrent File-Sharing Network," LSAP Workshop in conjunction with HPDC, 2010.
[2]
T. Hegeman, B. Ghit, M. Capotă, J. Hidders, D. H. J. Epema, and A. Iosup, "The BTWorld use case for big data analytics: Description, MapReduce logical workflow, and empirical evaluation," 2013 Int'l Conf. on Big Data. IEEE, Oct. 2013, pp. 622--630. {Online}. Available
[3]
BitTorrent, Inc., "BitTorrent and Torrent Software Surpass 150 Million User Milestone." {Online}. Available: http://www.bittorrent.com/company/about/ces_2012_150m_users
[4]
Sandvine, "Global Internet Phenomena Report 1H2013."
[5]
B. Cohen, "The BitTorrent Protocol Specification." {Online}. Available: http://bittorrent.org/beps/bep_0003.html
[6]
C. Zhang, P. Dhungel, D. Wu, and K. W. Ross, "Unraveling the bittorrent ecosystem," IEEE TPDS, Vol. 22, no. 7, pp. 1164--1177, 2011.
[7]
J. Poort, J. Leenheer, J. van der Ham, and C. Dumitru, "Baywatch: Two Approaches to Measure the Effects of Blocking Access to the Pirate Bay," SSRN Electronic Journal, 2013. {Online}. Available
[8]
www.cs.vu.nl/das4/.
[9]
M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, "Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling," EuroSys, 2010.
[10]
H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. Cetin, and S. Babu, "Starfish: A Self-tuning System for Big Data Analytics," 5th Biennal Conference on Innovative Data Systems Research (CIDR), 2011.
[11]
N. Yigitbasi, T. L. Willke, G. Liao, and D. Epema, "Towards machine learning-based auto-tuning of mapreduce," 21st MASCOTS. IEEE Computer Society, 2013, pp. 11--20.
[12]
B. Ghit, N. Yigitbasi, and D. Epema, "Resource Management for Dynamic MapReduce Clusters in Multicluster Systems," High Performance Computing, Networking, Storage and Analysis (SCC), SC Companion. IEEE, 2012, pp. 1252--1259.
[13]
S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang, "The Hibench Benchmark Suite: Characterization of the MapReduce-based Data Analysis," ICDEW, 2010, pp. 41--51.
[14]
Y. Chen, A. Ganapathi, R. Griffith, and R. Katz, "The Case for Evaluating MapReduce Performance Using Workload Suites," MASCOTS, 2011, pp. 390--399.
- V for vicissitude: the challenge of scaling complex big data workflows
Comments
Please enable JavaScript to view thecomments powered by Disqus.Information & Contributors
Information
Published In
Publisher
IEEE Press
Publication History
Published: 26 May 2014
Check for updates
Qualifiers
- Research-article
Conference
CCGrid '14
CCGrid '14: 2014 IEEE International Symposium on Cluster Computing and the Grid
May 26 - 29, 2014
Illinois, Chicago
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 11Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Reflects downloads up to 12 Dec 2024
Other Metrics
Citations
View Options
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in