[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1851476.1851540acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Reshaping text data for efficient processing on Amazon EC2

Published: 21 June 2010 Publication History

Abstract

Text analysis tools are nowadays required to process increasingly large corpora which are often organized as small files (abstracts, news articles, etc). Cloud computing offers a convenient, on-demand, pay-as-you-go computing environment for solving such problems. We investigate provisioning on the Amazon EC2 cloud from the user perspective, attempting to provide a scheduling strategy that is both timely and cost effective. We rely on the empirical performance of the application of interest on smaller subsets of data, to construct an execution plan. A first goal of our performance measurements is to determine an optimal file size for our application to consume. Using the subset-sum first fit heuristic we reshape the input data by merging files in order to match as closely as possible the desired file size. This also speeds up the task of retrieving the results of our application, by having the output be less segmented. Using predictions of the performance of our application based on measurements on small data sets, we devise an execution plan that meets a user specified deadline while minimizing cost.

References

[1]
}}Bonnie++. http://www.coker.com.au/bonnie++/
[2]
}}Project gutenberg. http://www.gutenberg.org/
[3]
}}S. Barker and P. Shenoy. Empirical evaluation of latency-sensitive application performance in the cloud. In Proceedings of MMSys 2010, February 2010.
[4]
}}J. Cao, D. J. Kerbyson, E. Papaefstathiou, and G. R. Nudd. Performance modelling of parallel and distributed computing using pace1. IEEE International Performance Computing and Communications Conference, IPCCC-2000, pages 485--492, February 2000.
[5]
}}E. Deelman, G. Singh, M. Livny, B. Berriman, and J. Good. The cost of doing science on the cloud: the montage example. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press.
[6]
}}J. Dejun, G. Pierre, and C.-H. Chi. EC2 performance analysis for resource provisioning of service-oriented applications. In Proceedings of the 3rd Workshop on Non-Functional Properties and SLA Management in Service-Oriented Computing, Nov. 2009.
[7]
}}K. C. et al. New grid scheduling and rescheduling methods in the grads project. In in Proceedings of NSF Next Generation Software Workshop: International Parallel and Distributed Processing Symposium. Santa Fe, USA: IEEE CS, pages 209--229. Press, 2004.
[8]
}}I. T. Foster, Y. Zhao, I. Raicu, and S. Lu. Cloud computing and grid computing 360-degree compared. CoRR, abs/0901.0131, 2009.
[9]
}}S. L. Garfinkel. An evaluation of amazon's grid computing services: Ec2, s3 and sqs. Technical Report TR-08-07, Computer Science Group, Harvard University, 2008.
[10]
}}S. Hazelhurst. Scientific computing using virtual high-performance computing: a case study using the amazon elastic computing cloud. In SAICSIT '08: Proceedings of the 2008 annual research conference of the South African Institute of Computer Scientists and Information Technologists on IT research in developing countries, pages 94--103, New York, NY, USA, 2008. ACM.
[11]
}}G. Juve, E. Deelman, K. Vahi, G. Mehta, B. Berriman, B. P. Berman, and P. Maechling. Scientific workflow applications on amazon ec2. In Workshop on Cloud-based Services and Applications in conjunction with 5th IEEE Internation Conference on e-Science (e-Science 2009), 2009.
[12]
}}D. Murray and S. Hand. Nephology towards a scientific method for cloud computing. In 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI), Boston, MA, April 2009.
[13]
}}G. R. Nudd, D. J. Kerbyson, E. Papaefstathiou, S. C. Perry, J. S. Harper, and D. V. Wilcox. Pace--a toolset for the performance prediction of parallel and distributed systems. Int. J. High Perform. Comput. Appl., 14(3):228--251, 2000.
[14]
}}M. R. Palankar, A. Iamnitchi, M. Ripeanu, and S. Garfinkel. Amazon s3 for science grids: a viable solution? In DADC '08: Proceedings of the 2008 international workshop on Data-aware distributed computing, pages 55--64, New York, NY, USA, 2008. ACM.
[15]
}}W. Smith, I. T. Foster, and V. E. Taylor. Predicting application run times using historical information. In IPPS/SPDP '98: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing, pages 122--142, London, UK, 1998. Springer-Verlag.
[16]
}}Stanford part-of-speech tagger. http://nlp.stanford.edu/software/tagger.shtml
[17]
}}E. Walker. Benchmarking amazon ec2 for high-performance scientific computing. USENIX Login, 33(5):18--23, 2008.
[18]
}}G. Wang and T. E. Ng. The impact of virtualization on network performance of amazon ec2 data center. In Proceedings of the 3rd Workshop on Non-Functional Properties and SLA Management in Service-Oriented Computing, 2010.
[19]
}}J. Yu, R. Buyya, and C. K. Tham. Cost-based scheduling of scientific workflow application on utility grids. In E-SCIENCE '05: Proceedings of the First International Conference on e-Science and Grid Computing, pages 140--147, Washington, DC, USA, 2005. IEEE Computer Society.

Cited By

View all
  • (2018)Crayons: Empowering CyberGIS by Employing Cloud InfrastructureCyberGIS for Geospatial Discovery and Innovation10.1007/978-94-024-1531-5_7(115-141)Online publication date: 27-Jun-2018
  • (2015)Cloud Query ManagerProceedings of the 2015 IEEE 8th International Conference on Cloud Computing10.1109/CLOUD.2015.98(702-709)Online publication date: 27-Jun-2015
  • (2014)Application of Mobile Cloud-Based Technologies in News ReportingMobile Networks and Cloud Computing Convergence for Progressive Services and Applications10.4018/978-1-4666-4781-7.ch017(320-343)Online publication date: 2014
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
June 2010
911 pages
ISBN:9781605589428
DOI:10.1145/1851476
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Amazon EC2
  2. cloud computing
  3. provisioning
  4. text processing

Qualifiers

  • Research-article

Conference

HPDC '10
Sponsor:

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)2
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2018)Crayons: Empowering CyberGIS by Employing Cloud InfrastructureCyberGIS for Geospatial Discovery and Innovation10.1007/978-94-024-1531-5_7(115-141)Online publication date: 27-Jun-2018
  • (2015)Cloud Query ManagerProceedings of the 2015 IEEE 8th International Conference on Cloud Computing10.1109/CLOUD.2015.98(702-709)Online publication date: 27-Jun-2015
  • (2014)Application of Mobile Cloud-Based Technologies in News ReportingMobile Networks and Cloud Computing Convergence for Progressive Services and Applications10.4018/978-1-4666-4781-7.ch017(320-343)Online publication date: 2014
  • (2014)Towards an MPI-like framework for the Azure cloud platformProceedings of the 14th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2014.100(176-185)Online publication date: 26-May-2014
  • (2013)Compilation of ReferencesMobile Networks and Cloud Computing Convergence for Progressive Services and Applications10.4018/978-1-4666-4781-7.chcrf(0-0)Online publication date: 30-Nov-2013
  • (2013)AzureBOTProceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum10.1109/IPDPSW.2013.261(2139-2146)Online publication date: 20-May-2013
  • (2013)Benchmarking Joyent Smartdatacenter for Hadoop Mapreduce and Mpi Operations2013 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM)10.1109/CCEM.2013.6684429(1-6)Online publication date: Oct-2013
  • (2012)AzureBenchProceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum10.1109/IPDPSW.2012.128(1048-1057)Online publication date: 21-May-2012
  • (2012)Lessons Learnt from the Development of GIS Application on Azure Cloud PlatformProceedings of the 2012 IEEE Fifth International Conference on Cloud Computing10.1109/CLOUD.2012.140(352-359)Online publication date: 24-Jun-2012

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media