Scientific Data Management in the Cloud: A Survey of Technologies, Approaches and Challenges

Sangmi Lee Pallickara³,
Shrideep Pallickara³ &
Marlon Pierce⁴

7286 Accesses
3 Citations

Abstract

Experimental sciences create vast amounts of data. In astronomy, data produced by the Pan-STARRS project (Pan-STARRS project, 2010; Jedicke, Magnier, Kaiser, & Chambers, 2006) is expected to result in more than a petabyte of images every year. In high-energy physics, the Large Hadron Collider will generate 50–100 petabytes of data each year, with about 20 PB of that data being stored and processed on a worldwide federation of national grids linking 100,000 CPUs (Large Hadron Collider project, 2010; Massimo Lammana, 2004).

Cloud computing is immensely appealing to the scientific community, who increasingly see it as being part of the solution to cope with burgeoning data volumes. Cloud computing enables economies-of-scale in facility design and hardware construction. Groups of users are allowed to host, process, and analyze large volumes of data from various sources. There are several vendors that offer cloud computing platforms; these include Amazon Web Services (2010), Google’s App Engine (2010), AT&T’s Synaptic Hosting (2010), Rackspace (2010), GoGrid (2010) and AppNexus (2010). These vendors promise seemingly infinite amounts of computing power and storage that can be made available on demand, in a pay-only-for-what-you-use pricing model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 111.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 139.99; Price includes VAT (United Kingdom)

Hardcover Book: GBP 199.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Introduction to the 3rd International Workshop on Cloud Computing and Scientific Applications (CCSA’13)

Architecting Scientific Data Systems in the Cloud

Cloud resource management: towards efficient execution of large-scale scientific applications and workflows on complex infrastructures

Article Open access 19 June 2017

References

Adabi, D. J. (2009). Data management in the cloud: Limitations and opportunities. IEEE Data Engineering Bulletin, 32(1), 4–12.
Google Scholar
Agrawal, R., Kiernan, J., Srikant, R., & Xu, Y. (2004). Order preserving encryption for numeric data. Proceedings of SIGMOD, 563–574.
Google Scholar
Amazon Elastic MapReduce (2010). http://aws.amazon.com/elasticmapreduce/. Accessed on February 20, 2010.
Amazon EBS (2010), http://aws.amazon.com/ebs/. Accessed on February 20, 2010.
Amazon Public Datasets (2010), http://aws.amazon.com/publicdatasets/. Accessed on February 20, 2010.
Amazon RDS (2010), http://aws.amazon.com/rds/. Accessed on February 20, 2010.
Amazon SimpleDB (2010), http://aws.amazon.com/simpledb/. Accessed on February 20, 2010.
Amazon Web Services (2010). http://aws.amazon.com/ Accessed on February 20, 2010.
Antonioletti, M., Krause, A., Paton, N. W., Eisenberg, A., Laws, S., Malaika, S., et al. (2006). The WS-DAI family of specifications for web service data access and integration. ACM SIGMOD Record, 35(1), 48–55.
Article Google Scholar
AppNexus (2010). http://www.appnexus.com/ Accessed on February 20, 2010.
AT&T Synaptic Hosting (2010). http://www.business.att.com/enterprise/Family/application-hosting-enterprise/synaptic-hosting-enterprise/ Accessed on February 20, 2010.
Baru, C. K., Fecteau, G., Goyal, A., Hsiao, H., Jhingran, A., Padmanabhan, S., et al. (1995). DB2 parrel edition. IBM Systems Journal, 34(2), 292–322.
Article Google Scholar
Budavari, T., Malik, T., Szalay, A. S., Thakar, A., & Gray, J. (2003). SkyQuery – A prototype distributed query web service for the virtual observatory. In H. Payne, R. I. Jedrzejewski, & R. N. Hook (Eds.), Proceedings of ADASS XII, Astronomical Society of the Pacific, ASP Conference Series (Vol. 295, p. 31).
Google Scholar
Cary, A., Sun, Z., Hristidis, V., & Rishe, N. (2009). Experiences on processing spatial data with mapreduce. Proceedings of the 21st SSDBM Conference. Lecture notes in computer science, Vol. 5566, 302–319.
Google Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., et al. (November 2006). Bigtable: A distributed storage system for structured data. OSDI’06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, 205–218.
Google Scholar
Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., & Tuecke, S. (2001). The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets. Journal of Network and Computer Applications, 23, 187–200.
Article Google Scholar
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, Vol. 51(1), 107–113.
Google Scholar
Deanand, J., & Ghemawat, S. (December 2004). Mapreduce: Simplified data processing on large clusters. In Proceedings of OSDI, San Francisco, CA, 137–150.
Google Scholar
Dennins D. Gannon, D., & Dan D. Reed, D. (2009). “Parallelism and the cloud.” In: T. Hey, S. Hensley, and K. Tolle (Eds.), The fourth paradigm: Data-intensive scientific discovery, Microsoft research (pp 131–136), ISBN-10:0982544200.
Google Scholar
FITS (2010). http://fits.gsfc.nasa.gov/. Accessed on February 20, 2010.
Gardner, J. (2007). Enabling knowledge discovery in a virtual universe. Proceedings of TeraGrid ’07: Broadening Participation in the TeraGrid, ACM Press.
Google Scholar
Gardner, J. P., Connolly, A., & McBride, C. (2007). Enabling rapid development of parallel tree search applications. Proceedings of the 2007 Symposium on Challenges of Large Applications in Distributed Environments (CLADE 2007), ACM Press, 1–10.
Google Scholar
Ge, T., & Zdonik, S. (2007). Answering aggregation queries in a secure system model. Proceedings of VLDB, 519–530.
Google Scholar
GenBank (2010). http://www.ncbi.nlm.nih.gov/Genbank/ Accessed on February 20.
Ghemawat, S., Gobioff, H., & Leung, S.-T. (October 2003). The google file system. Appeared in 19th ACM Symposium on Operating Systems Principles, Lake George, NY, 29–43.
Google Scholar
GoGrid (2010). http://www.gogrid.com Accessed on February 20, 2010.
Google App Engine (2010). http://code.google.com/appengine/ Accessed on February 20, 2010.
Gray, J. (2009). Jim gray on eScience: A transformed scientific method. In T. Hey, S. Hensley, & K. Tolle (Eds.), The fourth paradigm: Data-intensive scientific discovery, Microsoft research (pp xvii–xxxi), ISBN-10:0982544200.
Google Scholar
Gray, J., Liu, D. T., Nieto-Santisteban, M. A., Szalay, A. S., Heber, G., & DeWitt, D. (December 2005). Scientific data management in the coming decade. SIGMOD Record, 34(4), 34–41.
Google Scholar
Hacigumus, H., Iyer, B., Li, C., & Mehrotra, S. (2002). Executing sql over encrypted data in the database-service-provider model. Proceedings of SIGMOD, 216–227.
Google Scholar
HDF (2010) http://www.hdfgroup.org/HDF5/. Accessed on February 20, 2010.
Isard, M., & Yu, Y. (July 2009). Distributed data-parallel computing using a high-level programming language. Proceedings of the International Conference on Management of Data (SIGMOD), 987–994.
Google Scholar
Isard, M., Budiu, M., Yu, Y., Birrel, A., & Fetterly, D. (March 2007). Dryad: Distributed data-parallel programs from sequential building blocks. Proceedings of European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21–23, 59–72.
Google Scholar
IDL (2010) Interactive Data Language. http://www.ittvis.com/ProductServices/IDL.aspx. Accessed on February 20, 2010.
Jaeger-Frank, E., Crosby, C. J., Memon, A., Nandigam, V., Conner, J., Arrowsmith, J. R., et al. (December 2006). A domain independent three tier architecture applied to Lidar processing and monitoring. In the Special Issue of the Scientific Programming Journal devoted to WORKS06 and WSES06, 185–194.
Google Scholar
Jedicke, R., Magnier, E. A., Kaiser, N., & Chambers, K. C. (2006). The next decade of solar system discovery with pan-STARRS. Proceedings of IAU Symposium 236, 341–352.
Google Scholar
Kantarcoglu, M., & Clifton, C. (2004). Security issues in querying encrypted data. 19th Annual IFIP WG 11.3 Working Conference on Data and Applications Security, 325–337.
Google Scholar
Lakshman, A., Malik, P., & Ranganathan, K. (2008). Cassandra, structured storage system over a P2P network. Keynote Presentation, SIGMOD, Calgary, Canada, 5–5.
Google Scholar
Lammana, M. (November 2004). Nuclear instruments and methods in physics research section A: Accelerators, spectrometers, detectors and associated equipment. In the Proceedings of the 9th international Workshop on Advanced Computing and Analysis Techniques in Physics Research (Vol. 534, No. 1–2, pp. 1–6).
Google Scholar
Large Hadron Collider project (2010). http://public.web.cern.ch/public/en/LHC/LHC-en.html Accessed on February 20, 2010.
Li, Y., Perlman, E., Wan, M., Yang, Y., Meneveau, C., Burns, R., et al. (2008). A public turbulence database and applications to study lagrangian evolution of velocity increments in turbulence. Journal of Computational Physics, 9(31), 1468–5248.
Google Scholar
Loebman, S., Nunley, D., Kwon, Y. C., Howe, B., Balazinsk, M., & Gardner, J. P. (2009). Analyzing massive astrophysical datasets: Can pig/hadoop or a relational DBMS help? Proceedings of the Workshop on Interfaces and Architecture for Scientific Data Storage (IASDS), 1–10.
Google Scholar
LSST Science Collaborations and LSST Project (2009). LSST Science Book, Version 2.0, arXiv:0912.0201, http://www.lsst.org/lsst/scibook.
MacCormick, J., Murphy, N., Najork, M., Thekkath, C. A., & Zhou, L. (December 2004). Boxwood: Abstractions as the foundation for storage infrastructure. Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI 2004), San Francisco, CA, USA, 105–120.
Google Scholar
Microsoft, SQL Azure (2010). http://www.microsoft.com/windowsazure/sqlazure/ Accessed on February 20, 2010.
Microsoft, Windows Azure (2010). http://www.microsoft.com/windowsazure/ Accessed on February 20, 2010.
Moore, R. W. Moore, R. W., Jagatheesan, A. Jagatheesan, A., Rajasekar, A. Rajasekar, A., et al. (April 2004). “Data grid management systems,”. Proceedings of the 21st IEEE/NASA Conference on Mass Storage Systems and Technologies (MSST), April 13–16, 2004, College Park, Maryland, USA, April 13–16, 2004.
Google Scholar
Mykletun, E., & Tsudik, G. (2006). Aggregation queries in the database-as-a-servicemodel. IFIP WG 11.3 on Data and Application Security, 89–103.
Google Scholar
NCBI (2010). http://www.ncbi.nlm.nih.gov/guide/ Accessed on February 20, 2010.
NetCDF (2010). http://www.unidata.ucar.edu/software/netcdf/. Accessed on July 16, 2010.
Olston, C., Reed, B., Srivastava, U., Kumar, R., & Tomkins, A. (June 2008). Pig latin: A not-so-foreign language for data processing. ACM SIGMOD 2008 International Conference on Management of Data, Vancouver, Canada, 1099–1110.
Google Scholar
OpenMPI (2010). http://www.open-mpi.org/. Accessed on February 20, 2010.
OpenPBS (2010). http://www.pbsgridworks.com. Accessed on February 20, 2010.
Oracle Database 11 g (2010), http://www.oracledatabase11g.com/. Accessed on February 20, 2010.
Oracle Real Application Cluster (2010). http://www.oracle.com/technology/products/database/clustering/index.html. Accessed on Febrary 20, 2010
Ozone (2010). http://www.ozone-db.org/frames/home/what.html. Accessed on February 20, 2010.
Palankar, M. R., Iamnitchi, A., Ripeanu, M., & Garfinkel, S. (2008). Amazon S3 for science grids: A viable solution? DADC ’08: Proceedings of the 2008 International Workshop on Data-Aware Distributed Computing, 55–64.
Google Scholar
Pan-STARRS project (2010). http://pan-starrs.ifa.hawaii.edu/public/ Accessed on February 20, 2010.
Peng, J., & Law, K. H. Reference NEESgrid data model (Tech. Rep. NEESgrid-2004-40).
Google Scholar
Pike, R., Dorward, S., Griesemer, R., & Quinlan, S. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming Journal Special Issues on Grids and Worldwide Computing Programming Models and Infrastructure, 13(4), 227–298.
Google Scholar
Plale, B., Gannon, D., Alameda, J., Wilhelmson, B., Hampton, S., Rossi, A., et al. (2005). Active management of scientific data. IEEE Internet Computing Special Issue on Internet Access to Scientific Data, 9(1), 27–34.
Google Scholar
PubCam (2010). http://pubchem.ncbi.nlm.nih.gov/ Accessed on February 20, 2010.
PubMed (2010). http://www.ncbi.nlm.nih.gov/pubmed/ Accessed on February 20, 2010.
Rackspace (2010). http://www.rackspace.com/index.php Accessed on February 20, 2010.
Ratnasamy, S., Francis, P., Handley, M., Karp, R., & Shenker, S. (August 2001). A scalable content-addressable network. Proceedings of SIGCOMM, 161–172.
Google Scholar
Rowstron, A., & Drushel, P. (November 2001). Pastry: Scalable, distributed object location and routing for large scale peer-to-peer systems. Proceedings of Middleware 2001, 329–350.
Google Scholar
San Diego Supercomputing Center (2010), http://www.sdsc.edu/. Accessed on February 20, 2010.
SciDB (2010). http://scidb.org/ Accessed on February 20, 2010.
Simmhan, Y., Barge, R., van Ingen, C., Nieto-Santisteban, M., Dobos, L., Li, N., et al. (2009). GrayWulf: Scalable software architecture for data intensive computing. Proceedings of the 42nd Hawaii International Conference on System Science, 1–10.
Google Scholar
Singh, G., Bharathi, S., Chervenak, A., Deelman, E., Kesselman, C., Manohar, M., et al. (2003). A metadata catalog service for data intensive applications. IEEE, ACM, Super Computing the international conference for High Performance Computing, Networking, Storage and Analysis, 33–50.
Google Scholar
Stadel, J. G. (2001). Cosmological N-Body simulations and their analysis. (Doctoral dissertation, University of Washington, 2001).
Google Scholar
Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., & Balakrishnan, H. (August 2001). Chord: A scalable peer0to-peer lookup service for internet applications. Proceedings of SIGCOMM, 149–160.
Google Scholar
Stonebraker, M. (1986). The case for shared nothing architecture. Database Engineering, 9(1), 4–9.
Google Scholar
Szalay, A., Bell, G., Vandenberg, J., Wonders, A., Burns, R., Fay, D., et al. (2009). GrayWulf: Scalable clustered architecture for data intensive computing. Proceedings of the 42nd Hawaii International Conference on System Science, 1–10.
Google Scholar
Teragrid (2010), http://www.teragrid.org/. Accessed on February 20, 2010.
Thain, D., Tannenbaum, T., & Livny, M. (February–April 2005). Distributed computing in practice: The condor experience. Concurrency and Computation: Practice and Experience, 17(2–4), 323–356.
Article Google Scholar
TIPSY (2010). http://hpcc.astrowaxhington.edu/tools/tipsy/tipsy.html. Accessed on February 20, 2010.
The Academic ClusterComputing Initiative (ACCI 2007). Google and IBM Announce University Initiative to Address Internet-Scale Computing Challegne, Google Official Press Center, http://www.google.com/intl/en/press/pressrel/20071008_ibm_univ.html.
The Globus Toolkit (2010). Data replication service. http://www-unix.globus.org/toolkit/docs/4.0/techpreview/datarep/ Accessed on February 20, 2010.
Unidata (2010). http://www.unidata.ucar.edu/ Accessed on February 20, 2010.
Yu, Y., Gunda, P. K., & Isard, M. (October 2009). Distributed aggregation for data-parallel computing: Interfaces and implementations. Proceedings of the Symposium on Operating Systems Principles (SOSP).
Google Scholar
Zhao, B. Y., Kubiatowicz, J., & Joseph, A. D. (April 2001). Tapestry: An infrastructure for fault-tolerant wide-area location and routing (Tech. Rep. UCB/CSD-01-1141, CS Division, UC Berkeley).
Google Scholar
Zverina, J. (2010) San Diego supercomputing center begins cloud computing research using the Google IBM clue cluster. http://www.sdsc.edu/News%20Items/PR021309_clue.html, Accessed on February 20, 2010.
ZODB (2010). http://wiki.zope.org/481zope2/ZODBZopeObjectDatabase. Accessed on February 20, 2010.
Zookeeper (2010). http://wiki.apache.org/hadoop/ZooKeeper. Accessed on February 20, 2010.

Download references

Author information

Authors and Affiliations

Department of Computer Science, Colorado State University, Fort Collins, CO, USA
Sangmi Lee Pallickara & Shrideep Pallickara
Community Grids Lab, Indiana University, Bloomington, IN, USA
Marlon Pierce

Authors

Sangmi Lee Pallickara
View author publications
You can also search for this author in PubMed Google Scholar
Shrideep Pallickara
View author publications
You can also search for this author in PubMed Google Scholar
Marlon Pierce
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sangmi Lee Pallickara .

Editor information

Editors and Affiliations

, Dept. of Comp. & Elect. Engin. and, Florida Atlantic University, Glades Road 777, Boca Raton, 33431, Florida, USA
Borko Furht
LexisNexis, Park of Commerce Blvd 6601, Boca Raton, 33487, Florida, USA
Armando Escalante

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Pallickara, S.L., Pallickara, S., Pierce, M. (2010). Scientific Data Management in the Cloud: A Survey of Technologies, Approaches and Challenges. In: Furht, B., Escalante, A. (eds) Handbook of Cloud Computing. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-6524-0_22

Download citation

DOI: https://doi.org/10.1007/978-1-4419-6524-0_22
Published: 27 August 2010
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-6523-3
Online ISBN: 978-1-4419-6524-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Scientific Data Management in the Cloud: A Survey of Technologies, Approaches and Challenges

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Introduction to the 3rd International Workshop on Cloud Computing and Scientific Applications (CCSA’13)

Architecting Scientific Data Systems in the Cloud

Cloud resource management: towards efficient execution of large-scale scientific applications and workflows on complex infrastructures

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Scientific Data Management in the Cloud: A Survey of Technologies, Approaches and Challenges

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Introduction to the 3rd International Workshop on Cloud Computing and Scientific Applications (CCSA’13)

Architecting Scientific Data Systems in the Cloud

Cloud resource management: towards efficient execution of large-scale scientific applications and workflows on complex infrastructures

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation