[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Scientific Data Management in the Cloud: A Survey of Technologies, Approaches and Challenges

  • Chapter
  • First Online:
Handbook of Cloud Computing

Abstract

Experimental sciences create vast amounts of data. In astronomy, data produced by the Pan-STARRS project (Pan-STARRS project, 2010; Jedicke, Magnier, Kaiser, & Chambers, 2006) is expected to result in more than a petabyte of images every year. In high-energy physics, the Large Hadron Collider will generate 50–100 petabytes of data each year, with about 20 PB of that data being stored and processed on a worldwide federation of national grids linking 100,000 CPUs (Large Hadron Collider project, 2010; Massimo Lammana, 2004).

Cloud computing is immensely appealing to the scientific community, who increasingly see it as being part of the solution to cope with burgeoning data volumes. Cloud computing enables economies-of-scale in facility design and hardware construction. Groups of users are allowed to host, process, and analyze large volumes of data from various sources. There are several vendors that offer cloud computing platforms; these include Amazon Web Services (2010), Google’s App Engine (2010), AT&T’s Synaptic Hosting (2010), Rackspace (2010), GoGrid (2010) and AppNexus (2010). These vendors promise seemingly infinite amounts of computing power and storage that can be made available on demand, in a pay-only-for-what-you-use pricing model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 111.50
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 139.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
GBP 199.99
Price includes VAT (United Kingdom)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  • Adabi, D. J. (2009). Data management in the cloud: Limitations and opportunities. IEEE Data Engineering Bulletin, 32(1), 4–12.

    Google Scholar 

  • Agrawal, R., Kiernan, J., Srikant, R., & Xu, Y. (2004). Order preserving encryption for numeric data. Proceedings of SIGMOD, 563–574.

    Google Scholar 

  • Amazon Elastic MapReduce (2010). http://aws.amazon.com/elasticmapreduce/. Accessed on February 20, 2010.

  • Amazon EBS (2010), http://aws.amazon.com/ebs/. Accessed on February 20, 2010.

  • Amazon Public Datasets (2010), http://aws.amazon.com/publicdatasets/. Accessed on February 20, 2010.

  • Amazon RDS (2010), http://aws.amazon.com/rds/. Accessed on February 20, 2010.

  • Amazon SimpleDB (2010), http://aws.amazon.com/simpledb/. Accessed on February 20, 2010.

  • Amazon Web Services (2010). http://aws.amazon.com/ Accessed on February 20, 2010.

  • Antonioletti, M., Krause, A., Paton, N. W., Eisenberg, A., Laws, S., Malaika, S., et al. (2006). The WS-DAI family of specifications for web service data access and integration. ACM SIGMOD Record, 35(1), 48–55.

    Article  Google Scholar 

  • AppNexus (2010). http://www.appnexus.com/ Accessed on February 20, 2010.

  • AT&T Synaptic Hosting (2010). http://www.business.att.com/enterprise/Family/application-hosting-enterprise/synaptic-hosting-enterprise/ Accessed on February 20, 2010.

  • Baru, C. K., Fecteau, G., Goyal, A., Hsiao, H., Jhingran, A., Padmanabhan, S., et al. (1995). DB2 parrel edition. IBM Systems Journal, 34(2), 292–322.

    Article  Google Scholar 

  • Budavari, T., Malik, T., Szalay, A. S., Thakar, A., & Gray, J. (2003). SkyQuery – A prototype distributed query web service for the virtual observatory. In H. Payne, R. I. Jedrzejewski, & R. N. Hook (Eds.), Proceedings of ADASS XII, Astronomical Society of the Pacific, ASP Conference Series (Vol. 295, p. 31).

    Google Scholar 

  • Cary, A., Sun, Z., Hristidis, V., & Rishe, N. (2009). Experiences on processing spatial data with mapreduce. Proceedings of the 21st SSDBM Conference. Lecture notes in computer science, Vol. 5566, 302–319.

    Google Scholar 

  • Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., et al. (November 2006). Bigtable: A distributed storage system for structured data. OSDI’06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, 205–218.

    Google Scholar 

  • Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., & Tuecke, S. (2001). The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets. Journal of Network and Computer Applications, 23, 187–200.

    Article  Google Scholar 

  • Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, Vol. 51(1), 107–113.

    Google Scholar 

  • Deanand, J., & Ghemawat, S. (December 2004). Mapreduce: Simplified data processing on large clusters. In Proceedings of OSDI, San Francisco, CA, 137–150.

    Google Scholar 

  • Dennins D. Gannon, D., & Dan D. Reed, D. (2009). “Parallelism and the cloud.” In: T. Hey, S. Hensley, and K. Tolle (Eds.), The fourth paradigm: Data-intensive scientific discovery, Microsoft research (pp 131–136), ISBN-10:0982544200.

    Google Scholar 

  • FITS (2010). http://fits.gsfc.nasa.gov/. Accessed on February 20, 2010.

  • Gardner, J. (2007). Enabling knowledge discovery in a virtual universe. Proceedings of TeraGrid ’07: Broadening Participation in the TeraGrid, ACM Press.

    Google Scholar 

  • Gardner, J. P., Connolly, A., & McBride, C. (2007). Enabling rapid development of parallel tree search applications. Proceedings of the 2007 Symposium on Challenges of Large Applications in Distributed Environments (CLADE 2007), ACM Press, 1–10.

    Google Scholar 

  • Ge, T., & Zdonik, S. (2007). Answering aggregation queries in a secure system model. Proceedings of VLDB, 519–530.

    Google Scholar 

  • GenBank (2010). http://www.ncbi.nlm.nih.gov/Genbank/ Accessed on February 20.

  • Ghemawat, S., Gobioff, H., & Leung, S.-T. (October 2003). The google file system. Appeared in 19th ACM Symposium on Operating Systems Principles, Lake George, NY, 29–43.

    Google Scholar 

  • GoGrid (2010). http://www.gogrid.com Accessed on February 20, 2010.

  • Google App Engine (2010). http://code.google.com/appengine/ Accessed on February 20, 2010.

  • Gray, J. (2009). Jim gray on eScience: A transformed scientific method. In T. Hey, S. Hensley, & K. Tolle (Eds.), The fourth paradigm: Data-intensive scientific discovery, Microsoft research (pp xvii–xxxi), ISBN-10:0982544200.

    Google Scholar 

  • Gray, J., Liu, D. T., Nieto-Santisteban, M. A., Szalay, A. S., Heber, G., & DeWitt, D. (December 2005). Scientific data management in the coming decade. SIGMOD Record, 34(4), 34–41.

    Google Scholar 

  • Hacigumus, H., Iyer, B., Li, C., & Mehrotra, S. (2002). Executing sql over encrypted data in the database-service-provider model. Proceedings of SIGMOD, 216–227.

    Google Scholar 

  • HDF (2010) http://www.hdfgroup.org/HDF5/. Accessed on February 20, 2010.

  • Isard, M., & Yu, Y. (July 2009). Distributed data-parallel computing using a high-level programming language. Proceedings of the International Conference on Management of Data (SIGMOD), 987–994.

    Google Scholar 

  • Isard, M., Budiu, M., Yu, Y., Birrel, A., & Fetterly, D. (March 2007). Dryad: Distributed data-parallel programs from sequential building blocks. Proceedings of European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21–23, 59–72.

    Google Scholar 

  • IDL (2010) Interactive Data Language. http://www.ittvis.com/ProductServices/IDL.aspx. Accessed on February 20, 2010.

  • Jaeger-Frank, E., Crosby, C. J., Memon, A., Nandigam, V., Conner, J., Arrowsmith, J. R., et al. (December 2006). A domain independent three tier architecture applied to Lidar processing and monitoring. In the Special Issue of the Scientific Programming Journal devoted to WORKS06 and WSES06, 185–194.

    Google Scholar 

  • Jedicke, R., Magnier, E. A., Kaiser, N., & Chambers, K. C. (2006). The next decade of solar system discovery with pan-STARRS. Proceedings of IAU Symposium 236, 341–352.

    Google Scholar 

  • Kantarcoglu, M., & Clifton, C. (2004). Security issues in querying encrypted data. 19th Annual IFIP WG 11.3 Working Conference on Data and Applications Security, 325–337.

    Google Scholar 

  • Lakshman, A., Malik, P., & Ranganathan, K. (2008). Cassandra, structured storage system over a P2P network. Keynote Presentation, SIGMOD, Calgary, Canada, 5–5.

    Google Scholar 

  • Lammana, M. (November 2004). Nuclear instruments and methods in physics research section A: Accelerators, spectrometers, detectors and associated equipment. In the Proceedings of the 9th international Workshop on Advanced Computing and Analysis Techniques in Physics Research (Vol. 534, No. 1–2, pp. 1–6).

    Google Scholar 

  • Large Hadron Collider project (2010). http://public.web.cern.ch/public/en/LHC/LHC-en.html Accessed on February 20, 2010.

  • Li, Y., Perlman, E., Wan, M., Yang, Y., Meneveau, C., Burns, R., et al. (2008). A public turbulence database and applications to study lagrangian evolution of velocity increments in turbulence. Journal of Computational Physics, 9(31), 1468–5248.

    Google Scholar 

  • Loebman, S., Nunley, D., Kwon, Y. C., Howe, B., Balazinsk, M., & Gardner, J. P. (2009). Analyzing massive astrophysical datasets: Can pig/hadoop or a relational DBMS help? Proceedings of the Workshop on Interfaces and Architecture for Scientific Data Storage (IASDS), 1–10.

    Google Scholar 

  • LSST Science Collaborations and LSST Project (2009). LSST Science Book, Version 2.0, arXiv:0912.0201, http://www.lsst.org/lsst/scibook.

  • MacCormick, J., Murphy, N., Najork, M., Thekkath, C. A., & Zhou, L. (December 2004). Boxwood: Abstractions as the foundation for storage infrastructure. Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI 2004), San Francisco, CA, USA, 105–120.

    Google Scholar 

  • Microsoft, SQL Azure (2010). http://www.microsoft.com/windowsazure/sqlazure/ Accessed on February 20, 2010.

  • Microsoft, Windows Azure (2010). http://www.microsoft.com/windowsazure/ Accessed on February 20, 2010.

  • Moore, R. W. Moore, R. W., Jagatheesan, A. Jagatheesan, A., Rajasekar, A. Rajasekar, A., et al. (April 2004). “Data grid management systems,”. Proceedings of the 21st IEEE/NASA Conference on Mass Storage Systems and Technologies (MSST), April 13–16, 2004, College Park, Maryland, USA, April 13–16, 2004.

    Google Scholar 

  • Mykletun, E., & Tsudik, G. (2006). Aggregation queries in the database-as-a-servicemodel. IFIP WG 11.3 on Data and Application Security, 89–103.

    Google Scholar 

  • NCBI (2010). http://www.ncbi.nlm.nih.gov/guide/ Accessed on February 20, 2010.

  • NetCDF (2010). http://www.unidata.ucar.edu/software/netcdf/. Accessed on July 16, 2010.

  • Olston, C., Reed, B., Srivastava, U., Kumar, R., & Tomkins, A. (June 2008). Pig latin: A not-so-foreign language for data processing. ACM SIGMOD 2008 International Conference on Management of Data, Vancouver, Canada, 1099–1110.

    Google Scholar 

  • OpenMPI (2010). http://www.open-mpi.org/. Accessed on February 20, 2010.

  • OpenPBS (2010). http://www.pbsgridworks.com. Accessed on February 20, 2010.

  • Oracle Database 11 g (2010), http://www.oracledatabase11g.com/. Accessed on February 20, 2010.

  • Oracle Real Application Cluster (2010). http://www.oracle.com/technology/products/database/clustering/index.html. Accessed on Febrary 20, 2010

  • Ozone (2010). http://www.ozone-db.org/frames/home/what.html. Accessed on February 20, 2010.

  • Palankar, M. R., Iamnitchi, A., Ripeanu, M., & Garfinkel, S. (2008). Amazon S3 for science grids: A viable solution? DADC ’08: Proceedings of the 2008 International Workshop on Data-Aware Distributed Computing, 55–64.

    Google Scholar 

  • Pan-STARRS project (2010). http://pan-starrs.ifa.hawaii.edu/public/ Accessed on February 20, 2010.

  • Peng, J., & Law, K. H. Reference NEESgrid data model (Tech. Rep. NEESgrid-2004-40).

    Google Scholar 

  • Pike, R., Dorward, S., Griesemer, R., & Quinlan, S. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming Journal Special Issues on Grids and Worldwide Computing Programming Models and Infrastructure, 13(4), 227–298.

    Google Scholar 

  • Plale, B., Gannon, D., Alameda, J., Wilhelmson, B., Hampton, S., Rossi, A., et al. (2005). Active management of scientific data. IEEE Internet Computing Special Issue on Internet Access to Scientific Data, 9(1), 27–34.

    Google Scholar 

  • PubCam (2010). http://pubchem.ncbi.nlm.nih.gov/ Accessed on February 20, 2010.

  • PubMed (2010). http://www.ncbi.nlm.nih.gov/pubmed/ Accessed on February 20, 2010.

  • Rackspace (2010). http://www.rackspace.com/index.php Accessed on February 20, 2010.

  • Ratnasamy, S., Francis, P., Handley, M., Karp, R., & Shenker, S. (August 2001). A scalable content-addressable network. Proceedings of SIGCOMM, 161–172.

    Google Scholar 

  • Rowstron, A., & Drushel, P. (November 2001). Pastry: Scalable, distributed object location and routing for large scale peer-to-peer systems. Proceedings of Middleware 2001, 329–350.

    Google Scholar 

  • San Diego Supercomputing Center (2010), http://www.sdsc.edu/. Accessed on February 20, 2010.

  • SciDB (2010). http://scidb.org/ Accessed on February 20, 2010.

  • Simmhan, Y., Barge, R., van Ingen, C., Nieto-Santisteban, M., Dobos, L., Li, N., et al. (2009). GrayWulf: Scalable software architecture for data intensive computing. Proceedings of the 42nd Hawaii International Conference on System Science, 1–10.

    Google Scholar 

  • Singh, G., Bharathi, S., Chervenak, A., Deelman, E., Kesselman, C., Manohar, M., et al. (2003). A metadata catalog service for data intensive applications. IEEE, ACM, Super Computing the international conference for High Performance Computing, Networking, Storage and Analysis, 33–50.

    Google Scholar 

  • Stadel, J. G. (2001). Cosmological N-Body simulations and their analysis. (Doctoral dissertation, University of Washington, 2001).

    Google Scholar 

  • Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., & Balakrishnan, H. (August 2001). Chord: A scalable peer0to-peer lookup service for internet applications. Proceedings of SIGCOMM, 149–160.

    Google Scholar 

  • Stonebraker, M. (1986). The case for shared nothing architecture. Database Engineering, 9(1), 4–9.

    Google Scholar 

  • Szalay, A., Bell, G., Vandenberg, J., Wonders, A., Burns, R., Fay, D., et al. (2009). GrayWulf: Scalable clustered architecture for data intensive computing. Proceedings of the 42nd Hawaii International Conference on System Science, 1–10.

    Google Scholar 

  • Teragrid (2010), http://www.teragrid.org/. Accessed on February 20, 2010.

  • Thain, D., Tannenbaum, T., & Livny, M. (February–April 2005). Distributed computing in practice: The condor experience. Concurrency and Computation: Practice and Experience, 17(2–4), 323–356.

    Article  Google Scholar 

  • TIPSY (2010). http://hpcc.astrowaxhington.edu/tools/tipsy/tipsy.html. Accessed on February 20, 2010.

  • The Academic ClusterComputing Initiative (ACCI 2007). Google and IBM Announce University Initiative to Address Internet-Scale Computing Challegne, Google Official Press Center, http://www.google.com/intl/en/press/pressrel/20071008_ibm_univ.html.

  • The Globus Toolkit (2010). Data replication service. http://www-unix.globus.org/toolkit/docs/4.0/techpreview/datarep/ Accessed on February 20, 2010.

  • Unidata (2010). http://www.unidata.ucar.edu/ Accessed on February 20, 2010.

  • Yu, Y., Gunda, P. K., & Isard, M. (October 2009). Distributed aggregation for data-parallel computing: Interfaces and implementations. Proceedings of the Symposium on Operating Systems Principles (SOSP).

    Google Scholar 

  • Zhao, B. Y., Kubiatowicz, J., & Joseph, A. D. (April 2001). Tapestry: An infrastructure for fault-tolerant wide-area location and routing (Tech. Rep. UCB/CSD-01-1141, CS Division, UC Berkeley).

    Google Scholar 

  • Zverina, J. (2010) San Diego supercomputing center begins cloud computing research using the Google IBM clue cluster. http://www.sdsc.edu/News%20Items/PR021309_clue.html, Accessed on February 20, 2010.

  • ZODB (2010). http://wiki.zope.org/481zope2/ZODBZopeObjectDatabase. Accessed on February 20, 2010.

  • Zookeeper (2010). http://wiki.apache.org/hadoop/ZooKeeper. Accessed on February 20, 2010.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sangmi Lee Pallickara .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Pallickara, S.L., Pallickara, S., Pierce, M. (2010). Scientific Data Management in the Cloud: A Survey of Technologies, Approaches and Challenges. In: Furht, B., Escalante, A. (eds) Handbook of Cloud Computing. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-6524-0_22

Download citation

  • DOI: https://doi.org/10.1007/978-1-4419-6524-0_22

  • Published:

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4419-6523-3

  • Online ISBN: 978-1-4419-6524-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics