Abstract
Experimental sciences create vast amounts of data. In astronomy, data produced by the Pan-STARRS project (Pan-STARRS project, 2010; Jedicke, Magnier, Kaiser, & Chambers, 2006) is expected to result in more than a petabyte of images every year. In high-energy physics, the Large Hadron Collider will generate 50–100 petabytes of data each year, with about 20 PB of that data being stored and processed on a worldwide federation of national grids linking 100,000 CPUs (Large Hadron Collider project, 2010; Massimo Lammana, 2004).
Cloud computing is immensely appealing to the scientific community, who increasingly see it as being part of the solution to cope with burgeoning data volumes. Cloud computing enables economies-of-scale in facility design and hardware construction. Groups of users are allowed to host, process, and analyze large volumes of data from various sources. There are several vendors that offer cloud computing platforms; these include Amazon Web Services (2010), Google’s App Engine (2010), AT&T’s Synaptic Hosting (2010), Rackspace (2010), GoGrid (2010) and AppNexus (2010). These vendors promise seemingly infinite amounts of computing power and storage that can be made available on demand, in a pay-only-for-what-you-use pricing model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Adabi, D. J. (2009). Data management in the cloud: Limitations and opportunities. IEEE Data Engineering Bulletin, 32(1), 4–12.
Agrawal, R., Kiernan, J., Srikant, R., & Xu, Y. (2004). Order preserving encryption for numeric data. Proceedings of SIGMOD, 563–574.
Amazon Elastic MapReduce (2010). http://aws.amazon.com/elasticmapreduce/. Accessed on February 20, 2010.
Amazon EBS (2010), http://aws.amazon.com/ebs/. Accessed on February 20, 2010.
Amazon Public Datasets (2010), http://aws.amazon.com/publicdatasets/. Accessed on February 20, 2010.
Amazon RDS (2010), http://aws.amazon.com/rds/. Accessed on February 20, 2010.
Amazon SimpleDB (2010), http://aws.amazon.com/simpledb/. Accessed on February 20, 2010.
Amazon Web Services (2010). http://aws.amazon.com/ Accessed on February 20, 2010.
Antonioletti, M., Krause, A., Paton, N. W., Eisenberg, A., Laws, S., Malaika, S., et al. (2006). The WS-DAI family of specifications for web service data access and integration. ACM SIGMOD Record, 35(1), 48–55.
AppNexus (2010). http://www.appnexus.com/ Accessed on February 20, 2010.
AT&T Synaptic Hosting (2010). http://www.business.att.com/enterprise/Family/application-hosting-enterprise/synaptic-hosting-enterprise/ Accessed on February 20, 2010.
Baru, C. K., Fecteau, G., Goyal, A., Hsiao, H., Jhingran, A., Padmanabhan, S., et al. (1995). DB2 parrel edition. IBM Systems Journal, 34(2), 292–322.
Budavari, T., Malik, T., Szalay, A. S., Thakar, A., & Gray, J. (2003). SkyQuery – A prototype distributed query web service for the virtual observatory. In H. Payne, R. I. Jedrzejewski, & R. N. Hook (Eds.), Proceedings of ADASS XII, Astronomical Society of the Pacific, ASP Conference Series (Vol. 295, p. 31).
Cary, A., Sun, Z., Hristidis, V., & Rishe, N. (2009). Experiences on processing spatial data with mapreduce. Proceedings of the 21st SSDBM Conference. Lecture notes in computer science, Vol. 5566, 302–319.
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., et al. (November 2006). Bigtable: A distributed storage system for structured data. OSDI’06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, 205–218.
Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., & Tuecke, S. (2001). The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets. Journal of Network and Computer Applications, 23, 187–200.
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, Vol. 51(1), 107–113.
Deanand, J., & Ghemawat, S. (December 2004). Mapreduce: Simplified data processing on large clusters. In Proceedings of OSDI, San Francisco, CA, 137–150.
Dennins D. Gannon, D., & Dan D. Reed, D. (2009). “Parallelism and the cloud.” In: T. Hey, S. Hensley, and K. Tolle (Eds.), The fourth paradigm: Data-intensive scientific discovery, Microsoft research (pp 131–136), ISBN-10:0982544200.
FITS (2010). http://fits.gsfc.nasa.gov/. Accessed on February 20, 2010.
Gardner, J. (2007). Enabling knowledge discovery in a virtual universe. Proceedings of TeraGrid ’07: Broadening Participation in the TeraGrid, ACM Press.
Gardner, J. P., Connolly, A., & McBride, C. (2007). Enabling rapid development of parallel tree search applications. Proceedings of the 2007 Symposium on Challenges of Large Applications in Distributed Environments (CLADE 2007), ACM Press, 1–10.
Ge, T., & Zdonik, S. (2007). Answering aggregation queries in a secure system model. Proceedings of VLDB, 519–530.
GenBank (2010). http://www.ncbi.nlm.nih.gov/Genbank/ Accessed on February 20.
Ghemawat, S., Gobioff, H., & Leung, S.-T. (October 2003). The google file system. Appeared in 19th ACM Symposium on Operating Systems Principles, Lake George, NY, 29–43.
GoGrid (2010). http://www.gogrid.com Accessed on February 20, 2010.
Google App Engine (2010). http://code.google.com/appengine/ Accessed on February 20, 2010.
Gray, J. (2009). Jim gray on eScience: A transformed scientific method. In T. Hey, S. Hensley, & K. Tolle (Eds.), The fourth paradigm: Data-intensive scientific discovery, Microsoft research (pp xvii–xxxi), ISBN-10:0982544200.
Gray, J., Liu, D. T., Nieto-Santisteban, M. A., Szalay, A. S., Heber, G., & DeWitt, D. (December 2005). Scientific data management in the coming decade. SIGMOD Record, 34(4), 34–41.
Hacigumus, H., Iyer, B., Li, C., & Mehrotra, S. (2002). Executing sql over encrypted data in the database-service-provider model. Proceedings of SIGMOD, 216–227.
HDF (2010) http://www.hdfgroup.org/HDF5/. Accessed on February 20, 2010.
Isard, M., & Yu, Y. (July 2009). Distributed data-parallel computing using a high-level programming language. Proceedings of the International Conference on Management of Data (SIGMOD), 987–994.
Isard, M., Budiu, M., Yu, Y., Birrel, A., & Fetterly, D. (March 2007). Dryad: Distributed data-parallel programs from sequential building blocks. Proceedings of European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21–23, 59–72.
IDL (2010) Interactive Data Language. http://www.ittvis.com/ProductServices/IDL.aspx. Accessed on February 20, 2010.
Jaeger-Frank, E., Crosby, C. J., Memon, A., Nandigam, V., Conner, J., Arrowsmith, J. R., et al. (December 2006). A domain independent three tier architecture applied to Lidar processing and monitoring. In the Special Issue of the Scientific Programming Journal devoted to WORKS06 and WSES06, 185–194.
Jedicke, R., Magnier, E. A., Kaiser, N., & Chambers, K. C. (2006). The next decade of solar system discovery with pan-STARRS. Proceedings of IAU Symposium 236, 341–352.
Kantarcoglu, M., & Clifton, C. (2004). Security issues in querying encrypted data. 19th Annual IFIP WG 11.3 Working Conference on Data and Applications Security, 325–337.
Lakshman, A., Malik, P., & Ranganathan, K. (2008). Cassandra, structured storage system over a P2P network. Keynote Presentation, SIGMOD, Calgary, Canada, 5–5.
Lammana, M. (November 2004). Nuclear instruments and methods in physics research section A: Accelerators, spectrometers, detectors and associated equipment. In the Proceedings of the 9th international Workshop on Advanced Computing and Analysis Techniques in Physics Research (Vol. 534, No. 1–2, pp. 1–6).
Large Hadron Collider project (2010). http://public.web.cern.ch/public/en/LHC/LHC-en.html Accessed on February 20, 2010.
Li, Y., Perlman, E., Wan, M., Yang, Y., Meneveau, C., Burns, R., et al. (2008). A public turbulence database and applications to study lagrangian evolution of velocity increments in turbulence. Journal of Computational Physics, 9(31), 1468–5248.
Loebman, S., Nunley, D., Kwon, Y. C., Howe, B., Balazinsk, M., & Gardner, J. P. (2009). Analyzing massive astrophysical datasets: Can pig/hadoop or a relational DBMS help? Proceedings of the Workshop on Interfaces and Architecture for Scientific Data Storage (IASDS), 1–10.
LSST Science Collaborations and LSST Project (2009). LSST Science Book, Version 2.0, arXiv:0912.0201, http://www.lsst.org/lsst/scibook.
MacCormick, J., Murphy, N., Najork, M., Thekkath, C. A., & Zhou, L. (December 2004). Boxwood: Abstractions as the foundation for storage infrastructure. Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI 2004), San Francisco, CA, USA, 105–120.
Microsoft, SQL Azure (2010). http://www.microsoft.com/windowsazure/sqlazure/ Accessed on February 20, 2010.
Microsoft, Windows Azure (2010). http://www.microsoft.com/windowsazure/ Accessed on February 20, 2010.
Moore, R. W. Moore, R. W., Jagatheesan, A. Jagatheesan, A., Rajasekar, A. Rajasekar, A., et al. (April 2004). “Data grid management systems,”. Proceedings of the 21st IEEE/NASA Conference on Mass Storage Systems and Technologies (MSST), April 13–16, 2004, College Park, Maryland, USA, April 13–16, 2004.
Mykletun, E., & Tsudik, G. (2006). Aggregation queries in the database-as-a-servicemodel. IFIP WG 11.3 on Data and Application Security, 89–103.
NCBI (2010). http://www.ncbi.nlm.nih.gov/guide/ Accessed on February 20, 2010.
NetCDF (2010). http://www.unidata.ucar.edu/software/netcdf/. Accessed on July 16, 2010.
Olston, C., Reed, B., Srivastava, U., Kumar, R., & Tomkins, A. (June 2008). Pig latin: A not-so-foreign language for data processing. ACM SIGMOD 2008 International Conference on Management of Data, Vancouver, Canada, 1099–1110.
OpenMPI (2010). http://www.open-mpi.org/. Accessed on February 20, 2010.
OpenPBS (2010). http://www.pbsgridworks.com. Accessed on February 20, 2010.
Oracle Database 11 g (2010), http://www.oracledatabase11g.com/. Accessed on February 20, 2010.
Oracle Real Application Cluster (2010). http://www.oracle.com/technology/products/database/clustering/index.html. Accessed on Febrary 20, 2010
Ozone (2010). http://www.ozone-db.org/frames/home/what.html. Accessed on February 20, 2010.
Palankar, M. R., Iamnitchi, A., Ripeanu, M., & Garfinkel, S. (2008). Amazon S3 for science grids: A viable solution? DADC ’08: Proceedings of the 2008 International Workshop on Data-Aware Distributed Computing, 55–64.
Pan-STARRS project (2010). http://pan-starrs.ifa.hawaii.edu/public/ Accessed on February 20, 2010.
Peng, J., & Law, K. H. Reference NEESgrid data model (Tech. Rep. NEESgrid-2004-40).
Pike, R., Dorward, S., Griesemer, R., & Quinlan, S. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming Journal Special Issues on Grids and Worldwide Computing Programming Models and Infrastructure, 13(4), 227–298.
Plale, B., Gannon, D., Alameda, J., Wilhelmson, B., Hampton, S., Rossi, A., et al. (2005). Active management of scientific data. IEEE Internet Computing Special Issue on Internet Access to Scientific Data, 9(1), 27–34.
PubCam (2010). http://pubchem.ncbi.nlm.nih.gov/ Accessed on February 20, 2010.
PubMed (2010). http://www.ncbi.nlm.nih.gov/pubmed/ Accessed on February 20, 2010.
Rackspace (2010). http://www.rackspace.com/index.php Accessed on February 20, 2010.
Ratnasamy, S., Francis, P., Handley, M., Karp, R., & Shenker, S. (August 2001). A scalable content-addressable network. Proceedings of SIGCOMM, 161–172.
Rowstron, A., & Drushel, P. (November 2001). Pastry: Scalable, distributed object location and routing for large scale peer-to-peer systems. Proceedings of Middleware 2001, 329–350.
San Diego Supercomputing Center (2010), http://www.sdsc.edu/. Accessed on February 20, 2010.
SciDB (2010). http://scidb.org/ Accessed on February 20, 2010.
Simmhan, Y., Barge, R., van Ingen, C., Nieto-Santisteban, M., Dobos, L., Li, N., et al. (2009). GrayWulf: Scalable software architecture for data intensive computing. Proceedings of the 42nd Hawaii International Conference on System Science, 1–10.
Singh, G., Bharathi, S., Chervenak, A., Deelman, E., Kesselman, C., Manohar, M., et al. (2003). A metadata catalog service for data intensive applications. IEEE, ACM, Super Computing the international conference for High Performance Computing, Networking, Storage and Analysis, 33–50.
Stadel, J. G. (2001). Cosmological N-Body simulations and their analysis. (Doctoral dissertation, University of Washington, 2001).
Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., & Balakrishnan, H. (August 2001). Chord: A scalable peer0to-peer lookup service for internet applications. Proceedings of SIGCOMM, 149–160.
Stonebraker, M. (1986). The case for shared nothing architecture. Database Engineering, 9(1), 4–9.
Szalay, A., Bell, G., Vandenberg, J., Wonders, A., Burns, R., Fay, D., et al. (2009). GrayWulf: Scalable clustered architecture for data intensive computing. Proceedings of the 42nd Hawaii International Conference on System Science, 1–10.
Teragrid (2010), http://www.teragrid.org/. Accessed on February 20, 2010.
Thain, D., Tannenbaum, T., & Livny, M. (February–April 2005). Distributed computing in practice: The condor experience. Concurrency and Computation: Practice and Experience, 17(2–4), 323–356.
TIPSY (2010). http://hpcc.astrowaxhington.edu/tools/tipsy/tipsy.html. Accessed on February 20, 2010.
The Academic ClusterComputing Initiative (ACCI 2007). Google and IBM Announce University Initiative to Address Internet-Scale Computing Challegne, Google Official Press Center, http://www.google.com/intl/en/press/pressrel/20071008_ibm_univ.html.
The Globus Toolkit (2010). Data replication service. http://www-unix.globus.org/toolkit/docs/4.0/techpreview/datarep/ Accessed on February 20, 2010.
Unidata (2010). http://www.unidata.ucar.edu/ Accessed on February 20, 2010.
Yu, Y., Gunda, P. K., & Isard, M. (October 2009). Distributed aggregation for data-parallel computing: Interfaces and implementations. Proceedings of the Symposium on Operating Systems Principles (SOSP).
Zhao, B. Y., Kubiatowicz, J., & Joseph, A. D. (April 2001). Tapestry: An infrastructure for fault-tolerant wide-area location and routing (Tech. Rep. UCB/CSD-01-1141, CS Division, UC Berkeley).
Zverina, J. (2010) San Diego supercomputing center begins cloud computing research using the Google IBM clue cluster. http://www.sdsc.edu/News%20Items/PR021309_clue.html, Accessed on February 20, 2010.
ZODB (2010). http://wiki.zope.org/481zope2/ZODBZopeObjectDatabase. Accessed on February 20, 2010.
Zookeeper (2010). http://wiki.apache.org/hadoop/ZooKeeper. Accessed on February 20, 2010.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Pallickara, S.L., Pallickara, S., Pierce, M. (2010). Scientific Data Management in the Cloud: A Survey of Technologies, Approaches and Challenges. In: Furht, B., Escalante, A. (eds) Handbook of Cloud Computing. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-6524-0_22
Download citation
DOI: https://doi.org/10.1007/978-1-4419-6524-0_22
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-6523-3
Online ISBN: 978-1-4419-6524-0
eBook Packages: Computer ScienceComputer Science (R0)