[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Computational solutions to large-scale data management and analysis

Key Points

  • Biological research is becoming ever more information-driven, with individual laboratories now capable of generating terabyte scales of data in days. Supercomputing resources will be increasingly needed to get the most from the big data sets that researchers generate or analyse.

  • The big data revolution in biology is matched by a revolution in high-performance computing that is making supercomputing resources available to anyone with an internet connection.

  • A number of challenges are posed by large-scale data analysis, including data transfer (bringing the data and computational resources together), controlling access to the data, managing the data, standardizing data formats and integrating data of multiple different types to accurately model biological systems.

  • New computational solutions that are readily available to all can aid in addressing these challenges. These solutions include cloud-based computing and high-speed, low-cost heterogeneous computational environments. Taking advantage of these resources requires a thorough understanding of the data and the computational problem.

  • Knowing the parallelization of the analysis algorithms enables a more efficient solution to a computational problem by distributing tasks over many computer processors. The types of parallelism can be classified into two broad categories: loosely coupled (or coarse-grained) parallelism and tightly coupled (or fine-grained) parallelism, each benefiting from different types of computational platforms, depending on the problem of interest.

  • Clusters of computers can be optimized for many different classes of computationally intense applications, such as sequence alignment, genome-wide association tests and reconstruction of Bayesian networks. Cloud computing makes cluster-based computing more accessible and affordable for all. The distributed computing paradigm MapReduce has been designed for cloud-based computing to solve problems such as mapping raw DNA sequence reads to a reference genome (that is, problems that have loosely coupled parallelism).

  • Cloud computing provides a highly flexible, low-cost computational environment. However, the costs of cloud computing include sacrificing control of the underlying hardware and requiring that big data sets be transferred into the cloud for processing.

  • Heterogeneous multi-core computational systems, such as graphics processing units (GPUs), are complementary to cloud-based computing and operate as low-cost, specialized accelerators that can increase peak arithmetic throughput by 10-fold to 100-fold. These systems are specifically tuned to efficiently solve problems involving massive tightly coupled parallelism.

  • Heterogeneous computing provides a low-cost, flexible computational environment that improves performance and efficiency by exposing architectural features to programmers. However, programming applications to run in these environments requires significant informatics expertise.

  • Cloud providers such as Microsoft make advanced cloud computing resources freely available to individual researchers through a competitive, peer-reviewed granting process. Others providers, such as Amazon, provide advanced cloud storage and computational resources via an intuitive and simple web interface. Users of Amazon Web Services can today not only upload big data sets and analysis tools to Amazon S3 but also solve problems using MapReduce via a point-and-click interface.

Abstract

Today we can generate hundreds of gigabases of DNA and RNA sequencing data in a week for less than US$5,000. The astonishing rate of data generation by these low-cost, high-throughput technologies in genomics is being matched by that of other technologies, such as real-time imaging and mass spectrometry-based flow cytometry. Success in the life sciences will depend on our ability to properly interpret the large-scale, high-dimensional data sets that are generated by these technologies, which in turn requires us to adopt advances in informatics. Here we discuss how we can master the different types of computational environments that exist — such as cloud and heterogeneous computing — to successfully tackle our big data problems.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Generating and integrating large-scale, diverse types of data.
Figure 2: Cluster, cloud, grid and heterogeneous computing hardware and software stacks.
Figure 3: Amazon Web Services.

Similar content being viewed by others

References

  1. Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).

    Article  CAS  Google Scholar 

  2. Bandura, D. R. et al. Mass cytometry: technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight mass spectrometry. Anal. Chem. 81, 6813–6822 (2009).

    Article  CAS  Google Scholar 

  3. Chen, Y. et al. Variations in DNA elucidate molecular networks that cause disease. Nature 452, 429–435 (2008).

    Article  CAS  Google Scholar 

  4. Emilsson, V. et al. Genetics of gene expression and its effect on disease. Nature 452, 423–428 (2008).

    Article  CAS  Google Scholar 

  5. Altshuler, D., Daly, M. J. & Lander, E. S. Genetic mapping in human disease. Science 322, 881–888 (2008).

    Article  CAS  Google Scholar 

  6. Drmanac, R. et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327, 78–81 (2010).

    Article  CAS  Google Scholar 

  7. Munroe, D. J. & Harris, T. J. Third-generation sequencing fireworks at Marco Island. Nature Biotech. 28, 426–428 (2010).

    Article  CAS  Google Scholar 

  8. Flusberg, B. A. et al. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nature Methods 7, 461–465 (2010). Shows how SMRT sequencing will add an important time dimension to DNA and RNA sequencing data. Maximizing the information that can be derived from the data will demand substantially increased data-storage requirements and computational resources.

    Article  CAS  Google Scholar 

  9. Garey, M. R. & Johnson, D. S. Computers and Intractability: A Guide to the Theory of NP-Completeness (W. H. Freeman, New York, 1979).

    Google Scholar 

  10. Schadt, E. E. et al. Mapping the genetic architecture of gene expression in human liver. PLoS Biol. 6, e107 (2008).

    Article  Google Scholar 

  11. Snir, M. MPI-The Complete Reference 2nd edn (MIT Press, Cambridge, Massachusetts, 1998).

    Google Scholar 

  12. Zhang, W., Zhu, J., Schadt, E. E. & Liu, J. S. A Bayesian partition method for detecting pleiotropic and epistatic eQTL modules. PLoS Comput. Biol. 6, e1000642 (2010).

    Article  Google Scholar 

  13. Costello, E. K. et al. Bacterial community variation in human body habitats across space and time. Science 326, 1694–1697 (2009).

    Article  CAS  Google Scholar 

  14. Kuczynski, J. et al. Direct sequencing of the human microbiome readily reveals community differences. Genome Biol. 11, 210 (2010).

    Article  Google Scholar 

  15. Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010).

    Article  CAS  Google Scholar 

  16. McGinnis, S. & Madden, T. L. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. 32, W20–W25 (2004).

    Article  CAS  Google Scholar 

  17. Armbrust, M. et al. Above the Clouds: A Berkeley View of Cloud Computing (University of California, Berkeley, 2009).

    Google Scholar 

  18. Buyya, R., Yeo, C. S., Venugopal, S., Broberg, J. & Brandic, I. Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility. Future Generation Comput. Syst. 25, 599–616 (2009).

    Article  Google Scholar 

  19. Dean, J. & Ghemawat, S. MapReduce: simplified data processing on large clusters. 6th Symp. on Operating Systems Design and Implementation [online], (2004). Introduces the MapReduce concept, which was developed at Google. MapReduce is one of the leading large-scale parallel computing technologies, both in terms of the size of data it can handle and the size of the computational infrastructure that is available to process such data.

    Google Scholar 

  20. Matsunaga, A., Tsugawa, M. & Fortes, J. in 4th IEEE International Conference on eScience. 222–229 (IEEE, Indianapolis, Indiana, 2008).

    Google Scholar 

  21. Langmead, B., Schatz, M. C., Lin, J., Pop, M. & Salzberg, S. L. Searching for SNPs with cloud computing. Genome Biol. 10, R134 (2009). An early example in genomics of using standard cloud-based services to detect SNPs — in this case, by aligning whole-genome sequence data to a reference genome.

    Article  Google Scholar 

  22. Schatz, M. C. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25, 1363–1369 (2009).

    Article  CAS  Google Scholar 

  23. Sansom, C. Up in a cloud? Nature Biotech. 28, 13–15 (2010).

    Article  CAS  Google Scholar 

  24. Vance, A. Training to climb an Everest of digital data. New York Times B1 (11 Oct 2009).

    Google Scholar 

  25. Stein, L. D. Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges. Nature Rev. Genet. 9, 678–688 (2008). A comprehensive review of the informatics infrastructure that will be required to achieve success in biological research, both now and in the future.

    Article  CAS  Google Scholar 

  26. Constable, H., Guralnick, R., Wieczorek, J., Spencer, C. & Peterson, A. T. VertNet: a new model for biodiversity data sharing. PLoS Biol. 8, e1000309 (2010).

    Article  Google Scholar 

  27. Rosenthal, A. et al. Cloud computing: a new business paradigm for biomedical information sharing. J. Biomed. Inform. 43, 342–353 (2009).

    Article  Google Scholar 

  28. Owens, J. D. et al. A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum 26, 80–113 (2007).

    Article  Google Scholar 

  29. Friedrichs, M. S. et al. Accelerating molecular dynamic simulation on graphics processing units. J. Comput. Chem. 30, 864–872 (2009).

    Article  CAS  Google Scholar 

  30. Luttmann, E. et al. Accelerating molecular dynamic simulation on the cell processor and Playstation 3. J. Comput. Chem. 30, 268–274 (2009).

    Article  CAS  Google Scholar 

  31. Schatz, M. C., Trapnell, C., Delcher, A. L. & Varshney, A. High-throughput sequence alignment using Graphics Processing Units. BMC Bioinformatics 8, 474 (2007). One of the first genomics applications to use GPUs to substantially speed up the process of high-throughput sequence alignments.

    Article  Google Scholar 

  32. Liu, Y., Maskell, D. L. & Schmidt, B. CUDASW++: optimizing Smith–Waterman sequence database searches for CUDA-enabled graphics processing units. BMC Res. Notes 2, 73 (2009).

    Article  Google Scholar 

  33. Nawrocki, E. P., Kolbe, D. L. & Eddy, S. R. Infernal 1.0: inference of RNA alignments. Bioinformatics 25, 1335–1337 (2009). One of the first GPU-based RNA sequence aligners.

    Article  CAS  Google Scholar 

  34. Linderman, M. D. et al. High-throughput Bayesian network learning using heterogeneous multicore computers. in Proc. of the 24th ACM Int. Conf. on Supercomputing (Tsukuba, Ibaraki, Japan; 2–4 Jun 2010). 95–104, http://doi.acm.org/10.1145/1810085.1810101 (ACM, New York, 2010). Describes a high-throughput GPU-based application for Bayesian network learning. The network learner was built with a novel software tool, the Merge compiler, that helps programmers to integrate multiple implementations of the same algorithm, targeting different processors, into a single application that optimally chooses at run-time which implementation to use based on the problem and hardware available.

    Google Scholar 

  35. Nickolls, J., Buck, I., Garland, M. & Skadron, K. Scalable parallel programming with CUDA. Queue 6, 40–53 (2008).

    Article  Google Scholar 

  36. Zhang, B. & Horvath, S. A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4, Article17 (2005).

  37. Barroso, L. A. & Holzle, U. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. 1–108 (Morgan & Claypool Publishers, 2009). Highlights the important future role of large-scale data centres in hosting big data sets and facilitating computing on those sets.

    Google Scholar 

  38. Bell, G. & Gray, J. Petascale computational systems: balanced cyberinfrastructure in a data-centric world Microsoft Research [online], (2005).

    Google Scholar 

  39. Zhu, J. et al. Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nature Genet. 40, 854–861 (2008). An example of an integrative genomics network-reconstruction method that is among the most computationally demanding methods in biological research.

    Article  CAS  Google Scholar 

  40. Schadt, E. E., Friend, S. H. & Shaywitz, D. A. A network view of disease and compound screening. Nature Rev. Drug Discov. 8, 286–295 (2009).

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eric E. Schadt.

Ethics declarations

Competing interests

Eric E. Schadt, Jon Sorenson and Lawrence Lee are all employed by Pacific Biosciences and own stock in the company.

Related links

Related links

FURTHER INFORMATION

1000 Genomes Project

3Tera Application Store

Amazon Machine Images

Amazon Web Services Management Console (MC)

CLC Bioinformatics Cube

Collaboration between the National Science Foundation and Microsoft Research for access to cloud computing

Condor Project

Database of Genotypes and Phenotypes (dbGAP)

Ensembl

GenBank

Gene Expression Omnibus (GEO)

Nature Reviews Genetics audio slide show on 'Computational solutions to large-scale data management'

NIST Technical Report

NVIDIA Bio WorkBench

Pacific Biosciences Developers Network

Protein Data Bank (PDB)

Public data sets available through Amazon Web Services

UniGene

VMware Virtual Applicances

Glossary

Petabyte

Refers to 1012 bytes. Many large computer systems now have many petabytes of storage.

Cloud-based computing

The abstraction of the underlying hardware architectures (for example, servers, storage and networking) that enable convenient, on-demand network access to a shared pool of computing resources that can be readily provisioned and released.

Heterogeneous computational environments

Computers that integrate specialized accelerators, for example, graphics processing units (GPUs) or field-programmable gate arrays (FPGAs), alongside general purpose processors (GPPs).

High-performance computing

A catch-all term for hardware and software systems that are used to solve 'advanced' computational problems.

Bayesian network

A network that captures causal relationships between variables or nodes of interest (for example, transcription levels of a gene, protein states, and so on). Bayesian networks enable the incorporation of prior information in establishing relationships between nodes.

NP hard

For the purposes of this paper, NP hard problems are some of the most difficult computational problems; as such, they are typically not solved exactly, but with heuristics and high-performance computing.

Algorithm

A well-defined method or list of instructions for solving a problem.

Parallelization

Parallelizing an algorithm enables different tasks that are carried out by its implementation to be distributed across multiple processors, so that multiple tasks can be carried out simultaneously.

Markov chain Monte Carlo

A general method for integrating over probability distributions so that inferences can be made around model parameters or predictions can be made from a model of interest. The sampling from the probability distributions required for this process draws samples from a specially constructed Markov chain: a discrete time random process in which the distribution of a random variable at a given point in time given the random variables at all previous time points is only dependent on the distribution of the random variable directly preceding it.

General purpose processor

A microprocessor designed for many purposes. It is typified by the ×86 processors made by Intel and AMD and used in most desktop, laptop and server computers.

OPs/byte

A technical metric that describes how many computational operations (OPs) are performed per byte of data accessed, and where those bytes originate.

Random access memory

Computer memory that can be accessed in any order. It typically refers to the computer system's main memory and is implemented with large-capacity, volatile DRAM modules.

Cluster

Multiple computers linked together, typically through a fast local area network, that effectively function as a single computer.

Cluster-based computing

An inexpensive and scalable approach to large-scale computing that lowers costs by networking hundreds to thousands of conventional desktop central processing units together to form a supercomputer.

Computational node

The unit of replication in a computer cluster. Typically it consists of a complete computer comprising one or more processors, dynamic random access memory (DRAM) and one or more hard disks.

Central processing unit

(CPU). A term often used interchangeably with the term 'processor', the CPU is the component in the computer system that executes the instructions in the program.

Virtualization

Refers to software that abstracts the details of the underlying physical computational architecture and allows a virtual machine to be instantiated.

Operating system

Software that manages the different applications that can access a computer's hardware, as well as the ways in which a user can manipulate the hardware.

Health Insurance and Portability and Accountability Act

(HIPAA.) United States legislation that regulates, among many things, the secure handling of health information.

Distributed file system, distributed query language and distributed database

A file system, query language or database that allows access to files, queries and databases, respectively, from many different hosts that are networked together and that enable sharing via the network. In this way, many different processes (or users) running on many different computers can share data, share database and storage resources and execute queries in a large grid of computers.

Core

Typically used in the context of multi-core processors, which integrate multiple cores into a single processor.

Graphics processing unit

(GPU.) A specialized processor that is designed to accelerate real-time graphics. Previously narrowly tailored for that application, these chips have evolved so that they can now be used for many forms of general purpose computing. GPUs can offer tenfold higher throughput than traditional general purpose processors (GPPs).

Field-programmable gated array

(FPGA). Digital logic that can be reconfigured for different tasks. It is typically used for prototyping custom digital integrated circuits during the design process. Modern FPGAs include many embedded memory blocks and digital signal-processing units, making them suitable for some general purpose computing tasks.

Floating point operations

(FLOPS). The count of floating point arithmetic operations (an approximation of operations on real numbers) in an application.

Single-molecule, real-time sequencing

(SMRT sequencing). Pacific Biosciences' proprietary sequencing platform in which DNA polymerization is monitored in real time using zero-mode waveguide technology. SMRT sequencing produces much longer reads than do current second-generation technologies (averaging 1,000 bp or more versus 150–400 bp). It also produces kinetic information that can be used to detect base modifications such as methyl-cytosine.

Bucket

Fundamental storage unit provided to Amazon S3 users to store files. Buckets are containers for your files that are similar conceptually to a root folder on your personal hard drive, but in this case the file storage is hosted on Amazon S3.

Exabyte

Refers to 1018 bytes. For context, CISCO estimates that the monthly global internet traffic in the spring of 2010 was 21 exabytes.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schadt, E., Linderman, M., Sorenson, J. et al. Computational solutions to large-scale data management and analysis. Nat Rev Genet 11, 647–657 (2010). https://doi.org/10.1038/nrg2857

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg2857

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing