Introduction

Substantial progress has been made in human genetics and genomics research over the past 10 years since the publication of the draft sequence of the human genome [1, 2]. The Human Genome Project (HGP) provided the basic raw DNA sequence that spawned a plethora of secondary studies which together greatly improved our knowledge of the architecture and function of the genome, yielding new insights with respect to (i) gene number and density, (ii) non-protein-coding RNA genes (or RNA genes), (iii) pervasive transcription, (iv) high copy number repeat sequences and (v) evolutionary conservation. These developments also have challenged the classical definition of the gene (see below).

In parallel, the design of studies investigating complex diseases and traits has gradually shifted from candidate-gene association and linkage studies to genome-wide association studies (GWASs). The first proper GWAS study was published in 2005. This succeeded in identifying a common risk variant with a large effect size in the complement factor H (CFH) gene, which was associated with age-related macular degeneration [3]. By 2007, approximately 100 new GWASs had been published, relating to various complex diseases and traits [4]. There has, however, been some criticism of the inability of GWASs to identify many of the presumed disease-associated variants. Indeed, the validity of the common-disease common-variant (CD/CV) model has recently been challenged by virtue of the perceived 'missing heritability' [57]. This notwithstanding, the GWAS approach has dramatically changed the field of human disease genetics, from identifying mostly irreproducible disease associations in the pre-GWAS era to revealing thousands of statistically robust single nucleotide polymorphism (SNP) associations today [811]. The focus has also gradually shifted back to Mendelian disorders, with the advent of high-throughput sequence capture and sequencing technologies which have potentiated exome and whole-genome (re)sequencing (WGS) [1216].

The rapid advances made in genotyping technologies over the past decade, from the arrival of the first 'whole-genome' SNP genotyping array (the Affymetrix GeneChip 10K [Affymetrix; Santa Clara, CA] in 2003) to current capacity able to genotype five million SNPs per array (Illumina Omni5.0 Beadchip [Illumina; San Diego, CA]),[17, 18] have contributed substantially to GWASs (http://www.genome.gov/gwastudies). A total of 874 publications and 4,327 SNP associations with p-values < 1.0 × 10-5 for approximately 500 complex diseases and traits had been included in the catalogue as of 13th May 2011.

The genotyping arrays have also contributed significantly to population genetics studies [1921]. These arrays have been used to identify and characterise copy number variations (CNVs)[22, 23] and regions of homozygosity (ROHs) [24, 25]. Research on CNVs and ROHs has also progressed rapidly since CNVs were first reported to be widespread in the human genome,[26, 27] and ROHs have been found to be common in outbred populations [28]. In recognition of the progress achieved in the context of both GWASs and CNVs, 'human genetic variation' was considered the 'Breakthrough of The Year' in 2007 by Science [4].

Advances have also been made in sequencing technologies, with the advent of the first next-generation sequencer in 2004 (Roche GS 20 System [Roche 454; Branford, CT]) and later, third-generation sequencing (TGS) technologies such as true single molecule sequencing (Helicos Biosciences, Cambridge, MA) and single molecule real-time sequencing (SMRT) (Pacific Biosciences Menlo Park, CA) [2933]. Developments of other more promising TGS or single-molecule sequencing technologies are on the horizon, such as nanopore sequencing and sequencing using transmission electron microscopy [32, 3437]. These developments have also marked the end of the era of the Sanger dideoxynucleotide or chain termination sequencing method, which has dominated the field since its introduction in 1977 [38].

The arrival of next-generation sequencing (NGS) technologies has also significantly changed the approaches applied in structural and functional genomics studies. Several microarray-based methods have been swiftly supplanted by sequencing-based approaches such as ChIP-Seq, RNA-Seq, Methyl-Seq and CNV-Seq (paired-end mapping [PEM] and depth-of-coverage approaches). Studies using these sequencing approaches have contributed significantly to both fields [3941]. In addition to a variety of different applications in functional genomic studies, these sequencing technologies have also made it feasible, both technically and in terms of cost, to sequence a whole human genome within weeks, for tens rather than hundreds of thousands of US dollars [42, 43]. Currently, the cost of WGS at several tenfold depth of sequencing coverage has been reduced to less than 5,000 US dollars [44]. The number of WGS studies for both normal and cancer genomes has grown rapidly over the past three years [45]. These studies have led to important discoveries in the context of both heritable genetic variation [42, 43, 46] and somatic mutations in cancer genomes [4749].

Such progress would not have been possible without the reference genome generated by the HGP. Also made possible by the high-throughput genotyping and sequencing technologies, several large-scale international projects have been launched, such as the International HapMap Project; the Encyclopedia of DNA Elements (ENCODE) Project, the 1000 Genomes Project, the International Cancer Genome Consortium, the National Institute of Health (NIH) Roadmap Epigenomics Program and the Human Microbiome Project. These projects have contributed substantially to our understanding and knowledge of human genetics and genomics.

This paper aims to review these major developments in human genetics and genomics over the past decade. Major developments and landmarks in human genetics and genomics are summarised in Table 1.

Table 1 Major developments and landmarks in human genetics and genomics, 1977 to date

The HGP

Rapid progress has been made since the completion of the HGP, with the provision of a 'finished' reference DNA sequence for the human genome [64]. The project was initiated in 1990 and, upon its completion in 2003 it yielded important new insights into the architecture and function of the human genome. The sequencing of the HGP relied almost entirely upon the Sanger sequencing method.

The draft sequences of the HGP were imperfect because of the incomplete coverage of the euchromatic regions (euchromatin) -- approximately 10 per cent of these regions were missing. In reality, the coverage was even less complete when the whole genome was considered (ie when the heterochromatic regions were included). Thus, in all, some 30 per cent of the genome was not initially covered. Furthermore, there was an extensive number of gaps between contigs, which rendered the genome sequence discontinuous [1, 2]. The IHGSC subsequently published an improved version of the human genome sequence in 2004 and the HGP was then deemed to be 'complete'. This 'finished' version of the genome had achieved an almost complete coverage of all the euchromatic regions (ie approximately 99 per cent) and also significantly reduced the number of gaps between contigs to 341 from the initial hundreds of thousands [64].

Significant further progress toward the total completion of the human genome sequence continued until 2006; the complete euchromatic sequences of all individual human chromosomes, including the annotation of genes and other features, have now been published (summarised in Table S1 (Table 2)). Since November 2005, the National Center for Biotechnology Information (NCBI) Build 36 assembly of the human genome sequence has been available in public databases. The data comprise a reference assembly of the complete genome sequence plus the Celera WGS (Celera; Alameda, CA) and a number of alternative assemblies of individual haplotypic chromosomes or regions. The full list of assemblies in NCBI 36, as well as the genome sequences, is available through the following genome browsers:

Table S1 Special features of human autosomes 1-22 and the sex chromosomes, including respective lengths, gene number and density

Although both HGP and Celera Genomics had only sequenced the human haploid genome, the availability of the reference DNA sequence initiated a new era in the study of genetic variation and the functional characterisation of the human genome. The two global projects that subsequently ensued were the International HapMap Project and the ENCODE project [63, 65]. The aim of the HapMap initiative was to validate several million SNPs that were identified during and after the completion of the HGP, and then to characterise the extent of their linkage disequilibrium (LD) patterns in populations of European, Asian and African ancestry. The ENCODE project was conceived to identify all the functional and regulatory elements in the human genome.

Architecture and function of the human genome

To coincide with the tenth anniversary of the release of the draft human genome sequences, the key findings from the HGP and their importance for the results of subsequent studies will now be recalled briefly. The findings emanating from the HGP and follow-on studies have had an enormous impact on the understanding of the architecture and function of the human genome.

Gene number and density

Initial annotation data indicated that the human genome encodes at least 20,000-25,000 protein-coding genes, with an indeterminate number of additional 'computationally derived genes' supported by somewhat weaker in silico evidence [2, 64]. Many genes are now known to encode RNAs rather than proteins as their final products [117] but many still remain unannotated [75]. In the latest assembly of the human genome (Genome Reference Consortium, release GRCh37, February 2009), the Genebuild published by Ensembl (database version 56.37a) includes 23,616 protein-coding genes, 6,407 putative RNA genes and 12,346 pseudogenes (http://www.ensembl.org/Homo_sapiens/Info/StatsTable). The HUGO Human Gene Nomenclature Committee (http://www.genenames.org/index.html) has so far approved more than 28,000 human gene symbols, although some of these may yet turn out to correspond to functionally meaningless open reading frames [118]. It is nevertheless encouraging that at least 17,052 human genes have been shown to have orthologous counterparts in the mouse genome, suggesting that they do indeed correspond to real proteins [119]. The definition of what constitutes a gene is still fairly fluid, and hence, depending upon the precise definition adopted, it may be that many additional human 'genes' still remain to be described and annotated.

To appreciate why definition is an issue here, one need only be aware of the many exceptions to genes being contiguous (as well as functionally and spatially distinct) entities, as classically envisaged. Thus, some genes are known to occur within the introns of other genes [120122]. Some genes can overlap with each other either on the same or on different DNA strands,[123] resulting in the sharing of some of their coding and/or regulatory elements [124, 125]. In addition, the vast majority of human genes are now known to undergo alternative splicing,[84] leading in some cases to quite different proteins being encoded by the same gene. For example, the human cyclin-dependent kinase inhibitor 2A gene (CDKN2A) (MIM# 600160) encodes an alternatively spliced variant (p14ARF) which, through the inclusion of an alternative first exon, acquires an altered reading frame so as to specify a protein product that is structurally unrelated to the other p16 isoforms encoded by this gene.

Gene density varies between the human chromosomes and the gene distribution within chromosomes is also rather uneven. Strikingly gene-poor regions have been identified and are known as 'gene deserts' [126]. These are regions that are devoid of protein-coding genes over distances of several Mb but which may nevertheless contain regulatory sequences (Box 1).

RNA genes or non-protein-coding RNAs

A large proportion of the human transcriptome still remains to be annotated [136]. Although some of the overall transcriptional activity may simply be 'transcriptional noise',[137, 138] at least a portion of it is likely to be associated with functional non-coding RNA genes, many of which are located in regions previously regarded as intergenic and/or non-coding [71]. Non-coding RNA genes are as widespread as they are diverse,[139] are transcribed from both strands of the genome and may well exceed protein-coding genes in terms of their number [140, 141].

Non-coding RNAs of known function include structural RNAs such as transfer RNAs, ribosomal RNAs and small nuclear RNAs, but also putative regulatory RNAs (microRNAs, small interfering RNAs [siRNAs], piwi-interacting RNAs, transcription initiation RNAs [tiRNAs], transcription start site-associated RNAs [TSSa-RNAs], promoter upstream transcripts [PROMPTs], promoter-associated sRNAs [PASRs and PALRs] and longer non-coding RNAs such as XIST), which are involved in sequence-specific transcriptional and post-transcriptional modulation of gene expression [142148]. Thus, more than 1,000 microRNA genes already have been identified in the human genome, with many more probably awaiting discovery (Box 2). In total, at least 1,500 non-coding RNA genes already have been annotated in the human genome reference sequence, with up to 5,000 more predicted by homology-based methods [117] (see Ensembl, database version 56.37a).

Indeed, large intergenic non-coding RNAs (lincRNAs) recently have been found to represent a novel category of evolutionarily conserved RNAs, with a diverse array of functions ranging from stem cell pluripotency to cellular proliferation;[93, 94] lincRNAs appear to number at least 3,000 in the human genome [155158]. Some lincRNAs guide chromatin-modifying complexes to specific genomic loci, to regulate gene expression [94]. LincRNAs also play an important role in the derivation of human-induced pluripotent stem cells [156]. Collectively, non-coding RNAs have been intensively studied over the past several years [159, 160].

Pervasive transcription: Transcripts of unknown function and unannotated transcripts

The ENCODE project, designed to analyse 30 Mb of DNA from 44 genomic regions to characterise the functional elements present, has identified complex patterns of regulation and 'pervasive transcription' of the human genome [71]. Although > 90 per cent of the human genome appears to be represented in nuclear primary transcripts, it has become clear that only 35-50 per cent of processed transcripts have so far been annotated as genes, implying that many genes may not yet have been recognised as such [71, 85, 161, 162]. Thus, large numbers of hitherto unannotated transcripts may well yet turn out to be of functional significance [161]. Such transcripts have been collectively classified as transcripts of unknown function (TUFs) and are thought to include (i) antisense transcripts of protein-coding genes, (ii) isoforms of protein-coding genes and (iii) transcripts that either overlap introns of annotated gene transcripts (on the same strand) or which are derived entirely from inter-genic regions. Although both the complexity and abundance of TUFs are remarkable, it should be realised that there is often no firm evidence for these transcripts being of functional significance. Indeed, unannotated non-polyadenylated transcripts originating from intergenic regions have been found to represent the bulk of the > 90 per cent of the human genome that now appears to be transcribed [161, 163, 164]. Although the functional significance of pervasive transcription remains unclear, it is much more extensive than had previously been realised [165].

In both humans and mice, up to 70 per cent of genomic loci exhibit evidence of transcription from the antisense strand, as well as the sense strand [166168]. These naturally occurring antisense transcripts may modulate the level of expression of their associated sense transcripts (or otherwise influence their processing), thereby adding another level of complexity to the regulation of gene expression [169, 170]. Although there is, as yet, no suggestion that the genomic sources of such antisense transcripts should be regarded as genes in their own right, their prevalence clearly renders our task of defining the gene that much more difficult.

High copy number repeat sequences

The HGP revealed that repeat sequences account for at least 50 per cent of the human genome sequence. These repeats may be classified as (i) transposon-derived repeats, (ii) partially retroposed copies of genes (referred to as processed pseudo-genes), (iii) simple sequence repeats, (iv) blocks of tandemly repeated sequences at centromeres, telomeres and the short arms of acrocentric chromosomes and (v) segmental duplications (SDs) or low copy number repeats.

Segmental duplications

Both the number and the breadth of the distribution of SDs in the human genome (5 per cent) were surprising. SDs represent extensive inter- and intra-chromosomal duplications of genomic regions that contain genes as well as intergenic sequences [1, 2]. She et al. extended the initial analyses of these low copy number repeats or SDs and initiated the characterisation of the duplicational landscape of the human genome [171]. SDs may be viewed as mutational hotspots, since they are prone to aberrant recombination events occurring between highly homologous paralogous SDs, and give rise to large deletions or duplications of the intervening sequences resulting in human genomic disorders [172]. Indeed, SDs have been shown to represent frequent sites of CNV between individuals, thereby contributing considerably to human genomic diversity [173]. The mechanism that generates CNVs in SDs is known as non-allelic homologous recombination [174]. These interspersed SDs confer susceptibility to recurrent microdeletions and microduplications upon approximately 10 per cent of the human genome through unequal crossing over. Furthermore, data have accumulated showing that specific recurrent rearrangements within these genomic hotspots are associated with both syndromic and non-syndromic diseases. Studies of common complex diseases have shown that these recurrent events play an important role in autism, schizophrenia and epilepsy [175177].

The above notwithstanding, the duplicated genomic regions have remained largely intractable, owing to difficulties in accurately resolving their structure, copy number and sequence content. New algorithms have been developed to map comprehensively next-generation sequence reads, allowing the prediction of absolute CNVs of duplicated segments and genes. On average, 73-87 genes vary in copy number between any two individuals and these differences overwhelmingly correspond to segmental duplications [178].

Pseudogenes

Whether processed or non-processed (duplicational), it has become clear that pseudogenes are almost as abundant as genes ('classical' or otherwise) in the human genome, with ~20 per cent of known pseudogenes being transcribed [179181]. By means of a comparison of cytochrome P450 genes (CYP) from the mouse and human genomes, Nelson et al. (2004) demonstrated that the complete identification of all human pseudogene sequences is likely to be clinically important and proposed a naming procedure for CYP pseudogenes [182].

It should, however, be appreciated that, although some pseudogenes may well be readily identifiable as lacking protein-coding potential by virtue of the interruption of their open-reading frames by premature stop codons or frameshift mutations, others will be less easily recognisable, especially if they are transcribed. The recent identification of short (≤ 300 bp) human pseudogenes generated via the retrotransposi-tion of mRNAs,[183] however, suggests that pseudo-genes may be even more common in the human genome than previously appreciated. Intriguingly, some of these pseudogenes are polymorphic, in that they have functional as well as non-functional alleles segregating in the extant human population [184].

With the realisation that pseudogene-derived RNA transcripts may harbour functional elements,[181, 185] the distinction between genes and pseudogenes has become somewhat blurred [186]. Indeed, some 'pseudogenes' appear to have a regulatory role,[187, 188] providing additional examples of the potential functional significance of non-coding RNAs. At present it is unclear what proportion of the pseudogenes identified to date have either retained or acquired a function via their non-coding RNAs.

Transposable elements

Transposable elements, including Long INterspersed Elements (LINE-1), Alu and SINE-VNTR-Alu (SVA) elements (SVA is an unusual composite element derived from three other repeats: Short INterspersed Elements [SINE]-R, variable number tandem repeats [VNTR] and Alu), make up ~40 per cent of the human genome [189] and constitute a major source of inter-individual structural variability [190]. Some of these transposable elements have contributed gene-coding sequences to the human genome via 'exonisation' [191]. Other transposable elements have contributed functional non-coding sequence -- for example, as regulatory elements,[192, 193] microRNAs [194] or naturally occurring antisense transcripts [195]. Many more are likely to have functional significance, as suggested by their evolutionary conservation [196, 197].

Evolutionary conservation

Extensive evolutionary conservation of non-coding DNA sequences is evident in the human genome because only ~40 per cent of the evolutionarily constrained sequence occurs within protein-coding exons or their associated untranslated regions [71]. Studies of evolutionarily conserved non-coding sequences [198201] have suggested that 5-20 per cent of the genome may be of functional importance, rather than just the ~2 per cent associated with the protein-coding portion [202, 203]. Some non-coding regions (the genomic 'dark matter') contain 'ultra-conserved elements' which not only exhibit enhancer function, but are also transcribed and often appear to have been subject to selection to the same extent as protein-coding regions [204206]. Some non-coding regions contain CpG islands, which, although located far from the transcriptional initiation sites of genes, may nevertheless have some regulatory significance [207]. It should be appreciated, however, that the absence of evolutionary conservation does not necessarily denote lack of function. Indeed, human specific functional elements have been shown to be present within rapidly evolving non-coding sequences [208, 209].

Towards a new definition of the gene

It is clear from the above that precisely what constitutes a gene has become somewhat contentious. The unanticipated scale of the extent of transcription in the genome, coupled with the widespread occurrence of overlapping genes and shared functional elements, hampers attempts to demarcate precisely and unambiguously where one gene ends and another one begins. As a consequence, the notion of the gene has become diffuse [161, 210]. Indeed, as Kapranov et al.[211] opined, 'it is not unusual that a single base-pair can be part of an intricate network of multiple isoforms of overlapping sense and antisense transcripts, the majority of which are unannotated'. Gene regulatory elements that are often distant from the genes they regulate,[212] the existence of trans- as well as cis-regulatory elements [213] and the formation of non-co-linear transcripts through trans-splicing,[214] taken together with the abundance of non-coding RNA genes [215] and evolutionarily conserved non-coding regions,[199, 201] have combined to challenge the classical notion of the gene.

On the basis of the findings of the ENCODE project, Gerstein et al.[210] proposed an updated definition of the gene as 'a union of genomic sequences encoding a coherent set of potentially overlapping functional products'. An alternative definition of the gene as: 'A discrete genomic region whose transcription is regulated by one or more promoters and distal regulatory elements and which contains the information for the synthesis of functional proteins or non-coding RNAs, related by the sharing of a portion of genetic information at the level of the ultimate products (proteins or RNAs)' has been proposed by Pesole [216]. Irrespective of its precise definition, it is clear that the concept of the gene is inadequate to the task of building a lexicon of those functional genomic sequences that could harbour mutations causing human inherited disease. It is likely in the context of mutation detection, that we shall eventually have to consider the universe of functional genetic elements in the human genome as our hunting ground, rather than simply genes per se.

Development of the GWAS approach to complex diseases and traits

In this section, developments in cataloguing genetic variation (SNP and CNV), initiation and completion of the International HapMap Project, and advances in genotyping technologies are discussed. These developments are important prerequisites for the use of GWASs in the investigation of complex diseases and traits.

SNP discovery after the HGP

While the HGP was being completed, genetic variants, in particular SNPs, were also being discovered. By 2001, the International SNP Map Working Group had identified 1.42 million SNPs in the human genome [58]. Currently, more than 17 million SNPs in the human genome have been catalogued in the SNP Database (dbSNP; http://www.ncbi.nlm.nih.gov/projects/SNP/). It is, however, likely that at least some of the entries in the database are errors or artefacts rather than 'genuine' variants. A false-positive rate for the dbSNP of 15-17 per cent has been estimated [101]. Therefore, large-scale validation in population-based studies is necessary. The HapMap Project was conceived in 2003 with the aim of validating several million SNPs in order to obtain SNP and genotype frequency information, as well as to study their LD patterns in different populations.

SNPs are the most abundant type of genetic variation in the human genome. They occur at intervals of approximately one SNP to every kb of DNA sequence throughout the genome when the DNA sequences of any two unrelated individuals are compared. This is approximately equivalent to three million SNPs being carried by each individual genome. Therefore, the DNA sequences of any two unrelated genomes are estimated to be about 99.9 per cent identical; the 0.1 per cent comprises mainly SNPs, and these are believed to be responsible for many of the phenotypic differences noted among individuals in populations -- for example, disease susceptibility, drug responses and physical traits such as height [217].

The discovery of thousands of CNVs that collectively encompass hundreds of Mb of the genome [22, 23, 105] and the several hundred thousand short indels identified by WGS studies,[42, 43] however, have cast doubt upon the initial estimate of '99.9 per cent similarity' between any two genomes. Indeed, the DNA sequences of individuals within and between populations are genetically rather more diverse and varied than previously thought. This has been corroborated by a recent study demonstrating that the Craig Venter genome differs from the consensus reference sequence by approximately 1.2 per cent when indels and CNVs are considered, a further 0.1 per cent when SNPs are considered and ~0.3 per cent when inversions are considered -- a grand total of ~1.6 per cent [218].

Linkage disequilibrium and the International HapMap Project

Most SNPs are predicted to be neutral, without any functional effects. Owing to their abundance in the human genome, they may serve as useful genetic markers in GWASs, by comparison with other genetic variations, such as microsatellites, which in any case exhibit a mutation rate that is too high to be useful in this context. Early reports documented LD patterns between SNPs in parts of the human genome;[61, 62, 219] however, no large-scale effort had been undertaken to study the LD patterns in the whole genome until the initiation of the International HapMap Project. A total of more than three million SNPs were genotyped and validated in Phase I and Phase II of the project in four populations [66, 69]. These populations were the US Utah population of Northern and Western European ancestry (CEU), Han Chinese from Beijing (CHB), Japanese from Tokyo (JPT) and the Yoruba from Ibadan, Nigeria (YRI).

One novel finding has been that 10-30 per cent of pairs of individuals within a population share at least one region of extended genetic identity arising from recent common ancestry. An additional discovery was that up to 1 per cent of all common variants are not tagged by SNPs, primarily because they are located within recombination hotspots [69]. Importantly, increased population differentiation with respect to non-synonymous SNPs was noted, by comparison with synonymous SNPs. These observations have also indicated systematic differences in the strength or efficacy of natural selection between populations from different geographical areas involving genes linked to the Lassa virus in West Africa, skin pigmentation in Europe and hair follicle development in Asia [70].

The discovery of millions of SNPs has created a significant challenge in genotyping. It is neither technically feasible nor cost-effective to genotype all the SNPs in a GWAS, even with the latest genotyping technologies; however, the existence of LD significantly reduces the number of SNPs that need to be genotyped. The indirect association approach of GWASs is dependent on surrogate markers ('tag' SNPs) to locate disease variants through LD. As shown by the HapMap Project [69] and other published work,[220222] approximately half a million SNPs are adequate to capture most of the SNPs that have been genotyped in the HapMap Phase I and II projects. However, the genome coverage of commercial genotyping arrays is population dependent (Box 3).

The HapMap project has created a useful and valuable resource for GWASs. In parallel, the public availability of the HapMap resource has driven the rapid development of genotyping arrays, in which the data are used to guide the selection of tag SNPs. Once the HapMap Phase I and II projects were completed, a number of genotyping arrays were designed and introduced onto the market [223, 224]. The newer arrays (eg the Illumina Human 1M Beadchip and Affymetrix SNP Array 6.0) have significantly improved genome coverage and are also designed for CNV detection [225]. The HapMap Phase I and II projects led to the development of higher resolution genotyping arrays, which in turn were used in the HapMap Phase III project to investigate genetic variations (both SNPs and CNVs) in additional populations of diverse ancestry [21].

The Phase III project, building on the success of the HapMap Phase I and II projects, included an additional seven populations and has recently been completed [21]. These additional populations involved people of African ancestry in the south-western USA (ASW), the Chinese community in Metropolitan Denver, CO (CHD), Gujarati Indians in Houston, TX (GIH), the Luhya in Webuye, Kenya (LWK), people of Mexican ancestry in Los Angeles, CA (MEX), the Maasai in Kinyawa, Kenya (MKK) and Tuscans in Italy (TSI). The ethos behind the HapMap Phase III project was that, in order to obtain a more complete understanding of human genetic variation, populations with a wider geographical/ancestral range needed to be studied. In total, the HapMap Phase III project genotyped approximately 1.6 million SNPs (using both the Illumina Human 1M Beadchip and Affymetrix SNP Array 6.0) in 1,184 individuals from 11 populations (four original and seven additional populations). The population-specific differences among low-frequency variants were characterised in addition to SNPs and common CNVs or copy number polymorphisms (CNPs). More importantly, it also demonstrated the feasibility of imputing newly discovered CNPs and SNPs, which are important for future GWASs and meta-analyses [21].

Whole-genome SNP genotyping technologies

The paradigm shift from candidate-gene association and family linkage studies to GWASs has been attributed to several important developments, most notably the rapid advances in high-throughput SNP genotyping technologies, which have enabled researchers to interrogate up to one million SNPs simultaneously in a microarray [18]. GWASs employ an 'agnostic' approach in the search for unknown disease variants, and hence the ability to interrogate a large number of SNPs covering the entire human genome is a prerequisite for this study design. In parallel with the decreasing cost of genotyping, it has recently become technically feasible to genotype thousands of samples in GWASs. As a result, more than 800 GWASs have been published since 2005 (http://www.genome.gov/gwastudies/), of which almost all have used the commercially available whole-genome SNP genotyping arrays from Illumina or Affymetrix.

A series of whole-genome genotyping arrays have been introduced since 2005, such as the Affymetrix Human Mapping 100K 500K sets, and the Illumina HumanHap300 and HumanHap550 BeadChips [223, 224]. These genotyping arrays provide different degrees of genome coverage in different populations; lower coverage was achieved in African populations because of the greater genetic diversity in these populations. For example, the Illumina HumanHap550 Beadchip, which contains approximately 550,000 tag SNPs selected from the HapMap Phase I and II projects, achieved genome coverage of 87 per cent and 83 per cent in CEU and CHB + JTP populations, respectively, but only 50 per cent in YRI [220222]. Whole-genome genotyping arrays such as the Illumina Human 1M Beadchip and Affymetrix SNP Array 6.0 offer almost complete genome coverage (.90 per cent) for HapMap CEU and CHB + JPT populations (Box 3).

The more recent genotyping arrays, such as the Illumina Human 1M BeadChip and Affymetrix SNP Array 6.0, have enabled genotyping of up to one million SNPs and increased the sensitivity to detect CNVs because of higher marker density and more uniform marker distribution [225]. For example, the Affymetrix SNP Array 6.0 contains more than 1.8 million markers, half of which are SNPs, the remainder being non-polymorphic or copy number probes to enhance the power of detection of CNVs. Copy-number probes were deliberately selected so as to cover regions lacking SNPs or regions where SNPs are difficult to assay, such as repetitive sequences within segmental duplications [226]. In addition, markers were also chosen to target known copy number variable regions as reported in the Database of Genomic Variants (http://projects.tcag.ca/variation/). Employing such a design, these genotyping arrays have enabled researchers to discover novel CNVs, as well as to validate previously known CNVs. These more recent arrays were designed for the application of GWASs and CNV detection.

The first wave of GWASs utilised first-generation SNP genotyping arrays and focused mainly on common SNPs with MAF > 5 per cent [132]. Thus, expanding the coverage to include less common or rarer SNPs (MAF 1-5 per cent) is essential for new discoveries to be made in future GWASs. This step is now technically feasible and practically achievable with the arrival of second-generation SNP genotyping arrays (Illumina HumanOmni2.5 and Omni5.0) in 2010; these are capable of genotyping 2.5 to 5.0 million SNPs (Illumina Whole-Genome Genotyping Product Roadmap; http://www.illumina.com/applications/gwas.ilmn). These arrays were designed to increase the coverage of SNPs down to a MAF of 1 per cent. In contrast to the first-generation arrays, the SNP selection in these latest genotyping arrays leverages the data from the 1000 Genomes Project [102]. However, the promise of second-generation genotyping arrays for new discoveries in GWASs is conditional upon the adequacy of the statistical power of the studies to identify the associations of rarer SNPs with complex traits. This suggests that larger sample sizes will be needed in future GWASs.

The era of GWASs

More than 4,000 SNPs have been reported to be associated with various human complex diseases and traits with varying degrees of replication and success (http://www.genome.gov/gwastudies/).

Despite some notable successes in revealing numerous novel SNPs and loci associated with complex phenotypes, the results from GWASs have been disappointing, in that all the GWAS-SNPs collectively account for only a small proportion of the heritability of complex phenotypes. This is due mainly to the small effect sizes of most GWAS-SNPs (odds ratio < 1.5) [5, 10, 89]. The small effect sizes of the GWAS-SNPs have also limited their applications in disease risk prediction [227].

Although several diseases have been claimed to be investigated by GWASs and meta-analyses of sufficiently large sample sizes, most of their heritability still remains unaccounted for. This missing heritability has stimulated much discussion on future strategies for detecting the remaining genetic variants associated with complex phenotypes. The proposed strategies range from increasing the sample sizes by combining several GWASs through meta-analysis in order to attain a higher statistical power, to more complicated experiments such as epigenetic studies [5, 228]. The methodologies for meta-analysis and for the merging of SNP genotype data from multiple GWASs employing different genotyping arrays are now well developed and rely upon newly developed genotype imputation methods [229231]. By contrast, there are still many experimental and analytical uncertainties and challenges to be faced in the context of epigenetic studies of complex phenotypes [232, 233]. Other approaches are summarised in Figure 1.

Figure 1
figure 1

Summary of the approaches identifying disease-associated variants.

Figure 1 summarises a variety of approaches to the further identification of disease-associated variants: (1) GWASs of various complex diseases and traits ideally should be performed in different populations representing European, Asian and African ancestries, as most published studies have focused primarily on populations of European ancestry [234, 235]. (2) Most GWASs have done fast-track replication by selecting the top few or top tens of SNPs with the most significant p-values in stage 1 and then proceeded to replicate them in stage 2 or stage 3 with larger sample sizes. Therefore, the next step should be to conduct a second tier of replication, where more SNPs from stage 1 are tested to assess their associations [236]. (3) The role of CNVs is increasingly recognised as being associated with complex diseases and traits; thus, it is important to investigate their associations with these complex phenotypes [111]. (4) Resequencing of the GWAS loci will be needed to uncover additional rarer variants. The success of this approach has been demonstrated in the discoveries of multiple rare variants for type 1 diabetes and hypertriglyceridaemia [237, 238]. (5) Integrating GWAS results with other sources of genomic data, such as expression quantitative trait loci (eQTL) and ChIP-Seq, has led to the discovery of novel SNP associations [239, 240]. (6) Subgroup analysis of disease phenotypes is a powerful approach to identifying genetic variants that are specific to certain subtypes. For instance, differences in SNP associations for oestrogen receptor-positive and -negative breast cancer have been shown [241]. (7) Pathway-based approaches have been developed using prior biological knowledge of gene function to facilitate more powerful analysis of GWAS datasets [242]. (8) Most studies have not taken epistasis and gene-environment interactions into account, which could account for a proportion of the missing heritability of complex phenotypes; however, challenges associated with studying these interactions should also be noted [243, 244].

Genetic architecture of complex diseases

The genetic architecture of complex diseases has been the subject of intense debate over the past decade [59, 60] and has been polarised by the emergence of two opposing models: the CD/CV hypothesis and the multiple rare variant or common-disease rare-variant (CD/RV) hypothesis [245]. The CD/CV model formed the basis of the HapMap Project and largely influenced the development of commercial genotyping arrays with respect to SNP selection. Therefore, the published GWAS using the HapMap data mainly involved the interrogation of the association of common SNPs (MAF > 5 per cent) with complex diseases and traits.

One of the reasons that the CD/CV model became favoured was because of the sequencing technologies available at that time. Sanger sequencing did not allow the survey of rare variants in the whole genome. By contrast, the convenient high-throughput genotyping platforms have enabled efficient interrogation of up to one million SNPs throughout the genome, which eventually indirectly leads to the capture of almost all the SNPs in the HapMap Project. Furthermore, it is more affordable to genotype (rather than to sequence) the entire genomes of several thousand cases and controls as part of an adequately powered association study.

Currently, the results from the GWASs focus on common SNPs and explain only a small fraction of the heritability of complex phenotypes [5]. The missing heritability has challenged the validity of the CD/CV hypothesis, and has also diverted research endeavours toward rare variants;[109, 237, 238, 246, 247] however, published data have revealed the contributions of both common and rare variants to complex phenotypes. The results from GWASs have strongly supported the involvement of common variants, especially common SNPs, in complex phenotypes [132]. Moreover, recent studies have shown that common SNPs can explain a greater proportion of the heritability than has been accounted for by recent GWASs. These SNPs, however, are often 'hidden' within the GWAS data, and will require larger sample sizes to be uncovered [248, 249].

The data supporting the roles of rare variants have also been accumulating from an increasing number of studies of less-common SNPs [109, 237, 238, 246] and rare CNVs [250253]. This suggests that the genetic architecture of complex phenotypes is likely to comprise both common and rare variants. The relative proportions of these variants remain to be determined and will remain unclear until all the genetic variants for most complex phenotypes are found; furthermore, the relative proportions are likely to vary between different complex phenotypes, with some phenotypes having a greater influence on the genetic susceptibility risk by common variants, whereas other phenotypes may be more affected by rare variants. Being able to predict the genetic architecture of complex phenotypes is critical, however, as it will determine the future strategies to be adopted in seeking disease variants.

Homozygosity mapping

Homozygosity mapping has been shown to be useful in the identification of disease susceptibility genes in complex diseases [254, 255]. An ROH defines an uninterrupted stretch of a DNA sequence lacking heterozygosity in the diploid state (ie in the presence of both copies of the homologous DNA segment). Thus, all the genetic variants within the homologous DNA segments are represented by two identical alleles that contribute to the homozygosity [28]. Currently, there are no standardised criteria to define an ROH. Previous studies have focused on regions ≥ 1 Mb, however, and hence the true extent of homozygosity in the human genome could have been underestimated because shorter regions were not considered [28, 256, 257]. More recent studies have defined ROHs as having a minimum length of 500 kb,[258] the intention being to avoid underestimation of the number of such regions in the human genome.

Although long continuous ROHs were first documented a decade ago, until recently no large-scale population-based studies had been performed to assess the extent of ROHs in the human genome [259]. The recent advances in the genome-wide detection and characterisation of ROHs have been driven mainly by the availability of highly accurate SNP databases such as the HapMap project [28] and advanced genotyping technologies [24, 25]. Genotyping a large number of SNPs on a microarray platform presents a powerful tool for detecting ROHs comprehensively across the whole genome, thereby enabling investigation of the number, length, location and distribution of the ROHs in the human genome in a more unbiased manner, as compared with microsatellite markers. It was not previously expected that the genomes of outbred populations would contain ROHs of several Mb in length until the early reports appeared in 2006/2007 [28, 256, 257].

Many novel causal genes or mutations underlying autosomal recessive disorders have been identified through homozygosity mapping. This approach is particularly useful for investigating these disorders in populations with a high prevalence of consanguinity, as is evident from the many recent studies that have identified causal mutations [260265].

The effects of consanguinity and recessive variants or heterozygosity levels on the risk of complex phenotypes (diseases and quantitative traits) are well established [266268]. Higher levels of relative heterozygosity have been shown to be associated with lower blood pressure and total and low-density lipoprotein (LDL) cholesterol by measuring genome-wide heterozygosity [268]. In addition to quantitative traits, inbreeding has also been found to be a significant positive predictor for a number of late-onset complex diseases, such as coronary heart disease, stroke, cancer and asthma [266]. These studies have strongly supported the hypothesis that the genetics of complex phenotypes include a component which corresponds to recessively acting variants. The importance of ROHs to complex phenotypes remains largely unexplored; however, several studies have shown significant differences in ROHs between cases and controls in genome-wide investigations for schizophrenia [269] and late-onset Alzheimer's disease [270]. Success was also achieved for complex quantitative traits such as height, where strong statistical evidence for an association of a particular ROH with height was obtained in a total sample size of > 10,000. The height of individuals with this ROH was significantly higher (increased by 3.5 cm) than the individuals lacking the region [258]. Cataloguing ROHs in human genomes and investigating their associations with complex phenotypes by building on existing GWAS data should be fruitful areas for future research.

Beyond SNPs: CNVs

A new era of CNV discovery began when two separate studies, published concurrently in 2004, identified several hundred deletions and duplications in the human genome [26, 27]. Such genetic abnormalities had actually been documented decades before, however, in clinical cytogenetics studies that found them to be a cause of various genomic or cytoge-netic disorders [271]. The distinguishing feature of the recent studies was that these CNVs were found to be much more prevalent in the human genome than previously expected. These changes in copy number did not result in any clinical disorder or pathological phenotype and were found in the genomes of phenotypically normal individuals. As these submicroscopic (< 5 Mb) deletions and duplications were below the detection limit of traditional cytogenetics tools such as fluorescence in situ hybridisation (FISH), these recent discoveries were credited to the use of whole-genome microarray technologies [272].

Although these early whole-genome microarray studies discovered several hundred new CNVs, it was clear from the outset that that this would be a gross underestimate of the true total. These studies used 'low-resolution' microarrays such as representational oligonucleotide microarray analysis (ROMA) containing 85,000 probes with a resolution of approximately one probe per 35 kb [26] or the bacterial artificial chromosome-comparative genomic hybridisation (BAC-CGH) array with a resolution of approximately one probe per 1 Mb [27]. Further, these studies investigated a small sample size, which limited the efficiency of detection of less common CNVs. CNVs smaller than 50-100 kb would not have been detected because their size was below the resolution limit for these microarrays. Thus, both the sample size and the resolution of the microarray are critical factors that contribute to the discovery of less common and/or smaller CNVs.

The contribution of CNVs as a major source of genetic variation in human populations has become appreciated despite the limitations of the microarrays. The first comprehensive mapping of CNVs in 270 samples from the HapMap Phase I project identified a total of 1,447 copy number variable regions, covering 360 Mb. These regions contained hundreds of genes, disease loci, functional elements and segmental duplications [22]. The limitations of ROMA and the BAC-CGH arrays have been overcome in later studies by the use of higher-resolution microarrays and larger sample sizes comprising several hundred samples [23, 105, 273276]. High-resolution tiling oligonu-cleotide microarrays, comprising 42 million probes, were used to generate a comprehensive map of 11,700 CNVs [105]. Yim et al.[275] screened CNVs in 3,578 healthy, unrelated Korean individuals, using the Affymetrix SNP Array 5.0.

Other types of chromosomal rearrangement, particularly inversions and balanced translocations, have received considerably less attention [277279]. Inversions and translocations are also known as 'copy-neutral variations' or 'balanced chromosomal rearrangements', since they do not involve changes in copy number. These copy-neutral variations have also been found to be associated with disease [279]. Collectively, these copy number and copy-neutral variations are broadly classified as 'structural variations'. As discussed, the genome-wide mapping and detection of CNVs in different populations has advanced considerably since 2004, being driven mainly by microarray technologies such as oligonucleotide-CGH and SNP microar-rays. By contrast, the pace in identifying inversions and translocations in the human genome has been slower because more powerful and effective methods were not available until the advent of NGS technologies [76] (Boxes 4 and 5).

The discovery of a 20 kb deletion located immediately upstream of the immunity-related GTPase family M gene (IRGM) underlying Crohn's disease, and the identification of a 45 kb deletion that is in perfect LD with body mass index-associated SNPs near the neuronal growth regulator 1 gene (NEGR1),[287, 288] together with other studies reporting evidence for LD of CNVs with GWAS-SNPs at r2 > 0.5, suggest possible associations of CNVs with a variety of different human complex diseases and traits [105]. The genome-wide study performed by the Wellcome Trust Case Control Consortium (WTCCC) investigating the association between ~3,400 common CNVs and eight complex diseases in 19,000 samples did not yield any novel discoveries;[111] however, rare CNVs associated with various complex phenotypes have been identified in studies of schizophrenia,[250, 289, 290] epilepsy [251] and severe early-onset obesity [252, 253]. The studies on schizophrenia found that rare structural variations that disrupt multiple genes in neurodevelopmental pathways are over-represented in cases, as compared with controls [250, 289].

High-throughput sequencing technologies and their impact on genomic studies

The advent of high-throughput sequencing technologies has initiated the 'personal genome sequencing' era for both normal and cancer genomes, and large-scale genome sequencing studies such as the 1000 Genomes Project and the International Cancer Genome Consortium. The high-throughput sequencing technologies also provide new opportunities to study Mendelian disorders through exome sequencing and WGS. Several international projects have also been launched to explore functional genomics.

High-throughput sequencing technologies

NGS technologies have only been on the market since 2004, but have now largely replaced Sanger sequencing technologies (owing to the ultra-high-throughput production capacity of NGS technologies, which is a thousand times greater than that of traditional sequencing). One of the major differences is the ability of next-generation sequencers to simultaneously sequence millions of DNA fragments; hence, they are also referred to as massively parallel sequencing technologies. This feature has considerably increased the number of nucleotides that can be sequenced per instrument run when compared with Sanger sequencing. The sequencing chemistry of NGS technologies, together with their ultra-high-throughput production capacity, has also reduced sequencing costs significantly, making large-scale or WGS studies much more affordable [2931]. The sequencing technologies currently available can be broadly grouped into NGS technologies such as the Roche 454 Genome Sequencer FLX (GS FLX) System, Illumina Genome Analyzer (GA) and HiSeq and Life Technologies Supported Oligonucleotide Ligation Detection System (SOLiD), and TGS (or single-molecule sequencing) technologies such as the HeliScope Single Molecule Sequencer (Helicos Biosciences) [32].

One of the more laborious steps in WGS using the Sanger method was the in vivo amplification step using bacterial cloning. This has now been substituted by the in vitro amplification of millions of DNA fragments by NGS technologies using emulsion polymerase chain reaction (PCR) (Roche GS FLX and Life Technologies SOLiD) or bridge amplification on a solid surface (Illumina GA and HiSeq). The sequencing approach for NGS technologies broadly can be divided into: (1) sequencing-by-synthesis mediated by DNA poly-merase (ie pyrosequencing for Roche GS FLX and sequencing by reversible terminator chemistry for the Illumina sequencing platform); and (2) sequencing-by-synthesis mediated by DNA ligase for Life Technologies SOLiD [2931].

Whole-genome resequencing can now be accomplished relatively rapidly because of the availability of the HGP template for alignment of the billions of short sequence reads produced by next-generation sequencers. This is necessary because the NGS technologies are characterised by short sequence read lengths of approximately 50-125 bp for both Illumina and Life Technologies sequencing platforms [2931]. This feature makes de novo sequencing, or the assembly of billions of short sequence reads into large contigs challenging -- especially for large and complex genomes like the human genome [291]. A longer read length is key to obtaining larger contigs with fewer gaps between them during the assembly steps. Although the latest improvements in sequencing chemistry and systems allow the Roche GS FLX to achieve a sequence read length of 500 bp on average, this is still markedly lower than the 800 bp to 1 kb length achieved by Sanger sequencing (http://www.454.com/) [292]. In addition to a short read length, NGS technologies have higher sequence error rates, although this gradually has been improving [293].

A relatively new addition in the NGS market is the Ion Torrent Personal Genome Machine (PGM) produced by Life Technologies (http://www.iontorrent.com/). The earlier NGS technologies relied on emission of either fluor-escent (Illumina and Life Technologies SOLiD sequencing platforms) or chemiluminescent (Roche GS FLX) light to detect and distinguish the nucleotides incorporated during sequencing. However, the Ion Torrent PGM uses proprietary semiconductor sensors to perform direct real-time measurement of the hydrogen ions released upon incorporation of nucleotides during sequencing. Several ion semiconductor sequencing chips will be available, with throughputs ranging from > 10 Mb to > 1 gigabase (Gb) per instrument run, but these are many-fold lower than the several hundred Gb of sequencing data generated by the latest Illumina HiSeq and Life Technologies SOLiD machines. The Ion Torrent PGM is therefore more suitable for smaller-scale targeted sequencing.

The first TGS instrument -- the Heliscope Single Molecule Sequencer -- is now commercially marketed by Helicos Biosciences. The Heliscope Single Molecule Sequencer or true single-molecule sequencing (tSMS) is vaguely classified as a TGS technology because it has features of both NGS and TGS technologies. It is considered to be a TGS platform because of its ability to perform single DNA molecule sequencing without the need for whole-genome amplification but the sequencing is still based on 'cyclic sequencing' (repeated cycles of sequencing) comprising several steps, such as flow of fluorescent-labelled nucleotides and reagents, nucleotides incorporation, washing and imaging steps, in each cycle [32]. Therefore, one of the major distinctions between NGS and TGS is that TGS does not require whole-genome amplification steps.

Numerous other TGS technologies, such as SMRT sequencing, are on the horizon and will soon be marketed commercially,[294] whereas others -- such as nanopore sequencing -- may take several years to become a mature technology [34, 35]. SMRT sequencing is performed by synthesising complementary strands of the single DNA molecules by DNA polymerase through incorporation of four different fluorescent colourlabelled nucleotides. The incorporation of each nucleotide into the synthesising DNA strands is monitored in real time by visualisation of 'pulses' of coloured light emitted from each zero-mode waveguide. Each waveguide corresponds to a single molecule of DNA fragment and the incorporation of nucleotides is distinguished by emission of four different colours of light. Similarly, nanopore sequencing requires no cyclic sequencing steps [32]. By comparison, companies such as Complete Genomics (Mountain View, CA) provide a sequencing service, rather than selling their sequencing machines to end-users. The sequencing platform achieves efficient imaging and low reagent consumption with combinatorial probe anchor ligation chemistry independently to assay each base from patterned nanoarrays of self-assembling DNA nanoballs [44]. As TGS is characterised by single DNA molecule sequencing, it has the potential further to increase the number of sequence reads or throughput per instrument run above their current capacity.

Whole-genome (re)sequencing

NGS and TGS technologies have now made possible the sequencing of the entire human genome within a few days. The first human WGS study using a next-generation sequencer was completed in 2008;[46] this marked the beginning of a new era in personalised genome sequencing. To date, more than 20 WGS studies have been completed using NGS and TGS technologies [45]. The number of genomes being sequenced is expected to increase in the coming years, as sequencing technologies and analytical and bioinformatics tools become more advanced and affordable [295]. The reference genome sequence from the HGP is needed for alignments of the large amount of sequence reads produced by the high-throughput sequencers. Clearly, these studies do not involve the de novo assembly of human genome sequences, but rather constitute genome resequencing studies.

The first human diploid genome sequence -- Craig Venter's genome -- appeared in 2007 and was sequenced using the Sanger sequencing method [68]. A year later, the genome of James Watson, who discovered the double-helical structure of the DNA molecule half a century ago, was also sequenced [46]. In contrast to Venter's genome, Watson's genome was sequenced using NGS technologies. A number of additional human genomes have now also been fully sequenced. For example, a single Caucasian/European;[92] a single African (ie NA18507 from the HapMap project, sequenced using two different NGS technologies);[42, 296] two Koreans;[297, 298] a single Han Chinese;[43] a single Japanese;[299] a single Irish individual [300] and a single Gujarati Indian [301] have been sequenced.

Two whole genomes of the indigenous hunter-gatherer peoples of southern Africa (Khoisan and Bantu) have also been sequenced, together with the protein-coding regions from an additional three hunter-gatherers from the Kalahari. This study has been important for understanding human diversity, as these genomes represent the oldest known lineage of modern humans. A better understanding of genomic differences between the hunter-gatherers and others may help to pinpoint genetic adaptations to an agricultural lifestyle [302]. In addition, the genome of an extinct Palaeo-Eskimo (~4,000-years old)[108] and a Neanderthal genome [107] have been sequenced. The sequencing work of most of these individual genomes was accomplished using NGS technologies.

These WGS studies have identified several hundred thousand new SNPs that had not been previously catalogued in the dbSNP database. For example, Bentley et al. (2008)[42] found about one million new SNPs in the African genome (NA18507), and several hundred thousand new SNPs for other genomes. Most of the common SNPs in human populations have already been captured; thus, the new SNPs identified in these studies are probably representative of those from the lower-frequency spectrum. Data on population frequencies of the new SNPs are not available, since they were derived from individual genome-sequencing studies; however, these data should be available upon completion of the 1000 Genomes Project. In addition to SNPs, several hundred thousand short indels and several thousand structural variants have also been identified.

Schuster et al. characterised the extent of whole-genome and exome diversity among five individuals (two whole genomes and three exomes were sequenced) and identified 1.3 million novel DNA differences genome-wide [302]. Interestingly, in terms of nucleotide substitutions, the Bushmen would appear to be genetically more different from each other than Europeans and Asians are to each other. This is consistent with the view that the genetic diversity between African individuals is greater than between individuals from other ethnogeographic origins [302]. A total of 353,151 high-confidence SNPs were identified in the genome of the extinct Palaeo-Eskimo [108]. By comparing the high-confidence SNPs in this extinct human genome with contemporary populations to identify the populations most closely related to this individual, this study provided evidence for a migration from Siberia into the New World some 5,500 years ago. Comparisons of the Neanderthal genome with the genomes of five extant humans from different parts of the world identified a number of genomic regions that may have been affected by positive selection in ancestral modern humans, regions that include genes involved in metabolism and in cognitive and skeletal development [107].

The WGS studies also identified a portion of the sequence reads that could not be mapped to the NCBI human reference genome, indicating that some sequences are 'missing' from the reference genome. For example, Wheeler et al. found that 1.5 million reads (approximately 1.4 per cent of the total sequence data) did not map to the reference genome [46]. These 'unmappable' sequence reads were then assembled into ~170,000 contigs spanning 48 Mb. Even after the removal of contigs that were < 100 bp in size, there were still ~110,000 contigs spanning 29 Mb. This concurs with the estimated 25 Mb of euchromatic sequence that is absent from the reference genome. More recent studies using sequencing data have also identified new sequences that are absent in the human reference genome [303, 304].

1000 Genomes Project

The 1000 Genomes Project was initiated in 2008 with the aim of sequencing the genomes of at least 1,000 individuals from different populations around the world (http://www.1000genomes.org/). The main aim of this international collaborative project has been to provide a comprehensive map of human genetic variation for future disease association studies and population genetics. As with the HapMap project, the data from this project also will be made available publicly.

Owing to the ease of high-throughput genotyping technologies, SNPs have been widely used as genetic markers in GWASs to search for disease variants. Evidence has been accumulating to suggest that (common) SNPs alone are unlikely to account for all the heritable risk of complex disease, however [5]. Concurrently, the amount of data supporting associations of CNVs with complex diseases has been growing [305]. Similarly, the importance of rare variants in complex diseases is also increasingly being recognised [306, 307]. This indicates that future disease association studies need to interrogate non-SNP and rare genetic variants, requiring a comprehensive catalogue of human genetic variants. Common SNPs have been well documented in the dbSNP, but rarer (or lower frequency) SNPs are still under-represented in the database and information on indels and structural variations is still incomplete.

The completion of the pilot phase of the 1000 Genomes Project identified approximately 15 million SNPs, one million short indels and 20,000 structural variations, most of which were previously unreported [102]. In addition, the location, allele frequency and local haplotype structure of these genetic variants were described. The sequencing data also enabled characterisation of CNVs within heavily duplicated and near-identical regions [308]. Recently, a map of CNVs was constructed based on WGS data from 185 human genomes in the pilot phase of the project; this encompasses 22,025 deletions and 6,000 additional structural variations, including insertions and tandem duplications. More importantly, approximately half of the structural variations were mapped to single nucleotide resolution, thereby facilitating analysis of their origin and functional impact [112]. Precision in terms of the breakpoint delineation of structural variations is a prerequisite to obtain insights into their underlying mutational mechanisms [286]. The nucleotide resolution analysis of the breakpoints was hampered by the low resolution of the microarrays used in previous studies.

A recent study also identified approximately two million small indels, ranging from 1 bp to 10,000 bp in length, in the genomes of 79 humans. Interestingly, approximately half of these variants (ie 819,363 small indels) mapped to human genes. These small indels were frequently found in the coding exons of these genes, and several lines of evidence indicate that such variation is a major determinant of human biological diversity [309]. This study also found that many of the small indels had high levels of LD with both HapMap-SNPs and GWAS-SNPs, suggesting that a proportion of these indels have already been interrogated indirectly for their associations with complex phenotypes in GWASs through LD with the SNPs as surrogate markers. This also indicates that, in addition to SNPs and larger CNVs, small indel variation is likely to be a key factor underlying the genetics of human complex diseases and traits.

By comparison with WGS, which relies on a reference genome for aligning the sequence reads, de novo genome assembly will enable the more thorough and comprehensive detection of various genetic variations in the human genome ranging from single nucleotide variants and small indels, to large structural variations. Currently, de novo genome assembly is challenging and less practical because of the short sequence reads generated by NGS technologies, especially the Illumina and Life Technologies sequencing platforms. Recent studies have attempted to perform de novo human genome assembly using short sequence reads, with limited success [291, 310, 311]. One such study showed that de novo assemblies were 16.2 per cent shorter than the reference genome, with thousands of coding exons being completely absent [312]. De novo genome assembly and haplotype phasing will eventually become more feasible with longer sequence read lengths of up to tens of kb being generated by future sequencing technologies [33].

Cancer genome sequencing and somatic mutations

Cancers differ from other complex diseases in several aspects. The involvement of somatic mutations in cancer initiation and progression, in addition to germline variations, is well recognised. Sporadic cancer is considered to be an 'acquired disease' caused by the accumulation of somatic mutations in the genome of the original cancer cell type over the lifespan of a patient. Direct sequencing of the cancer genome, and comparison with the genome sequence from constitutional DNA from the same individual as a reference, is required for the proper assessment of somatic mutations [313, 314].

Recent advances in the understanding of the somatic mutational profile of cancer genomes have been driven by NGS technologies, which have enabled numerous whole cancer genomes to be sequenced for the first time [4749]. Nevertheless, many large-scale targeted resequencing studies of collections of cancer-relevant candidate genes, gene families or the RefSeq genes also have been performed previously using traditional PCR isolation and Sanger sequencing methods. The scale of these targeted studies previously has been limited by the lack of high-throughput sequence capture and sequencing methods [315, 316]. By contrast, sequencing of the entire collection of exons in acute monocytic leukaemia was completed without PCR isolation and Sanger sequencing methods [317].

Although somatic mutations have been found in many genes, only a few genes have been found to be frequently mutated across the tumour samples screened (ie mutated in a significant proportion of cancer samples). These genes have been referred to as 'mountains' -- as opposed to the 'hills', which correspond to genes that are infrequently mutated or mutated at low frequency [316, 318320]. For example, the gene encoding V-erb-a erythroblastic leukaemia viral oncogene homolog 4 (avian) (ERBB4) was found to be the most highly mutated gene in melanoma and hence may be considered to be a 'mountain'; a considerable proportion of samples (19 per cent) were found to have somatic mutations in this gene, with some samples containing more than one mutation. The role of ERBB4 was also supported by extensive functional studies showing that various missense mutations increased kinase activity and transformation ability, and the demonstration of reduced cell growth after knockdown of the gene in melanoma cells expressing mutant ERBB4 [320]. Targeted cancer genome sequencing has demonstrated the potential to identify potential therapeutic targets for melanoma.

Despite cost constraints, the number of WGS studies performed on different cancers has been increasing [4749] since the milestone first study that sequenced the cancer genome of an AML patient;[82] however, these WGS studies have generally sequenced only a few samples [321323]. The ability of WGS to detect somatic mutations in abundance requires us to be able to identify the 'driver' mutations from among the myriad 'passenger' mutations. It has been predicted that approximately ten functional driver mutations are required to cause most cancers, yet up to tens of thousands of mutations may be identified in an analysis of a cancer genome [313, 314]. Effective methods for identifying driver mutations in cancer genomes are not well developed, and the criteria for distinguishing driver mutations are not well defined. In addition, the set of driver mutations can be very different for different cancer types.

Although frequently mutated genes and recurrent mutations are of particular interest,[324] all of the current studies have interrogated only one or a few cancer genomes. Thus, these studies are unable to distinguish 'mountains' from 'hills', and recurrent mutations from other mutations that occur only once in the samples. Therefore, testing a subset of somatic mutations identified in the cancer genome in a larger number of cancer samples is required to identify this subset of genes or mutations [325] before the application of WGS in larger samples can be regarded as not only technically feasible, but also affordable (Box 6).

At present, somatic mutations in non-coding regions have received relatively scant attention and should be given more importance, since pervasive transcription beyond the protein-coding regions has now been demonstrated,[165, 333, 334] suggesting a regulatory role for the non-coding regions. These somatic mutations, and possibly driver mutations in the non-coding regions, can only be revealed by sequencing the whole cancer genome, as opposed to a targeted approach.

Revisiting Mendelian disorders

Mendelian or monogenic disorders make up approximately 7,000 known or suspected disorders and contribute significantly to the disease burden in society [335338]. Over the past two decades, much progress has been made in identifying the causal mutations and candidate genes for Mendelian disorders through mainly traditional linkage studies [339]. Currently, causal mutations for > 4,000 Mendelian disorders have been identified [99]. Indeed, a total of 112,864 different disease-causing and disease-associated mutations in 4,078 human genes are currently (as of May 2011) catalogued in the HGMD (http://www.hgmd.org/) (Box 7).

Although classical linkage studies have been the main tool for elucidating the genetics of Mendelian disorders, not all of these disorders are amenable to this study design. Homozygosity mapping is a more powerful and effective approach to studying recessive disorders in consanguineous families. For those disorders that are not amenable to these two conventional approaches, their causal mutations remain elusive. These disorders include: (a) extremely rare Mendelian disorders where only a small number of cases are available; (b) unrelated cases from different families; and (c) sporadic cases due to de novo mutations. Exome sequencing now offers new opportunities to study extremely rare disorders and sporadic cases caused by de novo mutations, such as Kabuki syndrome and Schinzel-Giedion syndrome [14, 340].

High-throughput sequence capture methods are able to isolate the universe of exons (the 'exome') in a more efficient and cost-effective way than traditional PCR-based methods. These methods are commercially marketed -- for example, the NimbleGen Sequence Capture technology (NimbleGen, Madison, WI: http://www.nimblegen.com/) and the Agilent SureSelect Target Enrichment technology (Agilent; Santa Clara, CA: http://www.home.agilent.com). They allow researchers to target custom genomic regions of interest in the human genome of up to tens of Mb in length, and also enable isolation of the exome in a single experiment. This development, coupled with the high-throughput sequencing data produced by NGS technologies, ensures an adequate depth of sequencing coverage accurately to detect the genetic variations in the exome or targeted regions [295, 341, 342].

Causal mutations have been identified for a number of previously unexplained rare disorders, such as Miller syndrome,[13] Sensenbrenner syndrome,[343] Perrault syndrome [344] and Fowler syndrome [345]. Exome sequencing is also a useful tool for diagnostic application and is anticipated to be used increasingly in molecular diagnosis [90, 346348]. The genetic diagnosis of congenital chloride diarrhoea in a patient with suspected Bartter syndrome was made through exome sequencing, which revealed a homozygous missense variant in the solute carrier family 26, member 3 gene (SLC26A3) [90]. The position of this variant is completely conserved from invertebrates to humans. The diagnostic application was further illustrated by Lupski et al. through WGS of a proband with Charcot-Marie-Tooth disease [16]. One missense variant and one nonsense variant were detected in SH3TC2, and all affected individuals in the family of the proband were found to be compound heterozygotes for these variants.

Studying Mendelian disorders can, paradoxically, reveal genes for complex diseases and traits. For example, numerous GWAS-identified common SNPs which are associated with triglyceride, high-density lipoprotein (HDL) cholesterol and LDL cholesterol levels were also found in the candidate genes causing the monogenic form of these lipid metabolism disorders [349, 350]. The discovery of causal mutations in the disease genes responsible for Mendelian disorders should help in acquiring an understanding of the underlying pathophysiology. For example, the identification of causal mutations in the gene encoding dihydroorotate dehydrogenase (DHODH) for Miller syndrome has provided new insights into the role of pyrimidine metabolism in craniofacial and limb development [13]. The potential discovery of new drug targets through study of the genetics of Mendelian disorders should also be emphasised. Thus, statins, the most commonly used drugs to lower cholesterol levels by inhibiting the enzyme 3-hydroxy-3-methyl-glutaryl-CoA (HMG-CoA) reductase, were discovered by studying familial hypercholesterolaemia [351].

Currently, the return to Mendelian disorder research has been mainly due to the 'attraction' of the exome sequencing approach, coupled with the disappointment engendered by GWAS results that have served to explain only a small fraction of the heritability of complex diseases and traits. Nevertheless, studying complex diseases should not be abandoned, as GWASs have also revealed new biological insights, such as unravelling the autophagy and interleukin (IL)-23 receptor pathways for Crohn's disease [352354]. The knowledge gained from studying Mendelian disorders and complex diseases will eventually complement each other and come together synergistically to enhance our understanding of genotype-phenotype relationships.

New efforts to identify further causal mutations underlying Mendelian disorders include a recent initiative by the National Human Genome Research Institute (USA) to establish 'A Center for Mendelian Disorders' whose mission will be to take on the sequencing of Mendelian disorders. This centre will be expected to explain the molecular basis of 40-50 disorders per year (NHGRI Large-Scale Sequencing Program May 2010, http://www.genome.gov/).

Sequencing-based approaches to the study of functional genomics

The NGS technologies, since their introduction in 2004, have been increasingly applied in studies of protein-DNA interactions and histone modifications (ChIP-Seq), transcriptomic profiling of mRNAs and non-coding RNAs (RNA-Seq), and bisulphite sequencing of DNA methylation (Methyl-Seq) [3941].

ChIP-Seq

Previous studies of protein-DNA interactions --such as the identification of transcription factor binding sites -- have relied on several low-throughput methods and have been focused on a few specific genomic regions. In the era of micro-arrays, the genome-wide studies of protein-DNA interactions and histone modifications were performed using a method known as ChIP-chip [355]. Undeniably, microarray development has enabled interrogation on a genome-wide scale but the detection of the immunoprecipitated DNA sequences is still dependent upon the availability of probes to capture them. Although the development of high-density tiling arrays,[356] where oligonucleotide probes are placed in high density throughout the whole genome, has improved the sensitivity of the ChIP-chip, the cost for such tiling arrays is expensive, especially for large genomes like the human genome [357]. By contrast, for ChIP-Seq, the immunoprecipitated DNA sequences are not hybridised on microarrays (thereby avoiding the problems inherent in probe hybridisation experiments) but instead are directly sequenced to detect their presence and measure their abundance. This allows detection of all the DNA fragments or sequences that are immunoprecipitated without any bias in relation to probe selection [357]. This is a key advantage of ChIP-Seq over microarrays.

ChIP-Seq or chromatin immunoprecipitation with the paired-end ditag sequencing (ChIP-PET) methods have led to major advances in the genome-wide mapping of binding sites for transcription factors (eg p53 transcription factor binding sites),[358] and for DNA binding proteins such as neurone restrictive silencer factor (NRSF) and signal transducer and activator of transcription (STAT1) [77, 359]. Studies of histone modifications have also been revolutionised by means of ChIP-Seq methodology;[78] this has expanded our knowledge of how this epigenetic mechanism regulates gene expression in the human genome. ChIP-Seq has made an important contribution to the studies of protein-DNA interactions and histone modifications [360, 361].

RNA-Seq

Studies of gene expression are important because they constitute immediate molecular traits that are directly affected by variation in DNA sequences and epigenetics. The term 'gene expression' usually refers to the expression of protein-coding genes. Previous studies were focused on mRNA expression, as mRNAs serve as the templates for protein synthesis; however, this perception changed after the completion of the pilot phase of the ENCODE project. This project and other studies revealed 'pervasive transcription' in the human genome [165, 333, 334]. Previously it had been thought that only the protein-coding regions or sequences (ie genes) would undergo transcription followed by translation; however, accumulating data are compatible with the view that transcription also occurs in non-protein-coding regions, indicating the importance of studying non-coding RNAs.

The advent of NGS technologies has spawned new approaches to exploring the transcriptome (eg RNA-Seq) [362, 363]. This method allows the study of the expression of mRNAs and non-coding RNAs, and is also able to detect and identify new transcripts (coding and non-coding) that have not been formally annotated. The applications of sequencing-based approaches in transcriptomic studies have included genome annotation and the discovery of new transcripts,[364] the investigation of the alternative splicing patterns,[84, 365] detection of gene fusions in cancer [366] and allele-specific expression analysis,[367] as well as the discovery and measurement of non-coding RNA expression.

Methyl-Seq

Substantial progress has also been achieved in the context of DNA methylation analysis with the advent of NGS technologies allowing the determination of the DNA methylome at a single-base resolution [96, 368371]. The 'gold standard' for the detection of DNA methylation (or cytosine methylation) is sodium bisulphite conversion of DNA followed by sequencing. The sodium bisulphite treatment will convert the unmethylated cytosine to uracil (subsequently read as thymine during sequencing), whereas methylated cytosine remains unchanged. One of the limitations of this method, however, is that it cannot distinguish between 5-methycytosine and 5-hydroxy-methylcytosine. The importance of studying 5-hydroxymethylcytosine for its biological roles will become clearer when more powerful methods to distinguish them become available [372]. The SMRT sequencer produced by Pacific Biosciences holds out great promise directly to sequence (and distinguish) 5-methycytosine and 5-hydroxymethylcytosine [373]. Nanopore sequencing technologies have also demonstrated the ability to directly detect methylated cytosines [35]. The revolution in sequencing approaches to exploring functional genomics in the human genome has also led to the initiation of several international projects (Box 8).

Personalised genomic medicine

The translation of genomic information to the clinical setting has shown great promise. In the field of pharmacogenetics, the US Food and Drug Administration (FDA) has approved genotyping tests for the screening of genetic variants in candidate genes that influence the responses and adverse effects of several commonly used anticancer drugs (eg the genes encoding thiopurine S-methyltransferase [TPMT] and UDP-glucuronosyltransferase 1A1 [UGT1A1] for thiopurine drugs and irrinotecan, respectively). Pharmacogenetic information is important to guide the optimal dose prescription [384]. Similarly, the FDA has also approved genotyping tests for two genes (CYP2C9 and the vitamin K epoxide reductase complex, subunit 1 gene [VKORC1]) in the prescription of warfarin, a drug of low therapeutic index [385].

The over-expression status of human epidermal growth factor receptor 2 (HER-2) receptors in breast cancer patients is clinically informative in deciding whether a given patient would benefit from trastuzumab treatment. Similarly, the deletion of CYP2D6 predicts whether a patient would benefit from tamoxifen treatment, as this prodrug requires bioactivation into its active metabolite, 4-hydroxytamoxifen, which is catalysed by the CYP2D6 enzyme. Thus, breast cancer patients who would not benefit from trastuzumab and tamoxifen treatments should be prescribed alternative drugs, such as aromatase inhibitors. In terms of prognosis, breast cancer prognostic gene expression arrays such as MammaPrint and Oncotype DX are informative and relevant to clinical management, as they help to determine which patients should receive adjuvant therapy after surgery [386388]. These examples highlight the potential clinical utility of genomic information in prescribing and optimising treatments.

Genomics information has also been used to develop molecular-targeted cancer therapies. The discovery of the breakpoint cluster region-c-abl oncogene 1 nav-receptor tyrosine kinase (BCR-ABL) genomic translocation ultimately led to the development of a molecular-targeted drug as a treatment for chronic myeloid leukaemia (CML), namely imatinib -- a tyrosine kinase inhibitor targeting the tyrosine kinase domain of the fusion protein [389]. The identification of somatic mutations in the epidermal growth factor receptor (EGFR) in non-small-cell lung cancer led to the development of gefitinib. Further, somatic mutations in EGFR have also been found to be informative in predicting sensitivity to gefitinib and in explaining inter-ethnic variability in drug responses [390]. Advances in epigenetics have led to drug developments such as inhibitors of DNA methylation (DNMTs); indeed, 5-azacytidine and 5-aza-2'-deoxycytidine have been approved in the treatment of AMLs and mye-lodysplastic syndromes by the US FDA [391, 392]. These show that genomic discoveries can be directly translated into clinical applications.

Given the advances in the field, more discoveries will eventually translate into clinical applications and management of patients. For example, GWASs have led to several promising discoveries, such as the identification of genetic variants in IL28B that influence the spontaneous clearance of hepatitis C virus and affect the individual response to chronic hepatitis C of interferon-α plus ribavirin therapy [393, 394]. Similarly, cancer genome sequencing has identified promising somatic mutations in candidate genes (eg the isocitrate dehydrogenase 1 gene [IDH1]) as potential targets for drug interventions. Recurrent mutations in IDH1 have been found in 12 per cent of glioblastoma multiforme patients [318]. The importance of this gene is not confined to glio-blastoma multiforme, as mutations in IDH1 were also found in 16 per cent of AML patients [325].

In the era of GWASs and WGS, the great challenge lies in data interpretation and how genomic information can be used to discover new drugs or molecular biomarkers for clinical applications that will eventually translate into patient benefit. The ultimate goal of these studies is to improve the clinical management of patients and to bring about personalised medicine [395, 396] through the development of new therapeutic agents tailored to the individual, based upon their genetic information. Although progress made towards achieving these goals has been promising, many challenges in the translational phase remain. Hence, it is still unclear how long it will take for personalised genomic medicine to become an everyday reality.

Summary

The analysis of the sequence of the human genome has had a major impact on biomedical research over the past few years. The HGP has made possible a multitude of genome-wide scale analyses and has thus provided a wealth of information about the architecture of the human genome. In many ways, the HGP has paved the way for what is coming to be called individualised or personalised genome medicine. The development of new (genotyping and sequencing) technologies for improved, less cost-intensive and more precise genome sequencing and assembly has been driven by the overwhelming success of the HGP.

In summary, the advances discussed in this review would not have been possible without the reference genome sequence produced a decade ago by the HGP. These advances have greatly improved our understanding of human genetic diversity, disease genetics and functional genomics. The development of powerful analytical and bioinformatics tools is crucially important in the era of genome sequencing (Box 9). The ongoing large-scale international projects will further contribute to the fields of human genetics, as well as human genomics, transcriptomics, epigenomics and metagenomics upon their completion. These projects will provide vital resources for future studies. Continued progress over the next ten years will bring us closer to the final goal of personalised genomic medicine.

Box 1. Gene deserts and their potential relevance to human inherited disease

A functional role(s) for gene deserts [127] has been supported by results from GWASs. Thus, multiple SNPs on chromosome 5p13.1 have been shown to be strongly associated with Crohn's disease, even though the region is located within a 1.2 Mb gene desert and the nearest annotated gene, that encoding prostaglandin E receptor EP4 (PTGER4), is about 270 kb away from the association signals [128131]. Although the SNPs were consistently associated with the disease, their functional effect is not easy to infer because these SNPs could exert an effect either on the nearest gene or on other genes that are located further away. However, Libioulle et al. (2007)[128] integrated the GWAS results with gene expression data and found that the associated SNPs influenced the level of expression of PTGER4.

The majority of GWAS-SNPs are located in either intronic, intergenic or gene desert regions rather than within gene-coding or promoter sequences. These SNPs could nevertheless be of direct functional significance if their locations coincide with regulatory elements, either already known or yet to be characterised, such as enhancers, transcription factor binding sites and sequences encoding for microRNAs [132].

The association of the SNP rs6983267 at 8q24 with colorectal and prostate cancer has been a mystery since its discovery because the risk allele is located in a gene desert > 300 kb away from the nearest annotated gene, MYC. Recent studies have, however, found that the region containing the risk allele is a transcriptional enhancer that interacts with the MYC proto-oncogene [133, 134]. In a similar vein, GWAS-SNPs in a 9p21-located gene desert (associated with coronary artery disease) have been found to impair the interferon-γ signalling response [135].

Box 2. MicroRNAs

MicroRNA has been the most intensively studied non-coding RNA in the human genome. MicroRNA gene loci may be fairly numerous: already more than 15,000 microRNA gene loci have been identified in various species (miRBase, Release 16.0: September 2010; http://www.mirbase.org/), with 1,048 microRNAs being found in the human genome.

Biogenesis and function

The synthesis of microRNAs starts with the transcription of primary microRNAs by RNA polymerase II. The primary microRNAs will be processed further to become precursor microRNAs and then mature microRNAs. The mature microRNAs are short sequences of 18-25 nucleotides; they are incorporated into RNA-induced silencing complex (RISC) to exert their post-transcriptional regulatory roles through binding to the 3' untranslated region (UTR) of target mRNAs. The binding of microRNA to target mRNAs can lead to two possible outcomes; either degradation or cleavage of the mRNAs or suppression of the translation of mRNAs into protein [149].

Relevance to diseases

The importance of microRNAs as functional regulators increasingly has been interrogated by microarray and sequencing studies. Deregulation in the expression patterns of microRNAs was commonly associated with various cancers [150152]. SNPs in the (i) sequences encoding microRNAs and (ii) 3' UTR of mRNAs also have been found to be associated with various cancers [153, 154].

Box 3. Genome coverage

High genome coverage is important, since the underlying principle of this approach is the use of LD to detect disease variants. In SNP-scarce regions, bona fide disease variants could be missed because they are not in strong LD with any of the SNPs genotyped on the array. Genome coverage is an estimate of the proportion of SNPs (using the International HapMap data as a reference) that can be captured by the SNPs which are directly genotyped in an array with a preset r2 threshold. Usually, a threshold of 0.8 is used to estimate genome coverage. These first-generation genotyping arrays used the International HapMap database for SNP selection and have poor coverage for SNPs with minor allele frequency (MAF) < 5 per cent [220222].

Box 4. Characterising structural variation by means of sequencing

The discovery of copy-neutral variations has been attributed to the development of the PEM method and concurrent advances in NGS technologies. The PEM method has also contributed greatly to the discovery of CNVs in the human genome [76, 81, 280]. Further studies have also taken advantage of an important feature of sequencing data generated by NGS technologies, where several hundred million short sequence reads are produced per instrument run to detect CNVs based on the abundance or density of the sequence reads aligned to the reference genome. This approach is known as depth-of-coverage (DOC) and is similar to microarray-based methods, in that it is also unable to detect copy-neutral variations [281].

PEM

In the PEM method, a library of DNA fragments with a fixed insert size is prepared and both ends of the DNA fragments are sequenced to generate 'paired-end sequences' (the sequences at both ends of the DNA fragments). This sequence information is then aligned against the reference genome. The underlying principle of PEM in detecting structural variations is reliance upon the discordance in insert size and orientation of the paired-end sequences being aligned to the reference genome to infer 'simple' deletion, insertion and inversion. Thus, when paired-end sequences aligned to the reference sequence display discordance from the expected insert size or distance, this is indicative of either a deletion or an insertion, whereas discordance in orientation suggests the presence of an inversion (ie paired-end sequences are incorrectly oriented by comparison with the reference genome). Hence, the paired-end sequences are usually classified as 'concordant pairs' or 'discordant pairs'; only the discordant pairs are informative for inferring structural variants. Other, more complex, rearrangements -- such as 'everted duplications', 'linked insertions' and 'hanging insertions' -- can also be detected [282].

DOC

The DOC method utilises NGS data for CNV detection. This method is based on the DOC of the sequence reads to infer deletions and duplications. The DOC method is made possible by the production of several hundred million short sequence reads per instrument-run by NGS technologies. The principle underlying the DOC approach is based on the assumption that the sequencing process is uniform, so that the number of sequence reads mapping to a region follows a Poisson distribution. As such, the number of sequence reads should be proportional to the number of times that a particular region appears in the genome. Therefore, it is expected that a duplicated region will have more reads aligned with it, with the converse being true for deletions [281, 282]. The assumption that the sequencing process is uniform may not be valid, however, because of the sequencing bias of the NGS technologies, which leads to certain regions of the genome being over- or under-sampled, resulting in spurious signals [283]. Despite their shortcomings, the PEM and DOC methods will continue to play a role in the discovery of structural variations until de novo genome assembly becomes more feasible.

Application in cancer studies

Both PEM and DOC have also proven useful in dissecting somatically acquired rearrangements in cancer genomes [87, 284]. Sequencing of both ends of the DNA fragments derived from the genomes of two individuals with lung cancer was performed and 306 germline structural variations and 103 somatic rearrangements were identified to the single nucleotide level of resolution [87].

Box 5. International effort to characterise structural variants using PEM

Proof-of-concept studies

The PEM method for detecting structural variants was first demonstrated by Tuzun et al. by mapping paired-end sequences data from a human fosmid DNA genomic library [285]. The average insert size of a fosmid library is approximately 40 kb. This study identified 297 structural variants (139 insertions, 102 deletions and 56 inversions); however, sequencing of fosmid clones by means of Sanger sequencing is laborious and costly [285]. These limitations have been overcome by NGS technologies which directly sequence the paired-end or mate-pair libraries without the need for cloning steps [76]. Both of these studies applied the PEM approach to investigate structural variants in the same sample (NA15510) from the International HapMap Project. Their library insert sizes differed, however, and this has enabled a comparison of the sensitivity between these studies. Korbel et al.[76] were able to confirm 41 per cent of all deletion and inversion events detected by fosmid paired-end sequencing. Moreover, they identified an additional 407 structural variants in NA15510 that previously had not been detected by fosmid paired-end sequencing. This further suggests that several libraries with different insert sizes are needed to increase the sensitivity of PEM.

Human Genome Structural Variation Working Group

In addition to individual studies, a large-scale effort is currently being undertaken by the Human Genome Structural Variation Working Group comprehensively to map structural variants in phenotypically normal individuals using the PEM approach [79]. More specifically, the objective is to characterise the pattern of human structural variants at the nucleotide sequence level from a collection of 48 individuals of European, Asian and African ancestry. This project plans to make fosmid clone libraries of approximately 40 kb insert size from the genomic DNA of 48 unrelated females. These samples were already genotyped by the HapMap Project. A larger insert size of approximately 150 kb prepared from BAC clone libraries will also be constructed from 14 unrelated HapMap males. This will aim to provide sequence information on structural variants that are too large to be included in the fosmid libraries, such as those associated with segmental duplications. As such, both the fosmid and BAC libraries will ensure the comprehensive capture of structural variants of varying sizes across the human genome.

Structural variation is biased toward complex duplicated and repetitive regions. Hence, developing clone libraries for a modest number of human genomes should serve as a valuable resource for characterising complex and difficult-to-assay regions of genome structural variation. Since the underlying clones can be retrieved, the complete sequence context of the discovered structural variant can also be obtained [79]. This is crucial for precise breakpoint delineation of structural variation, which is then important for understanding the mutational mechanisms responsible for human genome structural variation. A total of 1,695 structural variants were discovered with fosmid libraries derived from nine individuals. The study also showed that 50 per cent were seen in more than one individual and that nearly half lay outside regions of the genome previously described as structurally variant, indicating novel discoveries. More importantly, 525 new insertion sequences (that are not present in the human reference genome) were discovered and many of these were found to be variable in copy number between individuals [86]. This is important because it suggests that structural variants or CNVs could have gone undetected as part of the 'missing sequences' in the human reference genome. Complete sequencing of 261 structural variants provided insights into the different mutational processes that have shaped the human genome. This study therefore provided the first high-resolution sequence map of human structural variation [86]. A subsequent study then expanded the Human Genome Structural Variation clone resource by including capillary end sequencing of 4.1 million additional fosmid clones from eight additional human genomes. The combined set includes 13.8 million clones derived from the genomes of six YRI, five Centre d'Etude du Polymorphisme Humain (CEPH) Europeans, three JPT, two CHB and one individual of unknown ancestry [286]. This study characterised the complete sequence of 1,054 large structural variants and analysed their breakpoint junctions to infer their potential mechanisms of origin. Three mechanisms were found to account for the bulk of germline structural variation: microhomology-mediated processes involving short (2-20 bp) stretches of sequence (28 per cent), non-allelic homologous recombination (22 per cent) and L1 retrotransposition (19 per cent).

Box 6. Challenges in cancer genome sequencing

Several major challenges at the forefront of cancer genome sequencing studies are outlined and discussed. The first relates to the collection of 'high-quality' samples of cancer cells or tissues for DNA extraction for sequencing [48, 49]. Primary cancer tissues are usually contaminated by other normal cells that hamper our ability to detect somatic mutations in cancer genomes. The contamination with (or mixture of) DNA from non-cancerous cells is particularly problematic, and a higher depth of sequencing coverage will be required to detect somatic mutations in 'mixed DNA', increasing the cost of sequencing. For example, Ding et al. studied 188 primary lung adenocarcinoma samples, each containing a minimum of 70 per cent tumour cells independently determined by pathologists [326]. Single-cell sequencing is now emerging as a promising approach to resolving cancer tissue heterogeneity or mixed populations of cells, however, because it is potentially able to resolve genetic and/or cellular heterogeneity among the cancer cells. This single-cell sequencing approach was applied to investigate tumour population structure and evolution in two cases of human breast cancer. Analysis of 100 single cells from a polygenomic tumour revealed three distinct clonal subpopulations that probably represent sequential clonal expansions. Analysis of 100 single cells from a monogenomic primary tumour and its liver metastasis indicated that a single clonal expansion formed the primary tumour and seeded the metastasis [113].

The second most important challenge is accurately to identify different types of somatic mutations in the cancer genome. NGS technologies are characterised by shorter sequence read lengths and higher sequencing error rates, by comparison with Sanger sequencing [295]. Data quality could be adversely affected if these sequencing errors are not properly filtered out.

Thirdly, the cost of whole-genome resequencing is still prohibitively expensive when it is to be applied to hundreds of samples. Furthermore, there are also significant bioinformatics and analytical challenges to processing and analysing huge amounts of sequencing data. These two constraints currently restrict whole-genome resequencing studies to studies of only a few cancer genomes. This in itself becomes a major barrier to identifying recurrent mutations (which are more likely to be functionally important) and driver mutations. Although the current approach to identifying recurrent mutations is to select a subset of somatic mutations detected in cancer genomes and then to test them in a larger study,[325] this approach cannot be used to screen for all mutations, resulting in many recurrent mutations remaining undetected. For example, a total of 64 mutations were detected in protein-coding genes, regulatory RNAs and highly conserved non-coding regions in the AML genome, but only four of these mutations were subsequently found in additional samples when tested for in more than 180 AML patients. By contrast, targeted resequencing in large sample sizes is able to identify recurrent mutations. This targeted approach focuses only on certain genes, however, and, as a consequence, those recurrent mutations located outside the targeted regions remain undetected. In addition to identifying recurrent mutations, a large sample size is also needed to distinguish 'mountains' from 'hills'.

Although the findings from targeted,[315, 316, 320, 326, 327] exome [317, 328, 329] and whole-genome resequen-cing [82, 321323, 330332] studies have increasingly provided new insights into cancer genomes, the greatest challenge for cancer genome sequencing lies in discerning driver mutations from the multitude of other (passenger) mutations. Effective methods for identifying driver mutations in cancer genomes are not yet well developed. In addition, driver mutations may differ between cancer types.

Box 7. Human Gene Mutation Database and the 'human mutome'

As the number of disease-causing or disease-associated germline mutations or variants increases, proper cataloguing is critically important. In this regard, the HGMD represents an attempt to collate all known (published) gene lesions responsible for human inherited disease.

Disease-causing or disease-associated germline mutations/variants collated in the HGMD now exceed 110,000 in > 4,000 different nuclear genes. Newly described human gene mutations are currently being reported at a rate of ~10,000 per annum, with ~300 new 'inherited disease genes' being recognised every year. The HGMD has provided useful insights into the 'human mutome' (ie disease-causing or disease-associated germline mutations/variants in the entire human genome) [99, 100]. For a variety of reasons, however, this figure is likely to represent only a small proportion of the clinically relevant genetic variants present in the human genome. Those disease-causing or disease-associated variants that are located outside the gene-coding regions are likely to have been overlooked often as a direct consequence either of focusing exclusively on screening the protein-coding sequence or of the inherent limitations of the mutation detection techniques used. Such considerations are important for improving mutation screening strategies, as well as for facilitating the interpretation of findings from GWASs, exome sequencing and WGS.

Box 8. International projects that are exploring functional genomics

The advent of NGS and TGS will facilitate the undertaking of several international projects (http://commonfund.nih.gov/). These large-scale projects would not have been technically feasible without NGS and TGS technologies, which have potentiated sequencing-based approaches in studying functional genomics. These projects will contribute significantly to functional genomics.

The NIH Roadmap Epigenomics Program

The NIH Roadmap Epigenomics Program aims to generate new research tools, technologies, datasets and infrastructure to accelerate our understanding of the role of epigenetics [374]. This will improve our understanding of instances of transcriptional regulation that are not dependent on the DNA sequence. This will be important in understanding diseases attributed to epigenetic aberrations involving DNA methylation or histone modifications [375]. For example, many cancers are commonly associated with epigenetic aberrations [376].

The Genotype-Tissue Expression (GTeX) Project

Transcriptional regulation is modulated not only by epigenetics, but also by genetic variation in the DNA sequence. Therefore, the GTeX Project aims to study human gene expression and regulation in multiple tissues, providing valuable insights into the mechanisms of gene regulation and, in the future, its disease-relevant aberrations. Genetic variation between individuals will be examined for a correlation with differences in gene expression level. Major advances have been made in studies of eQTL through the use of high-throughput genotyping and sequencing technologies [377381]. For example, Montgomery et al. sequenced the mRNA fraction of the transcriptome in 60 HapMap individuals of European descent and integrated the data with SNP information from the HapMap Phase III project, an undertaking which led to discoveries of novel eQTLs and sequence variants responsible for alternative splicing [380].

The Human Microbiome Project

The Human Microbiome Project aims to characterise the microbial communities found at several different sites in the human body, such as oral cavities, skin, gastrointestinal tract and the urogenital tract. This project is important in providing insights into the roles of these microbes in human health and disease [382]. The first metagenomic sequencing of gut microbes was accomplished using NGS technologies [103]. A human gut microbial gene catalogue was established by characterisation of 3.3 million non-redundant microbial genes derived from faecal samples from 124 European individuals. This research is important in gaining better understanding of the influence of gut microbes on human health and disease [103].

The International Cancer Genome Consortium

New developments have also occurred in cancer genomics, where the International Cancer Genome Consortium aims to obtain a comprehensive description of genomic, transcriptomic and epigenomic changes in 50 different tumour types and subtypes [110]. This is in accordance with the notion of integrative analyses incorporating multiple sources of genomics data [383]. This project will be important in dissecting the somatic genetic heterogeneity, a general hallmark of cancer, through studying various tumour types and subtypes.

Box 9. Bioinformatics -- computational and analytical tools -- in the NGS era

Bioinformatics -- and computational and analytical tools -- play a key role in the NGS era, an era in which huge amounts of sequencing data are being generated. Parallel developments in bioinformatics tools have contributed greatly to recent advances in the field of human structural and functional genomics where NGS technologies have been applied. A detailed discussion of the development of these analytical tools and methodological pipelines is beyond the scope of this paper. However, bioinformatics, computational and analytical tools have been developed for a variety of applications at different stages of the analysis of data generated by both structural and functional genomics studies. Exemplars are given below.

Base calling, alignment, mapping and assembly

  1. 1.

    Base-calling for NGS platforms [397].

  2. 2.

    Survey of sequence alignment algorithms for NGS [398].

  3. 3.

    Evaluation of NGS software in mapping and assembly [399].

  4. 4.

    De novo assembly of short sequence reads [291].

  5. 5.

    Assembly algorithms for NGS data [400].

Structural genomics (discovery of genetic variations)

  1. 6.

    Computational methods for discovering structural variation with NGS [282].

  2. 7.

    Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes [401].

  3. 8.

    A framework for variation discovery and genotyping using NGS DNA data [402].

Functional genomics

  1. 9.

    Introduction to the analysis of high-throughput-sequencing based epigenome data [403].

  2. 10.

    Computation for ChIP-seq and RNA-seq studies [404].

  3. 11.

    Bioinformatics approaches for genomics and post-genomics applications of NGS [405].

Association studies

  1. 12.

    Association studies for NGS [406].