WO2013073929A1

WO2013073929A1 - Method and apparatus for detecting nucleic acid variation(s)

Info

Publication number: WO2013073929A1
Application number: PCT/MY2012/000273
Authority: WO
Inventors: Yang Ming POH; Soo Heong BOON; Ying Wah LEE
Original assignee: Acgt Intellectual Limited; ACGT, Sdn Bhd
Priority date: 2011-11-15
Filing date: 2012-11-14
Publication date: 2013-05-23
Also published as: WO2013073929A8; TW201323615A; AR088867A1

Abstract

The invention relates to a method for detecting at least one nucleic acid variation based on the ratio of corrected signal intensities of at least two differentially labelled probes capable of detecting the nucleic acid variation. The invention also relates to an apparatus for performing the method.

Description

METHOD AND APPARATUS FOR DETECTING NUCLEIC ACID VARIATION(S)

Field of the invention

The present invention relates to the field of detecting nucleic acid variations, for example in genotyping. More specifically, the invention relates to detecting a nucleic acid variation of a locus present in a sample. In particular, allelic variations or single nucleotide polymorphisms (SNPs) may be detected.

Background of the invention

In a population, the genetic makeup or genotype of individuals varies. Genotyping generally refers to identifying the genetic makeup of an individual organism. Genotyping may detect nucleic acid variations in an individual, for example allelic variations or single nucleotide polymorphisms (SNP). Recent developments have led to robust genotyping platforms. Examples of genotyping platforms include the invader assay (Olivier et a/., 2005), array-based methods (Perkel 2008) and arrayed primer extension or APEX (Kurg et a/., 2000). The abundant data generated with these robust genotyping platforms need to be analysed, typically using computerised methods. Examples of data analysis methods using diverse algorithms have been reported in US 2009/0062138, Ritchie et a/., 2009, Takitoh et a/., (2005). It is desirable to develop methods capable of genotyping with high accuracy.

Summary of the invention

The present invention relates to detecting nucleic acid variation(s). According to a first aspect, the present invention provides a method for detecting a nucleic acid variation of a locus in a sample, comprising the steps of: (i) contacting at least two differentially labelled probes with the sample; wherein the first labelled probe is capable of detecting a first nucleic acid variation A and the second labelled probe is capable of detecting a second nucleic acid variation B;

(ii) detecting a first signal intensity X for the first labelled probe and a second signal intensity Y for the second labelled probe on a support; wherein the first signal intensity X correlates to the presence of the first nucleic acid variation A and the second signal intensity Y correlates to the presence of the second nucleic acid variation B;

(iii) performing background corrections on the first and second signal intensities to give background corrected first signal intensity XA and background corrected second signal intensity YB;

(iv) expressing XA: YB as a ratio (S_r), wherein if XA:YB≥ C:1 given X_A , YB > 0, or if X_A > 0 and YB≥ 0, then the nucleic acid variation is A:A; if ΧΑΎΒ≤ 1 :C given XA , YB > 0, or if Y_B > 0 and XA≥ 0, then the nucleic acid variation is B:B; if XA YB is between C:1 and 1 :C, then the nucleic acid variation is A:B; wherein C is a real number; and if both XA and YB≤ 0, either both A and B are not present or the nucleic acid variation cannot be determined.

For example, the method may be for detecting different alleles. In particular, the method may be for detecting single nucleic acid polymorphisms (SNPs).

Brief description of the figures Figure 1 depicts the hybridisation of allele-specific probes and locus specific oligonucleotide (probe) comprising an IllumiCode region (a step from lllumina GoldenGate assay).

Figure 2 depicts the hybridisation of PCR products to beads on an array.

Figure 3 shows an example of the raw intensities output in graphical format from Genome Studio. Figure 4 shows further examples of raw signal intensities distribution output in graphical format from Genome Studio.

Figure 5 shows a graphical representation of the background corrected signal intensities against the expected genotype. Figure 6 shows the error margins of the present corrected signal ratio algorithm for detecting nucleic acid variation(s).

Figure 7 shows a Venn diagram comparing the SNP calls from the corrected signal ratio algorithm (Poh), Genome Studio (GS) and sequencing.

Figure 8 shows the flow diagram of an example of the method of the present invention.

Definitions

An array refers to a support including a slide, chip, membrane, bead, or microtiter plate, with a plurality of elements bound or immobilised at defined locations. The elements may comprise molecules (e.g. nucleic acid molecules). In particular, a microarray refers to a high density array. For example, a microarray may have a density of 120 or more elements per cm².

A "double polynucleotide polymorphism (DNP)" refers to two single polynucleotide polymorphisms, and includes the circumstances when the two SNPs are positioned next to each other, separated by other nucleotides, on different strands of the same nucleic acid molecules, or on different nucleic acid molecules.

Differentially labelled probes also include the situation when one probe is labelled and the other probe is not. A primer refers to an oligonucleotide to which deoxyribonucleotides may be added by a DNA polymerase. A single primer may be used to amplify a DNA or RNA region, for example, for sequencing.

A primer pair usually comprises a first primer complementary to one strand of a DNA or RNA molecule and a second primer complementary to a second strand of a DNA or RNA molecule, with both primers flanking a target DNA or RNA region, to be amplified by a DNA polymerase.

A probe refers to any molecule used to locate and/or identify a target DNA or RNA sequence. Probes may usually be labelled by standard methods, for example, radioactively or with fluorescent markers. For example, probes may be used to detect differences in DNA or RNA sequences, including single nucleotide polymorphism(s). The differentially labelled probes also include at least two differentially labelled nucleotides, wherein one nucleotide is incorporated by a polymerase to a polynucleotide being extended, depending on the SNP present.

"Nucleic acid variation" includes, but is not limited to allelic variations, a single nucleotide polymorphism (SNP), a double nucleotide polymorphism (DNP), a deletion, an insertion, a substitution, a nucleic acid amplification, a rearrangement of a nucleic acid sequence or a gene and/or its corresponding transcriptional and/or translational product, and/or alternative splicing of the transcriptional and/or translational product.

A single polynucleotide polymorphism (SNP) refers to a DNA and/or RNA sequence variation occurring when a single nucleotide in an organism's genetic material which differs between members of the species (or between paired chromosomes in the organism). SNP includes substitution, deletion or insertion of a single nucleotide. Detailed description of the invention

The invention relates to a method for detecting nucleic acid variations. As herein described, the method comprises the steps of:

(i) contacting at least two differentially labelled probes with the sample; wherein the first labelled probe is capable of detecting a first nucleic acid variation A and the second labelled probe is capable of detecting a second nucleic acid variation B;

(iv) expressing XA: YB as a ratio (S_r), wherein if XA:YB≥ C:1 given X_A , YB > 0, or if XA > 0 and YB≥ 0, then the nucleic acid variation is A:A; if XA:YB≤ 1 :C given XA , YB > 0, or if Y_B > 0 and X_A≥ 0, then the nucleic acid variation is B:B; if XA:YB is between C:1 and 1 :C, then the nucleic acid variation is A:B; wherein C is a real number; and if both X_A and Y_B≤ 0, either both A and B are not present or the nucleic acid variation cannot be determined.

In particular, the sample comprises an isolated sample.

Any labelling means as is known in the art may be used in the practice of the invention. Detection of the nucleic acid variation is based on the ratio of background corrected signal intensities of at least two differentially labelled probes. The signal intensity X of the first probe capable of detecting the first nucleic acid variation A is background corrected to give X_A. The signal intensity Y of the second probe capable of detecting the second nucleic acid variation B is background corrected to give YB.

Background correction may be performed by any suitable method. For example, background correction may be made by subtracting the background intensity (Bl) from the signals X and Y to give XA and Y_B respectively. The background intensity (Bl) may be determined by measuring signal intensity in the absence of any probe (negative control).

If X_A and Y_B are both > 0, the nucleic acid variation present is determined based on the corrected signal ratio (S_r) XA:YB- The nucleic acid variation is determined as A:A if X_A:YB≥ C:1. Conversely, the nucleic acid variation is determined as B:B if X_A:Y B ≤ 1 :C. The nucleic acid variation is determined as A:B if ΧΑ Β is between C:1 and 1 :C. In particular, C (cut-off) is a real number. For example, C may be any value≥ 2. In particular, C may be 2 or 3. More in particular, C = 3. As an example, consider the situation where C = 3. If XA:YB≥ 3:1 , the nucleic acid variation is determined as A:A. If ΧΑΎΒ ≤ 1 :3, the nucleic acid variation is determined as B:B. If XA:YB is between 3:1 and 1 :3, the nucleic acid variation is determined as A:B.

Corrected signal intensities for nucleic acid variations A and B respectively against the background should be larger than 0. If X_A or Y_B≤ 0, the signal is taken to be negligible or absent. Accordingly, if XA:YB≥ C:1 given X_A , YB > 0, or if XA > 0 and Y_B > 0, then the nucleic acid variation present is A:A. If XA-'YB≤ 1 :C given X_A , Y_B > 0, or if Y_B > 0 and X_A≥ 0, then the nucleic acid variation present is B:B. If both X_A and Y_B are≤ 0, it would be considered that either both the nucleic acid variations A and B are not present or the signal failed (i.e. the nucleic acid variation cannot be determined). The signal intensities X and Y are detected on a support. The support may comprise an array or more in particular, a microarray.

Step (i) may further comprise an amplification step. In particular, the amplification is with a polymerase chain reaction (PCR). In particular, the PCR is with the first labelled probe, the second labelled probe and a locus specific oligonucleotide as primers to give PCR products. The PCR products are then hybridised to locus specific nucleic acid immobilised on the support. For instance, the lllumina GoldenGate technology includes an amplification step (see Examples below).

Typically, two-channel detection may be used to detect the signal intensities X and Y. However, one-channel detection may be used if only one probe is labelled. Further, the method according to any aspect of the present invention may be adapted to detect more than two nucleic acid variations. For example, if there are three nucleic acid variants A, B, C, the analysis can be performed for A & B, A & C and B & C. If there are four nucleic acid variants A, B, C, D, the analysis can be performed for A & B, A & C, A & D, B & C, B & D and C & D. The present method may be adapted accordingly to detect any number of nucleic acid variants. The present method may be adapted to any suitable array platform for detecting nucleic acid variations and/or genotyping known in the art, including but not limited to lllumina GoldenGate, lllumina Infinium, Affymetrix platform or Invader assay. For example, the present method may also be used with the lllumina Infinium platform (Steemers et al., 2006) where a differentially labelled nucleotide corresponding to the SNP is incorporated during amplification.

The invention also includes an apparatus for performing the invention. The apparatus includes the support system and/or associated computer system. The support system includes the array system. The computer system may be used to process and/or analyze the signal intensities from the array. According to a further aspect, the invention relates to a computer system, programmed to perform steps (iii) and (iv) of the method of the invention. The computer system may in principle be any general computer, such as a personal computer, although in practice it is more likely typically to be a workstation or a mainframe computer.

The invention also relates to software executable by a computer system to cause the computer system to perform steps (iii) and (iv) of the method. The invention also includes a computer program product comprising the software. In particular, the computer program product is tangible. A computer program product includes, for example, a tangible recording, storage and/or computer- readable media. Examples of such media include but are not limited to a computer hard-drive, a compact disc, a flash memory device (e.g. memory cards, USB flash drives, solid state drives ), a floppy disk. Other suitable media known in the art may also be used.

Accordingly, the method of the invention comprises a computer-implemented method. Further, the present method is capable of being automated.

Having now generally described the invention, the same will be more readily understood through reference to the following examples which are provided by way of illustration, and are not intended to be limiting of the present invention.

EXAMPLES

Standard molecular biology techniques known in the art and not specifically described were generally followed as described in Sambrook and Russel (2001). Example 1 lllumina GoldenGate Assay

SNP analysis was performed using the lllumina GoldenGate Assay according to the manufacturer's instructions. Basically, this genotyping platform uses differentially labelled allele-specific probes and a locus-specific oligonucleotide (or probe) for detecting the SNP (Figure 1). The allele-specific probes are labelled with different fluorescent dyes. Each locus-specific oligonucleotide comprises a specific IllumiCode region which is unique to the locus. Depending on the SNP variant present, the corresponding allele-specific probe will specifically bind to the DNA template of the sample and extended via PCR to the locus specific oligonucleotide. After the extension, the PCR products flanked by the allele-specific probe and the locus-specific oligonucleotide are hybridized to a set of beads via the llumiCode region. Each specific IllumiCode region will represent a specific locus and the position of the bead with the corresponding complementary IlluniCode region on the array is tracked and used to aid identification of the expected SNPs associated with the locus. The PCR products that bind to the beads localised to specific locations on the array is then scanned for the presence/absence of each differentially labelled probe which would indicate the SNP present, either homozygous or heterozygous (Figure 2). In array-based SNP genotyping platforms, each SNP locus is in general represented by two dyes, where each dye represent one of the SNP alleles and both dyes in combination represent the presence of both alleles (heterozygous). The signal intensities of each dye are collected by instruments and analysed using software provided by the manufacturer. For each SNP locus, there are associated information such as the identity of the SNP, the SNP alleles and the dye representing each allele.

For example, the software, Genome studio, provided by lllumina is capable of processing and displaying the signal intensities and associated information in graphical format (Figures 3 and 4). Genome Studio also includes a proprietary clustering algorithm which analyses the signal intensities to determine (or "call") the SNP genotype. However, it was found that the algorithm provided in genome studio was not satisfactory and often calls the SNP wrongly compared to capillary sequencing of the SNPs or not called (uncalled) at all while the signal is available and callable. The miscalled SNP would distract and the uncalled SNPs excluded from analysis.

Example 2 Ratio of signal intensities

The present method may be adapted to any suitable array platform for detecting nucleic acid variations and/or genotyping known in the art.

The present method was tested on the lllumina GoldenGate platform. First, the following associated information from lllumina Genome Studio was extracted with a series of preprocessing scripts.

(i) The "Full data table.txt" which carries the information of the name of subject/sample that was genotyped, the IllumiCode address, locus and position of the SNP and the raw intensities.

(ii) The "Sample Table.txt" which carries information on the sample and distribution of intensities of the sample

(iii) "Paired sample table.txt" which carries the information of the SNP and bead type and the SNPs in the X„and Y, including background intensities of the negative control.

(iv) The "SNPtable.txt" which carries the orientation of the SNP using TOP/BOT convention of the lllumina GoldenGate Array and the primer sequences. Consider the situation for example allele A and B, where X and Y represented the respective allele raw signal intensities. For each array, there will be negative controls comprising beads or points without binding to any PCR product, where the intensities of these points may be used as the background intensity (Bl). In the case of lllumina GoldenGate Assays, the background intensity was derived from a few blank beads which are part of the negative control in lllumina GoldenGate assays. This could be substituted with any signal intensities from control experiments in any SNP genotyping assays.

The corrected signal intensities XA and YB for alleles A and B, respectively is as follows:

X_A = X - Bl

Y_B = Y - Bl

Corrected signal intensities XA and Y_B ^" for A and B, respectively against the background should be larger than 0. If XA or Y_B is found to be less than 0, it will be assigned to 0. When one of the allele corrected signal is 0, the allele is found to be negligible or absent, and the other allele that was found will be called as the genotype. In other words, if one of the signal intensities were found to be too low, i.e. only one dye has significant intensities after background subtraction, the represented allele would be taken as representing the SNP. If both X_A and Ye are found to be 0, the genotype is called as a No Call (NC). In other words, if the signal intensities of both dyes fall below the background intensity, it would be accepted that the signal failed or the SNP alleles being tested are not present.

For example, if the ratio :3 is used (i.e. C = 3), when the signal ratio (S_r) of corrected signal intensities of allele A arid B (X_A to Y_B or XA:YB)≥ 3:1 the allele in the sample would be A.A, if the XA-^'YB signal ratio < 1 :3 the SNP would be B:B, while the signal ratio of XA:YB is between 3:1 and 1 :3, it would be A.B. The calculation and genotype calling is represented as follows.

Where,

X_A = Corrected signal intensities for A, X_A≥ 0

X = Raw signal intensity of allele A

Bl = background intensity

Y_B = Y - BI

Where,

YB = Corrected signal intensities for B, Y_B≥ 0

Y= Raw signal intensity of allele B

Bl = background intensities

AA AB

Signal ratio, S_T ¾^■- ¾ =

BB BB AA NC

Or

> 2.0000, Genotype = AA

YB

0.5000 <— < 2.0000, Genotype

Signal ratio, S_R = X_A: Y_B = <

≤ 0.5000, Genotype

X_A = 0.0000, Y_E > 0.0000, Genotype = BB Y_B = o.oooo, _A > 0.0000, Genotype = AA KY_B = 0.0000, X_A = 0.0000, Genotype = NC

Alternatively, a ratio of 1 :2 could also be applied. Accordingly, the present method uses a principle where the differences in the dyes and bias in corrected signal ratio reflected the correct genotype to be called as expected (Figure 5).

Validation of the calls was made with 40 SNPs on 27 plant samples by capillary sequencing. The 40 SNPs was a set of SNP performed on oil palm sample that have both golden gate assay signal at sequencing data available for analysis.

After observation of false positive calls compared to the result of capillary sequencing, it was found that the error margin was 15% (Figure 6). In other words, given a ratio of XA:YB of 3:1 , the ratio is 3.000 and the error is within 15% of 3.000 or 3.000 +/- 15% would be the area for false positive calls to be present (Figure 6). A similar range could be applied for the ratio 1 :2. Not all present in the error margin was false. The terms "error margin" and "signal border" are used interchangeably.

Implementation of the present corrected signal ratio algorithm for calling SNP was done in a Perl script called GGGTSNPcaller.pl. This code could be ported to be implemented in other programming languages. The script could output essential data as shown in Table 1 , indicating the summary of the sample, the SNP, the SNP address in the array, expected SNP1 and SNP2 alleles, the corrected signal intensities XA and YB, the genotype call made (homozygous or heterozygous), the SNP called, the corrected signal ratio (S_r), the percentage variation (VARPCT) from the signal borders and the Borders - whether the signal is within the 15% borders. Refined data with more details of the call could also be produced. Table 1 Sample output from signal algorithm analysis

By observing the few good and successful calls in genome studio, corrected signal intensities after deduction of the background intensities from the dyes at ratio 1:2 or 1 : 3 ranges was found to suggest with relatively higher confidence of the correct calls. Benchmark was done against sequenced genotypes. Referring to Table 2 and Figure 7 which compares the calls from the corrected signal ratio algorithm (Poh), Genome Studio (GS) and sequencing (Seq), it can be seen that the corrected signal ratio algorithm had a cumulative 56.2% (15.09% + 41.11%) of calls matching to sequencing (Seq) while Genome Studio (GS) had a cumulative 48.6% (41.11% + 7.5%) of calls matching to sequencing. This shows that the present corrected signal ratio algorithm shows a higher percentage of correctly identifying the SNP than Genome Studio on validation with sequencing.

Table 2 Comparison of the present invention (Poh) with capillary sequencing (Se and Genome Studio (GS)

Table 3 Description of the Type of comparison

[Type [Description

AllSame Sr(Poh), GS, Sequencing calls the same result.

PohSeqSame Sr(Poh) and Sequencing calls the same result. GS calls different resu

PohGSSame Sr(Poh) and GS call the same result. Sequencing calls different resul

GSseqSame |GS and Sequencing call the same result. Sr (Poh) calls different resu

AllDiff Sr(Poh), GS and Sequencing calls different result.

A11F Sr(Poh), GS and Sequencing failed to call result.

GS-seqF GS and Sequencing failed to call result. Sr (Poh) calls a result.

Poh-seqF Sr(Poh) and Sequencing failed to call result. GS calls a result.

Poh-GSF Sr(Poh) and GS failed to call result. Sequencing calls a result.

seqF-PohGSSame Sr(Poh) and GS calls the same result. Sequencing failed to calls a re; seqF-PohGSDiff Sr(Poh) and GS calls different result. Sequencing failed to calls a res

GSF-PohseqSame Sr(Poh) and Sequencing calls the same result. GS failed to calls a re:

GSF-PohSeqDiff Sr(Poh) and Sequencing calls different result. GS failed to calls a res

PohF-GSseqSame GS and Sequencing calls the same result. Sr(Poh) failed to calls a re:

PohF-GSseqDiff GS and Sequencing calls different result. Sr(Poh) failed to calls a res

Figure 8 illustrates the flow diagram of an example of the method according to the invention.

References

Kurg et ai, (2000) Arrayed primer extension: Solid phase four color DNA resequencing and mutation detection technology. Genet. Test 4:1-7.

Olivier (2005) The Invader assay for SNP genotyping. Mutat. Res. 573(1- 2):103-110.

Perkel (2008) SNP genotyping: six technologies that keyed a revolution. Nat. Methods 5:447-454.

Ritchie er a/., (2009) Bioinformatics 25(19):2621 -2623.

Sambrook and Russel, Molecular Cloning: A Laboratory Manual, Cold Springs Harbor Laboratory, New York (2001). Steemers et ai, (2006) Whole-genome genotyping with the single-base extension assay. Nat. Methods 31(1):31-33 Takitoh et al., (2007) Genome Analysis 23(4):408- US 2009/0062138

Claims

1. A method for detecting a nucleic acid variation of a locus in a sample, comprising the steps of:

(iii) performing background corrections on the first and second signal intensities to give background corrected first signal intensity X_A and background corrected second signal intensity YB;

(iv) expressing X_A: YB as a ratio (S_r), wherein if X_A:Y_B > C:1 given X_A , YB > 0, or if X_A > 0 and Y_B≥ 0, then the nucleic acid variation is A:A; if X_A:Y_B < 1 :C given X_A , Y_B > 0, or if Y_B > 0 and X_A > 0, then the nucleic acid variation is B:B; if X_A:YB is between C:1 and 1 :C, then the nucleic acid variation is A:B; wherein C is a real number; and if both X_A and Y_B≤ 0, either both A and B are not present or the nucleic acid variation cannot be determined.

2. The method according to claim 1 , wherein C > 2.

3. The method according to claim 1 or 2, wherein C = 2 or 3.

4. The method according to any one of the preceding claims; wherein C = 3.

5. The method according to any one of the preceding claims, wherein step (iii) comprises subtracting the background intensity (Bl) from X and Y.

6. The method according to any one of the preceding claims, step (i) further comprising performing an amplification.

7. The method according to claim 6, wherein amplification is with a polymerase chain reaction (PCR).

8. The method according to claim 7, wherein the PCR is with the first labelled probe, the second labelled probe and a locus specific oligonucleotide as primers to give PCR products.

9. The method according to claim 8, wherein the PCR products are hybridized to locus specific nucleic acid immobilised on the support.

10. The method according to any one of the preceding claims, wherein the method is for detecting different alleles.

11. The method according to any one of the preceding claims, wherein the method is for detecting single nucleic acid polymorphisms (SNP).

12. The method according to any one of the preceding claims, wherein the method comprises a computer-implemented method.

13. A computer system, programmed to perform steps (iii) and (iv) of any one of the preceding claims. A computer program product comprising a software executable by a computer system to cause the computer system to perform steps (iii) and (iv) of any one of claims 1 to 12.