Summary of the invention
An aspect of of the present present invention provides one group of label, and it comprises at least 2 of sequence shown in SEQ ID NO:27 ~ 124.
Another aspect provides one group of Tag primer, it comprises 1 pair of Tag primer, and the structural formula of described Tag primer is
Wherein, X and X ' is all selected from same sequence in the label that one aspect of the present invention provides or different sequence, Y is selected from SEQ ID NO:1, SEQ ID NO:3, SEQ ID NO:5, SEQ ID NO:7, SEQ ID NO:9, SEQ ID NO:11, SEQ ID NO:13, SEQ ID NO:15, SEQ ID NO:17, SEQ ID NO:19, SEQ ID NO:21, SEQ ID NO:23 and the arbitrary sequence shown in SEQ ID NO:25, Y ' is corresponding with Y, said correspondence refers to: when Y is SEQ ID NO:1, Y ' is SEQ ID NO:2, when Y is SEQ ID NO:3, Y ' is SEQ ID NO:4, when Y is SEQ ID NO:5, Y ' is SEQ ID NO:6, when Y is SEQ ID NO:7, Y ' is SEQ ID NO:8, when Y is SEQ ID NO:9, Y ' is SEQ ID NO:10, when Y is SEQ ID NO:11, Y ' is SEQ ID NO:12, when Y is SEQ ID NO:13, Y ' is SEQ ID NO:14, when Y is SEQ ID NO:15, Y ' is SEQ ID NO:16, when Y is SEQ ID NO:17, Y ' is SEQ ID NO:18, when Y is SEQ ID NO:19, Y ' is SEQ ID NO:20, when Y is SEQ ID NO:21, Y ' is SEQ ID NO:22, when Y is SEQ ID NO:23, Y ' is SEQ ID NO:24, when Y is SEQ ID NO:25, Y ' is SEQ ID NO:26.
Another aspect of the invention provides a kind of test kit, and it comprises one group of label that one aspect of the present invention provides, and/or comprises the Tag primer that the present invention provides on the other hand.
Another aspect of the present invention provides aforementioned agents box at mark mixed nucleus acid sample, and/or multiple sample of nucleic acid mixing order-checking, and/or is determining the sample source that the data in mixing sequencing data are corresponding, and/or is detecting the purposes in deaf-related gene mutation.
Tag primer provided by the invention is utilized to introduce label sample or utilize test kit of the present invention, the sample being more than or equal to number of tags can be distinguished, the primer pair of a pair upstream and downstream primer sequence all with same label is such as utilized to mark a sample, make nucleic acid amplification product band specific label sequence of this sample to distinguish other sample, the primer pair of a pair different label of upstream and downstream primer sequence band is utilized to mark a sample for another example, make nucleic acid amplification product band two sequence labels of this sample, as long as like this with any one of two labels just can distinguish other sample nucleic acid from the different of other sample, utilize multipair primer mark sample more for another example, the upstream and downstream sequence of each pair of primer is all with same label or different label, the same or different label is with between each pair of primer, make the label of the one or more particular arrangement of this sample nucleic acid amplified production band, wherein the difference of arbitrary label or label position is enough to distinguish different sample.This group label or Tag primer provided by the invention, that contriver is by test of many times, creative work, design consider each sequence of examination based composition own, various base ratio and relationship between sequences such as between label from label, between label and primer, between primer and primer, the relation of label after being connected to primer between different Tag primer etc., the whole of these sequences or any portion can both be used in same reaction system and play a role.Primer in Tag primer provided by the invention, utilizes whole 13 pairs of primers to catch, mixes deaf-related gene common in amplification population of China, detect the catastrophe of deaf-related gene, the deaf disease of adjuvant clinical checkout and diagnosis.
Embodiment
According to one embodiment of the present invention, provide one group of label, it comprises at least 2 of sequence shown in SEQ ID NO:27 ~ 124.
According to a specific embodiment of the present invention, the label provided comprises at least 5 of sequence shown in SEQ ID NO:27 ~ 124; According to a specific embodiment of the present invention, said label comprises at least 10 of sequence shown in SEQ ID NO:27 ~ 124; According to a specific embodiment of the present invention, label used comprises at least 20 of sequence shown in SEQ ID NO:27 ~ 124; According to a specific embodiment of the present invention, the label used comprises at least 30 of sequence shown in SEQ ID NO:27 ~ 124; According to a specific embodiment of the present invention, the label used comprises at least 40 of sequence shown in SEQ ID NO:27 ~ 124; According to a specific embodiment of the present invention, the label used comprises at least 50 of sequence shown in SEQ ID NO:27 ~ 124; According to a specific embodiment of the present invention, the label used comprises at least 60 of sequence shown in SEQ ID NO:27 ~ 124; According to a specific embodiment of the present invention, the label used comprises at least 70 of sequence shown in SEQ ID NO:27 ~ 124; According to a specific embodiment of the present invention, the label used comprises at least 80 of sequence shown in SEQ ID NO:27 ~ 124; According to a specific embodiment of the present invention, the label used comprises at least 90 of sequence shown in SEQ ID NO:27 ~ 124; According to a specific embodiment of the present invention, the label used comprises whole 98 of sequence shown in SEQ ID NO:27 ~ 124.Sequence SEQ ID NO:27 ~ 124 are in table 1.
Table 1
This group label of one embodiment of the present invention, be consider sequence length, based composition, base positions ratio, with a large amount of sequence of relational design of other label base, test of many times screening obtains, label of the present invention some or all of can be placed in same reaction system and not disturbing influence each other, and do not disturb other reactant in conventional system or reaction, such as do not affect each reaction system in library construction and reaction, the fixed sequence program etc. on sequence testing chip.
According to another embodiment of the invention, provide one group of Tag primer, it comprises 1 pair of Tag primer, and the structural formula of Tag primer is
Wherein, X and X ' is all selected from the label that one aspect of the present invention provides, and Y is selected from SEQ ID NO:1, SEQ ID NO:3, SEQ ID NO:5, SEQ ID NO:7, SEQ ID NO:9, SEQ ID NO:11, SEQ ID NO:13, SEQ ID NO:15, SEQ ID NO:17, SEQ ID NO:19, SEQ ID NO:21, SEQ ID NO:23 and the arbitrary sequence shown in SEQ ID NO:25, Y ' is corresponding with Y, and described correspondence refers to:
When Y is SEQ ID NO:1, Y ' is SEQ ID NO:2,
When Y is SEQ ID NO:3, Y ' is SEQ ID NO:4,
When Y is SEQ ID NO:5, Y ' is SEQ ID NO:6,
When Y is SEQ ID NO:7, Y ' is SEQ ID NO:8,
When Y is SEQ ID NO:9, Y ' is SEQ ID NO:10,
When Y is SEQ ID NO:11, Y ' is SEQ ID NO:12,
When Y is SEQ ID NO:13, Y ' is SEQ ID NO:14,
When Y is SEQ ID NO:15, Y ' is SEQ ID NO:16,
When Y is SEQ ID NO:17, Y ' is SEQ ID NO:18,
When Y is SEQ ID NO:19, Y ' is SEQ ID NO:20,
When Y is SEQ ID NO:21, Y ' is SEQ ID NO:22,
When Y is SEQ ID NO:23, Y ' is SEQ ID NO:24,
When Y is SEQ ID NO:25, Y ' is SEQ ID NO:26.
According to a specific embodiment of the present invention, the one group of Tag primer provided comprises 2 pairs of Tag primers, and the Y of these 2 pairs of Tag primers is selected from SEQ ID NO:1, SEQ ID NO:3, SEQ ID NO:5, SEQ ID NO:7, SEQ ID NO:9, SEQ ID NO:11, SEQ ID NO:13, SEQ ID NO:15, SEQ ID NO:17, wantonly 2 sequences shown in SEQ ID NO:19, SEQ ID NO:21, SEQ ID NO:23 and SEQ ID NO:25.
According to a specific embodiment of the present invention, the one group of Tag primer provided comprises 5 pairs of Tag primers, and the Y of these 5 pairs of Tag primers is selected from SEQ ID NO:1, SEQ ID NO:3, SEQ ID NO:5, SEQ ID NO:7, SEQ ID NO:9, SEQ ID NO:11, SEQ ID NO:13, SEQ ID NO:15, SEQ ID NO:17, wantonly 5 sequences shown in SEQ ID NO:19, SEQ ID NO:21, SEQ ID NO:23 and SEQ ID NO:25.
According to a specific embodiment of the present invention, the one group of Tag primer provided comprises 10 pairs of Tag primers, and the Y of these 10 pairs of Tag primers is selected from SEQ ID NO:1, SEQ ID NO:3, SEQ ID NO:5, SEQ ID NO:7, SEQ ID NO:9, SEQ ID NO:11, SEQ ID NO:13, SEQ ID NO:15, SEQ ID NO:17, wantonly 10 sequences shown in SEQ ID NO:19, SEQ ID NO:21, SEQ ID NO:23 and SEQ ID NO:25.
According to a specific embodiment of the present invention, the one group of Tag primer provided comprises 13 pairs of Tag primers, and the Y of these 13 pairs of Tag primers is respectively SEQ ID NO:1, SEQ ID NO:3, SEQ ID NO:5, SEQ ID NO:7, SEQ ID NO:9, SEQ ID NO:11, SEQ ID NO:13, SEQ ID NO:15, SEQ ID NO:17, SEQ ID NO:19, SEQ ID NO:21, SEQ ID NO:23 and the sequence shown in SEQ ID NO:25.Sequence SEQ ID NO:1-26 is as shown in table 2.
Table 2
Primer numbers |
Primer sequence |
F1 |
TCTTTTCCAGAGCAAACCGC(SEQ?ID?NO:1) |
F2 |
ACGTGCATGGCCACTAGGAG(SEQ?ID?NO:3) |
F3 |
TGCAGCTGATCTTCGTGTCC(SEQ?ID?NO:5) |
F4 |
ATGGTGAGTACGATGCAGAC(SEQ?ID?NO:7) |
F5 |
GCCTTTGGTGTGCTAAAGAC(SEQ?ID?NO:9) |
F6 |
GGGTTCCAGGAAATTACTTTG(SEQ?ID?NO:11) |
F7 |
AAATGATCGGTTTAGACAC(SEQ?ID?NO:13) |
F8 |
AGGATCGTTGTCATCCAGTC(SEQ?ID?NO:15) |
F9 |
TAGGGCCTATTCCTGATTGG(SEQ?ID?NO:17) |
F10 |
CCAAAGCTCCAAATGTATA(SEQ?ID?NO:19) |
F11 |
AGAAAAGCTGGAGCAATGCG(SEQ?ID?NO:21) |
F12 |
ACACACAATAGCTAAGACCC(SEQ?ID?NO:23) |
F13 |
GAGTGCTTAGTTGAACAGGG(SEQ?ID?NO:25) |
R1 |
GGGTGTTGCAGACAAAGTCG(SEQ?ID?NO:2) |
R2 |
TTGTGGCTGCAAAGGAGGTG(SEQ?ID?NO:4) |
R3 |
ACCACAGGGAGCCTTCGATG(SEQ?ID?NO:6) |
R4 |
CAAGCTCATCATTGAGTTCC(SEQ?ID?NO:8) |
R5 |
GGAGAAGTGTTAAACTCCTG(SEQ?ID?NO:10) |
R6 |
ACAGCTAGAGTCCTGATTGC(SEQ?ID?NO:12) |
R7 |
TTTCCAGGTTGGCTCCATAT(SEQ?ID?NO:14) |
R8 |
AAGGCTGTTGTTCCTACCTG(SEQ?ID?NO:16) |
R9 |
CCAGTCCTATTTTCTATGGC(SEQ?ID?NO:18) |
R10 |
GTGGATTGGAACTCTGAGC(SEQ?ID?NO:20) |
R11 |
GATACATCTGTAGAAAGGTTG(SEQ?ID?NO:22) |
R12 |
GATTACAGAACAGGCTCCTC(SEQ?ID?NO:24) |
R13 |
AAGCTACACTCTGGTTCGTC(SEQ?ID?NO:26) |
Utilize one group of Tag primer in this embodiment of the present invention, the corresponding nucleic region can increased in each sample, and make in the amplified production of each sample with one or more label having ordinal relation accordingly, set up the corresponding relation of sample and label like this, according to this corresponding relation, way can be had to make the sample nucleic acid sequence data of mixing, the normally blended data of huge number, correspond to correct sample source, analyze each sample information nucleic acid.
According to of the present invention another embodiment there is provided a kind of test kit, it comprises the label that one aspect of the present invention provides.According to a specific embodiment of the present invention, it also comprises the Tag primer provided on the other hand.
According to of the present invention another embodiment there is provided aforementioned agents box at mark mixed nucleus acid sample, and/or multiple sample of nucleic acid mixing order-checking, and/or determining to mix the purposes in sample source corresponding to the data in sequencing data.
Utilize sequence provided by the invention, test kit or method, can mark to distinguish and to be more than or equal to or much larger than multiple samples of number of tags, enable the data of multiple sample mix process and this mark corresponding relation finally can be utilized to distinguish blended data, sorting out it to each sample.
Embodiment one mixing sample library construction, order-checking
96 people's blood samples are available from Tianjin healthcare hospital for women & children, and reagent or instrument are conventional commercial, such as can purchased from Life Technologies, Inc. (life technologies).
The nucleic acid of 96 samples is placed in respectively the 96 each holes of orifice plate, in 96 orifice plates, PCR labeled reactant is carried out to all samples: after extracting the DNA in blood sample, the DNA profiling of each sample is added respectively in different holes, and in each hole, add 13 pairs of Tag primers, Tag primer holds previously prepared by sequence label being connected to primer 5 ', the corresponding relation of the label (barcode or index) that record sample and PCR introduce, then be placed in PCR instrument and carry out pcr amplification, obtain target sequence amplification product, the component of this multi-PRC reaction system and amount or the configuration of ratio ground can refer to known multi-PRC reaction system, such as with reference to Hayden MJ, Nguyen TM, Waterman A, Chalmers KJ.2008.Multiplex-ready PCR:a new method for multiplexed SSR and SNP genotyping.BMC Genomics., multi-PRC reaction system configurations in doi:10.1186/1471-2164-9-80 is carried out, and the label in synthesis Tag primer and primer are selected from table 1 and table 2 respectively.Then according to mark mixing amplified production and purifying, mixing library construction is carried out in library construction explanation again according to the order-checking platform adopted, it is here the library construction handbook according to Pronton, comprise end reparation to be connected with joint, like this, a library comprises 96 samples, then Agilent Bioanalyzer 2100 is utilized to detect library fragments size and concentration, obtain qualified upper machine library, then according to the flow process of operating the computer of Ion Proton semiconductor microchip order-checking platform, checked order in library.
Because amplification region is little in this embodiment, the data volume of a sample is far below the data volume of a passage (lane) in Proton sequenator, (Meyer M, Kircher be Sequencing Library Preparation for Highly Multiplexed Target Capture and Sequencing.Cold Spring Harb.Protoc. M.2010.Illumina can to reference to multiple (Multiplex) sequencing technologies of Illumina; Doi:10.1101/pdb.prot5448), be transplanted to Proton platform, such as utilize when library construction connection label joint or amplification to introduce new label and mark multiple mixing library, machine order-checking on multiple mixing library can be mixed.What Fig. 1 illustrated is utilize each sample of sequence label amplification label of the present invention, mix each sample nucleic acid and build storehouse, obtain one or more library, if build multiple mixing sample library, new label is introduced in amplification after joint connects or joint connects, the label here introduced can utilize open set of tags, also can be selected from table 1 except the sequence label of marker samples is to distinguish multiple mixing library.
The classification of embodiment two blended data
After obtaining lower machine data, screen out too short reads according to read length, such as filter out the reads being less than 50bp, the corresponding relation then according to label and sample is sorted out reads.
Build special reference sequence, reference sequences is here made up of target sequence (target amplification region), the intercepting of reference targets sequence area determine can according to the primer comparison in embodiment one to the genomic position of reference intercept and determine.Blended data is sorted out, traditional method can be utilized, intercept the 5 ' 7bp that hold of comparison to the reference sequences at least soft montage reads (soft clip reads) of one end, the i.e. length of label, by corresponding with the sequence label of marker samples for this 7bp intercepted, just this reads is returned to this sample in correspondence.
When soft clip reads refers to that genome is got back in reads comparison, article one, reads is cut into two sections, match different regions, such reads is called soft clip reads, is generally because the disappearance of a certain section or the montage of transcript profile occur genome, in sequencing procedure, this kind of reads is across deletion segment or splice site, and refer to that only a part of comparison is to the reads of special reference sequence at this side soft clip reads, that section in non-comparison is called soft clip.
Contriver finds directly to compare with the reference sequences of amplification region (capture region) Sequence composition of design from the lower machine data of this semi-conductor non-chip order-checking platform, the observation comparison position of reads on this reference sequences, when the zero position in comparison is positioned at the 8th base of this reads, illustrating that this read has a complete length at its 5 ' end is the barcode of 7bp just; In like manner, when the distance that the final position distance read3 ' of comparison holds is 7bp, illustrate that this read has a complete barcode at 3 ' end.Through statistics, there is the reads of 64% all to have complete barcode at 5 ' end and 3 ' end, only have the reads of complete b arcode to account for 14% at 5 ' end, only have the reads of complete b arcode to account for 12% at 3 ' end.Based on this discovery, contriver proposes the classifying method of another set of blended data, the accuracy rate of classification correspondence can be made to improve at least 12%, the corresponding reads of this classification is to the method for sample source, also be first utilize the comparison software tmap of life technologies to arrange the section of reading (reads) comparison on the reference sequences formed with target sequence (amplification region) with default parameters, according to the soft clip reads information in comparison result, for every bar soft clip reads, when its 5 ' end or 3 ' end have a length to be the soft clip generation of 7bp, assert that this read has complete barcode sequence at 5 ' end or 3 ' end.If read has complete barcode sequence at 5 ' end and 3 ' end, whether identical both comparing, difference then casts out this reads; If reads only has barcode sequence at 5 ' end or 3 ' end, then this sequence is exactly the barcode sequence of this reads.Based on the corresponding relation of the barcode recorded before and sample, can learn the sample that this barcode sequence pair is answered, will clip part corresponding to barcode in the sequence of this reads and mass value, remaining part is included in the data of this sample.In this step, data user rate can up to 90%.
The variation of embodiment three sample nucleic acid detects
After correctly sorting out mixing sequencing data, carry out separately variation detect analysis for each sample, the software used is the lifetechnologies comparison software tmap provided and the inspection software tvc that makes a variation.These softwares can run in the random server of proton sequenator, and its default parameters is all arranged for human genome, can adjust.With reference to the specification documents of tmap and Torrent Variant Caller (tvc), the reads information utilizing amplification region information and correctly sort out, just can complete the analysis of comparison and variation detection.Wherein, the central principle that variation detects utilizes Bayesian inference (Shoemaker JS, Painter IS, Weir BS.1999.Bayesian statistics in genetics:A guide for the uninitiated.Trends Genet15:354 – 358).This algorithm, in conjunction with the base mass value information on comparison result and read, can draw the gene type of this point, thus judges whether this gene there occurs sudden change.
By utilizing 13 pairs of Tag primers to increase simultaneously, can realize carrying out variation for GJB2, GJB3, SLC26A4 and 12sRNA tetra-genes relevant to deafness and detect.Comprise the test of the 300 routine samples altogether of above-mentioned 96 routine samples, 100% is reached with the consistence of Sanger sequencing result, the detected result of the part sample of present method is as shown in table 3, because behaviour is diploid in table 3, so detected result two letters represent, if SNP, represent with base with regard to direct, the detected result of the point mutation " 9G>A " of such as sample " 14HL078963 " is " G.G ", illustrate that a pair bit base such as grade in this site is not all undergone mutation, namely not there is SNP in this site, for the detection of insertion and deletion (indel) mutation type, do not suddenly change with R representative, with V representative sudden change, Sanger order-checking detects the sudden change of this part sample, result is with table 3.Meanwhile, present method required time is generally 2 ~ 3 days, compares to shorten 2 days with mass spectroscopy, and relative to genome sequencing, the order-checking cost of present method be only its 1%, and, present method once goes up machine, and can to detect 500 increments originally even more simultaneously, greatly improve detection flux.
Table 3