[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN107577921A - A kind of tumor target gene sequencing data analytic method - Google Patents

A kind of tumor target gene sequencing data analytic method Download PDF

Info

Publication number
CN107577921A
CN107577921A CN201710739726.0A CN201710739726A CN107577921A CN 107577921 A CN107577921 A CN 107577921A CN 201710739726 A CN201710739726 A CN 201710739726A CN 107577921 A CN107577921 A CN 107577921A
Authority
CN
China
Prior art keywords
mutation
sequence
data
sequencing
read
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710739726.0A
Other languages
Chinese (zh)
Inventor
李志广
吕德康
张学红
张宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cloud One Biological Technology (dalian) Co Ltd
Original Assignee
Cloud One Biological Technology (dalian) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cloud One Biological Technology (dalian) Co Ltd filed Critical Cloud One Biological Technology (dalian) Co Ltd
Priority to CN201710739726.0A priority Critical patent/CN107577921A/en
Publication of CN107577921A publication Critical patent/CN107577921A/en
Pending legal-status Critical Current

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A kind of tumor target gene sequencing data analytic method, belongs to genomics high-flux sequence field, including analytical procedure have:Obtain the read sequence containing abrupt information;Sequencing quality controls;The amplification efficiency quality control of targeting amplification region corresponding to amplicon;Delete the primer sequence in read sequence;By the read on the reference sequences of read sequence alignment to target area, compared and compare situation;Identification is compared to the mutation in the read sequence of target area;Screening sample is carried out according to the sequencing depth in mutational site;With reference to the mutation of the screen mutation significant difference recorded;Analysis is associated with reference to case-data.The different amplification sublibraries that the present invention can customize to different user carry out general analyzing and processing and obtain the mutant analysis results significantly correlated with disease with reference to clinical information, the comprehensive assessment and primer sequence shearing procedure for targetting amplified library efficiency are added, improves the reliability of analysis result.

Description

A kind of tumor target gene sequencing data analytic method
Technical field
The present invention relates to genomics high-flux sequence data analysis field, specifically includes and number is sequenced to target gene library According to progress quality control, amplicon efficiency evaluation and filtering, genome alignment, mutation identification and annotation, and then combine and recorded Mutation and case-data complete statistical analysis, and providing a whole set of non-customized oncogene mutation detecting analysis for tumor patient solves Scheme, technical support is provided for tumour Personalized medicine.
Background technology
Tumour is inherently genopathy.Various environment and heredity carcinogenic factor causes DNA to damage in a manner of cooperateing with Evil, so as to activating proto-oncogene and (or) inactivation tumor suppressor gene, apoptosis gene and (or) DNA-repair gene in addition Change, then cause the exception of expression, target cell is progressively converted to cancer cell.The cell being converted first presents more Clonal hyperplasia, by a very long multistage evolution process, one of relatively unconfined amplification of clone, pass through Addition mutation is accumulated, the subclone (heterogeneousization) with different characteristics is formed selectively, so as to the energy for being infiltrated and being shifted Power (vicious transformation), form malignant tumour.
Oncogene detection be extract human body cell in inhereditary material, by be sequenced detection human body in oncogene or Tumor susceptibility gene, for the prevention of tumour, diagnosis, prognosis prediction, targeting medication, postoperative monitoring etc..
Targeting sequencing is that the PCR primer of length-specific or the fragment of capture are sequenced, the variation in analytical sequence. The sequencing of high coverage can be carried out to target area according to different demands by targetting sequencing, can also detect that low frequency is mutated.With The sequencing reduction of cost and going deep into for mankind's functional genomics research, targeting sequencing is moved towards to face from research institution Bed, for multiple fields such as genetic screening, disease risks assessment, tumor diagnosis and treatment and accurate medications.
The problem of targetting sequencing data analysis:First, it there is no targeting sequencing data instrument to combine known mutations database Or case-data carries out statistical analysis, it is impossible to provides the mutant analysis results significantly correlated with disease.Second, general analysis software Only meet the analysis of fixed panel libraries sequencing data, such as the TrueSeqAmplicon of illumina companies, can not meet not With the targeting library analysis of the user of demand.3rd, in existing method, do not assess the amplification efficiency of amplicon and to primer sequence The operation trimmed is arranged, if the SNV results that follow-up mutation analysis obtains can be caused with higher by not processing both of which False positive, so as to impact analysis conclusion.
The content of the invention
The defects of existing for existing analysis method, the present invention provide for the target gene sequencing data analysis of autonomous Design A whole set of solution.
The present invention seeks to what is be achieved through the following technical solutions:
A kind of tumor target gene sequencing data analytic method, it is characterised in that comprise the following steps:
Step 1:Obtain the read sequence containing abrupt information, i.e., high-throughout sequencing data;
Step 2:The quality control of sequencing data, all sequencing datas that step 1 obtains are entered by fastqc softwares Row quality analysis, sequencing data Quality Control Report is obtained, and filter out and be reported as low-quality data;
Step 3:The amplification efficiency of different amplification regions is counted, deletes the abnormal data of amplification;
Step 4:The primer sequence in sequencing data read sequence is deleted, that is, obtains real target area in read sequence Domain dna sequence;
Step 5:By on all sequences comparing obtained in step 4 to target area, comparison result data are obtained;
Step 6:All mutation are detected from comparison result data using mutation identification facility;
Step 7:The sequencing depth of all covering bases in amplification region is counted, is sieved according to the sequencing depth in mutational site Select the mutation that reliability is high;
Step 8:With reference to the mutation for having annotated the screen mutation significant difference in cancer Relational database;
Step 9:With reference to the clinical data information of case, statistical analysis, identification and the notable phase of character are carried out to various mutation The germline mutation (germline mutations) of pass and somatic mutation (somatic mutations);
Step 10:Graphically generate data analysis report.
Targeting of the described sequencing data from the high-flux sequence platform including IlluminaMiseq/Hiseq is expanded Increase library, target area may customize, i.e., provides all target point gene group location informations when analyzing first.
Low quality data described in step 2 refers to the sequencing data that the average sequencing quality score of single base is less than 20.
Amplification region abnormal data obtains in step 3 and deletion process is as follows:
(1) by the read comparing that sequencing obtains to targeting reference gene group;
(2) judge whether the amplicon primer sequence corresponding to two terminal sequences of read comes from same primer pair, that is, permit Perhaps respectively there are 2 mispairing at preceding primer and rear primer and 5 ' and 3 ' ends of sequencing fragment, remove ineligible read sequence;
(3) statistics covers the read sequence of targeting amplification region corresponding to each pair amplicon, and application expands number to weigh Measure and compare their amplification efficiency;
(4) when expanding number less than number average is expanded corresponding to all amplicons 1/3, then judge corresponding to the amplicon Amplification region exist abnormal, and amplified all read sequences come and deleted in analysis.
Comparison process in the step 5 needs, according to the target area genomic locations information provided first, to extract this A little nucleic acid sequence informations of the target area in genome, and generate index.
For all mutational sites in step 7, only by be sequenced in the site depth more than 100 × case include statistics Analysis.
The screening of the mutation of significant difference is based on the number stored in the cancer Relational database being currently known in step 8 According to come carry out.
The association analysis of clinical data information in step 9, the Chi-square Test function pair in concrete application R statistical softwares The clinical data of patient, including the age, sex, Cancer TNM staging, gross tumor volume, Tumor size, whether have lymph node invasion, Ki67 grade malignancies, tissue subtype, analysis is associated, finds out the risk factors related to specific gene mutation generation, be Which special mutation the no patient with a certain Clinical symptoms is prone to.
The reference sequences are derived from the reference oncogene or mankind's reference gene group sequence of UCSC public databases.
Beneficial effects of the present invention:A kind of tumor target gene sequencing data analytic method of the present invention includes (1) obtaining and contained The read sequence of abrupt information, i.e. sequencing data;(2) sequencing quality controls;(3) target area amplification efficiency quality control;(4) delete Except the primer sequence in read sequence;(5) read sequence is compared with reference to target site sequence, the read compared Sequence;(6) the mutation in read sequence is identified;(7) screening sample is carried out according to the sequencing depth in mutational site;(8) combine and annotated The mutation of screen mutation significant difference in cancer Relational database;(9) it is associated point with reference to the clinical data information of case Analysis;(10) data analysis report is graphically generated.Meet the analysis demand in non-customized targeting library, with reference to known mutations number The mutant analysis results significantly correlated with disease are provided according to storehouse or case-data.It will be combined in the analytic method known with reference to prominent Variable database and the clinical data information of patient, filtered out using different Statistical Identifying Methods prominent with significant difference Become.Mutation database includes germline mutation database and the class of somatic mutation database two.Wherein, conventional germline mutation data Storehouse includes thousand human genome database (http://www.1000genomes.org/) and 60,000 people ExAC human exonics group it is whole Close database (http://exac.broadinstitute.org/) etc..Conventional somatic mutation database swells including the U.S. Tumor gene group collection of illustrative plates TCGA databases (http://cancergenome.nih.gov/) and international cancer genome alliance ICGC Database (https://dcc.icgc.org/) etc..It is generally necessary to use four kinds of objects, first is mutation number ratio, that is, is taken Number of patients with mutation;Second, mutant proportion and colony's gene frequency in colony;3rd, homozygous mutation number ratio Example;4th, heterozygous mutant number ratio.After the data of above-mentioned four kinds of objects are taken, our cans are accurate using fisher The mutation (i.e. the gene mutation related to tumour generation) of the method screening significant difference of inspection statistics.This method application R is counted Chi-square Test function pair patient in software clinical data information (including the age, sex, Cancer TNM staging, gross tumor volume, Tumor size, whether have the information such as lymph node invasion, Ki67 grade malignancies, tissue subtype) analysis is associated, find out in lung cancer In to the related risk factors of specific gene mutation generations, i.e., whether which the patient with a certain Clinical symptoms is prone to Special mutation.On the one hand, a kind of tumor target gene sequencing data analytic method of the present invention compares the method that presently, there are more With versatility.On the other hand, the amplicon in this method particular for user's customization targets the amplification efficiency progress in library entirely Face is assessed, and primer sequence is trimmed, and ensures that the amplification efficiency of different amplicons is maintained at a substantially phase as far as possible Same level, to evade due to the false positive issue of SNV results caused by the amplification efficiency of different amplicons.Sum it up, this The different amplification sublibraries that method can not only customize to different user carry out general analyzing and processing and obtained with reference to clinical letter Breath the mutant analysis results significantly correlated with disease, also independently add for target amplified library efficiency comprehensive assessment and Primer sequence shearing procedure so that whole analysis method improves the reliability of analysis result again while novelty is had concurrently.
Brief description of the drawings
Fig. 1 is the inventive method implementation process figure.
Embodiment
Existing high-flux sequence platform have it is a variety of, including IlluminaNextSeq, MiSeq and HiSeq etc..The present invention In embodiment explained with IlluminaHiSeq/MiSeq microarray datasets.
Method provided by the invention abrupt climatic change suitable for targeting DNA or RNA, therefore will be explained respectively with embodiment State.Sample DNA/RNA extractions, structure library, high-flux sequence etc. are carried out using prior art in embodiment.
Unreceipted actual conditions in embodiment, the condition suggested according to normal condition or manufacturer are carried out;Agents useful for same Or the unreceipted production firm of instrument, can the conventional products obtained be bought by market.
Embodiment one:10 Pleural Fluid of Patients With Lung Cancer sample target gene sequencing data parsings:
Library in the present embodiment expands sublibrary for the targeting of 10 Pleural Fluid of Patients With Lung Cancer sample dissociative DNA structures.Text Storehouse structure comprises the following steps that:
(1) selection of target gene:Tumour heat mutation gene, proto-oncogene, tumor suppressor gene and targeted drug is selected to make Gene, specifically ABL1, EGFR, GNAS, MLH1, RET, AKT1, ERBB2, HNF1A, ALK, ERBB4, HRAS, NOTCH1、SMARCB1、APC、FBXW7、IDH1、NPM1、SMO、ATM、FGFR1、JAK2、NRAS、SRC、BRAF、FGFR2、 JAK3、PDGFRA、STK11、CDH1、FGFR3、KDR、PIK3CA、TP53、CDKN2A、FLT3、KIT、PTEN、VHL、CSF1R、 This 48 genes of GNA11, KRAS, PTPN11, EZH2, TNNB1, GNAQ, MET, RB1, IDH2, the target base studied as us Cause.
(2) extraction of dissociative DNA and quantitative:For the hydrothorax sample of patients with lung cancer, we first carry out low-speed centrifugal (3, 000rpm) take supernatant within 5 minutes, take supernatant within 10 minutes carrying out high speed centrifugation (14,000rpm), obtained the trip in hydrothorax sample From DNA (average length is about 166bp);And quantified using Qbuit2.0 (Invitrogen companies) instrument.
(3) amplicon designs:By online design of primers instrument DesignStudio, primer is carried out for 48 target genes Design.Finally, we have obtained covering 2,158 pairs of amplicons of 48 target gene whole exon regions, each pair amplification sub-pieces The size of section is about 150bp.Because the sequence length of different target genes is different, the clip size of our each pair amplicon is again Almost fix, therefore each target gene correspond to different number of amplicon primer pair.Target gene and amplicon primer pair The corresponding lists of number, are shown in Table 1.
Table 1
(4) extron of multiplexed PCR amplification target gene:After the completion of amplicon design of primers, provided according to design report Primer sequence, synthetic primer nucleic acid, and in the form of multiplex PCR expand target gene whole exon sequences.
(5) connection of Illumina sequence measuring joints and Library PCR amplification:For above-mentioned amplified production, we connect The sequence measuring joints of Illumina sequenators.Sequence measuring joints sequence is as follows:
Upstream sequence:5'P-NNN……NNNGATCGGAAGAGCACACGTCTGAA-3’
Downstream sequence:5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCNNN ... NNNT-3 ' joints have connected Cheng Hou, we can carry out 6-15 using KAPA HiFiHotStart PCR kit according to the difference of template initial amount to library The PCR amplifications of period.
(6) library quality inspection and Q-PCR are quantified:Library Quality is detected by agarose gel electrophoresis, uses 2% agar Sugared gel, 120v, 30 minutes, gel imaging, target stripe 270bp.Pass through the 2100Bioanalyzer of Agilent companies To library fragments size accurate quantification, and by Q-PCR to library concentration accurate quantification.
(7) machine is sequenced on MiSeq sequenators:Read sequence length is obtained under IlluminaMiSeq microarray datasets is 75bp both-end sequencing data.
Fig. 1 is refer to, the specific steps of the present embodiment include:
S101:It is sequenced by the structure and upper machine that expand sublibrary, it is all outer that we can obtain 48 target genes of covering The read sequence (i.e. both-end 75bp sequencing data) of aobvious subregion nucleic acid sequence information.
S102:Quality control is carried out to all read sequences in sequencing data using fastqc softwares, for single alkali It is poor that average read sequence data of the sequencing quality score less than 20 of base is set to sequencing quality, and is deleted in analysis.
S103:Sequencing data is filtered using amplicon primer sequence, that is, in two read sequences for extracting pairing Amplicon primer sequence pair, remove amplicon primer sequence to be not derived from it is same pairing primer read sequence, enter And the read sequence number for covering targeting amplification region corresponding to each pair amplicon primer is counted, the expansion of more different amplicons Increasing Efficiency, delete the read sequence data corresponding to the abnormal amplicon of amplification.
In the present embodiment, if the amplicon primer sequence corresponding to two terminal sequences of read is designed from us Same primer pair, then it is assumed that the read is derived from the amplification of the amplicon primer, to all amplifications for meeting above-mentioned condition Read sequence is counted corresponding to son, and the amplification efficiency of more all amplicons (weighs amplification effect with amplification number here Rate), when expanding number less than number average is expanded corresponding to all amplicons 1/3, then it is abnormal to judge that the amplicon is present, and will It amplifies all read sequences come and deleted in analysis.
S104:The amplicon primer sequence in read sequence is deleted, improves the accuracy and comparison efficiency of mutation identification.
In this example, we write Python programs by the read sequence of pairing and the sequence of known amplicon primer pair It is compared, and the part matched in read sequence with primer sequence is deleted from read sequence, so as to obtains real target Gene DNA sequence.Here, we are intercepted to read sequence is so as to reject the purpose of primer sequence part, the way On the one hand the base of primer resultant fault can be avoided to be taken as mutation to identify.On the other hand, the sequence length after simplifying Follow-up comparison time can be reduced.
S105:First, we are from UCSC genome browser databases
(http://genome.ucsc.edu/cgi-bin/hgTracksDb=hg19 mankind's reference gene group) is downloaded Sequences h g19.Secondly, we write program by the reference sequences of 48 target genes from whole human genomic sequence
(hg19) extracted in.Again, we are by the ginseng of all read sequence alignments obtained in upper step to target gene Examine in sequence, so as to obtain record the BAM files of comparison result.
In the present embodiment, be compared using read sequence with target gene reference sequences rather than with whole human genome Sequence is compared, and accuracy and comparison efficiency are compared so as to improve.Eukaryotic gene by extron and introne splicing and Into being directly compared with the reference sequences of target gene can more directly, accurately.Comparison process application BWA compares instrument, In other case study on implementation, other comparison softwares, such as Bowtie, SOAP2 etc. can also be used.
S106:For the BAM files after above-mentioned comparison, we are located at target genetic region at application mutation identification facility detection Single nucleotide mutation.
In the present embodiment, two kinds of mutation identification facilities of identification process application VarScan2 and Mutect are mutated, by respectively Obtained mutation list takes common factor, as the result data for subsequent analysis.
S107:Screening sequencing depth more than 100 × mutational site.
In the present embodiment, we apply the depth subprograms in samtools softwares to obtain each mutational site first Sequencing depth.Then, rejected for those sequencing depth less than the mutational site of certain threshold value.Those skilled in the art know Dawn, the SNV that certain region is carried out currently with high-flux sequence are detected, generally require the region 30 × sequencing data, sequencing is deep Degree is higher, and the gene frequency of acquisition is more reliable, sets threshold value according to sequencing depth, threshold value is bigger, the accurate journeys of the SNV left Degree is higher, more reliable, but follow-up data available is reduced;Threshold value is smaller, and follow-up data amount is bigger, but data reliability is low.Utilize It is more that these are mixed with the false positive SNV that the low SNV progress statistical analyses of reliability obtain.Here, we screen the threshold in mutational site Value is set as 100 ×, in other case study on implementation, the threshold value can be also changed as the case may be.
S108:According to the frequency of mutation, make significant difference inspection with reference to the SNV of existing mutation database and data-base recording Test, screen the mutation (P of significant difference<0.05).Finally function note is carried out using mutation of the ANNOVAR instruments to significant difference Release, by SNV annotations to gene is upper and various mutation databases in, so as to illustrate affiliated type that these are mutated (same sense mutation, Nonsynonymous mutation, nonsense mutation etc.) whether can influence the coded by said gene protein function, it is prominent further to disclose these Become the effect in lung cancer formation and development.
In the present embodiment, the frequency of mutation derives from the output result of SNV identification facilities.The mutation database of reference includes Germline mutation database and the class of somatic mutation database two.Wherein, conventional germline mutation database includes thousand human genomes Database (http://www.1000genomes.org/) and 60,000 people ExAC human exonics group integrated database (http:// Exac.broadinstitute.org/) etc..Conventional somatic mutation database includes U.S. Oncogenome collection of illustrative plates TCGA Database (http://cancergenome.nih.gov/) and international cancer genome alliance ICGC databases (https:// Dcc.icgc.org/) etc..The mutation that significant difference is screened using the method for the accurate inspection statistics of fisher (is occurred with tumour Related gene mutation).Include four kinds of objects altogether, first is mutation number ratio, that is, carries the number of patients of mutation;Second, Mutant proportion and colony's gene frequency in colony;3rd, homozygous mutation number ratio;4th, heterozygous mutant number ratio Example.
S109:To the clinical data information of 10 patients with lung cancer in the implementation case (including age, sex, cancer TNM By stages, smoking history, gross tumor volume, Tumor size, whether have the information such as lymph node invasion, Ki67 grade malignancies, tissue subtype) enter Whether row association analysis, finds out related to specific gene mutation generation risk factors in lung cancer, i.e., have a certain clinical special Which special mutation the patient of sign is prone to.
Here, we apply the Chi-square Test function in R statistical softwares to be associated analysis.
The SNV statistical results of embodiment one, are shown in Table 2
Table 2
Embodiment two:11 Serum of Patients with Lung Cancer sample tumor target gene sequencing data parsings.
Library in the present embodiment expands sublibrary for the targeting of 11 Serum of Patients with Lung Cancer dissociative DNA structures.Library structure That builds comprises the following steps that:
(1) selection of target gene:Select tumour heat mutation gene, proto-oncogene, tumor suppressor gene, some targeted drugs The gene of effect, specific ABL1, EGFR, GNAS, MLH1, RET, AKT1, ERBB2, HNF1A, ALK, ERBB4, HRAS, NOTCH1、SMARCB1、APC、FBXW7、IDH1、NPM1、SMO、ATM、FGFR1、JAK2、NRAS、SRC、BRAF、FGFR2、 JAK3、PDGFRA、STK11、CDH1、FGFR3、KDR、PIK3CA、TP53、CDKN2A、FLT3、KIT、PTEN、VHL、CSF1R、 This 48 genes of GNA11, KRAS, PTPN11, EZH2, TNNB1, GNAQ, MET, RB1, IDH2, the target base studied as us Cause.
(2) extraction of dissociative DNA and quantitative:For the hydrothorax sample of patients with lung cancer, we first carry out low-speed centrifugal (3, 000rpm) take supernatant within 5 minutes, take supernatant within 10 minutes carrying out high speed centrifugation (14,000rpm), obtained the trip in hydrothorax sample From DNA (average length is about 166bp);And quantified using Qbuit2.0 (Invitrogen companies) instrument.
(3) amplicon designs:By online design of primers instrument DesignStudio, primer is carried out for 48 target genes Design.Finally, we have obtained covering 2,158 pairs of amplicons of 48 target gene whole exon regions, each pair amplification sub-pieces The size of section is about 150bp.Because the sequence length of different target genes is different, the clip size of our each pair amplicon is again Almost fix, therefore each target gene correspond to different number of amplicon primer pair.Target gene and amplicon primer pair The corresponding lists of number are the same as table 1.
(4) extron of multiplexed PCR amplification target gene:After the completion of amplicon design of primers, provided according to design report Primer sequence, synthetic primer nucleic acid, and in the form of multiplex PCR expand target gene whole exon sequences.
(5) connection of Illumina sequence measuring joints and Library PCR amplification:For above-mentioned amplified production, we connect The sequence measuring joints of Illumina sequenators.Sequence measuring joints sequence is as follows:
Upstream sequence:5'P-NNN……NNNGATCGGAAGAGCACACGTCTGAA-3’
Downstream sequence:5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCNNN ... NNNT-3 ' joints have connected Cheng Hou, we can carry out 6-15 using KAPA HiFiHotStart PCR kit according to the difference of template initial amount to library The PCR amplifications of period.
(6) library quality inspection and Q-PCR are quantified:Library Quality is detected by agarose gel electrophoresis, uses 2% agar Sugared gel, 120v, 30 minutes, gel imaging, target stripe 270bp.By Agilent 2100Bioanalyzer to library Clip size accurate quantification, and by Q-PCR to library concentration accurate quantification.
(7) machine is sequenced on HiSeq4000 sequenators:Read sequence is obtained under Illumina HiSeq4000 microarray datasets Row length is 125bp both-end sequencing data.
Fig. 1 is refer to, the specific steps of the present embodiment include:
S101:It is sequenced by the structure and upper machine that expand sublibrary, it is all outer that we can obtain 48 target genes of covering The read sequence (i.e. both-end 125bp sequencing data) of aobvious subregion nucleic acid sequence information.
S102:Quality control is carried out to all read sequences in sequencing data using fastqc softwares, for single alkali It is poor that read sequence data of the average sequencing quality of base less than 20 is set to sequencing quality, and is deleted in analysis.
S103:Sequencing data is filtered using amplicon primer sequence, that is, in two read sequences for extracting pairing Amplicon primer sequence pair, remove amplicon primer sequence to be not derived from it is same pairing primer read sequence, enter And the read sequence number for covering targeting amplification region corresponding to each pair amplicon primer is counted, the expansion of more different amplicons Increasing Efficiency, delete the read sequence data corresponding to the abnormal amplicon of amplification.
In the present embodiment, judge that the amplicon has abnormal basis and condition with embodiment one.
S104:The amplicon primer sequence in read sequence is deleted, improves the accuracy and comparison efficiency of mutation identification.
In this example, we write Python programs by the read sequence of pairing and the sequence of known amplicon primer pair It is compared, and the part matched in read sequence with primer sequence is deleted from read sequence, so as to obtains real target Gene DNA sequence.Here, we are intercepted to read sequence is so as to reject the purpose of primer sequence part, the way On the one hand the base of primer resultant fault can be avoided to be taken as mutation to identify.On the other hand, the sequence length after simplifying Follow-up comparison time can be reduced.
S105:First, we download mankind's reference gene group sequence from UCSC genome browser databases hg19.Secondly, we write program and extract the reference sequences of 48 target genes from whole human genomic sequence (hg19) Out.Again, we are by all read sequence alignments obtained in upper step to the reference sequences of target gene, so as to be recorded The BAM files of comparison result.
In the present embodiment, comparison process application Bowtie is as comparison instrument.
S106:For the BAM files after above-mentioned comparison, we appear in target gene area at application mutation identification facility detection The single nucleotide mutation in domain.
In the present embodiment, identification process application two kinds of SNV identification facilities of VarScan2 and FreeBayes are mutated, will be divided The mutation list not obtained takes common factor, as the result data for subsequent analysis.
S107:Screening sequencing depth more than 100 × mutational site, concrete operations are the same as embodiment one.
S108:The frequency of mutation is counted, makees significant difference inspection with reference to the SNV of existing mutation database and data-base recording Test, screen the mutation (P of significant difference<0.05), and using ANNOVAR instruments to the mutation filtered out functional annotation is carried out.
In the present embodiment, the frequency of mutation derives from the output result of SNV identification facilities.The mutation database of reference and The method of the mutation of significant difference is screened with embodiment one.
S109:To the clinical data information of 11 patients with lung cancer in the implementation case (including age, sex, cancer TNM By stages, the information such as smoking history, gross tumor volume, Ki67 grade malignancies, tissue subtype) be associated analysis, find out in lung cancer with Related risk factors occur for specific gene mutation, i.e., whether having the patient of a certain Clinical symptoms, which is prone to is special Mutation.
The SNV statistical results of embodiment two, are shown in Table 3.
Table 3
It will be understood by those skilled in the art that all or part of step of various methods can pass through in above-mentioned embodiment Program instructs related hardware to complete, and the program can be stored in a computer-readable recording medium, storage medium can wrap Include:Read-only storage, random access memory, disk or CD etc..
Above content is to combine specific embodiment further description made for the present invention, it is impossible to assert this hair Bright specific implementation is confined to these explanations.For general technical staff of the technical field of the invention, do not taking off On the premise of from present inventive concept, some simple deduction or replace can also be made.

Claims (9)

1. a kind of tumor target gene sequencing data analytic method, it is characterised in that comprise the following steps:
Step 1:Obtain the read sequence containing abrupt information, i.e., high-throughout sequencing data;
Step 2:The quality control of sequencing data, all sequencing datas obtained by fastqc softwares to step 1 carry out matter Amount analysis, obtains sequencing data Quality Control Report, and filter out and be reported as low-quality data;
Step 3:The amplification efficiency of different amplification regions is counted, deletes the abnormal data of amplification;
Step 4:The primer sequence in sequencing data read sequence is deleted, that is, obtains real target area domain dna in read sequence Sequence;
Step 5:By on all sequences comparing obtained in step 4 to target area, comparison result data are obtained;
Step 6:All mutation are detected from comparison result data using mutation identification facility;
Step 7:The sequencing depth of all covering bases in amplification region is counted, can according to the screening of the sequencing depth in mutational site By the high mutation of property;
Step 8:With reference to the mutation for having annotated the screen mutation significant difference in cancer Relational database;
Step 9:With reference to the clinical data information of case, various mutation are carried out with statistical analysis, identification and character are significantly correlated Germline mutation (germline mutations) and somatic mutation (somatic mutations);
Step 10:Graphically generate data analysis report.
A kind of 2. tumor target gene sequencing data analytic method according to claim 1, it is characterised in that described survey Ordinal number is according to the targeting amplification library from the high-flux sequence platform including IlluminaMiseq/Hiseq, target area It is customizable, i.e., provide all target point gene group location informations when analyzing first.
3. a kind of tumor target gene sequencing data analytic method according to claim 1, it is characterised in that in step 2 Described low quality data refers to the sequencing data that the average sequencing quality score of single base is less than 20.
4. a kind of tumor target gene sequencing data analytic method according to claim 1, it is characterised in that in step 3 Amplification region abnormal data obtains and deletion process is as follows:
(1) by the read comparing that sequencing obtains to targeting reference gene group;
(2) judge whether the amplicon primer sequence corresponding to two terminal sequences of read comes from same primer pair, that is, before allowing Respectively there are 2 mispairing at primer and rear primer and 5 ' and 3 ' ends of sequencing fragment, remove ineligible read sequence;
(3) statistics covers the read sequence of targeting amplification region corresponding to each pair amplicon, and application amplification number weighing and Compare their amplification efficiency;
(4) when expanding number less than number average is expanded corresponding to all amplicons 1/3, then the expansion corresponding to the amplicon is judged Increase region and exception be present, and amplified all read sequences come and deleted in analysis.
A kind of 5. tumor target gene sequencing data analytic method according to claim 1, it is characterised in that the step Comparison process in five needs, according to the target area genomic locations information provided first, to extract these target areas in gene Nucleic acid sequence information in group, and generate index.
6. a kind of tumor target gene sequencing data analytic method according to claim 1, it is characterised in that in step 7 For all mutational sites, only by be sequenced in the site depth more than 100 × case include statistical analysis.
7. a kind of tumor target gene sequencing data analytic method according to claim 1, it is characterised in that in step 8 The screening of the mutation of significant difference is carried out based on the data stored in the cancer Relational database being currently known.
8. a kind of tumor target gene sequencing data analytic method according to claim 1, it is characterised in that in step 9 Clinical data information association analysis, the clinical data of the Chi-square Test function pair patient in concrete application R statistical softwares, bag Include the age, sex, Cancer TNM staging, gross tumor volume, Tumor size, whether have lymph node invasion, Ki67 grade malignancies, tissue Hypotype, analysis is associated, finds out the risk factors related to specific gene mutation generation, i.e., whether there is a certain Clinical symptoms Patient which special mutation be prone to.
9. such as any one methods described in claim 1-8, it is characterised in that it is public that the reference sequences are derived from UCSC The reference oncogene or mankind's reference gene group sequence of database.
CN201710739726.0A 2017-08-25 2017-08-25 A kind of tumor target gene sequencing data analytic method Pending CN107577921A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710739726.0A CN107577921A (en) 2017-08-25 2017-08-25 A kind of tumor target gene sequencing data analytic method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710739726.0A CN107577921A (en) 2017-08-25 2017-08-25 A kind of tumor target gene sequencing data analytic method

Publications (1)

Publication Number Publication Date
CN107577921A true CN107577921A (en) 2018-01-12

Family

ID=61034812

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710739726.0A Pending CN107577921A (en) 2017-08-25 2017-08-25 A kind of tumor target gene sequencing data analytic method

Country Status (1)

Country Link
CN (1) CN107577921A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197434A (en) * 2018-01-16 2018-06-22 深圳市泰康吉音生物科技研发服务有限公司 The method for removing human source gene sequence in macro gene order-checking data
CN109637581A (en) * 2018-12-10 2019-04-16 江苏医联生物科技有限公司 Whole process mass analysis method is sequenced in a kind of bis- generation of DNA
CN109801679A (en) * 2019-01-15 2019-05-24 仲恺农业工程学院 Mathematical sequence reconstruction method for long-chain molecules
CN109920484A (en) * 2019-02-14 2019-06-21 北京安智因生物技术有限公司 A kind of analysis method and system of the genetic test data of sequenator
CN110021348A (en) * 2018-06-19 2019-07-16 上海交通大学医学院附属瑞金医院 Oncogene mutation detection methods and system based on RNA-seq data
CN110093417A (en) * 2018-01-31 2019-08-06 北京大学 A method of the detection unicellular somatic mutation of tumour
CN111073961A (en) * 2019-12-20 2020-04-28 苏州赛美科基因科技有限公司 High-throughput detection method for gene rare mutation
CN111073998A (en) * 2018-10-19 2020-04-28 深圳华大生命科学研究院 Virus genome mutation detection method, device and storage medium
CN111199776A (en) * 2018-11-16 2020-05-26 深圳华大生命科学研究院 Method and device for evaluating analysis quality of tumor genome sequencing data and application
CN111676276A (en) * 2020-07-13 2020-09-18 湖北伯远合成生物科技有限公司 Method for rapidly and accurately determining gene editing mutation condition and application thereof
CN111816315A (en) * 2020-05-28 2020-10-23 上海生物信息技术研究中心 Pancreatic duct cancer state evaluation model construction method and application
CN111863137A (en) * 2020-05-28 2020-10-30 上海朴岱生物科技合伙企业(有限合伙) Complex disease state evaluation method established based on high-throughput sequencing data and clinical phenotype and application
CN113380327A (en) * 2021-03-15 2021-09-10 浙江大学 Human biological age prediction and human aging degree evaluation method based on whole peripheral blood transcriptome
CN113728391A (en) * 2019-04-18 2021-11-30 生命科技股份有限公司 Method for context-based compression of genomic data of immunooncology biomarkers
CN113936739A (en) * 2021-05-28 2022-01-14 四川大学 Novel automatic assessment method for base mutation of coronavirus sample
CN114758723A (en) * 2022-03-31 2022-07-15 广州华银医学检验中心有限公司 Method and system for detecting tumor treatment target based on MeRIP sequencing technology
CN117912560A (en) * 2024-01-18 2024-04-19 北京睿博兴科生物技术有限公司 Whole genome resequencing analysis method and system

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6586181B1 (en) * 1998-04-17 2003-07-01 Syngenta Limited Method for detecting allelic imbalance
CN102301005A (en) * 2008-12-17 2011-12-28 生命技术公司 Methods, compositions, and kits for detecting allelic variants
EP2722395A1 (en) * 2001-10-15 2014-04-23 Bioarray Solutions Ltd Multiplexed analysis of polymorphic loci by concurrent interrogation and enzyme-mediated detection
US20140335514A1 (en) * 2011-05-04 2014-11-13 Aegea Biotechnologies Methods for detecting nucleic acid sequence variants
CN104264231A (en) * 2014-09-30 2015-01-07 天津华大基因科技有限公司 Method for constructing sequencing library and application of sequencing library
CN104755633A (en) * 2012-10-31 2015-07-01 硅生物系统股份公司 Method and kit for detecting wild-type and/or mutated target DNA sequence
CN105420351A (en) * 2015-10-16 2016-03-23 深圳华大基因研究院 Method and system for determining individual gene mutation
CN105950626A (en) * 2016-06-17 2016-09-21 新疆畜牧科学院生物技术研究所 Method for obtaining sheep with different hair colors on basis of CRISPR/Cas9 and sgRNA of targeted ASIP gene
CN106202991A (en) * 2016-06-30 2016-12-07 厦门艾德生物医药科技股份有限公司 The detection method of abrupt information in a kind of genome multiplex amplification order-checking product
CN106399504A (en) * 2016-09-20 2017-02-15 苏州贝康医疗器械有限公司 Targeting-based new generation sequencing deafness gene detection set and kit, and detection method

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6586181B1 (en) * 1998-04-17 2003-07-01 Syngenta Limited Method for detecting allelic imbalance
EP2722395A1 (en) * 2001-10-15 2014-04-23 Bioarray Solutions Ltd Multiplexed analysis of polymorphic loci by concurrent interrogation and enzyme-mediated detection
CN102301005A (en) * 2008-12-17 2011-12-28 生命技术公司 Methods, compositions, and kits for detecting allelic variants
US20140335514A1 (en) * 2011-05-04 2014-11-13 Aegea Biotechnologies Methods for detecting nucleic acid sequence variants
CN104755633A (en) * 2012-10-31 2015-07-01 硅生物系统股份公司 Method and kit for detecting wild-type and/or mutated target DNA sequence
CN104264231A (en) * 2014-09-30 2015-01-07 天津华大基因科技有限公司 Method for constructing sequencing library and application of sequencing library
CN105420351A (en) * 2015-10-16 2016-03-23 深圳华大基因研究院 Method and system for determining individual gene mutation
CN105950626A (en) * 2016-06-17 2016-09-21 新疆畜牧科学院生物技术研究所 Method for obtaining sheep with different hair colors on basis of CRISPR/Cas9 and sgRNA of targeted ASIP gene
CN106202991A (en) * 2016-06-30 2016-12-07 厦门艾德生物医药科技股份有限公司 The detection method of abrupt information in a kind of genome multiplex amplification order-checking product
CN106399504A (en) * 2016-09-20 2017-02-15 苏州贝康医疗器械有限公司 Targeting-based new generation sequencing deafness gene detection set and kit, and detection method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ANDREA LUCHETTI ET AL: "Profiling of somatic mutations in phaeochromocytoma and paraganglioma by targeted next generation sequencing analysis", 《INTERNATIONAL JOURNAL OF ENDOCRINOLOGY》 *
BING YU ET AL: "Somatic DNA mutation analysis in targeted therapy of solid tumours", 《TRANSI PEDIATR》 *
XUEHONG ZHANG ET AL: "Clonal evolution of acute myeloid leukemia highlighted by latest genome sequencing studies", 《ONCOTARGET》 *
全成实: "PCR直接测序方法及其在分子肿瘤学研究中的应用", 《国外医学耳鼻咽喉科学分册》 *
杨尧等: "新一代测序技术同时进行染色体异常和基因突变分析", 《第十二次全国医学遗传学学术会议论文汇编》 *

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197434B (en) * 2018-01-16 2020-04-10 深圳市泰康吉音生物科技研发服务有限公司 Method for removing human gene sequence in metagenome sequencing data
CN108197434A (en) * 2018-01-16 2018-06-22 深圳市泰康吉音生物科技研发服务有限公司 The method for removing human source gene sequence in macro gene order-checking data
CN110093417B (en) * 2018-01-31 2021-03-02 北京大学 Method for detecting tumor single cell somatic mutation
CN110093417A (en) * 2018-01-31 2019-08-06 北京大学 A method of the detection unicellular somatic mutation of tumour
CN110021348A (en) * 2018-06-19 2019-07-16 上海交通大学医学院附属瑞金医院 Oncogene mutation detection methods and system based on RNA-seq data
CN111073998A (en) * 2018-10-19 2020-04-28 深圳华大生命科学研究院 Virus genome mutation detection method, device and storage medium
CN111199776A (en) * 2018-11-16 2020-05-26 深圳华大生命科学研究院 Method and device for evaluating analysis quality of tumor genome sequencing data and application
CN111199776B (en) * 2018-11-16 2023-03-28 深圳华大生命科学研究院 Method and device for evaluating analysis quality of tumor genome sequencing data and application
CN109637581A (en) * 2018-12-10 2019-04-16 江苏医联生物科技有限公司 Whole process mass analysis method is sequenced in a kind of bis- generation of DNA
CN109801679A (en) * 2019-01-15 2019-05-24 仲恺农业工程学院 Mathematical sequence reconstruction method for long-chain molecules
CN109801679B (en) * 2019-01-15 2021-02-02 广州柿宝生物科技有限公司 Mathematical sequence reconstruction method for long-chain molecules
CN109920484A (en) * 2019-02-14 2019-06-21 北京安智因生物技术有限公司 A kind of analysis method and system of the genetic test data of sequenator
CN113728391A (en) * 2019-04-18 2021-11-30 生命科技股份有限公司 Method for context-based compression of genomic data of immunooncology biomarkers
US12040048B2 (en) 2019-04-18 2024-07-16 Life Technologies Corporation Methods for context based compression of genomic data for immuno-oncology biomarkers
CN113728391B (en) * 2019-04-18 2024-06-04 生命科技股份有限公司 Methods for context-based compression of genomic data of immunooncology biomarkers
CN111073961A (en) * 2019-12-20 2020-04-28 苏州赛美科基因科技有限公司 High-throughput detection method for gene rare mutation
CN111863137A (en) * 2020-05-28 2020-10-30 上海朴岱生物科技合伙企业(有限合伙) Complex disease state evaluation method established based on high-throughput sequencing data and clinical phenotype and application
CN111816315B (en) * 2020-05-28 2023-10-13 上海市生物医药技术研究院 Pancreatic duct cancer state assessment model construction method and application
CN111863137B (en) * 2020-05-28 2024-01-02 上海朴岱生物科技合伙企业(有限合伙) Complex disease state evaluation method based on high-throughput sequencing data and clinical phenotype construction and application
CN111816315A (en) * 2020-05-28 2020-10-23 上海生物信息技术研究中心 Pancreatic duct cancer state evaluation model construction method and application
CN111676276A (en) * 2020-07-13 2020-09-18 湖北伯远合成生物科技有限公司 Method for rapidly and accurately determining gene editing mutation condition and application thereof
CN113380327A (en) * 2021-03-15 2021-09-10 浙江大学 Human biological age prediction and human aging degree evaluation method based on whole peripheral blood transcriptome
CN113380327B (en) * 2021-03-15 2023-06-13 浙江大学 Human biological age prediction and human aging degree assessment method
CN113936739A (en) * 2021-05-28 2022-01-14 四川大学 Novel automatic assessment method for base mutation of coronavirus sample
CN118280441A (en) * 2021-05-28 2024-07-02 四川大学 Novel coronavirus sample evaluation method
CN114758723A (en) * 2022-03-31 2022-07-15 广州华银医学检验中心有限公司 Method and system for detecting tumor treatment target based on MeRIP sequencing technology
CN117912560A (en) * 2024-01-18 2024-04-19 北京睿博兴科生物技术有限公司 Whole genome resequencing analysis method and system
CN117912560B (en) * 2024-01-18 2024-06-25 北京睿博兴科生物技术有限公司 Whole genome resequencing analysis method and system

Similar Documents

Publication Publication Date Title
CN107577921A (en) A kind of tumor target gene sequencing data analytic method
JP6921888B2 (en) Methods and systems for detecting genetic variants
EP3481966B1 (en) Methods for fragmentome profiling of cell-free nucleic acids
CN107526944B (en) A kind of sequencing data analysis method, device and the computer-readable medium of microsatellite instability
CN107513565B (en) A kind of microsatellite instability Sites Combination, detection kit and its application
CN109427412B (en) Sequence combination for detecting tumor mutation load and design method thereof
CN108753967A (en) A kind of gene set and its panel detection design methods for liver cancer detection
CN105063208A (en) Low-frequency mutation enrichment sequencing method for free target DNA (deoxyribonucleic acid) in plasma
CA2972433A1 (en) Detection and treatment of disease exhibiting disease cell heterogeneity and systems and methods for communicating test results
CN105780129B (en) Target area sequencing library construction method
CN113151474A (en) Plasma DNA mutation analysis for cancer detection
JP6983307B2 (en) Nucleotide sequence mutation detection method based on gene panel and base sequence mutation detection device using this
CN105420351A (en) Method and system for determining individual gene mutation
US20210348240A1 (en) Hereditary cancer genes
US20210238668A1 (en) Biterminal dna fragment types in cell-free samples and uses thereof
US20190352695A1 (en) Methods for fragmentome profiling of cell-free nucleic acids
CN105925665A (en) Kit, database establishment method, and method and system for detecting area target variation
US20240279745A1 (en) Systems and methods for multi-analyte detection of cancer
CN105779435A (en) Kit and application thereof
JP2020521216A (en) Methods and systems for detecting insertions and deletions
US20150344966A1 (en) Hereditary Cancer Diagnostics
CN105950709A (en) Kit, library building method, and method and system for detecting variation of object region
US11970732B2 (en) Method for determining nucleic acid quality of biological sample
CN111383713B (en) ctDNA detection and analysis device and method
CN111684079A (en) Method for predicting response to treatment by assessing tumor genetic heterogeneity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180112

WD01 Invention patent application deemed withdrawn after publication