CN107577921A

CN107577921A - A kind of tumor target gene sequencing data analytic method

Info

Publication number: CN107577921A
Application number: CN201710739726.0A
Authority: CN
Inventors: 李志广; 吕德康; 张学红; 张宇
Original assignee: Cloud One Biological Technology (dalian) Co Ltd
Current assignee: Cloud One Biological Technology (dalian) Co Ltd
Priority date: 2017-08-25
Filing date: 2017-08-25
Publication date: 2018-01-12

Abstract

A kind of tumor target gene sequencing data analytic method, belongs to genomics high-flux sequence field, including analytical procedure have：Obtain the read sequence containing abrupt information；Sequencing quality controls；The amplification efficiency quality control of targeting amplification region corresponding to amplicon；Delete the primer sequence in read sequence；By the read on the reference sequences of read sequence alignment to target area, compared and compare situation；Identification is compared to the mutation in the read sequence of target area；Screening sample is carried out according to the sequencing depth in mutational site；With reference to the mutation of the screen mutation significant difference recorded；Analysis is associated with reference to case-data.The different amplification sublibraries that the present invention can customize to different user carry out general analyzing and processing and obtain the mutant analysis results significantly correlated with disease with reference to clinical information, the comprehensive assessment and primer sequence shearing procedure for targetting amplified library efficiency are added, improves the reliability of analysis result.

Description

A kind of tumor target gene sequencing data analytic method

Technical field

The present invention relates to genomics high-flux sequence data analysis field, specifically includes and number is sequenced to target gene library According to progress quality control, amplicon efficiency evaluation and filtering, genome alignment, mutation identification and annotation, and then combine and recorded Mutation and case-data complete statistical analysis, and providing a whole set of non-customized oncogene mutation detecting analysis for tumor patient solves Scheme, technical support is provided for tumour Personalized medicine.

Background technology

Tumour is inherently genopathy.Various environment and heredity carcinogenic factor causes DNA to damage in a manner of cooperateing with Evil, so as to activating proto-oncogene and (or) inactivation tumor suppressor gene, apoptosis gene and (or) DNA-repair gene in addition Change, then cause the exception of expression, target cell is progressively converted to cancer cell.The cell being converted first presents more Clonal hyperplasia, by a very long multistage evolution process, one of relatively unconfined amplification of clone, pass through Addition mutation is accumulated, the subclone (heterogeneousization) with different characteristics is formed selectively, so as to the energy for being infiltrated and being shifted Power (vicious transformation), form malignant tumour.

Oncogene detection be extract human body cell in inhereditary material, by be sequenced detection human body in oncogene or Tumor susceptibility gene, for the prevention of tumour, diagnosis, prognosis prediction, targeting medication, postoperative monitoring etc..

Targeting sequencing is that the PCR primer of length-specific or the fragment of capture are sequenced, the variation in analytical sequence. The sequencing of high coverage can be carried out to target area according to different demands by targetting sequencing, can also detect that low frequency is mutated.With The sequencing reduction of cost and going deep into for mankind's functional genomics research, targeting sequencing is moved towards to face from research institution Bed, for multiple fields such as genetic screening, disease risks assessment, tumor diagnosis and treatment and accurate medications.

The problem of targetting sequencing data analysis：First, it there is no targeting sequencing data instrument to combine known mutations database Or case-data carries out statistical analysis, it is impossible to provides the mutant analysis results significantly correlated with disease.Second, general analysis software Only meet the analysis of fixed panel libraries sequencing data, such as the TrueSeqAmplicon of illumina companies, can not meet not With the targeting library analysis of the user of demand.3rd, in existing method, do not assess the amplification efficiency of amplicon and to primer sequence The operation trimmed is arranged, if the SNV results that follow-up mutation analysis obtains can be caused with higher by not processing both of which False positive, so as to impact analysis conclusion.

The content of the invention

The defects of existing for existing analysis method, the present invention provide for the target gene sequencing data analysis of autonomous Design A whole set of solution.

The present invention seeks to what is be achieved through the following technical solutions：

A kind of tumor target gene sequencing data analytic method, it is characterised in that comprise the following steps：

Step 1：Obtain the read sequence containing abrupt information, i.e., high-throughout sequencing data；

Step 2：The quality control of sequencing data, all sequencing datas that step 1 obtains are entered by fastqc softwares Row quality analysis, sequencing data Quality Control Report is obtained, and filter out and be reported as low-quality data；

Step 3：The amplification efficiency of different amplification regions is counted, deletes the abnormal data of amplification；

Step 4：The primer sequence in sequencing data read sequence is deleted, that is, obtains real target area in read sequence Domain dna sequence；

Step 5：By on all sequences comparing obtained in step 4 to target area, comparison result data are obtained；

Step 6：All mutation are detected from comparison result data using mutation identification facility；

Step 7：The sequencing depth of all covering bases in amplification region is counted, is sieved according to the sequencing depth in mutational site Select the mutation that reliability is high；

Step 8：With reference to the mutation for having annotated the screen mutation significant difference in cancer Relational database；

Step 9：With reference to the clinical data information of case, statistical analysis, identification and the notable phase of character are carried out to various mutation The germline mutation (germline mutations) of pass and somatic mutation (somatic mutations)；

Step 10：Graphically generate data analysis report.

Targeting of the described sequencing data from the high-flux sequence platform including IlluminaMiseq/Hiseq is expanded Increase library, target area may customize, i.e., provides all target point gene group location informations when analyzing first.

Low quality data described in step 2 refers to the sequencing data that the average sequencing quality score of single base is less than 20.

Amplification region abnormal data obtains in step 3 and deletion process is as follows：

(1) by the read comparing that sequencing obtains to targeting reference gene group；

(2) judge whether the amplicon primer sequence corresponding to two terminal sequences of read comes from same primer pair, that is, permit Perhaps respectively there are 2 mispairing at preceding primer and rear primer and 5 ' and 3 ' ends of sequencing fragment, remove ineligible read sequence；

(3) statistics covers the read sequence of targeting amplification region corresponding to each pair amplicon, and application expands number to weigh Measure and compare their amplification efficiency；

(4) when expanding number less than number average is expanded corresponding to all amplicons 1/3, then judge corresponding to the amplicon Amplification region exist abnormal, and amplified all read sequences come and deleted in analysis.

Comparison process in the step 5 needs, according to the target area genomic locations information provided first, to extract this A little nucleic acid sequence informations of the target area in genome, and generate index.

For all mutational sites in step 7, only by be sequenced in the site depth more than 100 × case include statistics Analysis.

The screening of the mutation of significant difference is based on the number stored in the cancer Relational database being currently known in step 8 According to come carry out.

The association analysis of clinical data information in step 9, the Chi-square Test function pair in concrete application R statistical softwares The clinical data of patient, including the age, sex, Cancer TNM staging, gross tumor volume, Tumor size, whether have lymph node invasion, Ki67 grade malignancies, tissue subtype, analysis is associated, finds out the risk factors related to specific gene mutation generation, be Which special mutation the no patient with a certain Clinical symptoms is prone to.

The reference sequences are derived from the reference oncogene or mankind's reference gene group sequence of UCSC public databases.

Beneficial effects of the present invention：A kind of tumor target gene sequencing data analytic method of the present invention includes (1) obtaining and contained The read sequence of abrupt information, i.e. sequencing data；(2) sequencing quality controls；(3) target area amplification efficiency quality control；(4) delete Except the primer sequence in read sequence；(5) read sequence is compared with reference to target site sequence, the read compared Sequence；(6) the mutation in read sequence is identified；(7) screening sample is carried out according to the sequencing depth in mutational site；(8) combine and annotated The mutation of screen mutation significant difference in cancer Relational database；(9) it is associated point with reference to the clinical data information of case Analysis；(10) data analysis report is graphically generated.Meet the analysis demand in non-customized targeting library, with reference to known mutations number The mutant analysis results significantly correlated with disease are provided according to storehouse or case-data.It will be combined in the analytic method known with reference to prominent Variable database and the clinical data information of patient, filtered out using different Statistical Identifying Methods prominent with significant difference Become.Mutation database includes germline mutation database and the class of somatic mutation database two.Wherein, conventional germline mutation data Storehouse includes thousand human genome database (http://www.1000genomes.org/) and 60,000 people ExAC human exonics group it is whole Close database (http://exac.broadinstitute.org/) etc..Conventional somatic mutation database swells including the U.S. Tumor gene group collection of illustrative plates TCGA databases (http://cancergenome.nih.gov/) and international cancer genome alliance ICGC Database (https://dcc.icgc.org/) etc..It is generally necessary to use four kinds of objects, first is mutation number ratio, that is, is taken Number of patients with mutation；Second, mutant proportion and colony's gene frequency in colony；3rd, homozygous mutation number ratio Example；4th, heterozygous mutant number ratio.After the data of above-mentioned four kinds of objects are taken, our cans are accurate using fisher The mutation (i.e. the gene mutation related to tumour generation) of the method screening significant difference of inspection statistics.This method application R is counted Chi-square Test function pair patient in software clinical data information (including the age, sex, Cancer TNM staging, gross tumor volume, Tumor size, whether have the information such as lymph node invasion, Ki67 grade malignancies, tissue subtype) analysis is associated, find out in lung cancer In to the related risk factors of specific gene mutation generations, i.e., whether which the patient with a certain Clinical symptoms is prone to Special mutation.On the one hand, a kind of tumor target gene sequencing data analytic method of the present invention compares the method that presently, there are more With versatility.On the other hand, the amplicon in this method particular for user's customization targets the amplification efficiency progress in library entirely Face is assessed, and primer sequence is trimmed, and ensures that the amplification efficiency of different amplicons is maintained at a substantially phase as far as possible Same level, to evade due to the false positive issue of SNV results caused by the amplification efficiency of different amplicons.Sum it up, this The different amplification sublibraries that method can not only customize to different user carry out general analyzing and processing and obtained with reference to clinical letter Breath the mutant analysis results significantly correlated with disease, also independently add for target amplified library efficiency comprehensive assessment and Primer sequence shearing procedure so that whole analysis method improves the reliability of analysis result again while novelty is had concurrently.

Brief description of the drawings

Fig. 1 is the inventive method implementation process figure.

Embodiment

Existing high-flux sequence platform have it is a variety of, including IlluminaNextSeq, MiSeq and HiSeq etc..The present invention In embodiment explained with IlluminaHiSeq/MiSeq microarray datasets.

Method provided by the invention abrupt climatic change suitable for targeting DNA or RNA, therefore will be explained respectively with embodiment State.Sample DNA/RNA extractions, structure library, high-flux sequence etc. are carried out using prior art in embodiment.

Unreceipted actual conditions in embodiment, the condition suggested according to normal condition or manufacturer are carried out；Agents useful for same Or the unreceipted production firm of instrument, can the conventional products obtained be bought by market.

Embodiment one：10 Pleural Fluid of Patients With Lung Cancer sample target gene sequencing data parsings：

Library in the present embodiment expands sublibrary for the targeting of 10 Pleural Fluid of Patients With Lung Cancer sample dissociative DNA structures.Text Storehouse structure comprises the following steps that：

(1) selection of target gene：Tumour heat mutation gene, proto-oncogene, tumor suppressor gene and targeted drug is selected to make Gene, specifically ABL1, EGFR, GNAS, MLH1, RET, AKT1, ERBB2, HNF1A, ALK, ERBB4, HRAS, NOTCH1、SMARCB1、APC、FBXW7、IDH1、NPM1、SMO、ATM、FGFR1、JAK2、NRAS、SRC、BRAF、FGFR2、 JAK3、PDGFRA、STK11、CDH1、FGFR3、KDR、PIK3CA、TP53、CDKN2A、FLT3、KIT、PTEN、VHL、CSF1R、 This 48 genes of GNA11, KRAS, PTPN11, EZH2, TNNB1, GNAQ, MET, RB1, IDH2, the target base studied as us Cause.

(2) extraction of dissociative DNA and quantitative：For the hydrothorax sample of patients with lung cancer, we first carry out low-speed centrifugal (3, 000rpm) take supernatant within 5 minutes, take supernatant within 10 minutes carrying out high speed centrifugation (14,000rpm), obtained the trip in hydrothorax sample From DNA (average length is about 166bp)；And quantified using Qbuit2.0 (Invitrogen companies) instrument.

(3) amplicon designs：By online design of primers instrument DesignStudio, primer is carried out for 48 target genes Design.Finally, we have obtained covering 2,158 pairs of amplicons of 48 target gene whole exon regions, each pair amplification sub-pieces The size of section is about 150bp.Because the sequence length of different target genes is different, the clip size of our each pair amplicon is again Almost fix, therefore each target gene correspond to different number of amplicon primer pair.Target gene and amplicon primer pair The corresponding lists of number, are shown in Table 1.

Table 1

(4) extron of multiplexed PCR amplification target gene：After the completion of amplicon design of primers, provided according to design report Primer sequence, synthetic primer nucleic acid, and in the form of multiplex PCR expand target gene whole exon sequences.

(5) connection of Illumina sequence measuring joints and Library PCR amplification：For above-mentioned amplified production, we connect The sequence measuring joints of Illumina sequenators.Sequence measuring joints sequence is as follows：

Upstream sequence：5'P-NNN……NNNGATCGGAAGAGCACACGTCTGAA-3’

Downstream sequence：5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCNNN ... NNNT-3 ' joints have connected Cheng Hou, we can carry out 6-15 using KAPA HiFiHotStart PCR kit according to the difference of template initial amount to library The PCR amplifications of period.

(6) library quality inspection and Q-PCR are quantified：Library Quality is detected by agarose gel electrophoresis, uses 2% agar Sugared gel, 120v, 30 minutes, gel imaging, target stripe 270bp.Pass through the 2100Bioanalyzer of Agilent companies To library fragments size accurate quantification, and by Q-PCR to library concentration accurate quantification.

(7) machine is sequenced on MiSeq sequenators：Read sequence length is obtained under IlluminaMiSeq microarray datasets is 75bp both-end sequencing data.

Fig. 1 is refer to, the specific steps of the present embodiment include：

S101：It is sequenced by the structure and upper machine that expand sublibrary, it is all outer that we can obtain 48 target genes of covering The read sequence (i.e. both-end 75bp sequencing data) of aobvious subregion nucleic acid sequence information.

S102：Quality control is carried out to all read sequences in sequencing data using fastqc softwares, for single alkali It is poor that average read sequence data of the sequencing quality score less than 20 of base is set to sequencing quality, and is deleted in analysis.

S103:Sequencing data is filtered using amplicon primer sequence, that is, in two read sequences for extracting pairing Amplicon primer sequence pair, remove amplicon primer sequence to be not derived from it is same pairing primer read sequence, enter And the read sequence number for covering targeting amplification region corresponding to each pair amplicon primer is counted, the expansion of more different amplicons Increasing Efficiency, delete the read sequence data corresponding to the abnormal amplicon of amplification.

In the present embodiment, if the amplicon primer sequence corresponding to two terminal sequences of read is designed from us Same primer pair, then it is assumed that the read is derived from the amplification of the amplicon primer, to all amplifications for meeting above-mentioned condition Read sequence is counted corresponding to son, and the amplification efficiency of more all amplicons (weighs amplification effect with amplification number here Rate), when expanding number less than number average is expanded corresponding to all amplicons 1/3, then it is abnormal to judge that the amplicon is present, and will It amplifies all read sequences come and deleted in analysis.

S104：The amplicon primer sequence in read sequence is deleted, improves the accuracy and comparison efficiency of mutation identification.

In this example, we write Python programs by the read sequence of pairing and the sequence of known amplicon primer pair It is compared, and the part matched in read sequence with primer sequence is deleted from read sequence, so as to obtains real target Gene DNA sequence.Here, we are intercepted to read sequence is so as to reject the purpose of primer sequence part, the way On the one hand the base of primer resultant fault can be avoided to be taken as mutation to identify.On the other hand, the sequence length after simplifying Follow-up comparison time can be reduced.

S105：First, we are from UCSC genome browser databases

(http://genome.ucsc.edu/cgi-bin/hgTracksDb=hg19 mankind's reference gene group) is downloaded Sequences h g19.Secondly, we write program by the reference sequences of 48 target genes from whole human genomic sequence

(hg19) extracted in.Again, we are by the ginseng of all read sequence alignments obtained in upper step to target gene Examine in sequence, so as to obtain record the BAM files of comparison result.

In the present embodiment, be compared using read sequence with target gene reference sequences rather than with whole human genome Sequence is compared, and accuracy and comparison efficiency are compared so as to improve.Eukaryotic gene by extron and introne splicing and Into being directly compared with the reference sequences of target gene can more directly, accurately.Comparison process application BWA compares instrument, In other case study on implementation, other comparison softwares, such as Bowtie, SOAP2 etc. can also be used.

S106：For the BAM files after above-mentioned comparison, we are located at target genetic region at application mutation identification facility detection Single nucleotide mutation.

In the present embodiment, two kinds of mutation identification facilities of identification process application VarScan2 and Mutect are mutated, by respectively Obtained mutation list takes common factor, as the result data for subsequent analysis.

S107：Screening sequencing depth more than 100 × mutational site.

In the present embodiment, we apply the depth subprograms in samtools softwares to obtain each mutational site first Sequencing depth.Then, rejected for those sequencing depth less than the mutational site of certain threshold value.Those skilled in the art know Dawn, the SNV that certain region is carried out currently with high-flux sequence are detected, generally require the region 30 × sequencing data, sequencing is deep Degree is higher, and the gene frequency of acquisition is more reliable, sets threshold value according to sequencing depth, threshold value is bigger, the accurate journeys of the SNV left Degree is higher, more reliable, but follow-up data available is reduced；Threshold value is smaller, and follow-up data amount is bigger, but data reliability is low.Utilize It is more that these are mixed with the false positive SNV that the low SNV progress statistical analyses of reliability obtain.Here, we screen the threshold in mutational site Value is set as 100 ×, in other case study on implementation, the threshold value can be also changed as the case may be.

S108：According to the frequency of mutation, make significant difference inspection with reference to the SNV of existing mutation database and data-base recording Test, screen the mutation (P of significant difference<0.05).Finally function note is carried out using mutation of the ANNOVAR instruments to significant difference Release, by SNV annotations to gene is upper and various mutation databases in, so as to illustrate affiliated type that these are mutated (same sense mutation, Nonsynonymous mutation, nonsense mutation etc.) whether can influence the coded by said gene protein function, it is prominent further to disclose these Become the effect in lung cancer formation and development.

In the present embodiment, the frequency of mutation derives from the output result of SNV identification facilities.The mutation database of reference includes Germline mutation database and the class of somatic mutation database two.Wherein, conventional germline mutation database includes thousand human genomes Database (http://www.1000genomes.org/) and 60,000 people ExAC human exonics group integrated database (http:// Exac.broadinstitute.org/) etc..Conventional somatic mutation database includes U.S. Oncogenome collection of illustrative plates TCGA Database (http://cancergenome.nih.gov/) and international cancer genome alliance ICGC databases (https:// Dcc.icgc.org/) etc..The mutation that significant difference is screened using the method for the accurate inspection statistics of fisher (is occurred with tumour Related gene mutation).Include four kinds of objects altogether, first is mutation number ratio, that is, carries the number of patients of mutation；Second, Mutant proportion and colony's gene frequency in colony；3rd, homozygous mutation number ratio；4th, heterozygous mutant number ratio Example.

S109：To the clinical data information of 10 patients with lung cancer in the implementation case (including age, sex, cancer TNM By stages, smoking history, gross tumor volume, Tumor size, whether have the information such as lymph node invasion, Ki67 grade malignancies, tissue subtype) enter Whether row association analysis, finds out related to specific gene mutation generation risk factors in lung cancer, i.e., have a certain clinical special Which special mutation the patient of sign is prone to.

Here, we apply the Chi-square Test function in R statistical softwares to be associated analysis.

The SNV statistical results of embodiment one, are shown in Table 2

Table 2

Embodiment two：11 Serum of Patients with Lung Cancer sample tumor target gene sequencing data parsings.

Library in the present embodiment expands sublibrary for the targeting of 11 Serum of Patients with Lung Cancer dissociative DNA structures.Library structure That builds comprises the following steps that：

(1) selection of target gene：Select tumour heat mutation gene, proto-oncogene, tumor suppressor gene, some targeted drugs The gene of effect, specific ABL1, EGFR, GNAS, MLH1, RET, AKT1, ERBB2, HNF1A, ALK, ERBB4, HRAS, NOTCH1、SMARCB1、APC、FBXW7、IDH1、NPM1、SMO、ATM、FGFR1、JAK2、NRAS、SRC、BRAF、FGFR2、 JAK3、PDGFRA、STK11、CDH1、FGFR3、KDR、PIK3CA、TP53、CDKN2A、FLT3、KIT、PTEN、VHL、CSF1R、 This 48 genes of GNA11, KRAS, PTPN11, EZH2, TNNB1, GNAQ, MET, RB1, IDH2, the target base studied as us Cause.

(3) amplicon designs：By online design of primers instrument DesignStudio, primer is carried out for 48 target genes Design.Finally, we have obtained covering 2,158 pairs of amplicons of 48 target gene whole exon regions, each pair amplification sub-pieces The size of section is about 150bp.Because the sequence length of different target genes is different, the clip size of our each pair amplicon is again Almost fix, therefore each target gene correspond to different number of amplicon primer pair.Target gene and amplicon primer pair The corresponding lists of number are the same as table 1.

Upstream sequence：5'P-NNN……NNNGATCGGAAGAGCACACGTCTGAA-3’

(6) library quality inspection and Q-PCR are quantified：Library Quality is detected by agarose gel electrophoresis, uses 2% agar Sugared gel, 120v, 30 minutes, gel imaging, target stripe 270bp.By Agilent 2100Bioanalyzer to library Clip size accurate quantification, and by Q-PCR to library concentration accurate quantification.

(7) machine is sequenced on HiSeq4000 sequenators：Read sequence is obtained under Illumina HiSeq4000 microarray datasets Row length is 125bp both-end sequencing data.

Fig. 1 is refer to, the specific steps of the present embodiment include：

S101：It is sequenced by the structure and upper machine that expand sublibrary, it is all outer that we can obtain 48 target genes of covering The read sequence (i.e. both-end 125bp sequencing data) of aobvious subregion nucleic acid sequence information.

S102：Quality control is carried out to all read sequences in sequencing data using fastqc softwares, for single alkali It is poor that read sequence data of the average sequencing quality of base less than 20 is set to sequencing quality, and is deleted in analysis.

In the present embodiment, judge that the amplicon has abnormal basis and condition with embodiment one.

S105：First, we download mankind's reference gene group sequence from UCSC genome browser databases hg19.Secondly, we write program and extract the reference sequences of 48 target genes from whole human genomic sequence (hg19) Out.Again, we are by all read sequence alignments obtained in upper step to the reference sequences of target gene, so as to be recorded The BAM files of comparison result.

In the present embodiment, comparison process application Bowtie is as comparison instrument.

S106：For the BAM files after above-mentioned comparison, we appear in target gene area at application mutation identification facility detection The single nucleotide mutation in domain.

In the present embodiment, identification process application two kinds of SNV identification facilities of VarScan2 and FreeBayes are mutated, will be divided The mutation list not obtained takes common factor, as the result data for subsequent analysis.

S107：Screening sequencing depth more than 100 × mutational site, concrete operations are the same as embodiment one.

S108：The frequency of mutation is counted, makees significant difference inspection with reference to the SNV of existing mutation database and data-base recording Test, screen the mutation (P of significant difference<0.05), and using ANNOVAR instruments to the mutation filtered out functional annotation is carried out.

In the present embodiment, the frequency of mutation derives from the output result of SNV identification facilities.The mutation database of reference and The method of the mutation of significant difference is screened with embodiment one.

S109：To the clinical data information of 11 patients with lung cancer in the implementation case (including age, sex, cancer TNM By stages, the information such as smoking history, gross tumor volume, Ki67 grade malignancies, tissue subtype) be associated analysis, find out in lung cancer with Related risk factors occur for specific gene mutation, i.e., whether having the patient of a certain Clinical symptoms, which is prone to is special Mutation.

The SNV statistical results of embodiment two, are shown in Table 3.

Table 3

It will be understood by those skilled in the art that all or part of step of various methods can pass through in above-mentioned embodiment Program instructs related hardware to complete, and the program can be stored in a computer-readable recording medium, storage medium can wrap Include：Read-only storage, random access memory, disk or CD etc..

Above content is to combine specific embodiment further description made for the present invention, it is impossible to assert this hair Bright specific implementation is confined to these explanations.For general technical staff of the technical field of the invention, do not taking off On the premise of from present inventive concept, some simple deduction or replace can also be made.

Claims

1. a kind of tumor target gene sequencing data analytic method, it is characterised in that comprise the following steps：

Step 2：The quality control of sequencing data, all sequencing datas obtained by fastqc softwares to step 1 carry out matter Amount analysis, obtains sequencing data Quality Control Report, and filter out and be reported as low-quality data；

Step 4：The primer sequence in sequencing data read sequence is deleted, that is, obtains real target area domain dna in read sequence Sequence；

Step 7：The sequencing depth of all covering bases in amplification region is counted, can according to the screening of the sequencing depth in mutational site By the high mutation of property；

Step 9：With reference to the clinical data information of case, various mutation are carried out with statistical analysis, identification and character are significantly correlated Germline mutation (germline mutations) and somatic mutation (somatic mutations)；

Step 10：Graphically generate data analysis report.

A kind of 2. tumor target gene sequencing data analytic method according to claim 1, it is characterised in that described survey Ordinal number is according to the targeting amplification library from the high-flux sequence platform including IlluminaMiseq/Hiseq, target area It is customizable, i.e., provide all target point gene group location informations when analyzing first.

3. a kind of tumor target gene sequencing data analytic method according to claim 1, it is characterised in that in step 2 Described low quality data refers to the sequencing data that the average sequencing quality score of single base is less than 20.

4. a kind of tumor target gene sequencing data analytic method according to claim 1, it is characterised in that in step 3 Amplification region abnormal data obtains and deletion process is as follows：

(2) judge whether the amplicon primer sequence corresponding to two terminal sequences of read comes from same primer pair, that is, before allowing Respectively there are 2 mispairing at primer and rear primer and 5 ' and 3 ' ends of sequencing fragment, remove ineligible read sequence；

(3) statistics covers the read sequence of targeting amplification region corresponding to each pair amplicon, and application amplification number weighing and Compare their amplification efficiency；

(4) when expanding number less than number average is expanded corresponding to all amplicons 1/3, then the expansion corresponding to the amplicon is judged Increase region and exception be present, and amplified all read sequences come and deleted in analysis.

A kind of 5. tumor target gene sequencing data analytic method according to claim 1, it is characterised in that the step Comparison process in five needs, according to the target area genomic locations information provided first, to extract these target areas in gene Nucleic acid sequence information in group, and generate index.

6. a kind of tumor target gene sequencing data analytic method according to claim 1, it is characterised in that in step 7 For all mutational sites, only by be sequenced in the site depth more than 100 × case include statistical analysis.

7. a kind of tumor target gene sequencing data analytic method according to claim 1, it is characterised in that in step 8 The screening of the mutation of significant difference is carried out based on the data stored in the cancer Relational database being currently known.

8. a kind of tumor target gene sequencing data analytic method according to claim 1, it is characterised in that in step 9 Clinical data information association analysis, the clinical data of the Chi-square Test function pair patient in concrete application R statistical softwares, bag Include the age, sex, Cancer TNM staging, gross tumor volume, Tumor size, whether have lymph node invasion, Ki67 grade malignancies, tissue Hypotype, analysis is associated, finds out the risk factors related to specific gene mutation generation, i.e., whether there is a certain Clinical symptoms Patient which special mutation be prone to.

9. such as any one methods described in claim 1-8, it is characterised in that it is public that the reference sequences are derived from UCSC The reference oncogene or mankind's reference gene group sequence of database.