CN110246543A

CN110246543A - The method and computer system of single pattern detection copy number variation are utilized based on two generation sequencing technologies

Info

Publication number: CN110246543A
Application number: CN201910541057.5A
Authority: CN
Inventors: 郎继东; 王博; 杨家亮; 田埂
Original assignee: Meta Code Gene Technology (beijing) Ltd By Share Ltd
Current assignee: Meta Code Gene Technology (beijing) Ltd By Share Ltd
Priority date: 2019-06-21
Filing date: 2019-06-21
Publication date: 2019-09-17
Anticipated expiration: 2039-06-21
Also published as: CN110246543B

Abstract

The present invention discloses the method and computer system based on two generation sequencing technologies using single pattern detection copy number variation.The present invention can compare, the factor that sequencing depth, G/C content are corrected and the conventional methods such as paired sample must or require carries out single sample to copy number variation CNV detection from sequencing initial data without dependence.Thus experiment and analytical procedure are not only simplified, it reduces costs, and analysis result is consistent with conventional method height, is also effectively corrected by increasing Clinical results (such as FISH is verified) to the false positive and false negative of traditional method detection.

Description

The method and calculating of single pattern detection copy number variation are utilized based on two generation sequencing technologies Machine system

Technical field

The present invention relates to genetic tests, and in particular to utilizes single pattern detection copy number variation based on two generation sequencing technologies Method and computer system.

Background technique

Copying number variation (copy number variation, CNV) is a kind of structure variation, is that weight occurs by genome Caused by row, microscopic level (microscopic) and sub- microscopic level can be divided into according to size (submicroscopic).The structure variation of microscopic level is primarily referred to as visible under microscope including euploid or non-multiple The chromosome aberrations such as body, insertion, missing, inversion, transposition；The structure variation of sub- microscopic level refers mainly to DNA fragmentation length in 1Kb Above includes the variation of the generations such as insertion, missing, repetition, inversion, transposition.Copy number variation is that the important of human diseases is caused a disease One of factor, current research discovery CNV is related with the pathogenic mechanism of many complex inheritance diseases or neurological susceptibility, including tumour, Acquired immunodeficiency syndrome, systemic loupus erythematosus, autoimmune inflammatory diseases etc..Clinically carry out copy number variation inspection Survey is necessary, can early discovery genome in large fragment DNA sequence variation, thus be disease diagnosing and treating Reference frame is provided.

There are many means and method of copy number variation detection at present, such as the method based on polymerase chain reaction, including more Heavy chain connects probe amplification technology and multiple amplifiable probe hybridization technique etc.；Method based on hybridization technique, including it is in situ immune Fluorescence and Gismsa Banded method etc.；Method based on chip technology, including mononucleotide polymorphism chip etc..These methods are not Only complicated for operation, resolution ratio is low, it is difficult to the specifying information of variation section is provided, and analysis throughput is lower, price costly, Cost performance is not very high.And with the fast development of two generation sequencing technologies, not only sequencing cost substantially reduces, and analysis throughput has Index improves, and resolution ratio can drop to Kb level, so that the copy number variation research of sub- microscopic level can be more deep Enter.The algorithm of detection CNV is essentially all and is developed based on genome sequencing (WGS) level at present, such as CNVkit, CNVnator, Control-FREEC etc., and consider detection accuracy, it generally can all require to be paired sample to detect CNV； Detection for single sample generally all can be that identification CNV is corrected according to sequencing depth and G/C content.And with target sequencing Demand is higher and higher, the stronger algorithm of some specific aims of also having come into being other than the algorithm mentioned before, such as PatternCNV, Ioncopy etc..But these methods will be compared and for excessive dependent on sequencing without exception Depth and G/C content, and it is also limited to the parameter setting of alignment parameters and parser, overall flow is relatively complicated and complicated, Experiment and analysis cost are also all higher.

Summary of the invention

In consideration of it, the present invention establishes a kind of method and calculating based on two generation sequencing technologies list pattern detections copy number variation Machine system.The present invention can not have to rely on comparison, sequencing depth, G/C content correction and pairing sample from sequencing initial data The factor that the conventional methods such as this must or require carries out copy number variation CNV detection to single sample.

Specifically, the present invention includes the following contents.

The first aspect of the present invention provides a kind of side based on two generation sequencing technologies using single pattern detection copy number variation Method comprising following steps:

(1) the first cdna sample database and the second cdna sample database are established, wherein the first cdna sample database Comprising A copy number mutant genes, the second cdna sample database includes B and does not occur to copy number variation in corresponding gene Gene, wherein A and B is respectively 50 or more natural number；Gene in first cdna sample database, which preferably comprises, to be copied Shellfish number variation, and the gene in the second cdna sample database preferably with the gene copy number in the first cdna sample database The position (i.e. region) of variation does not occur to copy number variation accordingly.It should be noted that in the second cdna sample database It includes copying number variation mutation that there may be differences with the mutation except copy number variable region in the first cdna sample in gene. In order to guarantee the reliability and accuracy of the method for the present invention, it is however generally that the natural number that need A and B respectively be 50 or more, It is preferred that 100 or more, more preferable 200 or more, further preferred 300 or more.The upper limit of A and B is not specially required.

It (2) is L by length_jThe j copy number variable region of bp is divided with the sliding window of m bp size, step-length n Thus bp obtains i=L in each copy number variable region_j/ n seed sequence, wherein if L_j/ n is to divide exactly, then i is rounded, if L_j/ n is not to divide exactly, then i, which is rounded downwards, adds 1, amounts to and obtains the matrix being made of j*i seed sequence.In general, j be 1 with On natural number, preferably 10 or more, more preferable 30 or more.The natural number that m is 50 or more, more preferable 80 or more natural number, more It is preferred that m is L or less.N is 1 so that up to L natural number below, preferably 5 up to L natural number below.

(3) j*i seed sequence is carried out in the first cdna sample database and the second cdna sample database respectively Not fault-tolerant sufficient sequence matching obtains the matrix of j*i exact matching seed sequence number in each database.

(4) matrix that seed sequence number is exactly matched in each database is standardized, i.e., it is complete each With seed sequence number divided by the average of all exact matching seed sequence numbers of the copy number variable region.

(5) matrix of the exact matching seed sequence number after standardization is carried out mending the processing of 0 value, i.e., is become with copy number The seed sequence that different region obtains is the largest number of to compare, and the matrix value of remaining region deficiency number is set as 0.

(6) by A+B 0 value of benefit treated standardization exact matching seed sequence matrix number carry out mathematical modeling, according to Positive and negative result establishes data statistics model, finally obtains the mathematical model of the yin and yang attribute of judgement copy number variation.

(7) step (2)-(5) will be repeated to judgement sample, carry out copy number change using the resulting mathematical model of step (6) Different prediction and judgement is then judged as positive, otherwise is if predicted value is greater than 0.5, preferably greater than 0.6, more preferably greater than 0.8 It is negative.

Preferably, described in the method for the invention based on two generation sequencing technologies using single pattern detection copy number variation Cdna sample data source in first cdna sample database and the second cdna sample database in genome sequencing and/or The data that target area capture/amplification is sequenced.

Preferably, described in the method for the invention based on two generation sequencing technologies using single pattern detection copy number variation Copy number variation includes gene copy number amplification and/or missing.

Preferably, described in the method for the invention based on two generation sequencing technologies using single pattern detection copy number variation Copy number variation include chromosome euploid or aneuploid, insertion, missing, inversion and transposition and the insertion of DNA fragmentation, Missing, repetition, inversion or transposition.

Preferably, described in the method for the invention based on two generation sequencing technologies using single pattern detection copy number variation The length of DNA fragmentation is 1Kb or more, preferably 1.5Kb or more.On the other hand, preferably 10Kb is hereinafter, more preferably 8Kb or less.

Preferably, described in the method for the invention based on two generation sequencing technologies using single pattern detection copy number variation Gene is ERBB2.

Preferably, in the method for the invention based on two generation sequencing technologies using single pattern detection copy number variation, in step Suddenly data statistics model is established by logistic regression or deep learning algorithm in (6).

The second aspect of the present invention provides a kind of computer system comprising processor, and be configured as executing the present invention First aspect described in method.

Invention not only simplifies experiment and analytical procedures, reduce costs, and analyze result and conventional method have compared with High concordance rate, it is negative to the false positive of traditional method detection and vacation also by Clinical results (such as FISH is verified) are increased Property is effectively corrected.

Detailed description of the invention

Fig. 1 is a kind of exemplary process diagram of the present invention.

Specific embodiment

The existing various exemplary embodiment that the present invention will be described in detail, the detailed description are not considered as to limit of the invention System, and it is understood as the more detailed description to certain aspects of the invention, characteristic and embodiment.

It should be understood that it is to describe special embodiment that heretofore described term, which is only, it is not intended to limit this hair It is bright.In addition, for the numberical range in the present invention, it is thus understood that specifically disclose the range upper and lower bound and they it Between each median.Median and any other statement value in any statement value or stated ranges or in the range Lesser range is also included in the present invention each of between interior median.These small range of upper and lower bounds can be independent Ground includes or excludes in range.

Unless otherwise stated, all technical and scientific terms used herein has the routine in field of the present invention The normally understood identical meanings of technical staff.Although the present invention only describes preferred method and material, of the invention Implement or also can be used and similar or equivalent any method and material described herein in testing.The institute mentioned in this specification There is document to be incorporated by reference into, to disclosure and description method relevant to the document and/or material.It is incorporated to any When document conflicts, it is subject to the content of this specification.Unless otherwise stated, " % " or " amount " is the percentage based on weight Number.

Embodiment

It chooses 5 known ERBB2 amplification positives and 5 negative full exon group data of known ERBB2 amplification is surveyed Try analysis method of the invention.Specifically illustrated (such as flow chart 1) with 1 ERBB2 positive sample embodiment Sample1, remaining reality It applies example and repeats step 1-11, as follows:

1. collecting the full sequencing of extron group data of 272 ERBB2 gene magnification positives, it is collected simultaneously 1029 ERBB2 The full sequencing of extron group data of gene magnification feminine gender, and data are divided into training set and test set two parts；Wherein Training set includes 223 positive samples, and 817 negative samples, test set includes 49 positive samples, 212 negative samples This；

2. gene ERBB2 includes 27 full exons, by region Lj (0 < j < 28) bp of 27 amplifications or missing with 50bp The sliding window of size is divided, step-length 40bp, the available total i=Lj/40 seed sequence in region of each amplification or missing Column, wherein i is rounded if Lj/40 is to divide exactly, if Lj/40 is not to divide exactly, i, which is rounded downwards, adds 1, therefore available 27* in total I seed sequence matrix, wherein Lj be respectively 311bp, 152bp, 214bp, 135bp, 69bp, 116bp, 142bp, 120bp, 127bp、74bp、91bp、200bp、133bp、91bp、161bp、48bp、139bp、123bp、99bp、186bp、156bp、 76bp,147bp,98bp,189bp,253bp,974bp.Corresponding i is respectively 8,4,6,4,2,3,4,4,4,2,3,6,4,3,5, 2、4、4、3、5、4、2、4、3、5、7、25。

3. 27*i seed sequence is carried out not fault-tolerant complete sequence in the data of training 1040 samples of set respectively Column matching, the matrix of 27*i exact matching seed sequence number of available each sample.

4. the exact matching seed sequence matrix number of pair each sample is standardized, i.e., each exact matching seed Sequence number is divided by the amplification or the average of all exact matching seed sequence numbers of absent region.

5. the matrix of the exact matching seed sequence number after pair standardization carries out mending the processing of 0 value, i.e., to expand or lack area The seed sequence that domain obtains is the largest number of to compare, and the matrix value of remaining region deficiency number is set as 0.

6. by 1040 0 values of benefit treated standardization exact matching seed sequence number 27*25 matrix carry out mathematical modeling, 10 times of cross validations are carried out to 1040 samples first, and utilize convolutional neural networks (CNN) algorithm knot in deep learning It closes positive and negative result and hyper parameter, adjusting and optimizing is chosen to model, finally obtaining to training set AUC is 93.04%, to test The optimal mathematical model that set AUC is 94.54%, in this, as the model method for the new samples for judging same type data.Model Parameter is as shown in table 1.

Table 1- model parameter

7. by Sample1 repeat step 2-5, using 6 resulting optimal mathematical models carry out copy number variation prediction and Judgement, predicted value 0.9916596 are greater than 0.5, it is believed that are positive.

Sample2-Sample10 predicted value and judging result are as shown in table 2 below.

The prediction result of each sample of table 2- summarizes

Sample ID	Predicted value	Observation
			Sample1	0.9916596	It is positive
Sample2	0.9989957	It is positive
			Sample3	0.9999901	It is positive
Sample4	0.9990958	It is positive
			Sample5	0.99751943	It is positive
Sample6	0.012844639	It is negative
			Sample7	0.006111831	It is negative
Sample8	0.003521628	It is negative
			Sample9	0.008016149	It is negative
Sample10	0.002645513	It is negative

Without departing substantially from the scope or spirit of the invention, the specific embodiment of description of the invention can be done more Kind improvements and changes, this will be apparent to those skilled in the art.Other realities obtained by specification of the invention Applying mode for technical personnel is apparent obtain.Present specification and embodiment are merely exemplary.

Claims

1. a kind of utilize the method for single pattern detection copy number variation based on two generation sequencing technologies, which is characterized in that including following Step:

(1) the first cdna sample database and the second cdna sample database are established, wherein the first cdna sample database includes A Example copy number mutant gene, the second cdna sample database include the B bases for not occurring to copy number variation in corresponding position Cause, wherein A and B is respectively 50 or more natural number；

It (2) is L by length_jThe j copy number variable region of bp is divided with the sliding window of m bp size, and step-length is n bp, thus I=L is obtained in each copy number variable region_j/ n seed sequence, wherein if L_j/ n is to divide exactly, then i is rounded, if L_j/ n is not Divide exactly, then i, which is rounded downwards, adds 1, amounts to and obtains the matrix being made of j*i seed sequence；

(3) j*i seed sequence is not allowed in the first cdna sample database and the second cdna sample database respectively Wrong sufficient sequence matching obtains the matrix of j*i exact matching seed sequence number in each database；

(4) matrix that seed sequence number is exactly matched in each database is standardized, i.e., each exact matching kind Subsequence number divided by all exact matching seed sequence numbers of the copy number variable region average；

(5) matrix of the exact matching seed sequence number after standardization is carried out mending the processing of 0 value, i.e., with copy number region of variability The seed sequence that domain obtains is the largest number of to compare, and the matrix value of remaining region deficiency number is set as 0；

(6) by A+B 0 value of benefit treated standardization exact matching seed sequence matrix number carry out mathematical modeling, according to yin, yang Property result establish data statistics model, finally obtain judgement copy number variation yin and yang attribute mathematical model；

(7) step (2)-(5) will be repeated to judgement sample, carry out copy number variation using step (6) resulting mathematical model Prediction and judgement are judged as positive if predicted value is greater than 0.5, otherwise are feminine gender.

2. according to claim 1 utilize the method for single pattern detection copy number variation, spy based on two generation sequencing technologies Sign is, the natural number that j is 1 or more.

3. according to claim 2 utilize the method for single pattern detection copy number variation, spy based on two generation sequencing technologies Sign is that the cdna sample data source in the first cdna sample database and the second cdna sample database is in complete The data that gene order-checking and/or target area capture/amplification are sequenced.

4. according to claim 3 utilize the method for single pattern detection copy number variation, spy based on two generation sequencing technologies Sign is that the copy number variation includes gene copy number amplification and/or missing.

5. according to claim 4 utilize the method for single pattern detection copy number variation, spy based on two generation sequencing technologies Sign is, described to copy the euploid or aneuploid, insertion, missing, inversion or transposition and DNA that number variation includes chromosome Insertion, missing, repetition, inversion or the transposition of segment.

6. according to claim 5 utilize the method for single pattern detection copy number variation, spy based on two generation sequencing technologies Sign is that the length of the DNA fragmentation is 1Kb or more.

7. according to claim 1 utilize the method for single pattern detection copy number variation, spy based on two generation sequencing technologies Sign is that the gene is ERBB2.

8. according to claim 1 utilize the method for single pattern detection copy number variation, spy based on two generation sequencing technologies Sign is, establishes data statistics model by logistic regression or deep learning algorithm in step (6).

9. a kind of computer system, which is characterized in that it includes processor, and be configured to execute according to claim 1- 8 described in any item methods.