CNV detection device
The application is a divisional application of a Chinese application patent application with the application number of 201811623637.0, the application date of 2018, 12 months and 28 days and the application name of a CNV detection device.
Technical Field
The application relates to a noninvasive CNV detection device and a method for noninvasively detecting CNV by using the noninvasive CNV detection device.
Background
The gene copy number variation (Copy number variations, hereinafter abbreviated as CNV) is a clinically important structural variation, and most microdeletions or microduplications have polymorphism, but some of the microdeletion duplications have pathogenicity or lethality. Thus, early intervention to identify a CNV that is pathogenic lethal before the fetus is born can reduce neonatal defects.
The current noninvasive prenatal gene detection (NIPT screening) is based on a new generation sequencing platform (NGS platform) to carry out sequencing analysis on maternal peripheral blood, and the analysis means is used for filtering system noise and increasing fetal signals, so that the detection of chromosome aneuploidy is realized. Noninvasive CNV is based on NIPT to window the chromosome and signal amplification and significance verification are performed independently for each window.
Since most of the signals in the sequencing data are from the mother, fetal signals are easily masked when maternal CNV or placental embedding is present. On the other hand, when the experimental system is unstable, GC shift or interference of system noise easily causes result judgment misalignment, and false positive or false negative results appear. Fetal concentration is also an important variable affecting outcome determination, with higher concentrations leading to higher confidence in the outcome.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a detection device and a detection method with higher detection sensitivity for CNV.
In particular, the object of the invention is achieved by the following technical scheme.
1. A copy number variation detection apparatus, comprising:
a sequencing data acquisition module that performs sequencing based on the acquired maternal peripheral blood free DNA to obtain chromosomal sequencing data of the sample to be tested and chromosomal sequencing data from the background library sample;
a windowed fragmentation module for aligning the sequencing data to a reference genomic sequence, cutting the sequencing data into windows of equal length, and allowing an intersection to exist between every two adjacent windows, and counting window parameters including read, unique Read (UR), capability, genomic GC and/or unique reads GC of each window;
A module for detecting CNV based on the number of reads, which calculates Z value based on each window, calculates CNV probability, and estimates fetal concentration by using CNV probability, thereby judging whether the sample to be detected is suspected to be positive CNV, and eliminating interference of maternal CNV;
a module for detecting CNV based on the number of unique reads, defining a sliding step length m according to the detection resolution, calculating average reads (Mr) and average GC (Mgc) based on m adjacent windows, and constructing a window-specific linear regression model so as to judge whether a sample to be detected is suspected to be CNV;
and the model result summarizing module is used for comparing, analyzing and outputting a final result based on the output results of the two modules for detecting CNV.
2. The detection apparatus according to item 1, wherein the means for detecting CNV based on the reads number includes the following sub-modules:
a data preprocessing and normalization module for GC correction of the reads to eliminate inter-library differences; performing uniformity correction after GC correction so as to enable comparability between all the samples to be tested and the background library samples;
the Z test amplified signal module calculates the mean value and the variance of each window by using the background library sample, and calculates the Z value of each window by Z test;
The chromosome slicing module performs slicing treatment on the chromosome by utilizing a continuity window Z value, merges continuity windows with similar states into a section to be detected, and judges the attribute of the section including dup, del, normal;
the module calculates a Z value confidence interval, calculates the median value of Z values of continuous windows existing in the same interval of the background library sample for each interval to be detected combined by the chromosome slicing module, calculates and sets a confidence interval range according to the mean value and the variance of median distribution, judges whether the interval to be detected falls into the confidence interval, and judges the interval which does not fall into the confidence interval as a potential CNV interval;
the module calculates the CNV probability, the module calculates the sum of reads of windows in the interval in the same interval of the background library sample aiming at the potential CNV interval to obtain probability density distribution, calculates the significance probability according to the reads of the CNV interval to be detected, and carries out negative logarithmic conversion on the significance probability and compares the significance probability with a given threshold value;
and the module for calculating the CNV concentration is used for fitting the potential CNV interval by utilizing the UR and the real GC of the same interval of the background library sample, determining the UR and the GC of the potential CNV interval, calculating the CNV concentration by utilizing the UR and the GC of the potential CNV interval, and judging whether the sample to be detected is suspected to be maternal CNV or placenta mosaic according to the comparison of the calculated CNV concentration and the real fetal concentration.
3. The detection apparatus according to item 1 or 2, wherein the module for detecting CNV based on the unique reads number includes the following sub-modules:
the MiniModel construction module is used for preprocessing the data quantity difference among different libraries, after preprocessing, defining a step length m according to resolution, combining each adjacent m windows into a unit to calculate average reads (Mr) and average GC (Mgc), calculating Mr ' and Mgc ' distribution of the same interval by using a background library sample, fitting Mr ' and Mgc ', calculating residual errors according to theoretical values corresponding to the Mr and Mgc to be detected, calculating weights according to the attribute including dup, del, normal of the residual error judgment window, and calculating the confidence according to the correlation R, mgc of Mr ' and Mgc ' and the standard deviation sd of background data Mr ';
a chromosome sectioning and slicing module which utilizes a given model or algorithm to identify adjacent areas which are normally distributed from two different mean values and have obvious differences, so as to sectioning and slicing the chromosome and identify the CNV boundary position;
and the saliency evaluation module randomly extracts the same number of window values from other areas of the chromosome of the sample to be tested for the section interval, and repeats the process to determine the saliency of the true values in the background distribution.
4. The detection apparatus according to item 3, wherein in the MiniModel construction module, calculating the residual error and determining the confidence according to the theoretical values corresponding to the measured values Mr and Mgc further comprises:
for each unit, calculating the standard deviation of all the background library samples Mr ', pearson correlation coefficients of Mr' and Mgc ', and calculating the weight by integrating the standard deviation, the correlation coefficients and the quantiles of the samples Mgc to be tested distributed on the background library sample Mgc', thereby judging the confidence.
5. The detection apparatus according to any one of items 1 to 4, wherein in the model result summarizing module, if there is a module for detecting CNV based on the numbers of reads and Z values and a module for detecting CNV based on the numbers of UR and means in the sample to be detected, the output results of the two modules are both reported as part of a target CNV section, and when it is judged that the coincidence ratio of the target CNV section exceeds a set threshold, the coincidence area is reported as CNV, and if the results in the two modules are inconsistent for the section to be detected, a result that is false positive is output.
6. The detection apparatus according to any one of items 3 to 5, wherein in the saliency evaluation module, the process is repeated 10000 times.
7. A computer readable storage medium having stored thereon a computer program for performing the steps of:
sequencing data acquisition, namely sequencing based on the acquired maternal peripheral blood free DNA to obtain chromosome sequencing data of a sample to be tested and chromosome sequencing data from a background library sample;
a windowed fragmenting step for comparing the sequencing data to a reference genome sequence, cutting the sequencing data into windows of equal length, and allowing an intersection to exist between every two adjacent windows, and counting window parameters including read, unique Read (UR), map, genomic GC and/or unique reads GC of each window;
a step of detecting CNV based on the reads number, calculating Z value based on each window, calculating CNV probability, and estimating fetal concentration by using CNV probability, thereby judging whether the sample to be detected is suspected to be positive CNV, and eliminating interference of maternal CNV;
detecting CNV based on the number of unique reads, defining a sliding window length m according to resolution, calculating average reads (Mr) and average GC (Mgc) based on m adjacent windows, and constructing a window-specific linear regression model, so as to judge whether a sample to be detected is suspected to be CNV;
And a model result summarizing step, wherein the final result is output by comparing and analyzing based on the output results of the two modules for detecting CNV.
8. The computer-readable storage medium of item 7, having stored thereon a computer program, wherein the computer program is further configured to perform the steps of:
a data preprocessing and normalization step for GC correction of the reads to eliminate the inter-library differences; performing uniformity correction after GC correction so as to enable comparability between all the samples to be tested and the background library samples;
a step of Z test amplified signal, which calculates the mean value and variance of each window by using the background library sample, and calculates the Z value of each window by Z test;
a chromosome slicing step, wherein a chromosome is sliced by utilizing a continuity window Z value, a continuity window with similar states is combined into a section to be detected, and the attribute of the section including dup, del, normal is judged;
calculating a Z value confidence interval, namely calculating the median value of Z values of continuous windows existing in the same interval of a background library sample aiming at each interval to be detected combined by the chromosome slicing module, calculating a 95% confidence interval range according to the mean value and the variance of median distribution, judging whether the interval to be detected falls into the confidence interval, and judging an interval which does not fall into the confidence interval as a potential CNV interval;
Calculating CNV probability, namely calculating the sum of reads of windows in the potential CNV interval in the same interval of a background library sample to obtain probability density distribution, calculating significance probability according to reads of the CNV interval to be detected, performing negative logarithmic conversion on the significance probability and comparing the significance probability with a given threshold;
and calculating the CNV concentration, wherein the step is to fit the potential CNV interval by utilizing the UR and the real GC of the same interval of the background library sample, determine the UR and the GC of the potential CNV interval, calculate the CNV concentration by utilizing the UR and the GC of the potential CNV interval, and judge whether the sample to be detected is suspected to be maternal CNV or placenta mosaic according to the comparison of the calculated CNV concentration and the real fetal concentration.
9. The computer-readable storage medium of item 7, having stored thereon a computer program, wherein the computer program is further configured to perform the steps of:
a MiniModel construction step of carrying out pretreatment for eliminating the difference of data amounts among different libraries, after the pretreatment, defining a sliding window length m according to resolution, combining each adjacent m windows into a unit to calculate average reads (Mr) and average GC (Mgc), calculating Mr ' and Mgc ' distribution of the same interval by using a background library sample, fitting Mr ' and Mgc ', calculating residual errors according to theoretical values corresponding to the Mr and Mgc to be detected, calculating weights according to the attribute comprising dup, del, normal of the residual error judging window, correlation R, mgc of Mr ' and Mgc ' and standard deviation sd of background data Mr ', and judging confidence;
A chromosome sectioning step of identifying adjacent regions with significant differences from normal distributions of two different means using a given model or algorithm, thereby sectioning the chromosome to identify a CNV boundary position;
and a significance evaluation step of randomly extracting the same number of window values from other areas of the chromosome of the sample to be measured for the section, and repeating the process to determine the significance of the true value in the background distribution.
10. The computer-readable storage medium of item 7, having stored thereon a computer program, wherein the computer program is further configured to perform the steps of:
if the sample to be tested has a module for detecting CNV based on the numbers of reads and Z values and a module for detecting CNV based on the numbers of URs and the average value, the output results of the two modules are reported as the part of the target CNV interval, when the coincidence rate of the target CNV interval is judged to exceed a set threshold value, the coincidence area is reported as CNV, and if the results of the two modules are inconsistent for the interval to be tested, the result is output as false positive.
11. A copy number variation detection method comprising the steps of:
Sequencing data acquisition, namely sequencing based on the acquired maternal peripheral blood free DNA to obtain chromosome sequencing data of a sample to be tested and chromosome sequencing data from a background library sample;
a window segmentation step, namely comparing the sequencing data to a reference genome sequence, cutting the sequencing data into equal-length windows, enabling intersection between every two adjacent windows to exist, and counting window parameters including read, unique Read (UR), map, genomic GC and/or unique reads GC of each window;
a step of detecting CNV based on the number of reads, in which Z value is calculated based on each window, CNV probability is calculated, and fetal concentration is estimated by using CNV probability, thereby judging whether the sample to be detected is suspected to be positive CNV, and eliminating interference of maternal CNV;
a step of detecting CNV based on the number of unique reads, in which average reads (Mr) and average GC (Mgc) are calculated based on adjacent 10 windows, and a window-specific linear regression model is constructed, thereby judging whether the sample to be detected is suspected to be CNV;
and a model result summarizing step, wherein the final result is output by comparing and analyzing based on the output results of the two modules for detecting CNV.
12. The detection method according to item 11, wherein the step of detecting CNV based on the numbers of reads includes the steps of:
a data preprocessing and normalization step for GC correction of the reads to eliminate the inter-library differences; performing uniformity correction after GC correction so as to enable comparability between all the samples to be tested and the background library samples;
a step of Z test amplified signal, which calculates the mean value and variance of each window by using the background library sample, and calculates the Z value of each window by Z test;
a chromosome slicing step, wherein a chromosome is sliced by utilizing a continuity window Z value, a continuity window with similar states is combined into a section to be detected, and the attribute of the section including dup, del, normal is judged;
calculating a Z value confidence interval, namely calculating the median value of Z values of continuous windows existing in the same interval of a background library sample aiming at each interval to be detected combined by the chromosome slicing module, calculating a 95% confidence interval range according to the mean value and the variance of median distribution, judging whether the interval to be detected falls into the confidence interval, and judging an interval which does not fall into the confidence interval as a potential CNV interval;
Calculating CNV probability, namely calculating the sum of reads of windows in the potential CNV interval in the same interval of a background library sample to obtain probability density distribution, calculating significance probability according to reads of the CNV interval to be detected, performing negative logarithmic conversion on the significance probability and comparing the significance probability with a given threshold;
and calculating the CNV concentration, wherein the step is to fit the potential CNV interval by utilizing the UR and the real GC of the same interval of the background library sample, determine the UR and the GC of the potential CNV interval, calculate the CNV concentration by utilizing the UR and the GC of the potential CNV interval, and judge whether the sample to be detected is suspected to be maternal CNV or placenta mosaic according to the comparison of the calculated CNV concentration and the real fetal concentration.
13. The detection method according to item 11 or 12, wherein the step of detecting CNV based on the unique reads number includes the steps of:
a MiniModel construction step of carrying out pretreatment for eliminating the difference of data amounts among different libraries, after the pretreatment, defining a sliding window length m according to resolution, combining each adjacent m windows into a unit to calculate average reads (Mr) and average GC (Mgc), calculating Mr ' and Mgc ' distribution of the same interval by using a background library sample, fitting Mr ' and Mgc ', calculating residual errors according to theoretical values corresponding to the Mr and Mgc to be detected, calculating weights according to the attribute comprising dup, del, normal of the residual error judging window, correlation R, mgc of Mr ' and Mgc ' and standard deviation sd of background data Mr ', and judging confidence;
A chromosome sectioning step of identifying adjacent regions with significant differences from normal distributions of two different means using a given model or algorithm, thereby sectioning the chromosome to identify a CNV boundary position;
and a significance evaluation step of randomly extracting the same number of window values from other areas of the chromosome of the sample to be measured for the section, and repeating the process to determine the significance of the true value in the background distribution.
14. The detection method according to item 13, wherein in the MiniModel construction step, calculating a residual error and determining a confidence according to theoretical values corresponding to the measured values Mr and Mgc further includes:
for each unit, calculating the standard deviation of all the background library samples Mr ', pearson correlation coefficients of Mr' and Mgc ', and calculating the weight by integrating the standard deviation, the correlation coefficients and the quantiles of the samples Mgc to be tested distributed on the background library sample Mgc', thereby judging the confidence.
15. The detection method according to any one of items 11 to 14, wherein in the model result summarizing step, if there is a module for detecting CNV based on the numbers of reads and Z values and a module for detecting CNV based on the numbers of UR and means in the sample to be detected, the output results of the two modules are both reported as part of the target CNV section, and when it is judged that the coincidence ratio of the target CNV section exceeds the set threshold, the coincidence area is reported as CNV, and if the results in the two modules are inconsistent for the section to be detected, a result that is false positive is output.
16. The detection method according to any one of items 13 to 15, wherein in the significance evaluation module, the process is repeated 10000 times.
In the invention, N negative samples are adopted to establish a background library, and the sample to be tested (namely the fetus) is compared with the background library to carry out significance verification. In the device and the method, the sample to be tested and the background library are subjected to the same pretreatment process, and the method mainly comprises chromosome windowing: each chromosome is cut into equal-length windows, and an intersection exists between every two adjacent windows; lowersgc correction: GC correction was performed for each chromosome to be tested together with chromosome 1 and/or chromosome 2. The chromosome 1 and the chromosome 2 are relatively stable, have higher volume rate and diversity, and can be used as a reference for effectively evaluating the deletion or repetition of the chromosome to be tested. In addition, the difference in the data amount of the different libraries can be eliminated to a certain extent by using chromosome 1 and chromosome 2 as references. For each window, the mean and variance in the N negative samples were calculated in the background library and the signal was amplified by three Z-tests. Finally, the window with Z value larger than 1 is considered to be repeated, the window with Z value smaller than-1 is considered to be missing, and the rest windows belong to normal fluctuation. The windows of the same class are merged, and finally, the fetal concentration is calculated for the merged windows UR, and the false positive result caused by the fluctuation of the data is filtered by combining the Z value and the fetal concentration. All CNVs are matched to the DGV and OMIM databases, and annotation information corresponding to the CNVs is output, including polymorphism, pathogenicity, etc.
In the invention, the whole chromosome is segmented into windows, so that the influence on the whole chromosome due to local microdeletion or micro-repetition can be effectively avoided. The length of each window is equal, and the window length can be calculated based on the sequencing depth, e.g., the number of free DNA fragments aligned to each window is not less than the inverse of the lower sequencing concentration limit. In the present invention, the length of each window may be preferably 100k, and there is an intersection of 50k between every two adjacent windows.
In the present invention, m may be any integer. The smaller M, the higher the resolution, but the stronger the fluctuation of each combined bin, the lower the stability. The larger M is, the lower the resolution is, but the combined bin has strong stability, and the correlation between the unique reads and the GC is more remarkable. For example, M may range from any integer between 5 and 20, with a corresponding resolution of 0.25M to 1M.
In the present invention, the set threshold is used to evaluate the consistency of two CNV detection modules. Since there is a difference between the segmentation modules of the two CNV detection modules, there may be some deviation for the identified CNV boundary. The higher the set threshold, the more stringent the consistency requirements for the two modules; whereas the more relaxed. In the present invention, the threshold is preferably set to 50%.
In the present invention, the confidence interval may be set to a value or range commonly employed by those skilled in the art, for example, 95% or 99%.
In the present invention, the CNV boundaries are identified by chromosome segmentation, relying on a model or algorithm that segments the normally distributed sequence data of different mean values. Since there is a significant difference between the mean value of the CNV region and the adjacent chromosome region, CNV boundary information can be identified using the given module.
Unlike the NIPT chromosome aneuploidy detection, the noninvasive CNV detection is more prone to system noise such as data fluctuation in the form of false positives in the results under unstable experimental conditions. When the system is noisy, one of the main features is represented by the real GC bias of reads, which cannot be removed by genomic GC correction.
As described above, the device according to the invention is based on the NIPT platform for detection of sample autosomes and X-chromosome microdeletions. The invention provides a noninvasive CNV detection device with higher detection sensitivity, which can reduce the occurrence probability of false positive or false negative and greatly improve the accuracy and sensitivity of detecting fetal CNV.
Drawings
Various other advantages and benefits of the present invention will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. It is evident that the figures described below are only some embodiments of the invention, from which other figures can be obtained without inventive effort for a person skilled in the art. Also, like reference numerals are used to designate like parts throughout the figures.
FIG. 1 shows a data analysis flow performed by the detection apparatus of the present invention.
Fig. 2 is a graph of the result of CNV determination by the method of comparative example 1.
Fig. 3 is a graph showing the result of CNV determination by the method of example 1.
Detailed Description
The present invention relates to the following definitions.
High throughput sequencing: high throughput sequencing technology (High-throughput sequencing), also known as "Next generation" sequencing technology ("Next-generation" sequencing technology), allows sequencing of hundreds of thousands to millions of DNA molecules in parallel at a time.
Window (sliding window): generally refers to a region of fixed length on the genome.
Background library: sample libraries consisting of N (generally considered > =20) healthy human samples.
Reads: the plurality of reads, a short sequencing fragment sequence generated by the high throughput sequencing platform.
Unique read: refers to reads that are uniquely aligned to the genome. During sequencing, some reads can be aligned to multiple locations on the genome at the same time, and Unique reads are filtered from all reads that are not dup, with the remainder being Unique reads.
Capability: for some windows, the short sequence uniqueness is low, probably due primarily to repeated sequences from heterochromatin large sheets or more complex biological reasons, when the efficiency of each window is calculated using the factor map and compared to a threshold of 0.625, windows below the threshold are not taken into account.
Genomic GC: this parameter represents the genomic GC corresponding to each window, which is the same in all libraries. In addition, in model one described below, the parameters are used for GC corrections to correct for differences in reads readings due to GC preferences.
Reads GC: GC corresponding to all reads in each window.
Unique reads GC: representing the GC corresponding to the unique reads in each window, which is used to calculate the concentration of CNV in model one below; in model two below, the unique reads GC was used to fit the background data for the data points P synthesized for 10 consecutive windows, thus calculating the residual of P.
Dup: duplicate, duplicate region, representing 3 copies of the target CNV
Del: delete, deletion region, single copy exists representing target CNV
Normal: representing normal 2 copies
True GC: is defined with respect to the intrinsic genomic GC. The real GC refers to GC corresponding to the unique reads, and is sequence GC information actually reflected in a sequencing process and an experimental environment.
The invention detects sample autosomes and X chromosome microdeletions based on the NIPT platform of low depth whole genome sequencing.
In one embodiment, the copy number variation detection apparatus of the present invention includes:
the system comprises a sequencing data acquisition module, a windowing fragmentation module, a CNV detection module based on all ready numbers, a CNV detection module based on unique ready numbers and a model result summarization module.
First, a sequencing data acquisition module performs sequencing based on acquired maternal peripheral blood free DNA to obtain chromosomal sequencing data of a sample to be tested and chromosomal sequencing data from a background library sample. The module was used to extract, amplify, pool, and sequence mixed DNA in maternal peripheral blood based on SE 40. And finally, comparing the obtained product with a chromosome by an information analysis method, thereby analyzing the information of the chromosome. The methods for extracting, amplifying, pooling and sequencing the mixed DNA in the peripheral blood of the mother can be all the methods commonly used in the field.
The number of background library samples in this example is not fixed and can be determined according to different time periods, different reagents and different experimental conditions. For example, the background library sample includes 1000 or more negative samples, preferably 2000 or more negative samples, preferably 3000 or more negative samples, preferably 3500 or more negative samples, and more preferably 4000 negative samples, for example.
For a windowed fragmentation module, the module is configured to align the sequencing data to a reference genomic sequence, cut the sequencing data into windows of equal length, and make an intersection between every two adjacent windows, and count window parameters including read, unique Read (UR), map and/or unique reads GC for each window.
In the present invention, the reference genome sequence is not limited, and any known reference sequence of human genome can be used as long as it is ensured that the same set of sequences is used for alignment for all samples. In a specific embodiment, the reference genomic sequence is the hg19 reference sequence.
As for a module for detecting CNV based on all the reads, the module includes the following sub-modules and is used to execute the model one described below.
The module for detecting CNV based on all reads comprises the following sub-modules:
a data preprocessing and normalization module for GC correction of all reads to eliminate inter-library differences; performing uniformity correction after GC correction so as to enable comparability between all the samples to be tested and the background library samples;
the Z test amplified signal module calculates the mean value and the variance of each window by using the background library sample, and calculates the Z value of each window by Z test;
the chromosome slicing module performs slicing treatment on the chromosome by utilizing a continuity window Z value, merges continuity windows with similar states into a section to be detected, and judges the attribute of the section including dup, del, normal;
the module calculates a Z value confidence interval, calculates the median value of Z values of continuous windows existing in the same interval of a background library sample aiming at each interval to be detected combined by the chromosome slicing module, calculates a 95% confidence interval range according to the mean value and the variance of median distribution, judges whether the interval to be detected falls into the confidence interval, and judges the interval which does not fall into the confidence interval as a potential CNV interval;
The module calculates the CNV probability, the module calculates the sum of all reads of the window in the interval in the same interval of the background library sample aiming at the potential CNV interval to obtain probability density distribution, calculates the significance probability according to all reads of the CNV interval to be detected, and carries out negative logarithmic conversion on the significance probability and compares the significance probability with a given threshold value;
and the module for calculating the CNV concentration is used for fitting the potential CNV interval by utilizing the UR and the real GC of the same interval of the background library sample, determining the UR and the GC of the potential CNV interval, calculating the CNV concentration by utilizing the UR and the GC of the potential CNV interval, and judging whether the sample to be detected is suspected to be maternal CNV or placenta mosaic according to the comparison of the calculated CNV concentration and the real fetal concentration.
Model one
The first model comprises the following steps:
step one, data preprocessing and normalization, which further comprises the following sub-steps:
(1) GC correction
In the first model, the reads are subjected to GC correction by using a lowess algorithm, so that fluctuation conditions of chromosomes are objectively evaluated for eliminating differences among libraries, and for any chromosome to be tested, correction is performed simultaneously with the chromosome 1 and the chromosome 2. Because of the lower incidence of both 1 and 2 chromosomes, GC coverage was greater, thus increasing the stability of the results when we corrected. The smoothing coefficient f is set to 0.67. The correction process uses high quality reads, i.e., unique reads/(quality+1) > = 0.625, and then estimates reads for low quality windows using corrected overall mean and variance.
(2) Uniformity correction
In order to make all the samples to be tested and the reference samples have comparability, the model I estimates the corresponding variance according to the chromosome window reads (abnormal value is removed) after GC correction, and divides the window reads of the chromosome to be tested by the standard deviation so as to correct the variance to the level of 1.
Here, the purpose of GC correction is to correct GC bias inherent in the sequencing process, with reads at different positions on the chromosome tending to the same level after correction; correction was made with the test chromosomes using chromosome 1 and chromosome 2 as background in order to eliminate the differences between libraries. Because the data size of different libraries is different, but the relative relationship between chromosomes inside the library is stable, the difference in the data size of different libraries can be eliminated to a certain extent by using chromosome 1 and chromosome 2 as references.
Step two, Z checking amplified signal
The mean and variance of each window are calculated using the background library samples, and the Z value of each window is calculated by Z test. Each time the Z-check obtains a small variance through the convergence data, amplifying the signal, the Z-check process is repeated three times.
Step three, slicing the chromosome by sliding window
In order to identify CNV intervals such as dup, del and the like and other normal intervals from the chromosome to be detected, the model I needs to utilize a continuous window Z value to carry out slicing treatment on the chromosome. Here, by using the sliding window method, the continuity windows with similar states are combined into one section, and the attribute (dup, del, normal) of this section is further judged.
Step four, calculating a Z value confidence interval
For each interval after slicing, we calculate the median value of the continuous window Z values in the interval in the same interval of the background library samples, and estimate the 95% confidence interval range according to the mean and variance of the median distribution. If the interval to be measured falls within the confidence interval, the interval is considered to be normal 2 copies, otherwise, the interval may be a potential CNV interval.
Step five, calculating CNV probability
And for the potential CNV interval, window reads in the interval are added in the same interval of the background library sample to obtain probability density distribution, and the significance probability is calculated according to the CNV interval reads to be detected and is subjected to negative logarithmic conversion and is compared with a threshold value.
Wherein the negative logarithmic transformation computes the saliency probability P and compares it with a threshold. The threshold is defined by the lowest detection line of the positive sample, namely, the threshold of the CNV interval of the true positive sample can be ensured to be reported.
Step six, calculating the concentration of CNV
And for the section where the CNV is located, calculating a fitting line by using the UR and the real GC of the same section of the background library sample, and calculating the concentration by using the UR and the GC of the potential CNV. The CNV concentration is compared with the fetal true concentration, and if the CNV concentration is obviously lower than the fetal concentration, the CNV concentration is considered to be possibly false positive caused by data fluctuation or noise; if significantly higher than fetal concentration, maternal CNV or chimerism is suspected.
The true fetal concentration may be determined herein using the following method: for a male fetus, the true fetal concentration is calculated from the content of the Y chromosome; for female fetuses, the actual concentration of CNV estimation can be measured through maternal gestational week, weight and other information, and the estimation method does not influence the identification of maternal CNV.
As for the module for detecting CNV based on the unique reads number, the module includes the following sub-modules and is used to execute the following model two.
The MiniModel construction module is used for preprocessing the data quantity difference among different libraries, defining the sliding window length m according to the resolution ratio after preprocessing, combining and calculating average reads (Mr) and average GC (Mgc) for every m adjacent windows, calculating Mr ' and Mgc ' distribution of the same interval by using a background library sample, fitting Mr ' and Mgc ', calculating residual errors according to theoretical values corresponding to the Mr and Mgc to be detected, judging the attribute comprising dup, del, normal of the window according to the residual errors, calculating weight according to correlation R, mgc of Mr ' and Mgc ' and standard deviation sd of background data Mr ', and judging the confidence;
a chromosome sectioning and slicing module which utilizes a given model or algorithm to identify adjacent areas which are normally distributed from two different mean values and have obvious differences, so as to sectioning and slicing the chromosome and identify the CNV boundary position;
Specifically, the module can utilize the HaarSeg model to carry out slicing treatment on the chromosome so as to identify a chromosome interval with the same copy, and parameters break FdrQ in the model are calculated through model self-adaption, namely gradually converged according to a designated step length until the results of two cyclic slicing are consistent, and the model is stable, namely the number of the sliced pieces is not changed any more;
a saliency assessment module that randomly extracts the same number of window values from other regions of the chromosome of the sample to be tested for the section, and repeats the process, for example 10000 times, to determine the saliency of the true values in the background distribution.
Model II
The second model comprises the following steps:
step one, miniModel construction
For the chromosomes to be tested, to eliminate the difference in data size between the different libraries, each window reads was divided by the median of chromosome 1 window reads. After preprocessing, the sliding window length m is specified according to the resolution, average reads (Mr) and average GC (Mgc) are calculated in a combined manner every m adjacent windows, and meanwhile the same interval Mr 'and Mgc' distribution is calculated by using a background library sample, and fitting is performed by using a linear regression model. Calculating residual errors according to theoretical values corresponding to the Mr and Mgc to be measured, wherein the larger the residual errors are, the more likely the m windows belong to dup; the smaller the residual, the more likely the m windows belong to del; the closer the residual is to 0, the more likely the m windows are normal 2 copies; finally, a weight (weight) is calculated according to the correlation R, mgc of Mr ' and Mgc ' and the standard deviation sd of the background data Mr ', and the larger the weight is, the higher the confidence is.
In detail, first we divide all window Unique reads by the average number of Unique reads of chromosome 1 to eliminate the difference in data size between samples. We then calculate the Mr (i.e., average) of the corrected average Unique reads in the sample to be measured, and the average gc content Mgc of the corresponding region, taking every adjacent 10 windows as a unit. Similarly, we calculate for each background library sample the Mr ', mgc' corresponding to the same region. According to the Mr ', mgc' vector obtained by calculation from the background library sample, the fit line of the target region Mr corresponding to Mgc is fitted through regression analysis, and the fetal signal is separated from the mixed signal according to the residual error of the observed value and the theoretical value. However, due to the limitations of low data volume sequencing techniques, and the preference of the dnas fragments during sequencing, the Unique reads are not evenly distributed across the chromosome. This means that the residual of each cell is calculated directly by fitting a line, which is not fair for all cells. We therefore additionally calculated the standard deviation of all background library samples Mr ', pearson correlation coefficients for Mr' and Mgc ', the quantiles of the sample Mgc to be tested distributed over background library sample Mgc', and integrated the three variables to calculate the weight. The larger the standard deviation is, the smaller the correlation coefficient is, and the closer the quantile is to the boundary, which means that the sequencing quality of the corresponding unit region is low, or the relevance of the Unique reads and gc is weak, so that the confidence is low, the obtained weight is also small, and the influence of the low confidence unit on other surrounding regions is eliminated. On the contrary, the unit with high confidence has a large corresponding weight, so that the influence on the result judgment is also large.
In step one all fragmented regions were classified as dup repeats, del deleted and normal. Dup and Del were finally reported as CNVs. Wherein fitting to the Mr 'and Mgc' distributions is an analysis of the reference samples in the background library. I.e. Mr ', mgc' of the same window interval is calculated using the reference samples.
For example, 1000 reference samples may be calculated in the same interval, 1000 Mr 'corresponds to 1000 Mgc', the 1000 data points have Mgc 'as the horizontal axis and Mr' as the vertical axis, so that a scatter distribution of the background may be obtained, and a fitting line may be obtained by using the distribution, where any position on the fitting line represents the theoretical value of Mr 'corresponding to current Mgc'.
Step two, chromosome sectioning
And the second model adopts a HaarSeg model to carry out slicing treatment on the chromosome, and parameters break FdrQ are calculated through model self-adaption, namely the model gradually converges according to a designated step length until the two circulating slicing results are consistent, and the model is stable.
The HaarSeg model is an analytical model for analyzing ArrayCGH for fragmentation discrimination of chromosomes, identifying chromosomal intervals with identical copies. The larger the BreaksFdrQ, the higher the model resolution, the more slices; whereas the lower the resolution, the fewer slices. With the change of BreaksFdrQ, the number of the slices is changed, two adjacent cycles are guided, the number of the slices is not changed any more, the model is considered to be stable, but only one slice is not necessarily needed, and only the number of the slices is not changed under the influence of different BreaksFdrQ. Reference may be made to the HaarSeg model, for example: http:// webee.technology.ac.il/Sites/People/Yonina eldar/Info/software/Haarseg.htm.
Step three, significance evaluation
For the section, the same number of window values are randomly extracted from other regions of the chromosome to be measured, and the process is repeated 10000 times, so that the significance of the true value in the background distribution is estimated.
As described above, model one counts the counts of all reads; model two counts are counts of unique reads.
In the case of the model result summarizing module, the module performs comparison analysis based on the output results of the two modules for detecting CNV to output a final result.
Results of two models are summarized
And according to the output results of the two models, if the target CNV interval is reported in both models and the coincidence rate exceeds 50%, the coincidence region is reported as CNV. Otherwise, the result of the interval to be tested is not consistent in the two models, and may be a false positive result.
Examples
The present invention will be described more specifically with reference to examples, but the present invention is not limited to these examples.
Peripheral blood of pregnant women, which was sent to a hospital in Beijing at 1 st 2017, was used in the following examples and comparative examples, and the clinical examination result of the pregnant women was low in risk of CNV, and the pregnant women showed that normal infants without CNV had been produced in the following follow-up process.
Comparative example 1
And sequencing the sample to obtain chromosome sequencing data of the sample to be tested and chromosome sequencing data of the sample from the background library.
The above sample was analyzed by the method described in Statistical Approach to Decreasing the Error Rate of Noninvasive Prenatal Aneuploid Detection caused by Maternal Copy Number Variation (Published online 2015Nov 4.doi:10.1038/srep16106, PMCID: PMC 4632076), and the specific procedure was performed as follows with reference to the method described in the document, to obtain the analysis result shown in FIG. 2. And judging that the sample is the 15 th chromosome long arm with the repeated fragments according to the analysis result.
The basis for the above judgment is: all windows are normalized and corrected, so that the normal two copy areas are consistent with the background library signal, and the residual error is normal distribution obeying the mean value of 0. Thus by having 95% confidence intervals as the threshold, continuity windows above the threshold tend to be multiple copies and continuity windows below the threshold tend to be single copies. The chromosome is sectioned by the HaarSeg algorithm (see: https:// academic.comp/bioinformation/arc/24/16/i 139/199827 for HaarSeg algorithm), where the long arm front of chromosome 15 is significantly above the threshold, and is therefore highly suspected to be a micro-replicated CNV region.
Example 1
And sequencing the sample to obtain chromosome sequencing data of the sample to be tested and chromosome sequencing data of the sample from the background library.
Cutting the sequencing data of example 1 into equal length 100k length windows, and allowing 50k intersections between each two adjacent windows, counting window parameters including read, unique Read (UR), map, genomic GC and/or unique reads GC for each window;
detecting CNV based on the number of reads, calculating Z value based on each window obtained, calculating CNV probability, and estimating fetal concentration by using CNV probability, thereby judging whether the sample to be detected is suspected to be positive CNV, and eliminating interference of maternal CNV; the analysis result in this step is shown in a graph of model one of fig. 3, and according to the result, the model one is shown through forward, backward continuous difference calculation, and by combining wavelet analysis and smooth noise reduction, a potential CNV boundary is identified, and for each potential CNV region, the significance evaluation is performed, and through intra-sample and inter-sample comparison, the 15 # chromosome long arm front end signal is found to be not significant, so that the normal two copies are judged.
Detecting CNV based on the number of unique reads, calculating average reads (Mr) and average GC (Mgc) based on adjacent 10 windows by the module, and constructing a window specificity linear regression model so as to judge whether a sample to be detected is suspected to be CNV; the analysis result in this step is shown in a second model diagram of fig. 3, and according to the result, the second model is shown to extract fetal signals by using Unique reads, combine with HaarSeg model slices, divide the areas, and according to the intra-sample fluctuation, adaptively define a threshold, and the front end of the 15 th chromosome long arm does not exceed the threshold, so that the model is considered as signal fluctuation, and the model is judged to be normal two copies.
The results are summarized, and the final result is output by comparing and analyzing based on the output results of the two modules for detecting CNV, and the two models are negative, so that the slightly strong signal of the 15 # chromosome long arm can be considered to belong to the fluctuation of system noise, but not the real micro repetition, and the judgment is negative.
The specific manner of operation of each step may be found in the schemes described in the above specification.
As can be seen from FIG. 3, chromosome 15 of the above sample was considered to be a normal karyotype by the method of example 1, which was consistent with the actual results.
The method of the invention can be seen to greatly reduce false positive rate by using multiple correction and filtration criteria.