CN118412041A - DNA sequencing data matching enhancement method and system - Google Patents
DNA sequencing data matching enhancement method and system Download PDFInfo
- Publication number
- CN118412041A CN118412041A CN202410885945.XA CN202410885945A CN118412041A CN 118412041 A CN118412041 A CN 118412041A CN 202410885945 A CN202410885945 A CN 202410885945A CN 118412041 A CN118412041 A CN 118412041A
- Authority
- CN
- China
- Prior art keywords
- reference genome
- read
- base position
- dna sequencing
- matching
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001712 DNA sequencing Methods 0.000 title claims abstract description 82
- 238000000034 method Methods 0.000 title claims abstract description 73
- 238000012163 sequencing technique Methods 0.000 claims abstract description 97
- 238000009825 accumulation Methods 0.000 claims abstract description 78
- 230000008569 process Effects 0.000 claims abstract description 33
- 238000012545 processing Methods 0.000 claims abstract description 20
- 238000013075 data extraction Methods 0.000 claims abstract description 4
- 230000002159 abnormal effect Effects 0.000 claims description 18
- 238000004364 calculation method Methods 0.000 claims description 13
- 230000000694 effects Effects 0.000 claims description 9
- 230000001965 increasing effect Effects 0.000 claims description 5
- 238000011282 treatment Methods 0.000 claims description 4
- 230000001186 cumulative effect Effects 0.000 claims description 2
- 239000012634 fragment Substances 0.000 abstract description 18
- 238000003780 insertion Methods 0.000 abstract description 17
- 230000037431 insertion Effects 0.000 abstract description 17
- 238000012217 deletion Methods 0.000 abstract description 15
- 230000037430 deletion Effects 0.000 abstract description 15
- 238000004458 analytical method Methods 0.000 abstract description 11
- 238000005516 engineering process Methods 0.000 abstract description 2
- 108020004414 DNA Proteins 0.000 description 21
- 230000002708 enhancing effect Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 7
- 230000005856 abnormality Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 5
- 230000007614 genetic variation Effects 0.000 description 4
- 238000012268 genome sequencing Methods 0.000 description 4
- 238000010606 normalization Methods 0.000 description 4
- 102000053602 DNA Human genes 0.000 description 3
- 102100030569 Nuclear receptor corepressor 2 Human genes 0.000 description 3
- 101710153660 Nuclear receptor corepressor 2 Proteins 0.000 description 3
- 108091028043 Nucleic acid sequence Proteins 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000012512 characterization method Methods 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 108090000623 proteins and genes Proteins 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000013467 fragmentation Methods 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000000746 purification Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000002864 sequence alignment Methods 0.000 description 1
- 238000007671 third-generation sequencing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Biotechnology (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Computational Biology (AREA)
- Chemical & Material Sciences (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Analytical Chemistry (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention relates to the field of electric digital data processing, in particular to a DNA sequencing data matching enhancement method and a system, belonging to a specific application of an electric digital data processing technology. The method comprises the following steps: extracting DNA sequencing data; obtaining a sequencing error accumulation coefficient of each base position on a reference genome; adjusting a comparison matching strategy of a genome to be detected and a reference genome; enhancement processing is matched to the DNA sequencing data. The system comprises: the device comprises a DNA data extraction unit, a sequencing error accumulation coefficient acquisition unit, a comparison matching strategy adjustment unit and a data matching enhancement processing unit. According to the invention, the sequencing error accumulation coefficient is obtained through analyzing the difference in the sequence comparison process, so that the analysis of the data comparison matching process is realized, and the comparison matching strategy is adjusted. Compared with the conventional processing mode, the method can further remove the influence of large fragment insertion deletion errors according to the comparison matching strategy, so that the matching result of the DNA sequencing data is more accurate.
Description
Technical Field
The invention relates to the field of electric digital data processing, in particular to a DNA sequencing data matching enhancement method and system.
Background
DNA data matching refers to the process of aligning DNA sequences to be matched with DNA sequences in a reference database to determine similarity and relatedness between them. DNA data matching is commonly used for research and application in identifying genotypes, disease genes, biological species, relatives, and the like.
The prior art generally employs third generation single molecule sequencing techniques, typically by directly detecting the base sequence of a single DNA molecule, the signal interpretation of which is typically based on the optical signal detected by a sequencing instrument to determine the identity of each base. The chemical reactions used in the sequencing process, such as primer binding, labeling, etc., may have a certain error rate, and errors in these chemical reaction steps may gradually accumulate in the sequencing result, resulting in an increase in the error rate of the terminal sequencing data. In particular, in the data matching process, matching may be difficult due to structural variations such as insertion or deletion of large fragments.
Disclosure of Invention
In order to solve the technical problems, the invention aims to provide a DNA sequencing data matching enhancement method and a system, and the adopted technical scheme is as follows:
A DNA sequencing data match enhancement method comprising:
S1, extracting DNA sequencing data;
s2, according to DNA sequencing data, sequencing error accumulation coefficients of all base positions on a reference genome are obtained;
S3, according to the sequencing error accumulation coefficient, adjusting a comparison matching strategy of the genome to be tested and the reference genome;
s4, carrying out matching enhancement treatment on the DNA sequencing data according to a comparison matching strategy of the genome to be detected and the reference genome;
step S2, according to the DNA sequencing data, a sequencing error accumulation coefficient is obtained, and the method comprises the following steps:
S201, acquiring read coverage degree data of each base position on a reference genome according to the DNA sequencing data;
s202, according to the read coverage degree data, acquiring sequencing error accumulation coefficients of each base position on a reference genome.
In the above method for enhancing matching of DNA sequencing data, in step S201, the obtaining of the read coverage degree data of each base position on the reference genome according to the DNA sequencing data includes:
S2011, calculating the error degree of a read where the end point of each read of each base position on the reference genome is located according to the DNA sequencing data;
S2012, calculating the influence degree of the end points of the reads of the base positions on the reference genome according to the error degree of the reads of the end points of the reads of the base positions on the reference genome;
S2013, calculating the read coverage degree of each base position on the reference genome according to the influence degree of the read end point of each base position on the reference genome. ; indicating the degree of error of the read in which the end point of the first read at the first base position is located; a read end point influence degree indicating the position of the first base; the extent of read coverage at the base position of the first reference genome is shown;
According to the DNA sequencing data matching enhancement method, the read error degree of the end points of each read at each base position on the reference genome is determined by the ratio of the length of the read to the number of abnormal bases on the read.
In the above method for enhancing matching of DNA sequencing data, the extent of influence of the end point of the read at each base position on the reference genome is determined by accumulation of the extent of error of the read.
The DNA sequencing data matching enhancement method is characterized in that the read coverage of each base position on the reference genome is determined by the read coverage and the significance of the sequencing times on the base positions of the reference genome.
In the above method for enhancing matching of DNA sequencing data, in step S202, the cumulative coefficient of sequencing errors at each base position on the reference genome is determined by the increasing trend of errors in the comparison process and the coverage degree of the read at that position.
In the above method for enhancing matching of DNA sequencing data, step S3, according to the sequencing error accumulation coefficient, adjusts a comparison matching policy of a genome to be tested and a reference genome, including:
s301, comparing a sequencing error accumulation coefficient of the current comparison base position of the reference genome with a preset error threshold in the comparison matching process;
S302, if the sequencing error accumulation coefficient of the current comparison base position of the reference genome is larger than the preset error threshold, the comparison matching result of the current comparison base position is not credible, and the current comparison base position is relocated and matched.
In the above method for enhancing matching of DNA sequencing data, in step S4, the processing for enhancing matching of DNA sequencing data according to the comparison matching strategy of the genome to be detected and the reference genome includes:
S401, determining partial bases with abnormal base comparison matching results with the reference genome according to the corresponding positions of the DNA reads to be compared, which are determined in the comparison matching strategy, in the reference genome;
S402, acquiring a plurality of DNA reads to be compared, which are covered with the positions of the DNA reads to be compared, taking the inverse ratio of the sequencing error accumulation coefficient of the base positions corresponding to the partial bases as the credibility of the comparison result, and selecting the base type with the maximum credibility as the corrected base type of the partial bases.
The invention also provides a DNA sequencing data matching enhancement system, which comprises:
a DNA data extraction unit for extracting DNA sequencing data;
A sequencing error accumulation coefficient obtaining unit, which is used for obtaining the sequencing error accumulation coefficient of each base position on the reference genome according to the DNA sequencing data;
The comparison matching strategy adjusting unit is used for adjusting the comparison matching strategy of the genome to be detected and the reference genome according to the sequencing error accumulation coefficient;
The data matching enhancement processing unit is used for matching and enhancing the DNA sequencing data according to a comparison and matching strategy of the genome to be detected and the reference genome;
The sequencing error accumulation coefficient acquisition unit includes:
a read coverage degree acquisition unit for acquiring read coverage degree data of each base position on the reference genome according to the DNA sequencing data;
and the sequencing error accumulation coefficient acquisition subunit is used for acquiring sequencing error accumulation coefficients of all base positions on the reference genome according to the read coverage degree data.
The DNA sequencing data matching enhancement system described above, the read coverage degree obtaining unit includes:
an error degree calculation unit for calculating the error degree of the read where the end point of each read of each base position on the reference genome is located, based on the DNA sequencing data;
A read end point influence degree calculation unit, configured to calculate a read end point influence degree of each base position on the reference genome according to an error degree of a read where an end point of each read of each base position on the reference genome is located;
And the read coverage degree acquisition subunit is used for calculating the read coverage degree of each base position on the reference genome according to the influence degree of the read end point of each base position on the reference genome.
The invention has the following beneficial effects:
According to the invention, according to the specific data mode and characteristic of large fragment insertion or deletion in genome sequencing data, sequencing error accumulation coefficients are obtained by analyzing the difference in the sequence comparison process, the analysis of the data comparison matching process is realized, and the data comparison matching strategy is adjusted. Compared with the conventional processing mode, the method can further remove the influence of large fragment insertion deletion errors according to the comparison matching strategy, so that the matching result of the DNA sequencing data is more accurate.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for enhancing matching of DNA sequencing data according to an embodiment of the present invention.
FIG. 2 is a flowchart of a method for obtaining a sequencing error accumulation coefficient in step S2 of the embodiment shown in FIG. 1.
Fig. 3 is a flowchart of a method for acquiring the coverage data in step S201 in the embodiment shown in fig. 1.
Fig. 4 is a flowchart of a method for adjusting the alignment matching policy in step S3 in the embodiment shown in fig. 1.
FIG. 5 is a flow chart of a method of matching enhancement processing of DNA sequencing data in step S4 of the embodiment shown in FIG. 1.
FIG. 6 is a block diagram illustrating a DNA sequencing data matching enhancement system according to another embodiment of the present invention.
FIG. 7 is a block diagram of a sequencing error accumulation coefficient unit in the embodiment shown in FIG. 6.
Fig. 8 is a block diagram of the read coverage degree acquisition unit in fig. 7.
Fig. 9 is a block diagram showing the structure of the comparison strategy adjustment unit in the embodiment shown in fig. 6.
FIG. 10 is a block diagram of the data matching enhancement processing unit in the embodiment of FIG. 6.
Detailed Description
In order to further describe the technical means and effects adopted by the present invention to achieve the preset purposes, the following detailed description refers to specific embodiments, structures, features and effects of a method and a system for enhancing matching of DNA sequencing data according to the present invention, which are described in detail below with reference to the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The following specifically describes a specific scheme of a method and a system for enhancing matching of DNA sequencing data.
Referring to fig. 1, a flowchart of a DNA sequencing data matching enhancement method according to an embodiment of the present invention is shown. The DNA sequencing data matching enhancement method comprises the following steps: s1, extracting DNA sequencing data; s2, according to DNA sequencing data, sequencing error accumulation coefficients of all base positions on a reference genome are obtained; s3, according to the sequencing error accumulation coefficient, adjusting a comparison matching strategy of the genome to be tested and the reference genome; s4, carrying out matching enhancement treatment on the DNA sequencing data according to a comparison matching strategy of the genome to be detected and the reference genome.
In this embodiment, the extracted DNA sequencing data of step S1 is extracted by a detection sequencer. The specific implementation mode is as follows: first, DNA is extracted from a sample and subjected to appropriate treatments such as purification, modification, fragmentation, etc., to ensure the accuracy and efficiency of the sequencing reaction. The processed DNA sample is loaded into a sequencing device, typically single molecule sequencing by a carrier such as a Nanopore (Nanopore) or SMRT cells (pacbrio). The DNA single molecules pass through the nano holes or the SMRT cells one by one in the sequencing process, and a sequencing instrument records the base sequence information of each single molecule. The sequencing instrument collects and records the signal change of the DNA single molecule in the nanopore or the SMRT cell, and the base sequence is determined according to the signal change. And splicing and correcting the acquired base sequence data to obtain complete DNA sequence information.
The extraction technique of the above DNA sequencing data generally employs a third generation single molecule sequencing technique, which is generally performed by directly detecting the base sequence of a single DNA molecule, and the signal interpretation is generally performed by determining the identity of each base based on the optical signal detected by a sequencing instrument. One significant advantage of this technique is that the read length for the gene fragment is longer, but there are also errors in sequencing data alignment matching that result in insertion or deletion of large fragments. The error can change the length and structure of the sequence, and the sequence mismatch can be caused in the comparison process, so that the traditional comparison algorithm is difficult to find an accurate matching position, and the variation of the DNA molecule itself can also present similar fragment type matching abnormality, which is unfavorable for the accurate and rapid data comparison matching process.
Large fragment insertions or deletions will manifest themselves in genomic sequencing data as specific data patterns and features, one of which is that the sequence depth of the insertion or deletion region may be lower or higher than the depth level of the surrounding normal region. Thus, abnormalities resulting from insertions or deletions can be determined by analysis of the differences in the sequence alignment process. In particular, the insertion region may exhibit low depth due to the newly added fragment not being sufficiently sequenced; the deleted region may exhibit high depth because the deleted fragment results in multiple sequencing of the sequence of the region. In addition, in analyzing coverage during read matching, the effect of the end points of the read length needs to be removed.
Therefore, the present embodiment aims to provide more accurate sequencing data by analyzing the above data and performing comparison matching strategy adjustment with the sequencing error accumulation coefficient as an analysis result. Fig. 2 shows a flow chart of step S2. In step S2, a sequencing error accumulation coefficient is obtained according to the DNA sequencing data, including: s201, acquiring read coverage degree data of each base position on a reference genome according to the DNA sequencing data; s202, according to the read coverage degree data, acquiring sequencing error accumulation coefficients of each base position on a reference genome.
The specific flow of step S201 is shown in fig. 3, and includes: s2011, calculating the error degree of a read where the end point of each read of each base position on the reference genome is located according to the DNA sequencing data; s2012, calculating the influence degree of the end points of the reads of the base positions on the reference genome according to the error degree of the reads of the end points of the reads of the base positions on the reference genome; s2013, calculating the read coverage degree of each base position on the reference genome according to the influence degree of the read end point of each base position on the reference genome.
In one embodiment of the invention, each sequencing read data is considered a one-dimensional vector, denoted as the original vector, using a 30x subset of the original data (subset of 30 times sequencing depth). The position numbers of the corresponding reference genomes are known and the corresponding manner here is aligned using the BLAST algorithm. In particular, the reference genome is analyzed for only known portions, and no corresponding consideration is given to the unknown portions, i.e., the base positions on the reference genome are only known portions.
The read error degree of the end point of each read of each base position on the reference genome is determined by the ratio of the length of the read to the number of abnormal bases on the read, and the specific calculation formula is as follows:
;
Represent the first The first base positionThe degree of error of the read in which the end point of the individual read is located; Represent the first The first base positionThe length of the read where the end point of the individual read is located; Represent the first The first base positionThe number of base positions of the reads where the end points of the individual reads are located are abnormal in comparison.
Wherein,Represent the firstThe base position is subjected toThe degree of influence of the matching result of the individual reads, whenThe longer the read length of the individual reads, and theThe fewer the number of abnormal bases of each read, i.e., the more accurate the matching result, the firstThe difference in individual base positions during the actual matching of multiple reads compared to the remaining base positions is more of an error caused by the end points of the reads.
The extent of the effect of the end point of the reads at each base position on the reference genome is determined by the accumulation of the extent of read errors. According to the error degree of the reading segment where the end point is located, the calculation formula of the influence degree of the end point of the reading segment of each base position on the reference genome is as follows:
;
Represent the first The extent of read end point influence of individual base positions; Represent the first The number of reads at which the read end points of the individual base positions are located; representing a linear normalization function. Measured by the firstThe accumulation of read error levels at a base position as a read endpoint can characterize the base position as being affected by all of its local regions at the base as a read endpoint.
The extent of the influence of the end points of the reads of the base positions is characterized by the reliability of the difference between the base comparison results of the reads to be detected at the base positions and the positions on the reference genome from a wider view angle. Thus, the degree of coverage of the reads at each base position on the reference genome can be obtained. Specifically, the read coverage is determined by the read coverage and the significance of the number of sequencing times at the base position of the reference genome, and the calculation formula is as follows:
;
Representing the first Read coverage at each reference genome base position; Represent the first The number of times sequenced at each reference genomic base position; the mean of the number of times sequenced at the base position of the reference genome is shown. Represent the firstSignificance of the number of times sequenced at the base positions of the reference genome, i.e., the more times sequenced the more reliable the comparison and analysis of reads thereof, the correspondingThe higher the confidence that the endpoint effect is exhibited at the base position of each reference genome, the characterization can also be used for subsequent analysis of the stepwise distribution of base alignment results.
The major factors affecting the coverage and variation data of the sequencing data can be considered as the large fragment insertion deletion abnormality of the sequencing data and the variation caused by the genetic variation. The large fragment insertion deletion abnormality and the genetic variation have different characteristics in distribution in genome sequencing data, namely, in the DNA sequencing process, the large fragment insertion deletion abnormality is the difference of stages in coverage values of comparison results of sequencing data due to the limitation of reading length, the difference of stages shows a certain stability in a longer range, and the genetic variation is more prone to the difference characteristics of more local stages and single points.
Based on the above analysis, researchers can know that the coverage value of the comparison result of sequencing data caused by the limitation of the length of a read segment shows a stepwise difference, which is usually that a part of base comparison results are normal, but a part of insertion parts are not matched with the position to obtain a matching error at all, an abnormal condition exists at the moment, the error influence range is larger, the situation is shown that the large fragment is inserted and deleted abnormally, the part before the abnormal condition has a gradual increase trend of the comparison error, and the abnormal occurrence frequency of the base comparison position is shown to be higher and higher; the influence phase of the genetic variation is smaller and the gradual increase trend does not exist.
Analyzing any DNA reads to be compared, extracting base sequences at reference genome positions in two equal-size adjacent intervals, wherein the comparison process is affected by error accumulation in the actual sequencing process, namely, the matching result of the base position at the previous stage in sequence and the matching result on the reference genome affects the matching result of the subsequent point, and the reliability of the subsequent matching result is reduced due to the error of the matching result at the previous stage, so that in step S202, the sequencing error accumulation coefficient of each base position on the reference genome is determined by the error increasing trend in the comparison process and the read coverage degree of the position, and the calculation formula is as follows:
;
Wherein, Representing the firstSequencing error accumulation coefficients at each reference genomic base position; Representing the first Read coverage at each reference genome base position; Representing the read coverage sequence of all base positions corresponding to the i-th reference genome base position over the length of the read; Representing the sequence of the degree of coverage of the reads at all base positions corresponding to the length of the interval of equal length preceding the read at which the i-th reference genome base position is located; the sequence standard deviation is calculated; Represented is a normalization operation; Representing a preset number of neighbors. In this embodiment, the preset area number a is set to 5, and may also be set to other values according to actual situations.
In the above formula, the liquid crystal display device,The characteristic is that the comparison process is affected by the error accumulation degree of the actual sequencing process, when the error in the latter equal-sized interval is increased compared with the error in the former interval, the error accumulation change exists, which is expressed as thatThe value of (2) is larger, i.e. there is a tendency for the error in the comparison process to increase. Furthermore, for comparison results at a single reference genomic base position, the greater the degree of read coverage at that position, the greater the corresponding sequencing error accumulation coefficient, i.e., the greater the error accumulation effect herein is related to endpoint impact. Thus, parameters characterizing the influence of the end points are also incorporated in the above formula. To sum up, sequencing error accumulation coefficientsMay be used to adjust the alignment matching strategy.
As shown in fig. 4, step S3, adjusting a comparison matching strategy of the genome to be tested and the reference genome according to the sequencing error accumulation coefficient, includes: s301, comparing a sequencing error accumulation coefficient of the current comparison base position of the reference genome with a preset error threshold in the comparison matching process; s302, if the sequencing error accumulation coefficient of the current comparison base position of the reference genome is larger than the preset error threshold, the comparison matching result of the current comparison base position is not credible, and the current comparison base position is relocated and matched.
In one embodiment of the present invention, the preset error threshold is set to 0.91, and may be set to other values according to practical situations. When the sequencing error accumulation coefficient of the current comparison base position of the reference genome is compared with a preset error threshold value, and when the sequencing error accumulation coefficient of the current matching result in the comparison process is smaller than the error threshold value, the comparison result of the comparison result at the time on the reference genome position can be considered to be reliable, and positioning matching should be continued until the comparison of the current reading segment is completed. And combining with the step S302, the adjustment of the matching strategy can be achieved.
The comparison matching strategy adjustment result obtained in the above step is actually the sequencing result adjustment under the consideration of the error influence of the third generation sequencing technology, and the structure of the sequence can be reconstructed according to the obtained sequencing error accumulation coefficient, such as splicing, correction and other operations, so as to obtain more accurate sequence information. The flowchart shown in fig. 5 is further described taking corrective action as an example of the enhancement process. In step S4, according to the comparison matching strategy of the genome to be detected and the reference genome in step S3, the matching enhancement processing for the DNA sequencing data includes: s401, determining partial bases with abnormal base comparison matching results with the reference genome according to the corresponding positions of the DNA reads to be compared, which are determined in the comparison matching strategy, in the reference genome; s402, acquiring a plurality of DNA reads to be compared, which are covered with the positions of the DNA reads to be compared, taking the inverse ratio of the sequencing error accumulation coefficient of the base positions corresponding to the partial bases as the credibility of the comparison result, and selecting the base type with the maximum credibility as the corrected base type of the partial bases.
According to the embodiment, according to the specific data mode and characteristic which can be shown in genome sequencing data by large fragment insertion or deletion, sequencing error accumulation coefficients are obtained by analyzing the difference expression in the sequence comparison process, so that the analysis of the data comparison matching process is realized, and the data comparison matching strategy is adjusted. Compared with the conventional processing mode, the method can further remove the influence of large fragment insertion deletion errors according to the comparison matching strategy, so that the matching result of the DNA sequencing data is more accurate.
As shown in fig. 6, another embodiment of the present invention provides a DNA sequencing data matching enhancement system, comprising: a DNA data extraction unit 100 for extracting DNA sequencing data; a sequencing error accumulation coefficient obtaining unit 200, configured to obtain a sequencing error accumulation coefficient of each base position on the reference genome according to the DNA sequencing data; the comparison matching strategy adjustment unit 300 is configured to adjust a comparison matching strategy of the genome to be tested and the reference genome according to the sequencing error accumulation coefficient; and the data matching enhancement processing unit 400 is used for matching and enhancing the DNA sequencing data according to the comparison and matching strategy of the genome to be detected and the reference genome.
Specifically, as shown in fig. 7, the sequencing error accumulation coefficient acquisition unit 200 includes: a read coverage degree acquisition unit 201 for acquiring read coverage degree data of each base position on the reference genome based on the DNA sequencing data; a sequencing error accumulation coefficient obtaining subunit 202, configured to obtain a sequencing error accumulation coefficient of each base position on the reference genome according to the coverage degree data of the read.
More specifically, as shown in fig. 8, the read coverage degree acquisition unit 201 includes: an error degree calculating unit 2011 for calculating the error degree of the read where the end point of each read of each base position on the reference genome is located according to the DNA sequencing data; a read end point influence degree calculation unit 2012 for calculating the read end point influence degree of each base position on the reference genome according to the error degree of the read where the end point of each read of each base position on the reference genome is located; a read coverage obtaining subunit 2023 is configured to calculate the read coverage of each base position on the reference genome according to the influence degree of the read end point of each base position on the reference genome.
The respective data acquisition methods involved in the above-described read coverage acquisition unit 201 are as follows.
The calculation formula of the read error degree of the end point of each read of each base position on the reference genome is as follows:
;
Represent the first The first base positionThe degree of error of the read in which the end point of the individual read is located; Represent the first The first base positionThe length of the read where the end point of the individual read is located; Represent the first The first base positionThe number of base positions of the reads where the end points of the individual reads are located are abnormal in comparison.
Wherein,Represent the firstThe base position is subjected toThe degree of influence of the matching result of the individual reads, whenThe longer the read length of the individual reads, and theThe fewer the number of abnormal bases of each read, i.e., the more accurate the matching result, the firstThe difference in individual base positions during the actual matching of multiple reads compared to the remaining base positions is more of an error caused by the end points of the reads.
The calculation formula of the influence degree of the end point of the reading of each base position on the reference genome is as follows:
;
Represent the first The extent of read end point influence of individual base positions; Represent the first The number of reads at which the read end points of the individual base positions are located; representing a linear normalization function. Measured by the firstThe accumulation of read error levels at a base position as a read endpoint can characterize the base position as being affected by all of its local regions at the base as a read endpoint.
The calculation formula of the read coverage of each base position on the reference genome is as follows:
;
Representing the first Read coverage at each reference genome base position; Represent the first The number of times sequenced at each reference genomic base position; the mean of the number of times sequenced at the base position of the reference genome is shown. Represent the firstSignificance of the number of times sequenced at the base positions of the reference genome, i.e., the more times sequenced the more reliable the comparison and analysis of reads thereof, the correspondingThe higher the confidence that the endpoint effect is exhibited at the base position of each reference genome, the characterization can also be used for subsequent analysis of the stepwise distribution of base alignment results.
In the sequencing error accumulation coefficient obtaining subunit 202, the calculation formula of the sequencing error accumulation coefficient of each base position on the reference genome is as follows:
;
Wherein, Representing the firstSequencing error accumulation coefficients at each reference genomic base position; Representing the first Read coverage at each reference genome base position; Representing the read coverage sequence of all base positions corresponding to the i-th reference genome base position over the length of the read; Representing the sequence of the degree of coverage of the reads at all base positions corresponding to the length of the interval of equal length preceding the read at which the i-th reference genome base position is located; the sequence standard deviation is calculated; Represented is a normalization operation; the number of preset neighborhoods is shown, and in a specific example, the value of a is set to 5, and can be set to other values according to actual situations.
In the above formula, the liquid crystal display device,The characteristic is that the comparison process is affected by the error accumulation degree of the actual sequencing process, when the error in the latter equal-sized interval is increased compared with the error in the former interval, the error accumulation change exists, which is expressed as thatThe value of (2) is larger, i.e. there is a tendency for the error in the comparison process to increase. Furthermore, for comparison results at a single reference genomic base position, the greater the degree of read coverage at that position, the greater the corresponding sequencing error accumulation coefficient, i.e., the greater the error accumulation effect herein is related to endpoint impact. Thus, parameters characterizing the influence of the end points are also incorporated in the above formula. To sum up, sequencing error accumulation coefficientsMay be used to adjust the alignment matching strategy.
As shown in the block diagram of fig. 9, the comparison strategy adjustment unit 300 includes: a preset error threshold comparison unit 301, configured to compare, in a comparison matching process, a sequencing error accumulation coefficient of a currently aligned base position of the reference genome with a preset error threshold; a repositioning matching unit 302, configured to reposition and match the current aligned base position of the reference genome when the sequencing error accumulation coefficient of the current aligned base position is greater than the preset error threshold.
According to the specific data mode and characteristic of large fragment insertion or deletion in genome sequencing data, sequencing error accumulation coefficients are obtained by analyzing the difference expression in the sequence comparison process, the analysis of the data comparison matching process is realized, and the data comparison matching strategy is adjusted. Compared with the conventional processing mode, the method can further remove the influence of large fragment insertion deletion errors according to the comparison matching strategy, so that the matching result of the DNA sequencing data is more accurate.
In addition, a block diagram of the data matching enhancement processing unit 400 of the above-described DNA sequencing data matching enhancement system is given in fig. 10, including: an abnormal base confirming unit 401, configured to determine a part of bases having abnormal base comparison matching results with the reference genome according to the corresponding positions of the DNA reads to be compared determined in the comparison matching strategy in the reference genome; an abnormal base correction unit 402, configured to obtain a plurality of DNA reads to be compared covering positions of the DNA reads to be compared, and select a base type with the highest reliability as a corrected base type of the partial base according to an inverse ratio of a sequencing error accumulation coefficient of a base position corresponding to the partial base as a reliability of the comparison result. The abnormal comparison and matching result refers to the inconsistent base types at the same position. The unit can correct the position of the abnormal base, and further enhances the accuracy of the matching result of the DNA sequencing data.
It should be noted that: the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. The processes depicted in the accompanying drawings do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.
Claims (10)
1. A method for matching enhancement of DNA sequencing data, comprising:
S1, extracting DNA sequencing data;
s2, according to DNA sequencing data, sequencing error accumulation coefficients of all base positions on a reference genome are obtained;
S3, according to the sequencing error accumulation coefficient, adjusting a comparison matching strategy of the genome to be tested and the reference genome;
s4, carrying out matching enhancement treatment on the DNA sequencing data according to a comparison matching strategy of the genome to be detected and the reference genome;
step S2, according to the DNA sequencing data, a sequencing error accumulation coefficient is obtained, and the method comprises the following steps:
S201, acquiring read coverage degree data of each base position on a reference genome according to the DNA sequencing data;
s202, according to the read coverage degree data, acquiring sequencing error accumulation coefficients of each base position on a reference genome.
2. The DNA sequencing data matching enhancement method of claim 1, wherein:
In step S201, obtaining read coverage data of each base position on the reference genome according to the DNA sequencing data, including:
S2011, calculating the error degree of a read where the end point of each read of each base position on the reference genome is located according to the DNA sequencing data;
S2012, calculating the influence degree of the end points of the reads of the base positions on the reference genome according to the error degree of the reads of the end points of the reads of the base positions on the reference genome;
S2013, calculating the read coverage degree of each base position on the reference genome according to the influence degree of the read end point of each base position on the reference genome.
3. The DNA sequencing data matching enhancement method of claim 2, wherein:
The degree of read error at which the end points of each read at each base position on the reference genome are located; the degree of error in the read at which the end point of the first read representing the position of the first base is located is determined by the ratio of the length of the read to the number of abnormal bases on the read.
4. The DNA sequencing data matching enhancement method of claim 3, wherein:
the extent of the effect of the end point of the reads at each base position on the reference genome is determined by the accumulation of the extent of read errors.
5. The DNA sequencing data matching enhancement method of claim 4, wherein:
the degree of read coverage for each base position on the reference genome is determined by the degree of read coverage and the significance of the number of sequenced base positions on the reference genome.
6. The DNA sequencing data matching enhancement method of claim 5, wherein:
In step S202, the cumulative coefficient of sequencing error at each base position on the reference genome is determined by the increasing trend of error during the alignment and the degree of coverage of the reads at that position.
7. The DNA sequencing data matching enhancement method of claim 1, wherein:
step S3, according to the sequencing error accumulation coefficient, adjusting a comparison matching strategy of the genome to be tested and the reference genome, wherein the step comprises the following steps:
s301, comparing a sequencing error accumulation coefficient of the current comparison base position of the reference genome with a preset error threshold in the comparison matching process;
S302, if the sequencing error accumulation coefficient of the current comparison base position of the reference genome is larger than the preset error threshold, the comparison matching result of the current comparison base position is not credible, and the current comparison base position is relocated and matched.
8. The DNA sequencing data matching enhancement method of any one of claims 1 to 7, wherein:
In step S4, according to the comparison matching strategy of the genome to be detected and the reference genome, the matching enhancement processing for the DNA sequencing data includes:
S401, determining partial bases with abnormal base comparison matching results with the reference genome according to the corresponding positions of the DNA reads to be compared, which are determined in the comparison matching strategy, in the reference genome;
S402, acquiring a plurality of DNA reads to be compared, which are covered with the positions of the DNA reads to be compared, taking the inverse ratio of the sequencing error accumulation coefficient of the base positions corresponding to the partial bases as the credibility of the comparison result, and selecting the base type with the maximum credibility as the corrected base type of the partial bases.
9. A DNA sequencing data matching enhancement system, comprising:
a DNA data extraction unit (100) for extracting DNA sequencing data;
a sequencing error accumulation coefficient acquisition unit (200) for acquiring a sequencing error accumulation coefficient of each base position on the reference genome according to the DNA sequencing data;
A comparison matching strategy adjustment unit (300) for adjusting a comparison matching strategy of the genome to be tested and the reference genome according to the sequencing error accumulation coefficient;
a data matching enhancement processing unit (400) for matching enhancement processing of the DNA sequencing data according to a comparison matching strategy of the genome to be detected and a reference genome;
the sequencing error accumulation coefficient acquisition unit (200) includes:
A read coverage degree acquisition unit (201) for acquiring read coverage degree data of each base position on a reference genome according to the DNA sequencing data;
a sequencing error accumulation coefficient obtaining subunit (202) for obtaining sequencing error accumulation coefficients of each base position on the reference genome according to the read coverage degree data.
10. The DNA sequencing data matching enhancement system of claim 9, wherein:
The read coverage degree acquisition unit (201) includes:
an error degree calculation unit (2011) for calculating the error degree of the read where the end point of each read of each base position on the reference genome is located, based on the DNA sequencing data;
a read end point influence degree calculation unit (2012) for calculating the read end point influence degree of each base position on the reference genome based on the error degree of the read in which the end point of each read of each base position on the reference genome is located;
a read coverage obtaining subunit (2013) configured to calculate a read coverage of each base position on the reference genome according to a read endpoint influence of each base position on the reference genome; indicating the degree of error of the read in which the end point of the first read at the first base position is located; a read end point influence degree indicating the position of the first base; the extent of coverage of the reads at the base positions of the first reference genome is shown.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410885945.XA CN118412041B (en) | 2024-07-03 | 2024-07-03 | DNA sequencing data matching enhancement method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410885945.XA CN118412041B (en) | 2024-07-03 | 2024-07-03 | DNA sequencing data matching enhancement method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118412041A true CN118412041A (en) | 2024-07-30 |
CN118412041B CN118412041B (en) | 2024-09-13 |
Family
ID=91990222
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410885945.XA Active CN118412041B (en) | 2024-07-03 | 2024-07-03 | DNA sequencing data matching enhancement method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118412041B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6090550A (en) * | 1994-12-23 | 2000-07-18 | Imperial College Of Science, Technology And Medicine | Automated DNA sequencing comparing predicted and actual measurements |
US20130311105A1 (en) * | 2012-05-18 | 2013-11-21 | 454 Life Sciences Corporation | System And Method For Generation And Use Of Optimal Nucleotide Flow Orders |
US20140379270A1 (en) * | 2013-06-19 | 2014-12-25 | Samsung Sds Co., Ltd. | System and method for aligning genome sequence considering mismatch |
CN107075730A (en) * | 2014-09-12 | 2017-08-18 | 利兰·斯坦福青年大学托管委员会 | The identification of circle nucleic acid and purposes |
CN110785814A (en) * | 2018-01-05 | 2020-02-11 | 因美纳有限公司 | Predicting quality of sequencing results using deep neural networks |
CN113278611A (en) * | 2021-03-07 | 2021-08-20 | 华中科技大学同济医学院附属协和医院 | Capture sequencing probes and uses thereof |
CN117115468A (en) * | 2023-10-19 | 2023-11-24 | 齐鲁工业大学(山东省科学院) | Image recognition method and system based on artificial intelligence |
-
2024
- 2024-07-03 CN CN202410885945.XA patent/CN118412041B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6090550A (en) * | 1994-12-23 | 2000-07-18 | Imperial College Of Science, Technology And Medicine | Automated DNA sequencing comparing predicted and actual measurements |
US20130311105A1 (en) * | 2012-05-18 | 2013-11-21 | 454 Life Sciences Corporation | System And Method For Generation And Use Of Optimal Nucleotide Flow Orders |
CA2873146A1 (en) * | 2012-05-18 | 2013-11-21 | F. Hoffmann-La Roche Ag | System and method for generation and use of optimal nucleotide flow orders |
US20140379270A1 (en) * | 2013-06-19 | 2014-12-25 | Samsung Sds Co., Ltd. | System and method for aligning genome sequence considering mismatch |
CN107075730A (en) * | 2014-09-12 | 2017-08-18 | 利兰·斯坦福青年大学托管委员会 | The identification of circle nucleic acid and purposes |
CN110785814A (en) * | 2018-01-05 | 2020-02-11 | 因美纳有限公司 | Predicting quality of sequencing results using deep neural networks |
CN113278611A (en) * | 2021-03-07 | 2021-08-20 | 华中科技大学同济医学院附属协和医院 | Capture sequencing probes and uses thereof |
CN117115468A (en) * | 2023-10-19 | 2023-11-24 | 齐鲁工业大学(山东省科学院) | Image recognition method and system based on artificial intelligence |
Non-Patent Citations (3)
Title |
---|
ATSUFUMI OHTA等: "Using nanopore sequencing to identify fungi from clinical samples with high phylogenetic resolution", SCIENTIFIC REPORTS VOLUME, vol. 13, no. 9, 16 June 2023 (2023-06-16) * |
HENG LI等: "Fast and accurate long-read alignment with Burrows-Wheeler transform", BIOINFORMATICS, vol. 26, no. 5, 15 January 2010 (2010-01-15), pages 589, XP055700149, DOI: 10.1093/bioinformatics/btp698 * |
杨金晶: "基于混合测序的基因组变异检测方法研究", 中国硕士学位论文全文数据库, 15 June 2020 (2020-06-15) * |
Also Published As
Publication number | Publication date |
---|---|
CN118412041B (en) | 2024-09-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114999573B (en) | Genome variation detection method and detection system | |
CN111081315B (en) | Homologous pseudogene mutation detection method | |
CN113035272B (en) | Method and device for obtaining immunotherapeutic new antigen based on intein cell variation | |
CN109949861B (en) | Tumor mutation load detection method, device and storage medium | |
CN111341383B (en) | Method, device and storage medium for detecting copy number variation | |
CN111718982A (en) | Tumor tissue single sample somatic mutation detection method and device | |
CN108647495B (en) | Identity relationship identification method, device, equipment and storage medium | |
CN108595915A (en) | A kind of three generations's data correcting method based on DNA variation detections | |
CN111321209A (en) | Method for double-end correction of circulating tumor DNA sequencing data | |
WO2001016861A2 (en) | Method and apparatus for analyzing nucleic acid sequences | |
CN118412041B (en) | DNA sequencing data matching enhancement method and system | |
CN112735517A (en) | Method, device and storage medium for detecting joint deletion of chromosomes | |
CN115064209A (en) | Malignant cell identification method and system | |
US20190078155A1 (en) | Method for determining nucleotide sequence | |
KR101163425B1 (en) | Individual discrimination method and apparatus | |
CN112863594A (en) | Tumor purity estimation method and device | |
CN116189763A (en) | Single sample copy number variation detection method based on second generation sequencing | |
Scheetz et al. | ESTprep: preprocessing cDNA sequence reads | |
CN109390034B (en) | Method for detecting normal tissue content and tumor copy number in tumor tissue | |
WO2016176846A1 (en) | Reagent kit, apparatus, and method for detecting chromosome aneuploidy | |
EP3552127B1 (en) | Methods for detecting variants in next-generation sequencing genomic data | |
US20040009521A1 (en) | Methods of detecting DNA variation in sequence data | |
Isakov et al. | Deep sequencing data analysis: challenges and solutions | |
CN113593629A (en) | Method for reducing non-invasive prenatal detection false positive and false negative based on semiconductor sequencing | |
CN113528648A (en) | Method for judging aging degree based on gene mutation and DNA methylation characteristics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |