[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN118412041A - DNA sequencing data matching enhancement method and system - Google Patents

DNA sequencing data matching enhancement method and system Download PDF

Info

Publication number
CN118412041A
CN118412041A CN202410885945.XA CN202410885945A CN118412041A CN 118412041 A CN118412041 A CN 118412041A CN 202410885945 A CN202410885945 A CN 202410885945A CN 118412041 A CN118412041 A CN 118412041A
Authority
CN
China
Prior art keywords
reference genome
read
base position
dna sequencing
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410885945.XA
Other languages
Chinese (zh)
Other versions
CN118412041B (en
Inventor
袁林
赵羚
孙胜国
徐志杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qilu University of Technology
Original Assignee
Qilu University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu University of Technology filed Critical Qilu University of Technology
Priority to CN202410885945.XA priority Critical patent/CN118412041B/en
Publication of CN118412041A publication Critical patent/CN118412041A/en
Application granted granted Critical
Publication of CN118412041B publication Critical patent/CN118412041B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to the field of electric digital data processing, in particular to a DNA sequencing data matching enhancement method and a system, belonging to a specific application of an electric digital data processing technology. The method comprises the following steps: extracting DNA sequencing data; obtaining a sequencing error accumulation coefficient of each base position on a reference genome; adjusting a comparison matching strategy of a genome to be detected and a reference genome; enhancement processing is matched to the DNA sequencing data. The system comprises: the device comprises a DNA data extraction unit, a sequencing error accumulation coefficient acquisition unit, a comparison matching strategy adjustment unit and a data matching enhancement processing unit. According to the invention, the sequencing error accumulation coefficient is obtained through analyzing the difference in the sequence comparison process, so that the analysis of the data comparison matching process is realized, and the comparison matching strategy is adjusted. Compared with the conventional processing mode, the method can further remove the influence of large fragment insertion deletion errors according to the comparison matching strategy, so that the matching result of the DNA sequencing data is more accurate.

Description

DNA sequencing data matching enhancement method and system
Technical Field
The invention relates to the field of electric digital data processing, in particular to a DNA sequencing data matching enhancement method and system.
Background
DNA data matching refers to the process of aligning DNA sequences to be matched with DNA sequences in a reference database to determine similarity and relatedness between them. DNA data matching is commonly used for research and application in identifying genotypes, disease genes, biological species, relatives, and the like.
The prior art generally employs third generation single molecule sequencing techniques, typically by directly detecting the base sequence of a single DNA molecule, the signal interpretation of which is typically based on the optical signal detected by a sequencing instrument to determine the identity of each base. The chemical reactions used in the sequencing process, such as primer binding, labeling, etc., may have a certain error rate, and errors in these chemical reaction steps may gradually accumulate in the sequencing result, resulting in an increase in the error rate of the terminal sequencing data. In particular, in the data matching process, matching may be difficult due to structural variations such as insertion or deletion of large fragments.
Disclosure of Invention
In order to solve the technical problems, the invention aims to provide a DNA sequencing data matching enhancement method and a system, and the adopted technical scheme is as follows:
A DNA sequencing data match enhancement method comprising:
S1, extracting DNA sequencing data;
s2, according to DNA sequencing data, sequencing error accumulation coefficients of all base positions on a reference genome are obtained;
S3, according to the sequencing error accumulation coefficient, adjusting a comparison matching strategy of the genome to be tested and the reference genome;
s4, carrying out matching enhancement treatment on the DNA sequencing data according to a comparison matching strategy of the genome to be detected and the reference genome;
step S2, according to the DNA sequencing data, a sequencing error accumulation coefficient is obtained, and the method comprises the following steps:
S201, acquiring read coverage degree data of each base position on a reference genome according to the DNA sequencing data;
s202, according to the read coverage degree data, acquiring sequencing error accumulation coefficients of each base position on a reference genome.
In the above method for enhancing matching of DNA sequencing data, in step S201, the obtaining of the read coverage degree data of each base position on the reference genome according to the DNA sequencing data includes:
S2011, calculating the error degree of a read where the end point of each read of each base position on the reference genome is located according to the DNA sequencing data;
S2012, calculating the influence degree of the end points of the reads of the base positions on the reference genome according to the error degree of the reads of the end points of the reads of the base positions on the reference genome;
S2013, calculating the read coverage degree of each base position on the reference genome according to the influence degree of the read end point of each base position on the reference genome. ; indicating the degree of error of the read in which the end point of the first read at the first base position is located; a read end point influence degree indicating the position of the first base; the extent of read coverage at the base position of the first reference genome is shown;
According to the DNA sequencing data matching enhancement method, the read error degree of the end points of each read at each base position on the reference genome is determined by the ratio of the length of the read to the number of abnormal bases on the read.
In the above method for enhancing matching of DNA sequencing data, the extent of influence of the end point of the read at each base position on the reference genome is determined by accumulation of the extent of error of the read.
The DNA sequencing data matching enhancement method is characterized in that the read coverage of each base position on the reference genome is determined by the read coverage and the significance of the sequencing times on the base positions of the reference genome.
In the above method for enhancing matching of DNA sequencing data, in step S202, the cumulative coefficient of sequencing errors at each base position on the reference genome is determined by the increasing trend of errors in the comparison process and the coverage degree of the read at that position.
In the above method for enhancing matching of DNA sequencing data, step S3, according to the sequencing error accumulation coefficient, adjusts a comparison matching policy of a genome to be tested and a reference genome, including:
s301, comparing a sequencing error accumulation coefficient of the current comparison base position of the reference genome with a preset error threshold in the comparison matching process;
S302, if the sequencing error accumulation coefficient of the current comparison base position of the reference genome is larger than the preset error threshold, the comparison matching result of the current comparison base position is not credible, and the current comparison base position is relocated and matched.
In the above method for enhancing matching of DNA sequencing data, in step S4, the processing for enhancing matching of DNA sequencing data according to the comparison matching strategy of the genome to be detected and the reference genome includes:
S401, determining partial bases with abnormal base comparison matching results with the reference genome according to the corresponding positions of the DNA reads to be compared, which are determined in the comparison matching strategy, in the reference genome;
S402, acquiring a plurality of DNA reads to be compared, which are covered with the positions of the DNA reads to be compared, taking the inverse ratio of the sequencing error accumulation coefficient of the base positions corresponding to the partial bases as the credibility of the comparison result, and selecting the base type with the maximum credibility as the corrected base type of the partial bases.
The invention also provides a DNA sequencing data matching enhancement system, which comprises:
a DNA data extraction unit for extracting DNA sequencing data;
A sequencing error accumulation coefficient obtaining unit, which is used for obtaining the sequencing error accumulation coefficient of each base position on the reference genome according to the DNA sequencing data;
The comparison matching strategy adjusting unit is used for adjusting the comparison matching strategy of the genome to be detected and the reference genome according to the sequencing error accumulation coefficient;
The data matching enhancement processing unit is used for matching and enhancing the DNA sequencing data according to a comparison and matching strategy of the genome to be detected and the reference genome;
The sequencing error accumulation coefficient acquisition unit includes:
a read coverage degree acquisition unit for acquiring read coverage degree data of each base position on the reference genome according to the DNA sequencing data;
and the sequencing error accumulation coefficient acquisition subunit is used for acquiring sequencing error accumulation coefficients of all base positions on the reference genome according to the read coverage degree data.
The DNA sequencing data matching enhancement system described above, the read coverage degree obtaining unit includes:
an error degree calculation unit for calculating the error degree of the read where the end point of each read of each base position on the reference genome is located, based on the DNA sequencing data;
A read end point influence degree calculation unit, configured to calculate a read end point influence degree of each base position on the reference genome according to an error degree of a read where an end point of each read of each base position on the reference genome is located;
And the read coverage degree acquisition subunit is used for calculating the read coverage degree of each base position on the reference genome according to the influence degree of the read end point of each base position on the reference genome.
The invention has the following beneficial effects:
According to the invention, according to the specific data mode and characteristic of large fragment insertion or deletion in genome sequencing data, sequencing error accumulation coefficients are obtained by analyzing the difference in the sequence comparison process, the analysis of the data comparison matching process is realized, and the data comparison matching strategy is adjusted. Compared with the conventional processing mode, the method can further remove the influence of large fragment insertion deletion errors according to the comparison matching strategy, so that the matching result of the DNA sequencing data is more accurate.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for enhancing matching of DNA sequencing data according to an embodiment of the present invention.
FIG. 2 is a flowchart of a method for obtaining a sequencing error accumulation coefficient in step S2 of the embodiment shown in FIG. 1.
Fig. 3 is a flowchart of a method for acquiring the coverage data in step S201 in the embodiment shown in fig. 1.
Fig. 4 is a flowchart of a method for adjusting the alignment matching policy in step S3 in the embodiment shown in fig. 1.
FIG. 5 is a flow chart of a method of matching enhancement processing of DNA sequencing data in step S4 of the embodiment shown in FIG. 1.
FIG. 6 is a block diagram illustrating a DNA sequencing data matching enhancement system according to another embodiment of the present invention.
FIG. 7 is a block diagram of a sequencing error accumulation coefficient unit in the embodiment shown in FIG. 6.
Fig. 8 is a block diagram of the read coverage degree acquisition unit in fig. 7.
Fig. 9 is a block diagram showing the structure of the comparison strategy adjustment unit in the embodiment shown in fig. 6.
FIG. 10 is a block diagram of the data matching enhancement processing unit in the embodiment of FIG. 6.
Detailed Description
In order to further describe the technical means and effects adopted by the present invention to achieve the preset purposes, the following detailed description refers to specific embodiments, structures, features and effects of a method and a system for enhancing matching of DNA sequencing data according to the present invention, which are described in detail below with reference to the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The following specifically describes a specific scheme of a method and a system for enhancing matching of DNA sequencing data.
Referring to fig. 1, a flowchart of a DNA sequencing data matching enhancement method according to an embodiment of the present invention is shown. The DNA sequencing data matching enhancement method comprises the following steps: s1, extracting DNA sequencing data; s2, according to DNA sequencing data, sequencing error accumulation coefficients of all base positions on a reference genome are obtained; s3, according to the sequencing error accumulation coefficient, adjusting a comparison matching strategy of the genome to be tested and the reference genome; s4, carrying out matching enhancement treatment on the DNA sequencing data according to a comparison matching strategy of the genome to be detected and the reference genome.
In this embodiment, the extracted DNA sequencing data of step S1 is extracted by a detection sequencer. The specific implementation mode is as follows: first, DNA is extracted from a sample and subjected to appropriate treatments such as purification, modification, fragmentation, etc., to ensure the accuracy and efficiency of the sequencing reaction. The processed DNA sample is loaded into a sequencing device, typically single molecule sequencing by a carrier such as a Nanopore (Nanopore) or SMRT cells (pacbrio). The DNA single molecules pass through the nano holes or the SMRT cells one by one in the sequencing process, and a sequencing instrument records the base sequence information of each single molecule. The sequencing instrument collects and records the signal change of the DNA single molecule in the nanopore or the SMRT cell, and the base sequence is determined according to the signal change. And splicing and correcting the acquired base sequence data to obtain complete DNA sequence information.
The extraction technique of the above DNA sequencing data generally employs a third generation single molecule sequencing technique, which is generally performed by directly detecting the base sequence of a single DNA molecule, and the signal interpretation is generally performed by determining the identity of each base based on the optical signal detected by a sequencing instrument. One significant advantage of this technique is that the read length for the gene fragment is longer, but there are also errors in sequencing data alignment matching that result in insertion or deletion of large fragments. The error can change the length and structure of the sequence, and the sequence mismatch can be caused in the comparison process, so that the traditional comparison algorithm is difficult to find an accurate matching position, and the variation of the DNA molecule itself can also present similar fragment type matching abnormality, which is unfavorable for the accurate and rapid data comparison matching process.
Large fragment insertions or deletions will manifest themselves in genomic sequencing data as specific data patterns and features, one of which is that the sequence depth of the insertion or deletion region may be lower or higher than the depth level of the surrounding normal region. Thus, abnormalities resulting from insertions or deletions can be determined by analysis of the differences in the sequence alignment process. In particular, the insertion region may exhibit low depth due to the newly added fragment not being sufficiently sequenced; the deleted region may exhibit high depth because the deleted fragment results in multiple sequencing of the sequence of the region. In addition, in analyzing coverage during read matching, the effect of the end points of the read length needs to be removed.
Therefore, the present embodiment aims to provide more accurate sequencing data by analyzing the above data and performing comparison matching strategy adjustment with the sequencing error accumulation coefficient as an analysis result. Fig. 2 shows a flow chart of step S2. In step S2, a sequencing error accumulation coefficient is obtained according to the DNA sequencing data, including: s201, acquiring read coverage degree data of each base position on a reference genome according to the DNA sequencing data; s202, according to the read coverage degree data, acquiring sequencing error accumulation coefficients of each base position on a reference genome.
The specific flow of step S201 is shown in fig. 3, and includes: s2011, calculating the error degree of a read where the end point of each read of each base position on the reference genome is located according to the DNA sequencing data; s2012, calculating the influence degree of the end points of the reads of the base positions on the reference genome according to the error degree of the reads of the end points of the reads of the base positions on the reference genome; s2013, calculating the read coverage degree of each base position on the reference genome according to the influence degree of the read end point of each base position on the reference genome.
In one embodiment of the invention, each sequencing read data is considered a one-dimensional vector, denoted as the original vector, using a 30x subset of the original data (subset of 30 times sequencing depth). The position numbers of the corresponding reference genomes are known and the corresponding manner here is aligned using the BLAST algorithm. In particular, the reference genome is analyzed for only known portions, and no corresponding consideration is given to the unknown portions, i.e., the base positions on the reference genome are only known portions.
The read error degree of the end point of each read of each base position on the reference genome is determined by the ratio of the length of the read to the number of abnormal bases on the read, and the specific calculation formula is as follows:
Represent the first The first base positionThe degree of error of the read in which the end point of the individual read is located; Represent the first The first base positionThe length of the read where the end point of the individual read is located; Represent the first The first base positionThe number of base positions of the reads where the end points of the individual reads are located are abnormal in comparison.
Wherein,Represent the firstThe base position is subjected toThe degree of influence of the matching result of the individual reads, whenThe longer the read length of the individual reads, and theThe fewer the number of abnormal bases of each read, i.e., the more accurate the matching result, the firstThe difference in individual base positions during the actual matching of multiple reads compared to the remaining base positions is more of an error caused by the end points of the reads.
The extent of the effect of the end point of the reads at each base position on the reference genome is determined by the accumulation of the extent of read errors. According to the error degree of the reading segment where the end point is located, the calculation formula of the influence degree of the end point of the reading segment of each base position on the reference genome is as follows:
Represent the first The extent of read end point influence of individual base positions; Represent the first The number of reads at which the read end points of the individual base positions are located; representing a linear normalization function. Measured by the firstThe accumulation of read error levels at a base position as a read endpoint can characterize the base position as being affected by all of its local regions at the base as a read endpoint.
The extent of the influence of the end points of the reads of the base positions is characterized by the reliability of the difference between the base comparison results of the reads to be detected at the base positions and the positions on the reference genome from a wider view angle. Thus, the degree of coverage of the reads at each base position on the reference genome can be obtained. Specifically, the read coverage is determined by the read coverage and the significance of the number of sequencing times at the base position of the reference genome, and the calculation formula is as follows:
Representing the first Read coverage at each reference genome base position; Represent the first The number of times sequenced at each reference genomic base position; the mean of the number of times sequenced at the base position of the reference genome is shown. Represent the firstSignificance of the number of times sequenced at the base positions of the reference genome, i.e., the more times sequenced the more reliable the comparison and analysis of reads thereof, the correspondingThe higher the confidence that the endpoint effect is exhibited at the base position of each reference genome, the characterization can also be used for subsequent analysis of the stepwise distribution of base alignment results.
The major factors affecting the coverage and variation data of the sequencing data can be considered as the large fragment insertion deletion abnormality of the sequencing data and the variation caused by the genetic variation. The large fragment insertion deletion abnormality and the genetic variation have different characteristics in distribution in genome sequencing data, namely, in the DNA sequencing process, the large fragment insertion deletion abnormality is the difference of stages in coverage values of comparison results of sequencing data due to the limitation of reading length, the difference of stages shows a certain stability in a longer range, and the genetic variation is more prone to the difference characteristics of more local stages and single points.
Based on the above analysis, researchers can know that the coverage value of the comparison result of sequencing data caused by the limitation of the length of a read segment shows a stepwise difference, which is usually that a part of base comparison results are normal, but a part of insertion parts are not matched with the position to obtain a matching error at all, an abnormal condition exists at the moment, the error influence range is larger, the situation is shown that the large fragment is inserted and deleted abnormally, the part before the abnormal condition has a gradual increase trend of the comparison error, and the abnormal occurrence frequency of the base comparison position is shown to be higher and higher; the influence phase of the genetic variation is smaller and the gradual increase trend does not exist.
Analyzing any DNA reads to be compared, extracting base sequences at reference genome positions in two equal-size adjacent intervals, wherein the comparison process is affected by error accumulation in the actual sequencing process, namely, the matching result of the base position at the previous stage in sequence and the matching result on the reference genome affects the matching result of the subsequent point, and the reliability of the subsequent matching result is reduced due to the error of the matching result at the previous stage, so that in step S202, the sequencing error accumulation coefficient of each base position on the reference genome is determined by the error increasing trend in the comparison process and the read coverage degree of the position, and the calculation formula is as follows:
Wherein, Representing the firstSequencing error accumulation coefficients at each reference genomic base position; Representing the first Read coverage at each reference genome base position; Representing the read coverage sequence of all base positions corresponding to the i-th reference genome base position over the length of the read; Representing the sequence of the degree of coverage of the reads at all base positions corresponding to the length of the interval of equal length preceding the read at which the i-th reference genome base position is located; the sequence standard deviation is calculated; Represented is a normalization operation; Representing a preset number of neighbors. In this embodiment, the preset area number a is set to 5, and may also be set to other values according to actual situations.
In the above formula, the liquid crystal display device,The characteristic is that the comparison process is affected by the error accumulation degree of the actual sequencing process, when the error in the latter equal-sized interval is increased compared with the error in the former interval, the error accumulation change exists, which is expressed as thatThe value of (2) is larger, i.e. there is a tendency for the error in the comparison process to increase. Furthermore, for comparison results at a single reference genomic base position, the greater the degree of read coverage at that position, the greater the corresponding sequencing error accumulation coefficient, i.e., the greater the error accumulation effect herein is related to endpoint impact. Thus, parameters characterizing the influence of the end points are also incorporated in the above formula. To sum up, sequencing error accumulation coefficientsMay be used to adjust the alignment matching strategy.
As shown in fig. 4, step S3, adjusting a comparison matching strategy of the genome to be tested and the reference genome according to the sequencing error accumulation coefficient, includes: s301, comparing a sequencing error accumulation coefficient of the current comparison base position of the reference genome with a preset error threshold in the comparison matching process; s302, if the sequencing error accumulation coefficient of the current comparison base position of the reference genome is larger than the preset error threshold, the comparison matching result of the current comparison base position is not credible, and the current comparison base position is relocated and matched.
In one embodiment of the present invention, the preset error threshold is set to 0.91, and may be set to other values according to practical situations. When the sequencing error accumulation coefficient of the current comparison base position of the reference genome is compared with a preset error threshold value, and when the sequencing error accumulation coefficient of the current matching result in the comparison process is smaller than the error threshold value, the comparison result of the comparison result at the time on the reference genome position can be considered to be reliable, and positioning matching should be continued until the comparison of the current reading segment is completed. And combining with the step S302, the adjustment of the matching strategy can be achieved.
The comparison matching strategy adjustment result obtained in the above step is actually the sequencing result adjustment under the consideration of the error influence of the third generation sequencing technology, and the structure of the sequence can be reconstructed according to the obtained sequencing error accumulation coefficient, such as splicing, correction and other operations, so as to obtain more accurate sequence information. The flowchart shown in fig. 5 is further described taking corrective action as an example of the enhancement process. In step S4, according to the comparison matching strategy of the genome to be detected and the reference genome in step S3, the matching enhancement processing for the DNA sequencing data includes: s401, determining partial bases with abnormal base comparison matching results with the reference genome according to the corresponding positions of the DNA reads to be compared, which are determined in the comparison matching strategy, in the reference genome; s402, acquiring a plurality of DNA reads to be compared, which are covered with the positions of the DNA reads to be compared, taking the inverse ratio of the sequencing error accumulation coefficient of the base positions corresponding to the partial bases as the credibility of the comparison result, and selecting the base type with the maximum credibility as the corrected base type of the partial bases.
According to the embodiment, according to the specific data mode and characteristic which can be shown in genome sequencing data by large fragment insertion or deletion, sequencing error accumulation coefficients are obtained by analyzing the difference expression in the sequence comparison process, so that the analysis of the data comparison matching process is realized, and the data comparison matching strategy is adjusted. Compared with the conventional processing mode, the method can further remove the influence of large fragment insertion deletion errors according to the comparison matching strategy, so that the matching result of the DNA sequencing data is more accurate.
As shown in fig. 6, another embodiment of the present invention provides a DNA sequencing data matching enhancement system, comprising: a DNA data extraction unit 100 for extracting DNA sequencing data; a sequencing error accumulation coefficient obtaining unit 200, configured to obtain a sequencing error accumulation coefficient of each base position on the reference genome according to the DNA sequencing data; the comparison matching strategy adjustment unit 300 is configured to adjust a comparison matching strategy of the genome to be tested and the reference genome according to the sequencing error accumulation coefficient; and the data matching enhancement processing unit 400 is used for matching and enhancing the DNA sequencing data according to the comparison and matching strategy of the genome to be detected and the reference genome.
Specifically, as shown in fig. 7, the sequencing error accumulation coefficient acquisition unit 200 includes: a read coverage degree acquisition unit 201 for acquiring read coverage degree data of each base position on the reference genome based on the DNA sequencing data; a sequencing error accumulation coefficient obtaining subunit 202, configured to obtain a sequencing error accumulation coefficient of each base position on the reference genome according to the coverage degree data of the read.
More specifically, as shown in fig. 8, the read coverage degree acquisition unit 201 includes: an error degree calculating unit 2011 for calculating the error degree of the read where the end point of each read of each base position on the reference genome is located according to the DNA sequencing data; a read end point influence degree calculation unit 2012 for calculating the read end point influence degree of each base position on the reference genome according to the error degree of the read where the end point of each read of each base position on the reference genome is located; a read coverage obtaining subunit 2023 is configured to calculate the read coverage of each base position on the reference genome according to the influence degree of the read end point of each base position on the reference genome.
The respective data acquisition methods involved in the above-described read coverage acquisition unit 201 are as follows.
The calculation formula of the read error degree of the end point of each read of each base position on the reference genome is as follows:
Represent the first The first base positionThe degree of error of the read in which the end point of the individual read is located; Represent the first The first base positionThe length of the read where the end point of the individual read is located; Represent the first The first base positionThe number of base positions of the reads where the end points of the individual reads are located are abnormal in comparison.
Wherein,Represent the firstThe base position is subjected toThe degree of influence of the matching result of the individual reads, whenThe longer the read length of the individual reads, and theThe fewer the number of abnormal bases of each read, i.e., the more accurate the matching result, the firstThe difference in individual base positions during the actual matching of multiple reads compared to the remaining base positions is more of an error caused by the end points of the reads.
The calculation formula of the influence degree of the end point of the reading of each base position on the reference genome is as follows:
Represent the first The extent of read end point influence of individual base positions; Represent the first The number of reads at which the read end points of the individual base positions are located; representing a linear normalization function. Measured by the firstThe accumulation of read error levels at a base position as a read endpoint can characterize the base position as being affected by all of its local regions at the base as a read endpoint.
The calculation formula of the read coverage of each base position on the reference genome is as follows:
Representing the first Read coverage at each reference genome base position; Represent the first The number of times sequenced at each reference genomic base position; the mean of the number of times sequenced at the base position of the reference genome is shown. Represent the firstSignificance of the number of times sequenced at the base positions of the reference genome, i.e., the more times sequenced the more reliable the comparison and analysis of reads thereof, the correspondingThe higher the confidence that the endpoint effect is exhibited at the base position of each reference genome, the characterization can also be used for subsequent analysis of the stepwise distribution of base alignment results.
In the sequencing error accumulation coefficient obtaining subunit 202, the calculation formula of the sequencing error accumulation coefficient of each base position on the reference genome is as follows:
Wherein, Representing the firstSequencing error accumulation coefficients at each reference genomic base position; Representing the first Read coverage at each reference genome base position; Representing the read coverage sequence of all base positions corresponding to the i-th reference genome base position over the length of the read; Representing the sequence of the degree of coverage of the reads at all base positions corresponding to the length of the interval of equal length preceding the read at which the i-th reference genome base position is located; the sequence standard deviation is calculated; Represented is a normalization operation; the number of preset neighborhoods is shown, and in a specific example, the value of a is set to 5, and can be set to other values according to actual situations.
In the above formula, the liquid crystal display device,The characteristic is that the comparison process is affected by the error accumulation degree of the actual sequencing process, when the error in the latter equal-sized interval is increased compared with the error in the former interval, the error accumulation change exists, which is expressed as thatThe value of (2) is larger, i.e. there is a tendency for the error in the comparison process to increase. Furthermore, for comparison results at a single reference genomic base position, the greater the degree of read coverage at that position, the greater the corresponding sequencing error accumulation coefficient, i.e., the greater the error accumulation effect herein is related to endpoint impact. Thus, parameters characterizing the influence of the end points are also incorporated in the above formula. To sum up, sequencing error accumulation coefficientsMay be used to adjust the alignment matching strategy.
As shown in the block diagram of fig. 9, the comparison strategy adjustment unit 300 includes: a preset error threshold comparison unit 301, configured to compare, in a comparison matching process, a sequencing error accumulation coefficient of a currently aligned base position of the reference genome with a preset error threshold; a repositioning matching unit 302, configured to reposition and match the current aligned base position of the reference genome when the sequencing error accumulation coefficient of the current aligned base position is greater than the preset error threshold.
According to the specific data mode and characteristic of large fragment insertion or deletion in genome sequencing data, sequencing error accumulation coefficients are obtained by analyzing the difference expression in the sequence comparison process, the analysis of the data comparison matching process is realized, and the data comparison matching strategy is adjusted. Compared with the conventional processing mode, the method can further remove the influence of large fragment insertion deletion errors according to the comparison matching strategy, so that the matching result of the DNA sequencing data is more accurate.
In addition, a block diagram of the data matching enhancement processing unit 400 of the above-described DNA sequencing data matching enhancement system is given in fig. 10, including: an abnormal base confirming unit 401, configured to determine a part of bases having abnormal base comparison matching results with the reference genome according to the corresponding positions of the DNA reads to be compared determined in the comparison matching strategy in the reference genome; an abnormal base correction unit 402, configured to obtain a plurality of DNA reads to be compared covering positions of the DNA reads to be compared, and select a base type with the highest reliability as a corrected base type of the partial base according to an inverse ratio of a sequencing error accumulation coefficient of a base position corresponding to the partial base as a reliability of the comparison result. The abnormal comparison and matching result refers to the inconsistent base types at the same position. The unit can correct the position of the abnormal base, and further enhances the accuracy of the matching result of the DNA sequencing data.
It should be noted that: the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. The processes depicted in the accompanying drawings do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.

Claims (10)

1. A method for matching enhancement of DNA sequencing data, comprising:
S1, extracting DNA sequencing data;
s2, according to DNA sequencing data, sequencing error accumulation coefficients of all base positions on a reference genome are obtained;
S3, according to the sequencing error accumulation coefficient, adjusting a comparison matching strategy of the genome to be tested and the reference genome;
s4, carrying out matching enhancement treatment on the DNA sequencing data according to a comparison matching strategy of the genome to be detected and the reference genome;
step S2, according to the DNA sequencing data, a sequencing error accumulation coefficient is obtained, and the method comprises the following steps:
S201, acquiring read coverage degree data of each base position on a reference genome according to the DNA sequencing data;
s202, according to the read coverage degree data, acquiring sequencing error accumulation coefficients of each base position on a reference genome.
2. The DNA sequencing data matching enhancement method of claim 1, wherein:
In step S201, obtaining read coverage data of each base position on the reference genome according to the DNA sequencing data, including:
S2011, calculating the error degree of a read where the end point of each read of each base position on the reference genome is located according to the DNA sequencing data;
S2012, calculating the influence degree of the end points of the reads of the base positions on the reference genome according to the error degree of the reads of the end points of the reads of the base positions on the reference genome;
S2013, calculating the read coverage degree of each base position on the reference genome according to the influence degree of the read end point of each base position on the reference genome.
3. The DNA sequencing data matching enhancement method of claim 2, wherein:
The degree of read error at which the end points of each read at each base position on the reference genome are located; the degree of error in the read at which the end point of the first read representing the position of the first base is located is determined by the ratio of the length of the read to the number of abnormal bases on the read.
4. The DNA sequencing data matching enhancement method of claim 3, wherein:
the extent of the effect of the end point of the reads at each base position on the reference genome is determined by the accumulation of the extent of read errors.
5. The DNA sequencing data matching enhancement method of claim 4, wherein:
the degree of read coverage for each base position on the reference genome is determined by the degree of read coverage and the significance of the number of sequenced base positions on the reference genome.
6. The DNA sequencing data matching enhancement method of claim 5, wherein:
In step S202, the cumulative coefficient of sequencing error at each base position on the reference genome is determined by the increasing trend of error during the alignment and the degree of coverage of the reads at that position.
7. The DNA sequencing data matching enhancement method of claim 1, wherein:
step S3, according to the sequencing error accumulation coefficient, adjusting a comparison matching strategy of the genome to be tested and the reference genome, wherein the step comprises the following steps:
s301, comparing a sequencing error accumulation coefficient of the current comparison base position of the reference genome with a preset error threshold in the comparison matching process;
S302, if the sequencing error accumulation coefficient of the current comparison base position of the reference genome is larger than the preset error threshold, the comparison matching result of the current comparison base position is not credible, and the current comparison base position is relocated and matched.
8. The DNA sequencing data matching enhancement method of any one of claims 1 to 7, wherein:
In step S4, according to the comparison matching strategy of the genome to be detected and the reference genome, the matching enhancement processing for the DNA sequencing data includes:
S401, determining partial bases with abnormal base comparison matching results with the reference genome according to the corresponding positions of the DNA reads to be compared, which are determined in the comparison matching strategy, in the reference genome;
S402, acquiring a plurality of DNA reads to be compared, which are covered with the positions of the DNA reads to be compared, taking the inverse ratio of the sequencing error accumulation coefficient of the base positions corresponding to the partial bases as the credibility of the comparison result, and selecting the base type with the maximum credibility as the corrected base type of the partial bases.
9. A DNA sequencing data matching enhancement system, comprising:
a DNA data extraction unit (100) for extracting DNA sequencing data;
a sequencing error accumulation coefficient acquisition unit (200) for acquiring a sequencing error accumulation coefficient of each base position on the reference genome according to the DNA sequencing data;
A comparison matching strategy adjustment unit (300) for adjusting a comparison matching strategy of the genome to be tested and the reference genome according to the sequencing error accumulation coefficient;
a data matching enhancement processing unit (400) for matching enhancement processing of the DNA sequencing data according to a comparison matching strategy of the genome to be detected and a reference genome;
the sequencing error accumulation coefficient acquisition unit (200) includes:
A read coverage degree acquisition unit (201) for acquiring read coverage degree data of each base position on a reference genome according to the DNA sequencing data;
a sequencing error accumulation coefficient obtaining subunit (202) for obtaining sequencing error accumulation coefficients of each base position on the reference genome according to the read coverage degree data.
10. The DNA sequencing data matching enhancement system of claim 9, wherein:
The read coverage degree acquisition unit (201) includes:
an error degree calculation unit (2011) for calculating the error degree of the read where the end point of each read of each base position on the reference genome is located, based on the DNA sequencing data;
a read end point influence degree calculation unit (2012) for calculating the read end point influence degree of each base position on the reference genome based on the error degree of the read in which the end point of each read of each base position on the reference genome is located;
a read coverage obtaining subunit (2013) configured to calculate a read coverage of each base position on the reference genome according to a read endpoint influence of each base position on the reference genome; indicating the degree of error of the read in which the end point of the first read at the first base position is located; a read end point influence degree indicating the position of the first base; the extent of coverage of the reads at the base positions of the first reference genome is shown.
CN202410885945.XA 2024-07-03 2024-07-03 DNA sequencing data matching enhancement method and system Active CN118412041B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410885945.XA CN118412041B (en) 2024-07-03 2024-07-03 DNA sequencing data matching enhancement method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410885945.XA CN118412041B (en) 2024-07-03 2024-07-03 DNA sequencing data matching enhancement method and system

Publications (2)

Publication Number Publication Date
CN118412041A true CN118412041A (en) 2024-07-30
CN118412041B CN118412041B (en) 2024-09-13

Family

ID=91990222

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410885945.XA Active CN118412041B (en) 2024-07-03 2024-07-03 DNA sequencing data matching enhancement method and system

Country Status (1)

Country Link
CN (1) CN118412041B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6090550A (en) * 1994-12-23 2000-07-18 Imperial College Of Science, Technology And Medicine Automated DNA sequencing comparing predicted and actual measurements
US20130311105A1 (en) * 2012-05-18 2013-11-21 454 Life Sciences Corporation System And Method For Generation And Use Of Optimal Nucleotide Flow Orders
US20140379270A1 (en) * 2013-06-19 2014-12-25 Samsung Sds Co., Ltd. System and method for aligning genome sequence considering mismatch
CN107075730A (en) * 2014-09-12 2017-08-18 利兰·斯坦福青年大学托管委员会 The identification of circle nucleic acid and purposes
CN110785814A (en) * 2018-01-05 2020-02-11 因美纳有限公司 Predicting quality of sequencing results using deep neural networks
CN113278611A (en) * 2021-03-07 2021-08-20 华中科技大学同济医学院附属协和医院 Capture sequencing probes and uses thereof
CN117115468A (en) * 2023-10-19 2023-11-24 齐鲁工业大学(山东省科学院) Image recognition method and system based on artificial intelligence

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6090550A (en) * 1994-12-23 2000-07-18 Imperial College Of Science, Technology And Medicine Automated DNA sequencing comparing predicted and actual measurements
US20130311105A1 (en) * 2012-05-18 2013-11-21 454 Life Sciences Corporation System And Method For Generation And Use Of Optimal Nucleotide Flow Orders
CA2873146A1 (en) * 2012-05-18 2013-11-21 F. Hoffmann-La Roche Ag System and method for generation and use of optimal nucleotide flow orders
US20140379270A1 (en) * 2013-06-19 2014-12-25 Samsung Sds Co., Ltd. System and method for aligning genome sequence considering mismatch
CN107075730A (en) * 2014-09-12 2017-08-18 利兰·斯坦福青年大学托管委员会 The identification of circle nucleic acid and purposes
CN110785814A (en) * 2018-01-05 2020-02-11 因美纳有限公司 Predicting quality of sequencing results using deep neural networks
CN113278611A (en) * 2021-03-07 2021-08-20 华中科技大学同济医学院附属协和医院 Capture sequencing probes and uses thereof
CN117115468A (en) * 2023-10-19 2023-11-24 齐鲁工业大学(山东省科学院) Image recognition method and system based on artificial intelligence

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ATSUFUMI OHTA等: "Using nanopore sequencing to identify fungi from clinical samples with high phylogenetic resolution", SCIENTIFIC REPORTS VOLUME, vol. 13, no. 9, 16 June 2023 (2023-06-16) *
HENG LI等: "Fast and accurate long-read alignment with Burrows-Wheeler transform", BIOINFORMATICS, vol. 26, no. 5, 15 January 2010 (2010-01-15), pages 589, XP055700149, DOI: 10.1093/bioinformatics/btp698 *
杨金晶: "基于混合测序的基因组变异检测方法研究", 中国硕士学位论文全文数据库, 15 June 2020 (2020-06-15) *

Also Published As

Publication number Publication date
CN118412041B (en) 2024-09-13

Similar Documents

Publication Publication Date Title
CN114999573B (en) Genome variation detection method and detection system
CN111081315B (en) Homologous pseudogene mutation detection method
CN113035272B (en) Method and device for obtaining immunotherapeutic new antigen based on intein cell variation
CN109949861B (en) Tumor mutation load detection method, device and storage medium
CN111341383B (en) Method, device and storage medium for detecting copy number variation
CN111718982A (en) Tumor tissue single sample somatic mutation detection method and device
CN108647495B (en) Identity relationship identification method, device, equipment and storage medium
CN108595915A (en) A kind of three generations's data correcting method based on DNA variation detections
CN111321209A (en) Method for double-end correction of circulating tumor DNA sequencing data
WO2001016861A2 (en) Method and apparatus for analyzing nucleic acid sequences
CN118412041B (en) DNA sequencing data matching enhancement method and system
CN112735517A (en) Method, device and storage medium for detecting joint deletion of chromosomes
CN115064209A (en) Malignant cell identification method and system
US20190078155A1 (en) Method for determining nucleotide sequence
KR101163425B1 (en) Individual discrimination method and apparatus
CN112863594A (en) Tumor purity estimation method and device
CN116189763A (en) Single sample copy number variation detection method based on second generation sequencing
Scheetz et al. ESTprep: preprocessing cDNA sequence reads
CN109390034B (en) Method for detecting normal tissue content and tumor copy number in tumor tissue
WO2016176846A1 (en) Reagent kit, apparatus, and method for detecting chromosome aneuploidy
EP3552127B1 (en) Methods for detecting variants in next-generation sequencing genomic data
US20040009521A1 (en) Methods of detecting DNA variation in sequence data
Isakov et al. Deep sequencing data analysis: challenges and solutions
CN113593629A (en) Method for reducing non-invasive prenatal detection false positive and false negative based on semiconductor sequencing
CN113528648A (en) Method for judging aging degree based on gene mutation and DNA methylation characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant