CN117437978A - Low-frequency gene mutation analysis method and device for second-generation sequencing data and application of low-frequency gene mutation analysis method and device - Google Patents
Low-frequency gene mutation analysis method and device for second-generation sequencing data and application of low-frequency gene mutation analysis method and device Download PDFInfo
- Publication number
- CN117437978A CN117437978A CN202311696182.6A CN202311696182A CN117437978A CN 117437978 A CN117437978 A CN 117437978A CN 202311696182 A CN202311696182 A CN 202311696182A CN 117437978 A CN117437978 A CN 117437978A
- Authority
- CN
- China
- Prior art keywords
- sequencing
- sequence
- base
- sequences
- family
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 138
- 238000004458 analytical method Methods 0.000 title claims abstract description 41
- 206010064571 Gene mutation Diseases 0.000 title claims abstract description 25
- 238000000034 method Methods 0.000 claims abstract description 50
- 230000035772 mutation Effects 0.000 claims abstract description 30
- 238000003780 insertion Methods 0.000 claims abstract description 11
- 230000037431 insertion Effects 0.000 claims abstract description 11
- 238000001514 detection method Methods 0.000 claims description 22
- 108090000623 proteins and genes Proteins 0.000 claims description 11
- 108091093088 Amplicon Proteins 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000012937 correction Methods 0.000 claims description 4
- 230000007614 genetic variation Effects 0.000 claims description 4
- 239000002585 base Substances 0.000 claims 19
- 239000003513 alkali Substances 0.000 claims 1
- 239000011159 matrix material Substances 0.000 claims 1
- 238000012217 deletion Methods 0.000 abstract description 6
- 230000037430 deletion Effects 0.000 abstract description 6
- 238000007405 data analysis Methods 0.000 abstract description 5
- 230000037429 base substitution Effects 0.000 abstract description 4
- 239000000523 sample Substances 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000013610 patient sample Substances 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- BAAVRTJSLCSMNM-CMOCDZPBSA-N (2s)-2-[[(2s)-2-[[(2s)-2-[[(2s)-2-amino-3-(4-hydroxyphenyl)propanoyl]amino]-4-carboxybutanoyl]amino]-3-(4-hydroxyphenyl)propanoyl]amino]pentanedioic acid Chemical group C([C@H](N)C(=O)N[C@@H](CCC(O)=O)C(=O)N[C@@H](CC=1C=CC(O)=CC=1)C(=O)N[C@@H](CCC(O)=O)C(O)=O)C1=CC=C(O)C=C1 BAAVRTJSLCSMNM-CMOCDZPBSA-N 0.000 description 1
- 206010069754 Acquired gene mutation Diseases 0.000 description 1
- 108091035707 Consensus sequence Proteins 0.000 description 1
- 206010028980 Neoplasm Diseases 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 239000002994 raw material Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000002864 sequence alignment Methods 0.000 description 1
- 230000037439 somatic mutation Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 108010032276 tyrosyl-glutamyl-tyrosyl-glutamic acid Proteins 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/50—Mutagenesis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Biotechnology (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Computational Biology (AREA)
- Chemical & Material Sciences (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Analytical Chemistry (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a low-frequency gene mutation analysis method and device for second-generation sequencing data and application thereof, and particularly relates to an analysis method and device for molecular tag sequencing data of an IonTorrent sequencing platform and application thereof. The invention designs a brand-new molecular tag sequencing data analysis method suitable for various sequencing platforms, in particular IonTorrent, and the method can be used for determining the position information of the barcode by searching and positioning the barcode sequence in the ready instead of searching and positioning a fixed sequence, comparing the ready multiple sequences in the family and correcting the base sequencing quality value during consistency analysis, eliminating false base substitution introduced in the sequencing process, and eliminating false insertion and deletion, and can be used for accurately detecting the low-frequency SNV and INDEL mutation at the same time.
Description
Technical Field
The invention belongs to the technical field of bioinformatics, relates to a low-frequency gene mutation analysis method and device of second-generation sequencing data and application thereof, and particularly relates to an analysis method and device of molecular tag sequencing data for an IonTorrent sequencing platform and application thereof.
Background
In the study and application of clinical precision medicine, low frequency (< 1%) somatic mutations including point mutations, insertions and deletions of genes have been a hotspot of interest.
NGS technology is widely cited to detect genetic variations. However, NGS introduces erroneous base sequence information during the sequencing process due to the problems of the technology itself, which results in that the target mutation site information to be detected is covered by noise and cannot be detected correctly.
The Ion Torrent sequencer is a first commercial sequencer which does not need an optical system, adopts a semiconductor sequencing technology, directly converts chemical signals into digital signals through a semiconductor chip, is an economical, rapid, simple and scalable sequencing technology, and is very suitable for amplicon sequencing. Because of the characteristics of short sequencing time, cheap instrument and equipment, etc., the method is widely used. However, the sensor is not perfect for detecting the continuous bases, so that the number of the continuous bases can be error when the same base is measured, and the information of the target mutation site to be detected can be covered by noise and cannot be detected correctly.
The Illumina sequencing platform technology is mature, but the sequencing error rate of 0.1% -1% still exists, and the basic group replacement and AT basic group preference are mainly expressed.
In order to improve the detection accuracy of the low frequency mutation, a technique of molecular tags may be used to improve the detection sensitivity. In the establishment of a sequencing library, random sequences of 6bp are respectively connected to both ends of an amplified molecule, which are called barcode. Barcode will be amplified and sequenced along with the attached molecules in a downstream sequencing process. Reads with identical backode belong to the same family and can be considered amplified from the same original molecule. The reads in the same family should be perfectly identical in theory, and by combining all reads into one presentation reads through consistency analysis, base sequencing errors and redundancies of the sequencing process can be eliminated.
Bioinformatic analysis software for molecular tag sequencing data is UMItools, fgbio, samtools, smCounter and Conner, etc. UMIto, fgbio, smCounter and Conner are more applicable to data generated by the Illumina platform than to the Ion torrent platform. The Samtools' Consensu module can process sequencing data from multiple platforms, illumina and Ion torrent, and performs well in detecting SNV (single nucleotide variants), but Samtools will be referenced to the reference genome when reads are combined, and will eliminate the original insertions and deletions in reads, resulting in an inability to detect INDELs.
In conclusion, the development of the data analysis method suitable for the multiple sequencing platforms has important significance in the field of gene variation detection.
Disclosure of Invention
Aiming at the defects and actual demands of the prior art, the invention provides a low-frequency gene mutation analysis method and device for second-generation sequencing data and application thereof, in particular to an analysis method and device for molecular tag sequencing data of an IonTorrent sequencing platform and application thereof, and can accurately detect low-frequency SNV and INDEL mutations at the same time.
In order to achieve the above purpose, the invention adopts the following technical scheme:
in a first aspect, the present invention provides a method for low frequency gene mutation analysis of second generation sequencing data, the method comprising the steps of:
(1) Taking the sequence of the detected target gene as input, and establishing a database of blastn;
(2) Converting the sequenced fastq file into a fasta format file;
(3) Comparing the sequencing fasta sequence (i.e. reads) to a target gene by using blastn to obtain the position coordinates of the amplicon on the sequencing sequence, and extracting molecular tag sequence information;
(4) Classifying the sequencing sequences with the same molecular tag into the same family sequence (namely family), and filtering out family sequences with the sequencing number less than 3 in the family sequences;
(5) Performing multi-sequence comparison on all sequencing sequences in each family sequence, introducing null values (namely gaps), counting A, T, C, G and null values of all sequencing sequences in the family sequence at the same position, calculating base sequencing quality Q according to a formula (1), adding 33 to the Q, and converting the Q into characters corresponding to an ASCII table, namely a modified sequencing quality Phred33 value corresponding to each base in a fastq file;
Q = -10 log 10 (P) formula (1)
Wherein Q represents the base sequencing quality, and P represents the base sequencing error probability;
(6) And merging all sequencing sequences in the family sequences to obtain a sequencing sequence, taking the most counted base as the base of the consistent sequence (consensus sequence), judging that the sequencing insertion error exists at the position if the count of the null value of the corresponding position is the most, removing the position information in the consistent sequence, and taking the base with the highest base sequencing quality after correction if the count of a plurality of bases is the same, so as to obtain the fastq file after the consistent sequence (consensus).
According to the invention, a brand-new molecular tag sequencing data analysis method suitable for various sequencing platforms, particularly IonTorrent is designed, the barcode sequence in the reads is found and positioned through blast instead of the determination of the barcode position information through finding a fixed sequence, the multiple sequences of the reads in family are compared and the base sequencing quality value is corrected during consistency analysis, the false insertion and deletion can be eliminated besides eliminating the false base substitution introduced in the sequencing process, the method is compatible with the data of the illumina and ionorrent of the mainstream second-generation sequencing platform, and has a better effect especially for the analysis of the data of the ionorrent, and can be used for accurately detecting the low-frequency SNV and INDEL mutation at the same time.
According to the invention, the blastn is used for comparing each ready to the genome to determine the position information of the barcode on each ready, instead of only searching the fixed sequences at two ends of the ready to extract the barcode on the ready, the situation that the position information of the barcode cannot be positioned due to the fixed sequence error caused by base synthesis error or sequencing error can be avoided.
In the invention, the base sequencing quality is an important index when detecting mutation, and false positive mutation sites can be removed by screening bases with lower quality values.
In a second aspect, the present invention provides an apparatus for analyzing second-generation sequencing data, the apparatus being configured to perform the steps in the low-frequency gene mutation analysis method for second-generation sequencing data according to the first aspect, comprising:
building a database unit: the method comprises the steps of executing a database which takes a sequence of a detection target gene as input and establishing a blastn;
a conversion unit: for performing the conversion of the sequenced fastq file into a fasta format file;
a data acquisition unit: the method comprises the steps of performing comparison of a sequencing fasta sequence to a target gene by using blastn to obtain position coordinates of an amplicon on the sequencing, and extracting molecular tag sequence information;
classification unit: for performing the grouping of the sequencing sequences having the same molecular tag into the same family sequence, filtering out family sequences having a sequencing number of less than 3 in the family sequences;
calculating a corrected sequencing quality unit: the method comprises the steps of performing multi-sequence comparison on all sequencing sequences in each family sequence, introducing null values, counting A, T, C, G and null values of all sequencing sequences in the family sequence at the same position, calculating base sequencing quality Q according to a formula (1), adding 33 to the Q, and converting the Q into characters corresponding to an ASCII table, namely a corrected sequencing quality Phred33 value corresponding to each base in a fastq file;
Q = -10 log 10 (P) formula (1)
Wherein Q represents the base sequencing quality, and P represents the base sequencing error probability;
analysis unit: and the method is used for executing the combination of all sequencing sequences in the family sequence to obtain a sequencing sequence, wherein the most counted base is used as a consistent sequence base at the same position, if the number of blank values at the corresponding position is the most, the position is judged to have sequencing insertion errors, the position information is removed from the consistent sequence, and if the number of the plurality of bases is the same, the base with the highest base sequencing quality after correction is taken, so that a consistent fastq file is obtained.
In a third aspect, the present invention provides a low frequency gene mutation analysis method of the second generation sequencing data described in the first aspect or an analysis device of the second generation sequencing data described in the second aspect for use in genetic variation detection.
In a fourth aspect, the present invention provides a method of detecting low frequency genetic variation, the method comprising:
performing second-generation sequencing on a sample to be tested, analyzing the second-generation sequencing data by using the low-frequency gene mutation analysis method of the second-generation sequencing data according to the first aspect or the low-frequency gene mutation analysis method device of the second-generation sequencing data according to the second aspect, performing genome comparison by using comparison software based on analysis results, detecting variation by using variation detection analysis software, and outputting variation results.
The invention develops a molecular tag sequencing data analysis method suitable for various sequencing platforms, carries out rapid analysis processing on sequencing data, further carries out mutation detection analysis, can accurately detect low-frequency SNV and INDEL mutation simultaneously, and has wide application prospect, such as the clinical accurate medical field, the research of gene mutation basic behaviors for non-disease diagnosis and the like.
Preferably, the alignment software comprises any one of bwa software, bowtie2 software, blast software and the like;
preferably, the mutation detection analysis software includes any one of Varscan2 software, mutct 2 software, GATK software, freebayes software, and the like.
Preferably, the minimum base mass in the detection of SNV by the mutation detection analysis software is set to 20 to 25, and the value in a specific optional range may be, for example, 20, 21, 22, 23, 24 or 25, and the minimum base mass in the detection of INDEL is set to 20 to 25, and the value in a specific optional range may be, for example, 20, 21, 22, 23, 24 or 25.
In a fifth aspect, the present invention provides a computer device comprising a memory and a processor, the memory storing a computer program/instruction which when executed by the processor implements the steps of the low frequency gene mutation analysis method of the second generation sequencing data of the first aspect or the steps of the method of detecting low frequency gene mutation of the fourth aspect.
In a sixth aspect, the present invention provides a computer-readable storage medium storing a computer program for causing a computer to establish and/or run the steps of the low frequency gene mutation analysis method of the second generation sequencing data as described in the first aspect or the steps of the method of detecting low frequency gene mutation as described in the fourth aspect.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a molecular tag sequencing data detection method suitable for various sequencing platforms, in particular IonTorrent, which is used for determining the position information of a barcode in reads through blast searching instead of searching for a fixed sequence, comparing the multiple sequences of the reads in family and correcting the base sequencing quality value when carrying out consistency analysis, eliminating false base substitution introduced in the sequencing process, and eliminating false insertion and deletion, and can accurately detect SNV and INDEL mutation at the same time.
Drawings
FIG. 1 is a schematic diagram of an analysis flow;
FIG. 2 is a schematic diagram of amplicon structure;
fig. 3 is a presentation schematic.
Detailed Description
The technical means adopted by the invention and the effects thereof are further described below with reference to the examples and the attached drawings. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof.
The specific techniques or conditions are not identified in the examples and are described in the literature in this field or are carried out in accordance with the product specifications. The reagents or equipment used were conventional products available for purchase through regular channels, with no manufacturer noted.
Example 1
Sample preparation: first, a cancer patient sample is tested using a first generation sequencing, and the specific mutation site ratio of the patient sample is determined. And mixing the cell line of the patient with a wild type cell line according to a certain proportion to prepare reference samples with various known mutation proportions. Sample sequencing: and (5) using Ion Torrent to library and sequence the prepared sample to obtain a fastq file. The analytical procedure is as shown in fig. 1, data analysis: converting fastq file into fasta format, comparing with amplified template sequence by using blastn to obtain position coordinates of amplicon on reads, as shown in figure 2, respectively adding analysis tag and fixed sequence at two ends of amplicon when constructing sequencing library, removing base error in PCR and sequencing process by using front and rear molecular tag in subsequent analysis, recovering original sequence information, and positioning position of molecular tag in sequencing sequence by fixed sequence to extract molecular tag sequence information to obtain fastq file containing molecular tag information. Filtering family with family reads less than 3 after removing reads with length shorter than length of amplified template; performing multi-sequence comparison on all sequencing sequences in each family sequence, introducing null values (namely gaps), counting A, T, C, G and null values of all sequencing sequences in the family sequence at the same position, calculating base sequencing quality Q according to a formula (1), adding 33 to the Q, and converting the Q into characters corresponding to an ASCII table, namely a modified sequencing quality Phred33 value corresponding to each base in a fastq file; combining reads in family into one strip by using a presentation algorithm to obtain a presentation/fastq file, wherein the presentation/fastq file is shown in fig. 3, and the drawing is one family containing 6 sequencing sequences, and reducing the 6 sequences into 1 presentation reads after multi-sequence alignment, so that the sequence is considered to be an original base sequence; the bases of 6 sequences at the same position are identical, so that the original sequence is considered to be the base at the position; 2 of the 6 sequences are T at the tail end position of the continuous T, and the other 4 are deleted at the corresponding position ("-"), so that the original sequence is considered to be practically absent at the position; the 6 sequences have 5 GTGT sequences, and 1 sequence is TGTG at the corresponding position, so that the original sequence is considered to be actually GTGTGT at the position; the base at one position of 5 of the 6 sequences is A, and the base at the corresponding position of 1 is deleted, so that the original sequence is considered to be A at the position. The concresus. Fastq was aligned to the reference genome using bwa to obtain a sam file and mutation sites were detected using samtools and Varscan 2.
Samples of known mutation sites and mutation ratios were prepared as follows using IonTorrent sequencing. The target mutation types relate to snp and indel, and the target mutation proportion is 1% -30% different.
The test results are shown in Table 1.
TABLE 1
Conclusion: the detection result is basically consistent with the prediction result in the error range, which shows that the method has excellent performance on detecting point mutation and insertion mutation of IonTorrent sequencing data.
Example 2
Samples of known mutation sites and mutation ratios were prepared simultaneously, using Illumina sequencing, and the detection procedure was as described in example 1. The target mutation type is indel, and the target mutation proportion is about 10%.
The results of the detection are shown in Table 2, which shows that the method of the invention is also applicable to the Illumina sequencing data.
TABLE 2
Comparative example 1
Samples one, two, three and four of examples 1 and 2 were analyzed using samtools 'present module as a comparison technique, and the results are shown in table 3, with samtools' present module being able to detect snp, but without the ability to analyze INDEL. The invention can accurately detect INDEL.
TABLE 3 Table 3
In summary, the invention develops a molecular tag sequencing data detection method suitable for various sequencing platforms, in particular IonTorrent, and determines the position information of the barcode by blast search to locate the barcode sequence in reads instead of by searching for a fixed sequence, and compares and corrects the multiple sequences of the reads in family and the base sequencing quality value when carrying out consistency analysis, so that the invention not only eliminates the wrong base substitution introduced in the sequencing process, but also eliminates the wrong insertion and deletion, and can simultaneously and accurately detect the low-frequency SNV and INDEL mutation.
The applicant states that the detailed method of the present invention is illustrated by the above examples, but the present invention is not limited to the detailed method described above, i.e. it does not mean that the present invention must be practiced in dependence upon the detailed method described above. It should be apparent to those skilled in the art that any modification of the present invention, equivalent substitution of raw materials for the product of the present invention, addition of auxiliary components, selection of specific modes, etc., falls within the scope of the present invention and the scope of disclosure.
Claims (10)
1. A method for low frequency gene mutation analysis of second generation sequencing data, the method comprising the steps of:
(1) Taking the sequence of the detected target gene as input, and establishing a database of blastn;
(2) Converting the sequenced fastq file into a fasta format file;
(3) Comparing the sequencing fasta sequence with a target gene by using blastn to obtain the position coordinates of the amplicon on the sequencing, and extracting molecular tag sequence information;
(4) Classifying the sequencing sequences with the same molecular tag into the same family sequence, and filtering out family sequences with the sequencing number less than 3 in the family sequences;
(5) Comparing all sequencing sequences in each family sequence in multiple sequences, introducing null values, counting A, T, C, G and null values of all sequencing sequences in the family sequence at the same position, calculating base sequencing quality Q according to a formula (1), adding 33 to the Q, and converting the Q into characters corresponding to an ASCII table, namely a modified sequencing quality Phred33 value corresponding to each base in a fastq file;
Q = -10 log 10 (P) formula (1)
Wherein Q represents the base sequencing quality, and P represents the base sequencing error probability;
(6) And merging all sequencing sequences in the family sequence to obtain a sequence, wherein the most counted base at the same position is used as a consistent sequence base, if the number of blank values at the corresponding position is the most, determining that sequencing insertion errors exist at the position, removing position information in the consistent sequence, and if the number of the plurality of bases is the same, taking the base with the highest base sequencing quality after correction to obtain a consistent fastq file.
2. An apparatus for analyzing second-generation sequencing data, wherein the apparatus is configured to perform the steps of the method for analyzing low-frequency gene mutation of second-generation sequencing data according to claim 1, comprising:
building a database unit: the method comprises the steps of executing a database which takes a sequence of a detection target gene as input and establishing a blastn;
a conversion unit: for performing the conversion of the sequenced fastq file into a fasta format file;
a data acquisition unit: the method comprises the steps of performing comparison of a sequencing sequence to a target gene by using blastn, obtaining position coordinates of an amplicon on the sequencing sequence, and extracting molecular tag sequence information;
classification unit: for performing the grouping of sequenced fasta sequences having the same molecular tag into the same family sequence, filtering out family sequences having a sequence number of less than 3 in the family sequences;
calculating a corrected sequencing quality unit: the method comprises the steps of performing multi-sequence comparison on all sequencing sequences in each family sequence, introducing null values, counting A, T, C, G and null values of all sequencing sequences in the family sequence at the same position, calculating base sequencing quality Q according to a formula (1), adding 33 to the Q, and converting the Q into characters corresponding to an ASCII table, namely a corrected sequencing quality Phred33 value corresponding to each base in a fastq file;
Q = -10 log 10 (P) formula (1)
Wherein Q represents the base sequencing quality, and P represents the base sequencing error probability;
analysis unit: and the method is used for executing the combination of all sequencing sequences in the family sequences to obtain a sequencing sequence, wherein the most counted base at the same position is used as the base of the consistent sequence, if the count of the null value at the corresponding position is the most, the position is judged to have sequencing insertion errors, the position information is removed from the consistent sequence, and if the count of a plurality of bases is the same, the base with the highest base sequencing quality after correction is taken, so that the fastq file after the coincidence is obtained.
3. Use of the low frequency gene mutation analysis method of the second generation sequencing data of claim 1 or the analysis device of the second generation sequencing data of claim 2 in gene mutation detection.
4. A method of detecting low frequency genetic variation, the method comprising:
performing second-generation sequencing on a sample to be tested, analyzing the second-generation sequencing data by using the low-frequency gene mutation analysis method of the second-generation sequencing data or the analysis device of the second-generation sequencing data of claim 1, performing genome comparison by using comparison software based on analysis results, detecting variation by using variation detection analysis software, and outputting variation results.
5. The method of claim 4, wherein the alignment software comprises any of bwa software, bowtie2 software or blast software.
6. The method of claim 4, wherein the mutation detection analysis software comprises any of Varscan2 software, mutct 2 software, GATK software, or Freebayes software.
7. The method of claim 6, wherein the minimum base mass of the mutation detection analysis software is 20-25 when detecting SNV.
8. The method of claim 6, wherein the minimum alkali matrix value is 20-25 when the variation detection analysis software detects INDEL.
9. A computer device comprising a memory and a processor, the memory storing a computer program/instruction, wherein the computer program/instruction when executed by the processor performs the steps of the low frequency gene mutation analysis method of the second generation sequencing data of claim 1 or the steps of the method of detecting low frequency gene variation of any one of claims 4-8.
10. A computer-readable storage medium storing a computer program, wherein the computer program causes a computer to establish and/or run the steps of the low frequency gene mutation analysis method of the second generation sequencing data of claim 1 or the steps of the method of detecting low frequency gene mutation of any one of claims 4 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311696182.6A CN117437978A (en) | 2023-12-12 | 2023-12-12 | Low-frequency gene mutation analysis method and device for second-generation sequencing data and application of low-frequency gene mutation analysis method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311696182.6A CN117437978A (en) | 2023-12-12 | 2023-12-12 | Low-frequency gene mutation analysis method and device for second-generation sequencing data and application of low-frequency gene mutation analysis method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117437978A true CN117437978A (en) | 2024-01-23 |
Family
ID=89553645
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311696182.6A Pending CN117437978A (en) | 2023-12-12 | 2023-12-12 | Low-frequency gene mutation analysis method and device for second-generation sequencing data and application of low-frequency gene mutation analysis method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117437978A (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101921840A (en) * | 2010-06-30 | 2010-12-22 | 深圳华大基因科技有限公司 | DNA molecular label technology and DNA incomplete interrupt policy-based PCR sequencing method |
CN106701897A (en) * | 2015-11-12 | 2017-05-24 | 深圳华大基因研究院 | Method and apparatus for simultaneously detecting gene point mutation, insertion/deletion and CNV |
CN108154010A (en) * | 2017-12-26 | 2018-06-12 | 东莞博奥木华基因科技有限公司 | A kind of ctDNA low frequencies mutation sequencing data analysis method and device |
CN110734908A (en) * | 2019-11-15 | 2020-01-31 | 福州福瑞医学检验实验室有限公司 | Construction method of high-throughput sequencing library and kit for library construction |
US20210225456A1 (en) * | 2018-07-27 | 2021-07-22 | Myriad Women's Health, Inc. | Method for detecting genetic variation in highly homologous sequences by independent alignment and pairing of sequence reads |
KR20210112350A (en) * | 2019-01-04 | 2021-09-14 | 윌리엄 마쉬 라이스 유니버시티 | Quantitative amplicon sequencing for detection of multiple copy number variations and quantification of allele ratios |
CN114530199A (en) * | 2022-01-19 | 2022-05-24 | 重庆邮电大学 | Method and device for detecting low-frequency mutation based on double sequencing data and storage medium |
US20220364080A1 (en) * | 2019-09-20 | 2022-11-17 | Sophia Genetics S.A. | Methods for dna library generation to facilitate the detection and reporting of low frequency variants |
CN115369159A (en) * | 2022-08-30 | 2022-11-22 | 上海交通大学医学院 | Ultralow frequency mutation detection method based on double-end sequencing overlapping fragment and DNA double-strand complementary fragment |
CN116469462A (en) * | 2023-03-20 | 2023-07-21 | 重庆邮电大学 | Ultra-low frequency DNA mutation identification method and device based on double sequencing |
-
2023
- 2023-12-12 CN CN202311696182.6A patent/CN117437978A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101921840A (en) * | 2010-06-30 | 2010-12-22 | 深圳华大基因科技有限公司 | DNA molecular label technology and DNA incomplete interrupt policy-based PCR sequencing method |
CN106701897A (en) * | 2015-11-12 | 2017-05-24 | 深圳华大基因研究院 | Method and apparatus for simultaneously detecting gene point mutation, insertion/deletion and CNV |
CN108154010A (en) * | 2017-12-26 | 2018-06-12 | 东莞博奥木华基因科技有限公司 | A kind of ctDNA low frequencies mutation sequencing data analysis method and device |
US20210225456A1 (en) * | 2018-07-27 | 2021-07-22 | Myriad Women's Health, Inc. | Method for detecting genetic variation in highly homologous sequences by independent alignment and pairing of sequence reads |
KR20210112350A (en) * | 2019-01-04 | 2021-09-14 | 윌리엄 마쉬 라이스 유니버시티 | Quantitative amplicon sequencing for detection of multiple copy number variations and quantification of allele ratios |
US20220364080A1 (en) * | 2019-09-20 | 2022-11-17 | Sophia Genetics S.A. | Methods for dna library generation to facilitate the detection and reporting of low frequency variants |
CN110734908A (en) * | 2019-11-15 | 2020-01-31 | 福州福瑞医学检验实验室有限公司 | Construction method of high-throughput sequencing library and kit for library construction |
CN114530199A (en) * | 2022-01-19 | 2022-05-24 | 重庆邮电大学 | Method and device for detecting low-frequency mutation based on double sequencing data and storage medium |
CN115369159A (en) * | 2022-08-30 | 2022-11-22 | 上海交通大学医学院 | Ultralow frequency mutation detection method based on double-end sequencing overlapping fragment and DNA double-strand complementary fragment |
CN116469462A (en) * | 2023-03-20 | 2023-07-21 | 重庆邮电大学 | Ultra-low frequency DNA mutation identification method and device based on double sequencing |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10991453B2 (en) | Alignment of nucleic acid sequences containing homopolymers based on signal values measured for nucleotide incorporations | |
EP2834762B1 (en) | Sequence assembly | |
CN107229841B (en) | A kind of genetic mutation appraisal procedure and system | |
Larsson et al. | Comparative microarray analysis | |
CN113249453B (en) | Method for detecting copy number change | |
Kearse et al. | The Geneious 6.0. 3 read mapper | |
CN115691672B (en) | Base quality value correction method and device for sequencing platform characteristics, electronic equipment and storage medium | |
CN115083521B (en) | Method and system for identifying tumor cell group in single cell transcriptome sequencing data | |
KR101795662B1 (en) | Apparatus and Method for Diagnosis of metabolic disease | |
CN106591451B (en) | Method for determining the content of fetal free DNA and device for carrying out said method | |
CN109461473B (en) | Method and device for acquiring concentration of free DNA of fetus | |
US20240221954A1 (en) | Disease prediction methods and devices, electronic devices, and computer readable storage media | |
CN117437978A (en) | Low-frequency gene mutation analysis method and device for second-generation sequencing data and application of low-frequency gene mutation analysis method and device | |
CN116072222B (en) | Method for identifying and splicing viral genome and application thereof | |
CN116994649A (en) | Intelligent judging method and intelligent judging system for gene detection data | |
WO2019213810A1 (en) | Method, apparatus, and system for detecting chromosome aneuploidy | |
CN112885407B (en) | Second-generation sequencing-based micro-haplotype detection and typing system and method | |
CN113409886A (en) | HIV subtype classification system and classification method | |
Veeramachaneni | Data Analysis in Rare Disease Diagnostics | |
JP2004219140A (en) | Mass spectrum analyzing method and computer program | |
Liu et al. | Systematic biases in reference-based plasma cell-free DNA fragmentomic profiling | |
CN113327646A (en) | Sequencing sequence processing method and device, storage medium and electronic equipment | |
KR101907650B1 (en) | Method of non-invasive trisomy detection of fetal aneuploidy | |
CN114171118B (en) | Data processing method and device for noninvasive gene detection | |
CN117577182B (en) | System for rapidly identifying drug identification sites and application thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20240123 |