CN117437978A

CN117437978A - Low-frequency gene mutation analysis method and device for second-generation sequencing data and application of low-frequency gene mutation analysis method and device

Info

Publication number: CN117437978A
Application number: CN202311696182.6A
Authority: CN
Inventors: 李宇龙; 张钰; 苏晓云; 李彪; 葛猛; 叶锋
Original assignee: Beijing Genomeprecision Technology Co ltd
Current assignee: Beijing Genomeprecision Technology Co ltd
Priority date: 2023-12-12
Filing date: 2023-12-12
Publication date: 2024-01-23

Abstract

The invention discloses a low-frequency gene mutation analysis method and device for second-generation sequencing data and application thereof, and particularly relates to an analysis method and device for molecular tag sequencing data of an IonTorrent sequencing platform and application thereof. The invention designs a brand-new molecular tag sequencing data analysis method suitable for various sequencing platforms, in particular IonTorrent, and the method can be used for determining the position information of the barcode by searching and positioning the barcode sequence in the ready instead of searching and positioning a fixed sequence, comparing the ready multiple sequences in the family and correcting the base sequencing quality value during consistency analysis, eliminating false base substitution introduced in the sequencing process, and eliminating false insertion and deletion, and can be used for accurately detecting the low-frequency SNV and INDEL mutation at the same time.

Description

Low-frequency gene mutation analysis method and device for second-generation sequencing data and application of low-frequency gene mutation analysis method and device

Technical Field

The invention belongs to the technical field of bioinformatics, relates to a low-frequency gene mutation analysis method and device of second-generation sequencing data and application thereof, and particularly relates to an analysis method and device of molecular tag sequencing data for an IonTorrent sequencing platform and application thereof.

Background

In the study and application of clinical precision medicine, low frequency (< 1%) somatic mutations including point mutations, insertions and deletions of genes have been a hotspot of interest.

NGS technology is widely cited to detect genetic variations. However, NGS introduces erroneous base sequence information during the sequencing process due to the problems of the technology itself, which results in that the target mutation site information to be detected is covered by noise and cannot be detected correctly.

The Ion Torrent sequencer is a first commercial sequencer which does not need an optical system, adopts a semiconductor sequencing technology, directly converts chemical signals into digital signals through a semiconductor chip, is an economical, rapid, simple and scalable sequencing technology, and is very suitable for amplicon sequencing. Because of the characteristics of short sequencing time, cheap instrument and equipment, etc., the method is widely used. However, the sensor is not perfect for detecting the continuous bases, so that the number of the continuous bases can be error when the same base is measured, and the information of the target mutation site to be detected can be covered by noise and cannot be detected correctly.

The Illumina sequencing platform technology is mature, but the sequencing error rate of 0.1% -1% still exists, and the basic group replacement and AT basic group preference are mainly expressed.

In order to improve the detection accuracy of the low frequency mutation, a technique of molecular tags may be used to improve the detection sensitivity. In the establishment of a sequencing library, random sequences of 6bp are respectively connected to both ends of an amplified molecule, which are called barcode. Barcode will be amplified and sequenced along with the attached molecules in a downstream sequencing process. Reads with identical backode belong to the same family and can be considered amplified from the same original molecule. The reads in the same family should be perfectly identical in theory, and by combining all reads into one presentation reads through consistency analysis, base sequencing errors and redundancies of the sequencing process can be eliminated.

Bioinformatic analysis software for molecular tag sequencing data is UMItools, fgbio, samtools, smCounter and Conner, etc. UMIto, fgbio, smCounter and Conner are more applicable to data generated by the Illumina platform than to the Ion torrent platform. The Samtools' Consensu module can process sequencing data from multiple platforms, illumina and Ion torrent, and performs well in detecting SNV (single nucleotide variants), but Samtools will be referenced to the reference genome when reads are combined, and will eliminate the original insertions and deletions in reads, resulting in an inability to detect INDELs.

In conclusion, the development of the data analysis method suitable for the multiple sequencing platforms has important significance in the field of gene variation detection.

Disclosure of Invention

Aiming at the defects and actual demands of the prior art, the invention provides a low-frequency gene mutation analysis method and device for second-generation sequencing data and application thereof, in particular to an analysis method and device for molecular tag sequencing data of an IonTorrent sequencing platform and application thereof, and can accurately detect low-frequency SNV and INDEL mutations at the same time.

In order to achieve the above purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a method for low frequency gene mutation analysis of second generation sequencing data, the method comprising the steps of:

(1) Taking the sequence of the detected target gene as input, and establishing a database of blastn;

(2) Converting the sequenced fastq file into a fasta format file;

(3) Comparing the sequencing fasta sequence (i.e. reads) to a target gene by using blastn to obtain the position coordinates of the amplicon on the sequencing sequence, and extracting molecular tag sequence information;

(4) Classifying the sequencing sequences with the same molecular tag into the same family sequence (namely family), and filtering out family sequences with the sequencing number less than 3 in the family sequences;

(5) Performing multi-sequence comparison on all sequencing sequences in each family sequence, introducing null values (namely gaps), counting A, T, C, G and null values of all sequencing sequences in the family sequence at the same position, calculating base sequencing quality Q according to a formula (1), adding 33 to the Q, and converting the Q into characters corresponding to an ASCII table, namely a modified sequencing quality Phred33 value corresponding to each base in a fastq file;

Q = -10 log ₁₀ (P) formula (1)

Wherein Q represents the base sequencing quality, and P represents the base sequencing error probability;

(6) And merging all sequencing sequences in the family sequences to obtain a sequencing sequence, taking the most counted base as the base of the consistent sequence (consensus sequence), judging that the sequencing insertion error exists at the position if the count of the null value of the corresponding position is the most, removing the position information in the consistent sequence, and taking the base with the highest base sequencing quality after correction if the count of a plurality of bases is the same, so as to obtain the fastq file after the consistent sequence (consensus).

According to the invention, a brand-new molecular tag sequencing data analysis method suitable for various sequencing platforms, particularly IonTorrent is designed, the barcode sequence in the reads is found and positioned through blast instead of the determination of the barcode position information through finding a fixed sequence, the multiple sequences of the reads in family are compared and the base sequencing quality value is corrected during consistency analysis, the false insertion and deletion can be eliminated besides eliminating the false base substitution introduced in the sequencing process, the method is compatible with the data of the illumina and ionorrent of the mainstream second-generation sequencing platform, and has a better effect especially for the analysis of the data of the ionorrent, and can be used for accurately detecting the low-frequency SNV and INDEL mutation at the same time.

According to the invention, the blastn is used for comparing each ready to the genome to determine the position information of the barcode on each ready, instead of only searching the fixed sequences at two ends of the ready to extract the barcode on the ready, the situation that the position information of the barcode cannot be positioned due to the fixed sequence error caused by base synthesis error or sequencing error can be avoided.

In the invention, the base sequencing quality is an important index when detecting mutation, and false positive mutation sites can be removed by screening bases with lower quality values.

In a second aspect, the present invention provides an apparatus for analyzing second-generation sequencing data, the apparatus being configured to perform the steps in the low-frequency gene mutation analysis method for second-generation sequencing data according to the first aspect, comprising:

building a database unit: the method comprises the steps of executing a database which takes a sequence of a detection target gene as input and establishing a blastn;

a conversion unit: for performing the conversion of the sequenced fastq file into a fasta format file;

a data acquisition unit: the method comprises the steps of performing comparison of a sequencing fasta sequence to a target gene by using blastn to obtain position coordinates of an amplicon on the sequencing, and extracting molecular tag sequence information;

classification unit: for performing the grouping of the sequencing sequences having the same molecular tag into the same family sequence, filtering out family sequences having a sequencing number of less than 3 in the family sequences;

calculating a corrected sequencing quality unit: the method comprises the steps of performing multi-sequence comparison on all sequencing sequences in each family sequence, introducing null values, counting A, T, C, G and null values of all sequencing sequences in the family sequence at the same position, calculating base sequencing quality Q according to a formula (1), adding 33 to the Q, and converting the Q into characters corresponding to an ASCII table, namely a corrected sequencing quality Phred33 value corresponding to each base in a fastq file;

Q = -10 log ₁₀ (P) formula (1)

analysis unit: and the method is used for executing the combination of all sequencing sequences in the family sequence to obtain a sequencing sequence, wherein the most counted base is used as a consistent sequence base at the same position, if the number of blank values at the corresponding position is the most, the position is judged to have sequencing insertion errors, the position information is removed from the consistent sequence, and if the number of the plurality of bases is the same, the base with the highest base sequencing quality after correction is taken, so that a consistent fastq file is obtained.

In a third aspect, the present invention provides a low frequency gene mutation analysis method of the second generation sequencing data described in the first aspect or an analysis device of the second generation sequencing data described in the second aspect for use in genetic variation detection.

In a fourth aspect, the present invention provides a method of detecting low frequency genetic variation, the method comprising:

performing second-generation sequencing on a sample to be tested, analyzing the second-generation sequencing data by using the low-frequency gene mutation analysis method of the second-generation sequencing data according to the first aspect or the low-frequency gene mutation analysis method device of the second-generation sequencing data according to the second aspect, performing genome comparison by using comparison software based on analysis results, detecting variation by using variation detection analysis software, and outputting variation results.

The invention develops a molecular tag sequencing data analysis method suitable for various sequencing platforms, carries out rapid analysis processing on sequencing data, further carries out mutation detection analysis, can accurately detect low-frequency SNV and INDEL mutation simultaneously, and has wide application prospect, such as the clinical accurate medical field, the research of gene mutation basic behaviors for non-disease diagnosis and the like.

Preferably, the alignment software comprises any one of bwa software, bowtie2 software, blast software and the like;

preferably, the mutation detection analysis software includes any one of Varscan2 software, mutct 2 software, GATK software, freebayes software, and the like.

Preferably, the minimum base mass in the detection of SNV by the mutation detection analysis software is set to 20 to 25, and the value in a specific optional range may be, for example, 20, 21, 22, 23, 24 or 25, and the minimum base mass in the detection of INDEL is set to 20 to 25, and the value in a specific optional range may be, for example, 20, 21, 22, 23, 24 or 25.

In a fifth aspect, the present invention provides a computer device comprising a memory and a processor, the memory storing a computer program/instruction which when executed by the processor implements the steps of the low frequency gene mutation analysis method of the second generation sequencing data of the first aspect or the steps of the method of detecting low frequency gene mutation of the fourth aspect.

In a sixth aspect, the present invention provides a computer-readable storage medium storing a computer program for causing a computer to establish and/or run the steps of the low frequency gene mutation analysis method of the second generation sequencing data as described in the first aspect or the steps of the method of detecting low frequency gene mutation as described in the fourth aspect.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a molecular tag sequencing data detection method suitable for various sequencing platforms, in particular IonTorrent, which is used for determining the position information of a barcode in reads through blast searching instead of searching for a fixed sequence, comparing the multiple sequences of the reads in family and correcting the base sequencing quality value when carrying out consistency analysis, eliminating false base substitution introduced in the sequencing process, and eliminating false insertion and deletion, and can accurately detect SNV and INDEL mutation at the same time.

Drawings

FIG. 1 is a schematic diagram of an analysis flow;

FIG. 2 is a schematic diagram of amplicon structure;

fig. 3 is a presentation schematic.

Detailed Description

The technical means adopted by the invention and the effects thereof are further described below with reference to the examples and the attached drawings. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof.

The specific techniques or conditions are not identified in the examples and are described in the literature in this field or are carried out in accordance with the product specifications. The reagents or equipment used were conventional products available for purchase through regular channels, with no manufacturer noted.

Example 1

Sample preparation: first, a cancer patient sample is tested using a first generation sequencing, and the specific mutation site ratio of the patient sample is determined. And mixing the cell line of the patient with a wild type cell line according to a certain proportion to prepare reference samples with various known mutation proportions. Sample sequencing: and (5) using Ion Torrent to library and sequence the prepared sample to obtain a fastq file. The analytical procedure is as shown in fig. 1, data analysis: converting fastq file into fasta format, comparing with amplified template sequence by using blastn to obtain position coordinates of amplicon on reads, as shown in figure 2, respectively adding analysis tag and fixed sequence at two ends of amplicon when constructing sequencing library, removing base error in PCR and sequencing process by using front and rear molecular tag in subsequent analysis, recovering original sequence information, and positioning position of molecular tag in sequencing sequence by fixed sequence to extract molecular tag sequence information to obtain fastq file containing molecular tag information. Filtering family with family reads less than 3 after removing reads with length shorter than length of amplified template; performing multi-sequence comparison on all sequencing sequences in each family sequence, introducing null values (namely gaps), counting A, T, C, G and null values of all sequencing sequences in the family sequence at the same position, calculating base sequencing quality Q according to a formula (1), adding 33 to the Q, and converting the Q into characters corresponding to an ASCII table, namely a modified sequencing quality Phred33 value corresponding to each base in a fastq file; combining reads in family into one strip by using a presentation algorithm to obtain a presentation/fastq file, wherein the presentation/fastq file is shown in fig. 3, and the drawing is one family containing 6 sequencing sequences, and reducing the 6 sequences into 1 presentation reads after multi-sequence alignment, so that the sequence is considered to be an original base sequence; the bases of 6 sequences at the same position are identical, so that the original sequence is considered to be the base at the position; 2 of the 6 sequences are T at the tail end position of the continuous T, and the other 4 are deleted at the corresponding position ("-"), so that the original sequence is considered to be practically absent at the position; the 6 sequences have 5 GTGT sequences, and 1 sequence is TGTG at the corresponding position, so that the original sequence is considered to be actually GTGTGT at the position; the base at one position of 5 of the 6 sequences is A, and the base at the corresponding position of 1 is deleted, so that the original sequence is considered to be A at the position. The concresus. Fastq was aligned to the reference genome using bwa to obtain a sam file and mutation sites were detected using samtools and Varscan 2.

Samples of known mutation sites and mutation ratios were prepared as follows using IonTorrent sequencing. The target mutation types relate to snp and indel, and the target mutation proportion is 1% -30% different.

The test results are shown in Table 1.

TABLE 1

Conclusion: the detection result is basically consistent with the prediction result in the error range, which shows that the method has excellent performance on detecting point mutation and insertion mutation of IonTorrent sequencing data.

Example 2

Samples of known mutation sites and mutation ratios were prepared simultaneously, using Illumina sequencing, and the detection procedure was as described in example 1. The target mutation type is indel, and the target mutation proportion is about 10%.

The results of the detection are shown in Table 2, which shows that the method of the invention is also applicable to the Illumina sequencing data.

TABLE 2

Comparative example 1

Samples one, two, three and four of examples 1 and 2 were analyzed using samtools 'present module as a comparison technique, and the results are shown in table 3, with samtools' present module being able to detect snp, but without the ability to analyze INDEL. The invention can accurately detect INDEL.

TABLE 3 Table 3

In summary, the invention develops a molecular tag sequencing data detection method suitable for various sequencing platforms, in particular IonTorrent, and determines the position information of the barcode by blast search to locate the barcode sequence in reads instead of by searching for a fixed sequence, and compares and corrects the multiple sequences of the reads in family and the base sequencing quality value when carrying out consistency analysis, so that the invention not only eliminates the wrong base substitution introduced in the sequencing process, but also eliminates the wrong insertion and deletion, and can simultaneously and accurately detect the low-frequency SNV and INDEL mutation.

The applicant states that the detailed method of the present invention is illustrated by the above examples, but the present invention is not limited to the detailed method described above, i.e. it does not mean that the present invention must be practiced in dependence upon the detailed method described above. It should be apparent to those skilled in the art that any modification of the present invention, equivalent substitution of raw materials for the product of the present invention, addition of auxiliary components, selection of specific modes, etc., falls within the scope of the present invention and the scope of disclosure.

Claims

1. A method for low frequency gene mutation analysis of second generation sequencing data, the method comprising the steps of:

(2) Converting the sequenced fastq file into a fasta format file;

(3) Comparing the sequencing fasta sequence with a target gene by using blastn to obtain the position coordinates of the amplicon on the sequencing, and extracting molecular tag sequence information;

(4) Classifying the sequencing sequences with the same molecular tag into the same family sequence, and filtering out family sequences with the sequencing number less than 3 in the family sequences;

(5) Comparing all sequencing sequences in each family sequence in multiple sequences, introducing null values, counting A, T, C, G and null values of all sequencing sequences in the family sequence at the same position, calculating base sequencing quality Q according to a formula (1), adding 33 to the Q, and converting the Q into characters corresponding to an ASCII table, namely a modified sequencing quality Phred33 value corresponding to each base in a fastq file;

Q = -10 log ₁₀ (P) formula (1)

(6) And merging all sequencing sequences in the family sequence to obtain a sequence, wherein the most counted base at the same position is used as a consistent sequence base, if the number of blank values at the corresponding position is the most, determining that sequencing insertion errors exist at the position, removing position information in the consistent sequence, and if the number of the plurality of bases is the same, taking the base with the highest base sequencing quality after correction to obtain a consistent fastq file.

2. An apparatus for analyzing second-generation sequencing data, wherein the apparatus is configured to perform the steps of the method for analyzing low-frequency gene mutation of second-generation sequencing data according to claim 1, comprising:

a data acquisition unit: the method comprises the steps of performing comparison of a sequencing sequence to a target gene by using blastn, obtaining position coordinates of an amplicon on the sequencing sequence, and extracting molecular tag sequence information;

classification unit: for performing the grouping of sequenced fasta sequences having the same molecular tag into the same family sequence, filtering out family sequences having a sequence number of less than 3 in the family sequences;

Q = -10 log ₁₀ (P) formula (1)

analysis unit: and the method is used for executing the combination of all sequencing sequences in the family sequences to obtain a sequencing sequence, wherein the most counted base at the same position is used as the base of the consistent sequence, if the count of the null value at the corresponding position is the most, the position is judged to have sequencing insertion errors, the position information is removed from the consistent sequence, and if the count of a plurality of bases is the same, the base with the highest base sequencing quality after correction is taken, so that the fastq file after the coincidence is obtained.

3. Use of the low frequency gene mutation analysis method of the second generation sequencing data of claim 1 or the analysis device of the second generation sequencing data of claim 2 in gene mutation detection.

4. A method of detecting low frequency genetic variation, the method comprising:

performing second-generation sequencing on a sample to be tested, analyzing the second-generation sequencing data by using the low-frequency gene mutation analysis method of the second-generation sequencing data or the analysis device of the second-generation sequencing data of claim 1, performing genome comparison by using comparison software based on analysis results, detecting variation by using variation detection analysis software, and outputting variation results.

5. The method of claim 4, wherein the alignment software comprises any of bwa software, bowtie2 software or blast software.

6. The method of claim 4, wherein the mutation detection analysis software comprises any of Varscan2 software, mutct 2 software, GATK software, or Freebayes software.

7. The method of claim 6, wherein the minimum base mass of the mutation detection analysis software is 20-25 when detecting SNV.

8. The method of claim 6, wherein the minimum alkali matrix value is 20-25 when the variation detection analysis software detects INDEL.

9. A computer device comprising a memory and a processor, the memory storing a computer program/instruction, wherein the computer program/instruction when executed by the processor performs the steps of the low frequency gene mutation analysis method of the second generation sequencing data of claim 1 or the steps of the method of detecting low frequency gene variation of any one of claims 4-8.

10. A computer-readable storage medium storing a computer program, wherein the computer program causes a computer to establish and/or run the steps of the low frequency gene mutation analysis method of the second generation sequencing data of claim 1 or the steps of the method of detecting low frequency gene mutation of any one of claims 4 to 8.