CN118888001A - Pathogen detection system and method based on metagenome high-throughput sequencing - Google Patents
Pathogen detection system and method based on metagenome high-throughput sequencing Download PDFInfo
- Publication number
- CN118888001A CN118888001A CN202411370770.5A CN202411370770A CN118888001A CN 118888001 A CN118888001 A CN 118888001A CN 202411370770 A CN202411370770 A CN 202411370770A CN 118888001 A CN118888001 A CN 118888001A
- Authority
- CN
- China
- Prior art keywords
- species
- pathogen
- sequences
- database
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 244000052769 pathogen Species 0.000 title claims abstract description 92
- 238000001514 detection method Methods 0.000 title claims abstract description 46
- 238000000034 method Methods 0.000 title claims abstract description 46
- 230000001717 pathogenic effect Effects 0.000 title claims abstract description 39
- 238000012165 high-throughput sequencing Methods 0.000 title claims abstract description 17
- 241000894007 species Species 0.000 claims abstract description 144
- 238000001914 filtration Methods 0.000 claims abstract description 24
- 230000008569 process Effects 0.000 claims description 24
- 239000000523 sample Substances 0.000 claims description 17
- 238000012163 sequencing technique Methods 0.000 claims description 12
- 108020004707 nucleic acids Proteins 0.000 claims description 11
- 150000007523 nucleic acids Chemical class 0.000 claims description 11
- 102000039446 nucleic acids Human genes 0.000 claims description 11
- 239000013642 negative control Substances 0.000 claims description 10
- 238000011109 contamination Methods 0.000 claims description 9
- 238000012216 screening Methods 0.000 claims description 8
- 239000000356 contaminant Substances 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 3
- 230000000873 masking effect Effects 0.000 claims description 3
- 238000004140 cleaning Methods 0.000 abstract 1
- 238000000429 assembly Methods 0.000 description 14
- 230000000712 assembly Effects 0.000 description 14
- 241000894006 Bacteria Species 0.000 description 9
- 108090000623 proteins and genes Proteins 0.000 description 7
- 230000035945 sensitivity Effects 0.000 description 7
- 241000700605 Viruses Species 0.000 description 6
- 239000003550 marker Substances 0.000 description 6
- 244000005700 microbiome Species 0.000 description 5
- 230000002829 reductive effect Effects 0.000 description 5
- 241000588724 Escherichia coli Species 0.000 description 4
- 241000588747 Klebsiella pneumoniae Species 0.000 description 4
- 241001386813 Kraken Species 0.000 description 4
- 241000187479 Mycobacterium tuberculosis Species 0.000 description 4
- 239000012472 biological sample Substances 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 4
- 238000003745 diagnosis Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 244000045947 parasite Species 0.000 description 4
- 241000203069 Archaea Species 0.000 description 3
- 208000035473 Communicable disease Diseases 0.000 description 3
- 241000233866 Fungi Species 0.000 description 3
- 239000003344 environmental pollutant Substances 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 231100000719 pollutant Toxicity 0.000 description 3
- 241000029590 Leptotrichia wadei Species 0.000 description 2
- 241000675114 Oribacterium sinus Species 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000003759 clinical diagnosis Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 239000000428 dust Substances 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000007481 next generation sequencing Methods 0.000 description 2
- 239000002773 nucleotide Substances 0.000 description 2
- 125000003729 nucleotide group Chemical group 0.000 description 2
- 230000036961 partial effect Effects 0.000 description 2
- 239000013612 plasmid Substances 0.000 description 2
- 238000003908 quality control method Methods 0.000 description 2
- 101100100146 Candida albicans NTC1 gene Proteins 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 241000607768 Shigella Species 0.000 description 1
- 230000001580 bacterial effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007622 bioinformatic analysis Methods 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 238000010224 classification analysis Methods 0.000 description 1
- 230000003749 cleanliness Effects 0.000 description 1
- 238000001152 differential interference contrast microscopy Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000003912 environmental pollution Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000010348 incorporation Methods 0.000 description 1
- 208000015181 infectious disease Diseases 0.000 description 1
- 230000002401 inhibitory effect Effects 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 244000000010 microbial pathogen Species 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 239000000243 solution Substances 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Landscapes
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a pathogen detection system and a pathogen detection method based on metagenome high-throughput sequencing; the detection method comprises at least one of the following steps: cleaning a public database, calculating and filtering S_confidence and L_score of kmer classification information, and identifying pollution of a basic background model and a DIC background model; the method can control the detection of false positive species, and provides convenience for downstream report interpretation and assistance to clinicians in judging true pathogens.
Description
Technical Field
The invention belongs to the technical field of pathogen detection, and particularly relates to a pathogen detection system and method based on metagenome high-throughput sequencing.
Background
Infectious diseases are a collective term for diseases caused by pathogenic microorganisms (bacteria, viruses, fungi, parasites, etc.), and are serious diseases seriously threatening human health. The diagnosis and curative effect monitoring of infectious diseases have been dependent on methods such as morphology, immunology, molecular biology and pathogen isolation culture for a long time, and the methods have advantages and disadvantages and play an important role in auxiliary diagnosis of infectious diseases.
Recently emerging pathogen metagenome high throughput sequencing (metagenomic next-generation sequencing, mNGS) technology refers to a detection method that uses high throughput sequencing technology to sequence all nucleic acids in a specific clinical sample and determines whether a pathogen is present in the sample by bioinformatic analysis. Compared with the traditional pathogen detection technology based on separation culture, the technology can detect various microorganisms (such as viruses, bacteria, fungi, parasites and the like) without bias in theory, including pathogens which are difficult to culture and new pathogens. mNGS is an open analysis and diagnosis system, the number of pathogens detected by mNGS is not specified, and according to incomplete statistics, the mechanism for developing relevant detection services has almost ten thousand pathogens including bacteria, viruses, fungi, parasites and the like, and an effective technical means is provided for diagnosing serious and rare pathogen infection. Has clinical significance in specific clinical application scenes.
MNGS-based pathogen species identification falls into several broad categories:
1. an alignment-based method. mNGS reads are aligned to different species assemblies and the presence of a particular species is determined based on the alignment position and the alignment of the reliabilities. mNGS have no assumption about the scope of species identification, and the large annual increase in the species assembly sequences of public databases makes it difficult for general application equipment to meet the challenges presented by its spatial and temporal complexity. If only the sequence of the common known pathogen species is concerned, although the hardware and time bottlenecks can be alleviated, the detection range of the species is limited, and the sensitivity of clinical pathogen detection, particularly rare pathogens or novel pathogens, is reduced;
2. Identification of species based on marker genes. The method comprises the steps of firstly calculating species-level specific single copy genes (marker genes), and comparing mNGS data to the marker genes. This method has the advantage of small calculation amount, and has the disadvantage of high requirement for sequencing data amount (considering that the marker gene is sufficiently covered), which limits the sensitivity of the method. Meanwhile, the similarity of partial species nucleic acid is extremely high, so that challenges are brought to calculating the marker gene, and the common operation is that a plurality of near source species share the marker gene, so that the resolution of the near source species is reduced;
3. Kmer (consecutive subsequences of nucleic acid sequences) based methods. The species classification problem is converted into mNGS sequence classification problem, then the sequence classification problem is converted into kmer matching, the calculation amount is reduced, the timeliness of the whole analysis is improved, and the ultrahigh sensitivity is realized, so that the species detection method based on kmer is popular in the pathogen detection field. The ultra-high sensitivity of the method can have certain false positive under the combined actions of poor database quality, biological variation, sequencing noise, potential pollution of different experimental steps and the like, and is difficult to solve all the time, and the method brings confusion to report interpretation and auxiliary doctor diagnosis.
Therefore, there is a need to provide a pathogen detection system and method based on metagenomic high throughput sequencing to suppress false positives occurring in existing metagenomic high throughput sequencing pathogen detection, improving the accuracy of clinical diagnosis.
Disclosure of Invention
The invention aims to provide a pathogen detection system and a pathogen detection method based on metagenome high-throughput sequencing, which are used for effectively reducing or inhibiting pathogen detection false positive and improving the accuracy of clinical diagnosis.
In view of this, the scheme of the invention is as follows:
in a first aspect of the invention, a metagenomic high throughput sequencing-based pathogen detection system is presented, comprising:
The database module is used for constructing a pathogen species database;
The detection module is used for comparing the metagenome sequencing result with the database to obtain a pathogen species identification result;
The filtering module is used for filtering the identification result of the pathogenic species; the filtering method comprises the following steps: constructing a species kmer database based on the pathogenic species database, judging the reliability of the species based on the distribution condition of the metagenome sequence kmer on the corresponding species classification tree in the species kmer database, and filtering the identified pathogenic species; the reliability judging process comprises the following steps: and respectively calculating the S_confidence of the number of the species sequences to the ratio S_confidence of the number of the current species to the total number of the kmers, and the L_score of the number of the kmers of the species sequences on the current species node and the straight line node thereof to the total number of the kmers, taking the sequences of which the S_confidence and the L_score are close to 1 as reliable sequences, and taking the species with more than 2 reliable sequences as filtered pathogenic species.
Further, the reliability judging process is as follows: taking the species with the S_confidence quartile value of more than 80% of all sequences in the identified species as filtered pathogenic species;
And/or, taking as filtered pathogenic species a species of the identified species whose sequence l_score maximum satisfies the formula: Count k (all) is the number of kmers at the maximum sequence length assigned to the current species.
Further, the process of constructing the pathogenic species database by the pathogenic species database construction module includes: collecting and screening high-quality pathogen species assembly sequences; the high quality assembly sequence in the species is used as a core, and the assembly sequence with high similarity with the core assembly sequence is reserved.
Preferably, the seed similarity is based on average nucleic acid identity, alignment coverage.
Preferably, the screening process of the pathogenic species assembly sequence is selected according to the assembly completion degree and the pollution degree.
Preferably, the construction process of the pathogen species database further comprises filtering abnormal assembly indexes, wherein the assembly indexes comprise assembly total length, contig number and GC content;
and/or, combining the highly proximal species in the assembled sequence;
and/or, rejecting moving elements in the assembly sequence;
and/or masking low complexity sequences.
Further, the filtration module further comprises a step of removing internal and/or external contamination of the pathogenic species identification result.
Preferably, the internal contamination removal process is to calculate the significance of the identified species rPM relative to the negative control sample rPM, treat the identified species as a contaminant species with little significance and remove it;
preferably, the external contamination removal process is to normalize the number of reads of the identified species according to the number of manually inserted reference sequences, calculate the significance of the normalized number of species reads relative to the negative control sample, treat the species as a contamination species with little significance, and remove the species.
In a second aspect of the invention, a pathogen detection method based on metagenome high-throughput sequencing is provided, comprising the steps of constructing a pathogen species database, and comparing a metagenome sequencing result with the database to obtain a pathogen species identification result;
The detection method further comprises the steps of constructing a species kmer database based on the pathogenic species database, judging species reliability based on the distribution condition of the metagenome sequence kmers on corresponding species classification trees in the species kmer database, and filtering the identified pathogenic species; the reliability judging process comprises the following steps: and respectively calculating the S_confidence of the number of the species sequences to the ratio S_confidence of the number of the current species to the total number of the kmers, and the L_score of the number of the kmers of the species sequences on the current species node and the straight line node thereof to the total number of the kmers, taking the sequences of which the S_confidence and the L_score are close to 1 as reliable sequences, and taking the species with more than 2 reliable sequences as filtered pathogenic species.
The ratio close to 1 refers to the ratio closer to 1 in the ratio percentages, and the threshold values such as more than 50%, more than 60%, more than 70%, more than 80% and the like can be set to select reliable sequences and then filter to obtain reliable species.
In a third aspect of the present invention, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the pathogen detection method according to the second aspect.
Compared with the prior art, the invention has the beneficial effects that:
The pathogen detection system judges the species reliability based on the distribution condition of the metagenomic sequence kmer on the corresponding species classification tree in the species kmer database, filters the identified pathogen species, can control the detection of false positive species, and provides convenience for downstream report interpretation and auxiliary clinicians to judge the true pathogen;
The pathogen detection system constructs a pathogen species database through species similarity screening, eliminates assembly sequences with low credibility, and can further control the detection of false positive species through highly similar species merging and removing mobile elements;
According to the pathogen detection system, the basic background model is set, and the reference sequence background model respectively utilizes the statistical index z-score to identify potential pollutant species and remove the pollutant species, so that false positive detection can be further reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for detecting pathogens based on metagenome high throughput sequencing according to the present invention;
FIG. 2 is a flow chart of a species database process according to the present invention;
FIG. 3 is a schematic diagram of an intra-seed similarity screening process according to the present invention;
FIG. 4 is a schematic diagram of a mobile element rejection process according to the present invention;
FIG. 5 is a schematic representation of species filtration according to the distribution of sequence kmers over classification trees according to the present invention.
Detailed Description
The following provides definitions of some of the terms used in this specification. Unless otherwise defined, all terms used herein are intended to have the meanings commonly understood by those skilled in the art to which the present scheme pertains.
Term interpretation:
mNGS: metagenomic sequencing, metagenomic Next-Generation Sequencing, uses a high throughput sequencing method to sequence all biological sequences ((DNA or/and RNA)) in a sample. For analyzing biological species, abundance, function, etc. in a sample. It should be noted that the source of the samples for which the metagenomic sequencing data is directed may be clinical samples, or samples from animals and living environments.
Kmer: refers to a sequence of k consecutive base pairs in a DNA or RNA, and kmer is a characteristic representation of the sequence.
Z-score: in statistics, Z-score, also known as standard score, represents the distance of a data point from the mean of the data set in standard deviation units. In other words, it measures the degree of deviation of a data point from the average of the dataset.
ANI: average Nucleotide Identity, which is the average identity of nucleotides between two genomic sequences. It is obtained by comparing the whole genome sequences of two genomes and calculating the ratio of the same base pairs between them.
AF: ALIGNMENT FRACTION in bioinformatics, ALIGNMENT FRACTION, chinese translation is "alignment score" or "alignment" which refers to the ratio of the number of bases on a successful alignment to the total number of bases after alignment of two sequences.
NTC: negative control samples are commonly referred to as No Template Control negative control samples. In the experiment, the negative control sample is an important control group, which can help to exclude the influence of other factors on the experimental result, thereby ensuring the reliability of the experimental result.
In one embodiment, a method for detecting a pathogen based on metagenome high throughput sequencing is provided, comprising the steps of comparing metagenome sequencing results with a database of pathogenic species to obtain identification results of pathogenic species, and filtering the identification results of pathogenic species. In order to reduce or inhibit false positives in the species classification process, improvements are made in terms of assembly of pathogen databases, filtration of identification results and the like, and a specific flow is shown in figure 1. The specific improvements include the following aspects:
1. Public database species assembly clean-up
The inventors have found that the reliability of pathogen identification by kmer methods is first dependent on database quality, based on which they have collected the assembly sequences (including but not limited to Refseq, geneBank) for viruses, bacteria, archaea, parasites, etc. in the public database. Wherein bacteria and archaea account for the vast majority of the database. Therefore, the bacteria and archaea are subjected to database quality treatment, and the operation flow is shown in figure 2 and is specifically as follows:
1. assembly quality screening
While assemblies that meet the following conditions are retained as high quality candidate assemblies for further processing.
(1) Assembly integrity <60%;
(2) The assembly pollution degree is more than 5%;
(3) (assembly completion 5) x assembly contamination level <50;
the assembly evaluation index is calculated by checkM.
2. Seed similarity screening
The public database stores partially erroneous species information, which leads to deviations in species classification and even affects the sensitivity of the species.
For this purpose, we calculated the average nucleic acid similarity and alignment coverage of other assemblies to the core assembly under the same species with the highest quality of assembly or putative classical strain in the species and representative assembly as the core, as shown in fig. 3. In fig. 3, the central red dot represents the core assembly of a particular species, and the other dots (black and orange) represent the other assemblies of a particular species. Average nucleic acid identity (ANI) is used to measure the similarity of two assemblies at the nucleic acid level (between 0 and 1, closer to 1 means closer), and since it has unpaired properties, ANI for two assemblies is calculated twice, ANI for other assemblies versus core assemblies is defined as negative and ANI for core assemblies versus other assemblies is defined as positive. The alignment score (AF) represents the degree of alignment coverage of the two assemblies. Similarly, for the sake of unity, we define:
Where sgn () is a common sign function. v_ANI (formula 1) falls within the [ -5,5] interval and is considered similar to the core assembly of this species from the perspective of nucleic acid similarity. v_AF (equation 2) falls within the interval [ -10, 10] and is considered to be close and complete from the coverage level to the core assembly. As shown in fig. 3, the assembly shown in orange dots remains for an intra-species trusted assembly; black dot assembly kicks out for intra-seed suspected assembly.
3. Assembly index anomaly filtering
The internal assembly should be similar in terms of overall assembly length and GC content.
For assembly length, GC content, N50, L50, contig numbers gave kick-outs with significantly abnormal assembly compared to other assemblies. This step may further reduce the risk of contamination or inaccuracy of the assembled biological sample.
4. Highly proximal species incorporation
Since there are differences between traditional taxonomies and genomic species taxonomies, mNGS essentially infer the taxonomies from the nucleic acid level. Thus, we will increase the sensitivity of species detection by combining species with v_ani <1 between species core assemblies (if the two species nucleic acids are highly similar, the single sequence of mNGS will be assigned to the common minimal ancestor node of the two species, thus reducing the detection sensitivity at the species level, classical examples are e.coli and shigella).
5. Removal of moving elements
During species assembly, partial submitters can keep sequences or organism sequences such as plasmids, bacterial viruses (phage) and the like which can horizontally spread among different closely related species in species assembly, which can confuse downstream sequence species classification, and relevant horizontal moving elements are removed according to relevant plasmid databases, phage databases and sequence name information, so that the assembly cleanliness is ensured;
Mainly comprises two steps as shown in fig. 4: the first step is to kick out the complete sequence of the mobile element assembled by the species such as bacteria according to the sequence ID, and the second step is to kick out the sequence fragment of the mobile element formed by incorrect assembly or sequence integration in the assembly of the species according to the sequence similarity.
6. Low complexity sequence masking
In the biological sequence comparison process, the information contained in the low-complexity sequence is low, and an error result of the sequence comparison can be caused, and dust (not only dust) and other software are used for shielding the low-complexity sequence, so that false positive results in the biological sequence comparison process are avoided.
Filtering of sequence kmer classification information
After the clean database is obtained, a species kmer database is constructed using kmer based classification software (such as kraken a2 but not limited to kraken a). And a classification analysis is performed on the sequence of mNGS samples. Biological mutation, sequencing noise and limitation of different kmer classification methods, and certain probability of errors exists in the classification information of single kmers. To reduce false positives caused by such probability errors, the distribution of the kmer of the sequence over the species classification tree is further analyzed and two scores are used to determine the classification reliability of an independent sequence, S_confidence (features-level confidence) and L_ score (Lineage score), respectively. The calculation method is shown in fig. 5, and the specific process is as follows:
Defining S_confidence: the ratio of the number of kmers classified into the current species (and its child nodes) to the total number of kmers in the sequence (read or assembled contig) of all kmers of a particular sequence ranges from 0 to 1, and the more unique kmers belonging to the current species in sequence kmers, the greater the likelihood of being from the current species (as shown in equation 3).
Wherein count k represents the number of kmers at the maximum sequence length assigned to the current species, as follows.
L_score definition: all kmers of a particular sequence are assigned to the number of counts over the total kmer number at the current species node and its orthonodes (including the orthoancestor node and the orthooffspring node). The red circle of fig. 5 represents "seed 1", the blue shade represents its orthonormal node, and the others are non-orthonormal nodes, and if kmers of a specific sequence are centrally distributed over the orthonormal nodes, the surface classification process and data construction are reliable. The more reliable the value is between 0 and 1, the closer to 1. If kmer is distributed to other non-orthoscopic branches of the classification tree (e.g., non-negative nodes in fig. 4) too much, it is indicated that classification of the current sequence is unreliable (as shown in equation 4).
According to the definition, the sequences of S_confidence and L_score which are close to 1 are taken as reliable sequences, and the species with more than 2 reliable sequences are taken as filtered pathogen species.
Specifically, the reliable and efficient filtering means are: for sequences classified by kmer method, we group by species, set a threshold >80% for the s_confidence upper quartile for all sequences of the identified species, and satisfy equation 5 for the l_score maximum for the identified species, as a reliable species.
3. A background model is built to exclude internal and external contaminant species.
The species obtained by high-throughput sequencing may be from clinical samples per se, environmental pollution, reagent consumables, library-building sequencing background bacteria and the like. Pathogen detection aims at detecting microorganisms from clinical specimens, while microorganisms of other origin are regarded as detecting false positives. In order to identify microorganisms from environmental, consumable, instrument and the like sources, a plurality of Negative Control Samples (NTCs) are arranged, a specified quantity of reference Sequences (DICs) are added to all samples, a background model is established on a biological information layer according to the samples and data, and a statistical index z-score is used for identifying potential pollution species, so that false positives are detected in a reduced manner.
Background models are divided into two: a base background model, a reference sequence background model. The specific description is as follows:
1. basic background model:
The basic background model, without reference fragments, was used for general flow experiments. Firstly, calculating rPM (reads per million, millions of sequences) of a specific species (sp) of a biological sample (biosample) as shown in a formula (6); the purpose of this calculation is to counteract the reads variance due to the variance in the amount of sequencing data. And calculating rPM of corresponding species of three Negative Control Samples (NTCs) or Negative Control Samples (NTCs) of nearly three days of the current experiment, assuming that the DNA amount of the experimental pollutant species is in normal distribution in a plurality of NTC samples, calculating whether rPM(s) of the current biological sample is obvious under an NTC model by using Standard Deviation (SD), setting a Z-score (as shown in formula 7) threshold according to the characteristics of the species to define significance (as shown in virus and bacteria, and setting Z-score not less than 1.96).
Where mean () represents the average function.
2. Reference sequence (DIC) background model:
the reference sequence is added for quality normalization to further accurately identify exogenous contaminant species.
Firstly, the number of reads of a specific species sp in a sample is standardized according to the number of DIC, and the standardized number of the sp reads is obtained and is named as reads_norm (sp), as shown in a formula (8). The z-score was calculated with normalized reads and a background sample (NC) model, as in equation (9), with the thresholds above.
A significant difference between the species reads of the biological sample and either of the two background models is considered a reliable species, and vice versa a suspected contaminant species.
Examples
(1) MNGS pretreatment of data (high-throughput data quality control, human source data removal, low-complexity sequence removal) to obtain quality control, human source reads, low-complexity reads and microorganism reads, and the statistics are as follows:
Sample | Sample1 | NTC1 | NTC2 | NTC3 |
Sample_type | Bio | NTC | NTC | NTC |
Raw_reads | 20000000 | 1578463 | 1941510 | 1499540 |
Raw_bases | 1000000000 | 78923150 | 97075500 | 74977000 |
Clean_reads | 19999782 | 1565835 | 1925977 | 1487543 |
Clean_bases | 979989318 | 77508832 | 95335861 | 73633378 |
Q20(%) | 98.53 | 98.55 | 98.28 | 98.56 |
Q30(%) | 91.48 | 91.28 | 90.66 | 91.14 |
Reads_median_length | 50 | 50 | 50 | 50 |
GC(%) | 42.7 | 43.08 | 42.88 | 42.9 |
Host_reads | 19803696 | 1061636 | 1161364 | 1014504 |
Low_complexity_reads | 2459 | 192 | 236 | 182 |
Micro_reads | 193627 | 504007 | 764377 | 472857 |
Micro_rate(%) | 0.01 | 0.322 | 0.397 | 0.318 |
(2) Sequence species classification based on kmer (kraken for example)
The number of species-unique reads was obtained using kmer-based sequence classification software (Kraken 2).
Sample | name | UniqReads |
Sample1 | Mycobacterium tuberculosis | 24 |
Sample1 | Leptotrichia wadei | 3 |
Sample1 | Escherichia coli | 11 |
Sample1 | Oribacterium sinus | 1 |
Sample1 | Klebsiella pneumoniae | 21 |
(3) S_confidence and L_score computation and filtering
The kmer distribution unreliable reads are filtered according to the quartile threshold on s_confidence and the l_score maximum threshold.
Sample | name | S_confidence(UQ) | Max(L_score) | Pass |
Sample1 | Mycobacterium tuberculosis | 0.9375 | 1 | True |
Sample1 | Leptotrichia wadei | 0.6875 | 0.8125 | False |
Sample1 | Escherichia coli | 0.9375 | 1 | True |
Sample1 | Oribacterium sinus | 0.5625 | 0.5625 | False |
Sample1 | Klebsiella pneumoniae | 0.9375 | 1 | True |
(4) Background model filtering (basic background model, DIC background model)
Sample | name | Basic background model Z | DIC background model Z | Pass |
Sample1 | Mycobacterium tuberculosis | 100 | 100 | True |
Sample1 | Escherichia coli | 1.03 | 1.25 | False |
Sample1 | Klebsiella pneumoniae | 7.32 | 6.91 | True |
* And (3) injection: when no relevant species were detected in the NC sample, the z-score value was defined as 100.
(5) Final species information output
And (3) reliably detecting species information obtained after the false positive filtration in the step (3) and the step (4).
Sample | name | UniqReads |
Sample1 | Mycobacterium tuberculosis | 24 |
Sample1 | Klebsiella pneumoniae | 21 |
Although the present disclosure is disclosed above, the scope of the present disclosure is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the disclosure, and these changes and modifications will fall within the scope of the disclosure.
Claims (10)
1. A metagenome high throughput sequencing-based pathogen detection system, comprising:
The database module is used for constructing a pathogen species database;
The detection module is used for comparing the metagenome sequencing result with the database to obtain a pathogen species identification result;
The filtering module is used for filtering the identification result of the pathogenic species; the filtering method comprises the following steps: constructing a species kmer database based on the pathogenic species database, judging the reliability of the species based on the distribution condition of the metagenome sequence kmer on the corresponding species classification tree in the species kmer database, and filtering the identified pathogenic species; the reliability judging process comprises the following steps: and respectively calculating the ratio S_confidence of the number of the species sequences to the total number of the present species and the ratio L_score of the number of the species sequences to the total number of the present species nodes and the direct line nodes thereof, taking the sequences of which the S_confidence and the L_score are close to 1 as reliable sequences, and taking the species with more than 2 reliable sequences as filtered pathogenic species.
2. The pathogen detection system of claim 1, wherein the reliability determination process is: taking the species with the S_confidence quartile value of more than 80% of all sequences in the identified species as filtered pathogenic species;
And/or, taking as filtered pathogenic species a species of the identified species whose sequence l_score maximum satisfies the formula:
Count k (all) is the number of kmers at the maximum sequence length assigned to the current species.
3. The pathogen detection system of claim 1, wherein the pathogen species database construction process includes: collecting and screening high-quality pathogen species assembly sequences; the high quality assembly sequence in the species is used as a core, and the assembly sequence with high similarity with the core assembly sequence is reserved.
4. The pathogen detection system of claim 3, wherein the seed similarity is based on average nucleic acid identity, alignment coverage.
5. A pathogen detection system according to claim 3, wherein the screening process of the pathogen species assembly sequence screens according to assembly completion, contamination level.
6. The pathogen detection system of claim 3, wherein the process of constructing the pathogen species database further includes filtering for anomalies in assembly indicators including total assembly length, contig number, and GC content;
and/or, combining the highly proximal species in the assembled sequence;
and/or, rejecting moving elements in the assembly sequence;
and/or masking low complexity sequences.
7. The pathogen detection system of claim 1, wherein the filtration module further includes a step of removing internal and/or external contamination of the pathogen species identification.
8. The pathogen detection system of claim 7, wherein the internal contamination removal process is to calculate the significance of the identified species rPM relative to the negative control sample rPM, treat the identified species as a contaminant species with little significance and remove it;
And/or, the external pollution removal process is to normalize the number of reads of the identified species according to the number of manually inserted reference sequences, calculate the significance of the normalized number of species reads relative to the negative control sample, treat the species as a pollution species with small significance and remove the species.
9. The pathogen detection method based on metagenome high-throughput sequencing is characterized by comprising the steps of constructing a pathogen species database, and comparing a metagenome sequencing result with the database to obtain a pathogen species identification result;
The detection method further comprises the steps of constructing a species kmer database based on the pathogenic species database, judging species reliability based on the distribution condition of the metagenome sequence kmers on corresponding species classification trees in the species kmer database, and filtering the identified pathogenic species; the reliability judging process comprises the following steps: and respectively calculating the S_confidence of the number of the species sequences to the ratio S_confidence of the number of the current species to the total number of the kmers, and the L_score of the number of the kmers of the species sequences on the current species node and the straight line node thereof to the total number of the kmers, taking the sequences of which the S_confidence and the L_score are close to 1 as reliable sequences, and taking the species with more than 2 reliable sequences as filtered pathogenic species.
10. Computer readable storage medium, having stored thereon a computer program, the processor executing the computer program to implement the pathogen detection method of claim 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411370770.5A CN118888001A (en) | 2024-09-29 | 2024-09-29 | Pathogen detection system and method based on metagenome high-throughput sequencing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411370770.5A CN118888001A (en) | 2024-09-29 | 2024-09-29 | Pathogen detection system and method based on metagenome high-throughput sequencing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118888001A true CN118888001A (en) | 2024-11-01 |
Family
ID=93229952
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202411370770.5A Pending CN118888001A (en) | 2024-09-29 | 2024-09-29 | Pathogen detection system and method based on metagenome high-throughput sequencing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118888001A (en) |
-
2024
- 2024-09-29 CN CN202411370770.5A patent/CN118888001A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113160882B (en) | Pathogenic microorganism metagenome detection method based on third generation sequencing | |
BR112020013636A2 (en) | method to facilitate the prenatal diagnosis of a genetic disorder from a maternal sample associated with the pregnant woman, method for identifying contamination associated with at least one between preparation of sequencing library and high-throughput sequencing and method for characterization associated with at least one between sequencing library preparation and sequencing | |
JP6785995B2 (en) | A deep learning-based framework for identifying sequence patterns that cause sequence-specific errors (SSEs) | |
CN111462821A (en) | Pathogenic microorganism analysis and identification system and application | |
US20140149049A1 (en) | Accurate and fast mapping of reads to genome | |
CN112687344B (en) | Human adenovirus molecule typing and tracing method and system based on metagenome | |
CN108292327A (en) | The method of detection copy number variation in next generation's sequencing | |
AU2019480813A1 (en) | Methods for determining chromosome aneuploidy and constructing classification model, and device | |
CN108460248B (en) | Method for detecting long tandem repeat sequence based on Bionano platform | |
KR102124193B1 (en) | Method for screening makers for predicting depressive disorder or suicide risk using machine learning, markers for predicting depressive disorder or suicide risk, method for predicting depressive disorder or suicide risk | |
CN113270145B (en) | Method for judging background introduction microorganism sequence and application thereof | |
CN111733229B (en) | Schizophrenia genetic risk typing kit and typing device | |
CN118888001A (en) | Pathogen detection system and method based on metagenome high-throughput sequencing | |
US20220259657A1 (en) | Method for discovering marker for predicting risk of depression or suicide using multi-omics analysis, marker for predicting risk of depression or suicide, and method for predicting risk of depression or suicide using multi-omics analysis | |
CN114150047B (en) | Method for evaluating base damage, mismatching and variation in sample DNA by one-generation sequencing | |
CN116646010B (en) | Human virus detection method and device, equipment and storage medium | |
CN116168761B (en) | Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium | |
US20240011105A1 (en) | Analysis of microbial fragments in plasma | |
Luebbert et al. | Efficient and accurate detection of viral sequences at single-cell resolution reveals putative novel viruses perturbing host gene expression | |
WO2023077482A1 (en) | Combination of mnp markers of mycobacterium tuberculosis, primer pair combination, kit, and uses of combination, primer pair combination and kit | |
US20220267865A1 (en) | In vitro method for the diagnosis of viral infections | |
CN115859174A (en) | Bacterial meningitis classification model construction method and application of bacterial meningitis classification model in recognition of cerebrospinal fluid metagenome sequencing false positive result | |
CN114944188A (en) | Sample homology judgment model and establishment method and application thereof | |
Biswa et al. | Tameness selection pressure affects gut virome diversity in mice | |
CN115732031A (en) | Credit generation noise reduction analysis method and system based on hidden subgroups and aiming at credit generation noise |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination |