[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN118888001A - Pathogen detection system and method based on metagenome high-throughput sequencing - Google Patents

Pathogen detection system and method based on metagenome high-throughput sequencing Download PDF

Info

Publication number
CN118888001A
CN118888001A CN202411370770.5A CN202411370770A CN118888001A CN 118888001 A CN118888001 A CN 118888001A CN 202411370770 A CN202411370770 A CN 202411370770A CN 118888001 A CN118888001 A CN 118888001A
Authority
CN
China
Prior art keywords
species
pathogen
sequences
database
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411370770.5A
Other languages
Chinese (zh)
Inventor
费宏
柳佳琦
未庆
李翰鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xinnuo Baishi Medical Laboratory Co ltd
Original Assignee
Shanghai Xinnuo Baishi Medical Laboratory Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xinnuo Baishi Medical Laboratory Co ltd filed Critical Shanghai Xinnuo Baishi Medical Laboratory Co ltd
Priority to CN202411370770.5A priority Critical patent/CN118888001A/en
Publication of CN118888001A publication Critical patent/CN118888001A/en
Pending legal-status Critical Current

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a pathogen detection system and a pathogen detection method based on metagenome high-throughput sequencing; the detection method comprises at least one of the following steps: cleaning a public database, calculating and filtering S_confidence and L_score of kmer classification information, and identifying pollution of a basic background model and a DIC background model; the method can control the detection of false positive species, and provides convenience for downstream report interpretation and assistance to clinicians in judging true pathogens.

Description

Pathogen detection system and method based on metagenome high-throughput sequencing
Technical Field
The invention belongs to the technical field of pathogen detection, and particularly relates to a pathogen detection system and method based on metagenome high-throughput sequencing.
Background
Infectious diseases are a collective term for diseases caused by pathogenic microorganisms (bacteria, viruses, fungi, parasites, etc.), and are serious diseases seriously threatening human health. The diagnosis and curative effect monitoring of infectious diseases have been dependent on methods such as morphology, immunology, molecular biology and pathogen isolation culture for a long time, and the methods have advantages and disadvantages and play an important role in auxiliary diagnosis of infectious diseases.
Recently emerging pathogen metagenome high throughput sequencing (metagenomic next-generation sequencing, mNGS) technology refers to a detection method that uses high throughput sequencing technology to sequence all nucleic acids in a specific clinical sample and determines whether a pathogen is present in the sample by bioinformatic analysis. Compared with the traditional pathogen detection technology based on separation culture, the technology can detect various microorganisms (such as viruses, bacteria, fungi, parasites and the like) without bias in theory, including pathogens which are difficult to culture and new pathogens. mNGS is an open analysis and diagnosis system, the number of pathogens detected by mNGS is not specified, and according to incomplete statistics, the mechanism for developing relevant detection services has almost ten thousand pathogens including bacteria, viruses, fungi, parasites and the like, and an effective technical means is provided for diagnosing serious and rare pathogen infection. Has clinical significance in specific clinical application scenes.
MNGS-based pathogen species identification falls into several broad categories:
1. an alignment-based method. mNGS reads are aligned to different species assemblies and the presence of a particular species is determined based on the alignment position and the alignment of the reliabilities. mNGS have no assumption about the scope of species identification, and the large annual increase in the species assembly sequences of public databases makes it difficult for general application equipment to meet the challenges presented by its spatial and temporal complexity. If only the sequence of the common known pathogen species is concerned, although the hardware and time bottlenecks can be alleviated, the detection range of the species is limited, and the sensitivity of clinical pathogen detection, particularly rare pathogens or novel pathogens, is reduced;
2. Identification of species based on marker genes. The method comprises the steps of firstly calculating species-level specific single copy genes (marker genes), and comparing mNGS data to the marker genes. This method has the advantage of small calculation amount, and has the disadvantage of high requirement for sequencing data amount (considering that the marker gene is sufficiently covered), which limits the sensitivity of the method. Meanwhile, the similarity of partial species nucleic acid is extremely high, so that challenges are brought to calculating the marker gene, and the common operation is that a plurality of near source species share the marker gene, so that the resolution of the near source species is reduced;
3. Kmer (consecutive subsequences of nucleic acid sequences) based methods. The species classification problem is converted into mNGS sequence classification problem, then the sequence classification problem is converted into kmer matching, the calculation amount is reduced, the timeliness of the whole analysis is improved, and the ultrahigh sensitivity is realized, so that the species detection method based on kmer is popular in the pathogen detection field. The ultra-high sensitivity of the method can have certain false positive under the combined actions of poor database quality, biological variation, sequencing noise, potential pollution of different experimental steps and the like, and is difficult to solve all the time, and the method brings confusion to report interpretation and auxiliary doctor diagnosis.
Therefore, there is a need to provide a pathogen detection system and method based on metagenomic high throughput sequencing to suppress false positives occurring in existing metagenomic high throughput sequencing pathogen detection, improving the accuracy of clinical diagnosis.
Disclosure of Invention
The invention aims to provide a pathogen detection system and a pathogen detection method based on metagenome high-throughput sequencing, which are used for effectively reducing or inhibiting pathogen detection false positive and improving the accuracy of clinical diagnosis.
In view of this, the scheme of the invention is as follows:
in a first aspect of the invention, a metagenomic high throughput sequencing-based pathogen detection system is presented, comprising:
The database module is used for constructing a pathogen species database;
The detection module is used for comparing the metagenome sequencing result with the database to obtain a pathogen species identification result;
The filtering module is used for filtering the identification result of the pathogenic species; the filtering method comprises the following steps: constructing a species kmer database based on the pathogenic species database, judging the reliability of the species based on the distribution condition of the metagenome sequence kmer on the corresponding species classification tree in the species kmer database, and filtering the identified pathogenic species; the reliability judging process comprises the following steps: and respectively calculating the S_confidence of the number of the species sequences to the ratio S_confidence of the number of the current species to the total number of the kmers, and the L_score of the number of the kmers of the species sequences on the current species node and the straight line node thereof to the total number of the kmers, taking the sequences of which the S_confidence and the L_score are close to 1 as reliable sequences, and taking the species with more than 2 reliable sequences as filtered pathogenic species.
Further, the reliability judging process is as follows: taking the species with the S_confidence quartile value of more than 80% of all sequences in the identified species as filtered pathogenic species;
And/or, taking as filtered pathogenic species a species of the identified species whose sequence l_score maximum satisfies the formula: Count k (all) is the number of kmers at the maximum sequence length assigned to the current species.
Further, the process of constructing the pathogenic species database by the pathogenic species database construction module includes: collecting and screening high-quality pathogen species assembly sequences; the high quality assembly sequence in the species is used as a core, and the assembly sequence with high similarity with the core assembly sequence is reserved.
Preferably, the seed similarity is based on average nucleic acid identity, alignment coverage.
Preferably, the screening process of the pathogenic species assembly sequence is selected according to the assembly completion degree and the pollution degree.
Preferably, the construction process of the pathogen species database further comprises filtering abnormal assembly indexes, wherein the assembly indexes comprise assembly total length, contig number and GC content;
and/or, combining the highly proximal species in the assembled sequence;
and/or, rejecting moving elements in the assembly sequence;
and/or masking low complexity sequences.
Further, the filtration module further comprises a step of removing internal and/or external contamination of the pathogenic species identification result.
Preferably, the internal contamination removal process is to calculate the significance of the identified species rPM relative to the negative control sample rPM, treat the identified species as a contaminant species with little significance and remove it;
preferably, the external contamination removal process is to normalize the number of reads of the identified species according to the number of manually inserted reference sequences, calculate the significance of the normalized number of species reads relative to the negative control sample, treat the species as a contamination species with little significance, and remove the species.
In a second aspect of the invention, a pathogen detection method based on metagenome high-throughput sequencing is provided, comprising the steps of constructing a pathogen species database, and comparing a metagenome sequencing result with the database to obtain a pathogen species identification result;
The detection method further comprises the steps of constructing a species kmer database based on the pathogenic species database, judging species reliability based on the distribution condition of the metagenome sequence kmers on corresponding species classification trees in the species kmer database, and filtering the identified pathogenic species; the reliability judging process comprises the following steps: and respectively calculating the S_confidence of the number of the species sequences to the ratio S_confidence of the number of the current species to the total number of the kmers, and the L_score of the number of the kmers of the species sequences on the current species node and the straight line node thereof to the total number of the kmers, taking the sequences of which the S_confidence and the L_score are close to 1 as reliable sequences, and taking the species with more than 2 reliable sequences as filtered pathogenic species.
The ratio close to 1 refers to the ratio closer to 1 in the ratio percentages, and the threshold values such as more than 50%, more than 60%, more than 70%, more than 80% and the like can be set to select reliable sequences and then filter to obtain reliable species.
In a third aspect of the present invention, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the pathogen detection method according to the second aspect.
Compared with the prior art, the invention has the beneficial effects that:
The pathogen detection system judges the species reliability based on the distribution condition of the metagenomic sequence kmer on the corresponding species classification tree in the species kmer database, filters the identified pathogen species, can control the detection of false positive species, and provides convenience for downstream report interpretation and auxiliary clinicians to judge the true pathogen;
The pathogen detection system constructs a pathogen species database through species similarity screening, eliminates assembly sequences with low credibility, and can further control the detection of false positive species through highly similar species merging and removing mobile elements;
According to the pathogen detection system, the basic background model is set, and the reference sequence background model respectively utilizes the statistical index z-score to identify potential pollutant species and remove the pollutant species, so that false positive detection can be further reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for detecting pathogens based on metagenome high throughput sequencing according to the present invention;
FIG. 2 is a flow chart of a species database process according to the present invention;
FIG. 3 is a schematic diagram of an intra-seed similarity screening process according to the present invention;
FIG. 4 is a schematic diagram of a mobile element rejection process according to the present invention;
FIG. 5 is a schematic representation of species filtration according to the distribution of sequence kmers over classification trees according to the present invention.
Detailed Description
The following provides definitions of some of the terms used in this specification. Unless otherwise defined, all terms used herein are intended to have the meanings commonly understood by those skilled in the art to which the present scheme pertains.
Term interpretation:
mNGS: metagenomic sequencing, metagenomic Next-Generation Sequencing, uses a high throughput sequencing method to sequence all biological sequences ((DNA or/and RNA)) in a sample. For analyzing biological species, abundance, function, etc. in a sample. It should be noted that the source of the samples for which the metagenomic sequencing data is directed may be clinical samples, or samples from animals and living environments.
Kmer: refers to a sequence of k consecutive base pairs in a DNA or RNA, and kmer is a characteristic representation of the sequence.
Z-score: in statistics, Z-score, also known as standard score, represents the distance of a data point from the mean of the data set in standard deviation units. In other words, it measures the degree of deviation of a data point from the average of the dataset.
ANI: average Nucleotide Identity, which is the average identity of nucleotides between two genomic sequences. It is obtained by comparing the whole genome sequences of two genomes and calculating the ratio of the same base pairs between them.
AF: ALIGNMENT FRACTION in bioinformatics, ALIGNMENT FRACTION, chinese translation is "alignment score" or "alignment" which refers to the ratio of the number of bases on a successful alignment to the total number of bases after alignment of two sequences.
NTC: negative control samples are commonly referred to as No Template Control negative control samples. In the experiment, the negative control sample is an important control group, which can help to exclude the influence of other factors on the experimental result, thereby ensuring the reliability of the experimental result.
In one embodiment, a method for detecting a pathogen based on metagenome high throughput sequencing is provided, comprising the steps of comparing metagenome sequencing results with a database of pathogenic species to obtain identification results of pathogenic species, and filtering the identification results of pathogenic species. In order to reduce or inhibit false positives in the species classification process, improvements are made in terms of assembly of pathogen databases, filtration of identification results and the like, and a specific flow is shown in figure 1. The specific improvements include the following aspects:
1. Public database species assembly clean-up
The inventors have found that the reliability of pathogen identification by kmer methods is first dependent on database quality, based on which they have collected the assembly sequences (including but not limited to Refseq, geneBank) for viruses, bacteria, archaea, parasites, etc. in the public database. Wherein bacteria and archaea account for the vast majority of the database. Therefore, the bacteria and archaea are subjected to database quality treatment, and the operation flow is shown in figure 2 and is specifically as follows:
1. assembly quality screening
While assemblies that meet the following conditions are retained as high quality candidate assemblies for further processing.
(1) Assembly integrity <60%;
(2) The assembly pollution degree is more than 5%;
(3) (assembly completion 5) x assembly contamination level <50;
the assembly evaluation index is calculated by checkM.
2. Seed similarity screening
The public database stores partially erroneous species information, which leads to deviations in species classification and even affects the sensitivity of the species.
For this purpose, we calculated the average nucleic acid similarity and alignment coverage of other assemblies to the core assembly under the same species with the highest quality of assembly or putative classical strain in the species and representative assembly as the core, as shown in fig. 3. In fig. 3, the central red dot represents the core assembly of a particular species, and the other dots (black and orange) represent the other assemblies of a particular species. Average nucleic acid identity (ANI) is used to measure the similarity of two assemblies at the nucleic acid level (between 0 and 1, closer to 1 means closer), and since it has unpaired properties, ANI for two assemblies is calculated twice, ANI for other assemblies versus core assemblies is defined as negative and ANI for core assemblies versus other assemblies is defined as positive. The alignment score (AF) represents the degree of alignment coverage of the two assemblies. Similarly, for the sake of unity, we define:
Where sgn () is a common sign function. v_ANI (formula 1) falls within the [ -5,5] interval and is considered similar to the core assembly of this species from the perspective of nucleic acid similarity. v_AF (equation 2) falls within the interval [ -10, 10] and is considered to be close and complete from the coverage level to the core assembly. As shown in fig. 3, the assembly shown in orange dots remains for an intra-species trusted assembly; black dot assembly kicks out for intra-seed suspected assembly.
3. Assembly index anomaly filtering
The internal assembly should be similar in terms of overall assembly length and GC content.
For assembly length, GC content, N50, L50, contig numbers gave kick-outs with significantly abnormal assembly compared to other assemblies. This step may further reduce the risk of contamination or inaccuracy of the assembled biological sample.
4. Highly proximal species incorporation
Since there are differences between traditional taxonomies and genomic species taxonomies, mNGS essentially infer the taxonomies from the nucleic acid level. Thus, we will increase the sensitivity of species detection by combining species with v_ani <1 between species core assemblies (if the two species nucleic acids are highly similar, the single sequence of mNGS will be assigned to the common minimal ancestor node of the two species, thus reducing the detection sensitivity at the species level, classical examples are e.coli and shigella).
5. Removal of moving elements
During species assembly, partial submitters can keep sequences or organism sequences such as plasmids, bacterial viruses (phage) and the like which can horizontally spread among different closely related species in species assembly, which can confuse downstream sequence species classification, and relevant horizontal moving elements are removed according to relevant plasmid databases, phage databases and sequence name information, so that the assembly cleanliness is ensured;
Mainly comprises two steps as shown in fig. 4: the first step is to kick out the complete sequence of the mobile element assembled by the species such as bacteria according to the sequence ID, and the second step is to kick out the sequence fragment of the mobile element formed by incorrect assembly or sequence integration in the assembly of the species according to the sequence similarity.
6. Low complexity sequence masking
In the biological sequence comparison process, the information contained in the low-complexity sequence is low, and an error result of the sequence comparison can be caused, and dust (not only dust) and other software are used for shielding the low-complexity sequence, so that false positive results in the biological sequence comparison process are avoided.
Filtering of sequence kmer classification information
After the clean database is obtained, a species kmer database is constructed using kmer based classification software (such as kraken a2 but not limited to kraken a). And a classification analysis is performed on the sequence of mNGS samples. Biological mutation, sequencing noise and limitation of different kmer classification methods, and certain probability of errors exists in the classification information of single kmers. To reduce false positives caused by such probability errors, the distribution of the kmer of the sequence over the species classification tree is further analyzed and two scores are used to determine the classification reliability of an independent sequence, S_confidence (features-level confidence) and L_ score (Lineage score), respectively. The calculation method is shown in fig. 5, and the specific process is as follows:
Defining S_confidence: the ratio of the number of kmers classified into the current species (and its child nodes) to the total number of kmers in the sequence (read or assembled contig) of all kmers of a particular sequence ranges from 0 to 1, and the more unique kmers belonging to the current species in sequence kmers, the greater the likelihood of being from the current species (as shown in equation 3).
Wherein count k represents the number of kmers at the maximum sequence length assigned to the current species, as follows.
L_score definition: all kmers of a particular sequence are assigned to the number of counts over the total kmer number at the current species node and its orthonodes (including the orthoancestor node and the orthooffspring node). The red circle of fig. 5 represents "seed 1", the blue shade represents its orthonormal node, and the others are non-orthonormal nodes, and if kmers of a specific sequence are centrally distributed over the orthonormal nodes, the surface classification process and data construction are reliable. The more reliable the value is between 0 and 1, the closer to 1. If kmer is distributed to other non-orthoscopic branches of the classification tree (e.g., non-negative nodes in fig. 4) too much, it is indicated that classification of the current sequence is unreliable (as shown in equation 4).
According to the definition, the sequences of S_confidence and L_score which are close to 1 are taken as reliable sequences, and the species with more than 2 reliable sequences are taken as filtered pathogen species.
Specifically, the reliable and efficient filtering means are: for sequences classified by kmer method, we group by species, set a threshold >80% for the s_confidence upper quartile for all sequences of the identified species, and satisfy equation 5 for the l_score maximum for the identified species, as a reliable species.
3. A background model is built to exclude internal and external contaminant species.
The species obtained by high-throughput sequencing may be from clinical samples per se, environmental pollution, reagent consumables, library-building sequencing background bacteria and the like. Pathogen detection aims at detecting microorganisms from clinical specimens, while microorganisms of other origin are regarded as detecting false positives. In order to identify microorganisms from environmental, consumable, instrument and the like sources, a plurality of Negative Control Samples (NTCs) are arranged, a specified quantity of reference Sequences (DICs) are added to all samples, a background model is established on a biological information layer according to the samples and data, and a statistical index z-score is used for identifying potential pollution species, so that false positives are detected in a reduced manner.
Background models are divided into two: a base background model, a reference sequence background model. The specific description is as follows:
1. basic background model:
The basic background model, without reference fragments, was used for general flow experiments. Firstly, calculating rPM (reads per million, millions of sequences) of a specific species (sp) of a biological sample (biosample) as shown in a formula (6); the purpose of this calculation is to counteract the reads variance due to the variance in the amount of sequencing data. And calculating rPM of corresponding species of three Negative Control Samples (NTCs) or Negative Control Samples (NTCs) of nearly three days of the current experiment, assuming that the DNA amount of the experimental pollutant species is in normal distribution in a plurality of NTC samples, calculating whether rPM(s) of the current biological sample is obvious under an NTC model by using Standard Deviation (SD), setting a Z-score (as shown in formula 7) threshold according to the characteristics of the species to define significance (as shown in virus and bacteria, and setting Z-score not less than 1.96).
Where mean () represents the average function.
2. Reference sequence (DIC) background model:
the reference sequence is added for quality normalization to further accurately identify exogenous contaminant species.
Firstly, the number of reads of a specific species sp in a sample is standardized according to the number of DIC, and the standardized number of the sp reads is obtained and is named as reads_norm (sp), as shown in a formula (8). The z-score was calculated with normalized reads and a background sample (NC) model, as in equation (9), with the thresholds above.
A significant difference between the species reads of the biological sample and either of the two background models is considered a reliable species, and vice versa a suspected contaminant species.
Examples
(1) MNGS pretreatment of data (high-throughput data quality control, human source data removal, low-complexity sequence removal) to obtain quality control, human source reads, low-complexity reads and microorganism reads, and the statistics are as follows:
Sample Sample1 NTC1 NTC2 NTC3
Sample_type Bio NTC NTC NTC
Raw_reads 20000000 1578463 1941510 1499540
Raw_bases 1000000000 78923150 97075500 74977000
Clean_reads 19999782 1565835 1925977 1487543
Clean_bases 979989318 77508832 95335861 73633378
Q20(%) 98.53 98.55 98.28 98.56
Q30(%) 91.48 91.28 90.66 91.14
Reads_median_length 50 50 50 50
GC(%) 42.7 43.08 42.88 42.9
Host_reads 19803696 1061636 1161364 1014504
Low_complexity_reads 2459 192 236 182
Micro_reads 193627 504007 764377 472857
Micro_rate(%) 0.01 0.322 0.397 0.318
(2) Sequence species classification based on kmer (kraken for example)
The number of species-unique reads was obtained using kmer-based sequence classification software (Kraken 2).
Sample name UniqReads
Sample1 Mycobacterium tuberculosis 24
Sample1 Leptotrichia wadei 3
Sample1 Escherichia coli 11
Sample1 Oribacterium sinus 1
Sample1 Klebsiella pneumoniae 21
(3) S_confidence and L_score computation and filtering
The kmer distribution unreliable reads are filtered according to the quartile threshold on s_confidence and the l_score maximum threshold.
Sample name S_confidence(UQ) Max(L_score) Pass
Sample1 Mycobacterium tuberculosis 0.9375 1 True
Sample1 Leptotrichia wadei 0.6875 0.8125 False
Sample1 Escherichia coli 0.9375 1 True
Sample1 Oribacterium sinus 0.5625 0.5625 False
Sample1 Klebsiella pneumoniae 0.9375 1 True
(4) Background model filtering (basic background model, DIC background model)
Sample name Basic background model Z DIC background model Z Pass
Sample1 Mycobacterium tuberculosis 100 100 True
Sample1 Escherichia coli 1.03 1.25 False
Sample1 Klebsiella pneumoniae 7.32 6.91 True
* And (3) injection: when no relevant species were detected in the NC sample, the z-score value was defined as 100.
(5) Final species information output
And (3) reliably detecting species information obtained after the false positive filtration in the step (3) and the step (4).
Sample name UniqReads
Sample1 Mycobacterium tuberculosis 24
Sample1 Klebsiella pneumoniae 21
Although the present disclosure is disclosed above, the scope of the present disclosure is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the disclosure, and these changes and modifications will fall within the scope of the disclosure.

Claims (10)

1. A metagenome high throughput sequencing-based pathogen detection system, comprising:
The database module is used for constructing a pathogen species database;
The detection module is used for comparing the metagenome sequencing result with the database to obtain a pathogen species identification result;
The filtering module is used for filtering the identification result of the pathogenic species; the filtering method comprises the following steps: constructing a species kmer database based on the pathogenic species database, judging the reliability of the species based on the distribution condition of the metagenome sequence kmer on the corresponding species classification tree in the species kmer database, and filtering the identified pathogenic species; the reliability judging process comprises the following steps: and respectively calculating the ratio S_confidence of the number of the species sequences to the total number of the present species and the ratio L_score of the number of the species sequences to the total number of the present species nodes and the direct line nodes thereof, taking the sequences of which the S_confidence and the L_score are close to 1 as reliable sequences, and taking the species with more than 2 reliable sequences as filtered pathogenic species.
2. The pathogen detection system of claim 1, wherein the reliability determination process is: taking the species with the S_confidence quartile value of more than 80% of all sequences in the identified species as filtered pathogenic species;
And/or, taking as filtered pathogenic species a species of the identified species whose sequence l_score maximum satisfies the formula:
Count k (all) is the number of kmers at the maximum sequence length assigned to the current species.
3. The pathogen detection system of claim 1, wherein the pathogen species database construction process includes: collecting and screening high-quality pathogen species assembly sequences; the high quality assembly sequence in the species is used as a core, and the assembly sequence with high similarity with the core assembly sequence is reserved.
4. The pathogen detection system of claim 3, wherein the seed similarity is based on average nucleic acid identity, alignment coverage.
5. A pathogen detection system according to claim 3, wherein the screening process of the pathogen species assembly sequence screens according to assembly completion, contamination level.
6. The pathogen detection system of claim 3, wherein the process of constructing the pathogen species database further includes filtering for anomalies in assembly indicators including total assembly length, contig number, and GC content;
and/or, combining the highly proximal species in the assembled sequence;
and/or, rejecting moving elements in the assembly sequence;
and/or masking low complexity sequences.
7. The pathogen detection system of claim 1, wherein the filtration module further includes a step of removing internal and/or external contamination of the pathogen species identification.
8. The pathogen detection system of claim 7, wherein the internal contamination removal process is to calculate the significance of the identified species rPM relative to the negative control sample rPM, treat the identified species as a contaminant species with little significance and remove it;
And/or, the external pollution removal process is to normalize the number of reads of the identified species according to the number of manually inserted reference sequences, calculate the significance of the normalized number of species reads relative to the negative control sample, treat the species as a pollution species with small significance and remove the species.
9. The pathogen detection method based on metagenome high-throughput sequencing is characterized by comprising the steps of constructing a pathogen species database, and comparing a metagenome sequencing result with the database to obtain a pathogen species identification result;
The detection method further comprises the steps of constructing a species kmer database based on the pathogenic species database, judging species reliability based on the distribution condition of the metagenome sequence kmers on corresponding species classification trees in the species kmer database, and filtering the identified pathogenic species; the reliability judging process comprises the following steps: and respectively calculating the S_confidence of the number of the species sequences to the ratio S_confidence of the number of the current species to the total number of the kmers, and the L_score of the number of the kmers of the species sequences on the current species node and the straight line node thereof to the total number of the kmers, taking the sequences of which the S_confidence and the L_score are close to 1 as reliable sequences, and taking the species with more than 2 reliable sequences as filtered pathogenic species.
10. Computer readable storage medium, having stored thereon a computer program, the processor executing the computer program to implement the pathogen detection method of claim 9.
CN202411370770.5A 2024-09-29 2024-09-29 Pathogen detection system and method based on metagenome high-throughput sequencing Pending CN118888001A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411370770.5A CN118888001A (en) 2024-09-29 2024-09-29 Pathogen detection system and method based on metagenome high-throughput sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411370770.5A CN118888001A (en) 2024-09-29 2024-09-29 Pathogen detection system and method based on metagenome high-throughput sequencing

Publications (1)

Publication Number Publication Date
CN118888001A true CN118888001A (en) 2024-11-01

Family

ID=93229952

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411370770.5A Pending CN118888001A (en) 2024-09-29 2024-09-29 Pathogen detection system and method based on metagenome high-throughput sequencing

Country Status (1)

Country Link
CN (1) CN118888001A (en)

Similar Documents

Publication Publication Date Title
CN113160882B (en) Pathogenic microorganism metagenome detection method based on third generation sequencing
BR112020013636A2 (en) method to facilitate the prenatal diagnosis of a genetic disorder from a maternal sample associated with the pregnant woman, method for identifying contamination associated with at least one between preparation of sequencing library and high-throughput sequencing and method for characterization associated with at least one between sequencing library preparation and sequencing
JP6785995B2 (en) A deep learning-based framework for identifying sequence patterns that cause sequence-specific errors (SSEs)
CN111462821A (en) Pathogenic microorganism analysis and identification system and application
US20140149049A1 (en) Accurate and fast mapping of reads to genome
CN112687344B (en) Human adenovirus molecule typing and tracing method and system based on metagenome
CN108292327A (en) The method of detection copy number variation in next generation&#39;s sequencing
AU2019480813A1 (en) Methods for determining chromosome aneuploidy and constructing classification model, and device
CN108460248B (en) Method for detecting long tandem repeat sequence based on Bionano platform
KR102124193B1 (en) Method for screening makers for predicting depressive disorder or suicide risk using machine learning, markers for predicting depressive disorder or suicide risk, method for predicting depressive disorder or suicide risk
CN113270145B (en) Method for judging background introduction microorganism sequence and application thereof
CN111733229B (en) Schizophrenia genetic risk typing kit and typing device
CN118888001A (en) Pathogen detection system and method based on metagenome high-throughput sequencing
US20220259657A1 (en) Method for discovering marker for predicting risk of depression or suicide using multi-omics analysis, marker for predicting risk of depression or suicide, and method for predicting risk of depression or suicide using multi-omics analysis
CN114150047B (en) Method for evaluating base damage, mismatching and variation in sample DNA by one-generation sequencing
CN116646010B (en) Human virus detection method and device, equipment and storage medium
CN116168761B (en) Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium
US20240011105A1 (en) Analysis of microbial fragments in plasma
Luebbert et al. Efficient and accurate detection of viral sequences at single-cell resolution reveals putative novel viruses perturbing host gene expression
WO2023077482A1 (en) Combination of mnp markers of mycobacterium tuberculosis, primer pair combination, kit, and uses of combination, primer pair combination and kit
US20220267865A1 (en) In vitro method for the diagnosis of viral infections
CN115859174A (en) Bacterial meningitis classification model construction method and application of bacterial meningitis classification model in recognition of cerebrospinal fluid metagenome sequencing false positive result
CN114944188A (en) Sample homology judgment model and establishment method and application thereof
Biswa et al. Tameness selection pressure affects gut virome diversity in mice
CN115732031A (en) Credit generation noise reduction analysis method and system based on hidden subgroups and aiming at credit generation noise

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination