CN118335203B

CN118335203B - Coronavirus recombination detection method, system, equipment and medium for large-scale genome data

Info

Publication number: CN118335203B
Application number: CN202410763582.2A
Authority: CN
Inventors: 王辛; 胡明达; 任洪广; 王博千; 赵云祥; 柴子力; 靳远; 岳俊杰
Original assignee: Academy of Military Medical Sciences AMMS of PLA
Current assignee: Academy of Military Medical Sciences AMMS of PLA
Priority date: 2024-06-14
Filing date: 2024-06-14
Publication date: 2024-08-20
Anticipated expiration: 2044-06-14
Also published as: CN118335203A

Abstract

The invention relates to the technical field of bioinformatics, and discloses a coronavirus recombination detection method, system, device and medium for large-scale genome data. The invention applies data mining related methods, combines bioinformatics tools, and establishes a coronavirus genome sequence fragment library and a query library by processing the coronavirus genome sequence; for the sequence of an unknown coronavirus, the genome source determination and recombination event determination of the coronavirus sequence are realized through mechanisms such as fragment matching and iterative voting; through query comparison, the parent sequences corresponding to different genome sources and their corresponding recombinant fragments are determined, while retaining the rich information of the large-scale massive genome data of the coronavirus, the library construction data is effectively used, the recombination detection and analysis tasks of the coronavirus sequence to be detected are completed efficiently and accurately, the recombination event detection for the large-scale coronavirus genome data is realized, and important data support is provided for subsequent research and application.

Description

Coronavirus recombination detection method, system, equipment and medium for large-scale genomic data

技术领域Technical Field

本发明涉及生物信息学技术领域，尤其涉及一种面向大规模基因组数据的冠状病毒重组检测方法、系统、设备及介质。The present invention relates to the field of bioinformatics technology, and in particular to a coronavirus recombination detection method, system, equipment and medium for large-scale genome data.

背景技术Background Art

冠状病毒（Coronavirus）是正义单链RNA病毒，主要感染人的呼吸道，引起发热、咳嗽、咽痛、头痛等症状。SARS病毒、MERS病毒、新冠病毒均属于冠状病毒。在病原体遗传演化过程中，会发生基因重组现象。基因重组是在一个同时感染了多种病毒的宿主生物中，在病毒复制过程中会产生一系列基因组片段，某一个病毒的基因组片段插入或替换到另一个病毒的基因组序列中，由此会引发不同病毒近缘基因之间的重组，进而引发病毒基因组序列较大幅度的变化。冠状病毒的一大遗传演化特征就是内部各个病毒之间存在着频繁且复杂的重组演化事件。举例来说，由骆驼冠状病毒基因重组产生的MERS病毒优势谱系与2015年的韩国MERS疫情密切相关。此外，包括SARS病毒在内的诸多冠状病毒，其疫情爆发都与基因重组演化紧密相关。基于冠状病毒基因重组的重要影响，研究冠状病毒基因重组进化规律，对冠状病毒的病原监测与疫情防控具有极为重要的意义。Coronavirus is a positive-sense single-stranded RNA virus that mainly infects the human respiratory tract, causing symptoms such as fever, cough, sore throat, and headache. SARS virus, MERS virus, and new coronavirus are all coronaviruses. Genetic recombination occurs during the genetic evolution of pathogens. Genetic recombination is a series of genomic fragments produced during viral replication in a host organism infected with multiple viruses at the same time. The genomic fragments of a certain virus are inserted or replaced into the genomic sequence of another virus, which will trigger the recombination between closely related genes of different viruses, and then cause a large change in the viral genome sequence. A major genetic evolution feature of coronaviruses is that there are frequent and complex recombination evolution events between the viruses within. For example, the dominant lineage of MERS virus produced by the genetic recombination of camel coronavirus is closely related to the MERS epidemic in South Korea in 2015. In addition, the outbreak of many coronaviruses, including SARS virus, is closely related to genetic recombination evolution. Based on the important impact of coronavirus genetic recombination, studying the evolutionary laws of coronavirus genetic recombination is of great significance to the pathogen monitoring and epidemic prevention and control of coronavirus.

现有的主流重组检测技术，一般是通过模型针对病原微生物的系统发生学分析结果进行重组检测，这类重组检测技术在冠状病毒领域具有较大的局限性。一方面，传统的重组分析方法仅仅面向小规模的遗传序列数据，在新一代测序技术快速发展的背景下，传统方法难以支持快速增长的大规模基因组数据的遗传分析，尤其是难以充分利用千万级别序列数据的新冠病毒为代表的海量冠状病毒基因组内的丰富数据信息；另一方面，传统的重组分析方法只研究输入范围内的基因组序列之间可能存在的重组事件，无法对输入序列相对于整个冠状病毒大背景下的重组事件进行检测；此外，传统的重组检测方法要求输入序列完成序列比对，且在重组检测过程中大多需要进行系统发生学分析，这在大规模冠状病毒基因组数据中也难以实现。The existing mainstream recombination detection technology generally uses models to perform recombination detection on the results of phylogenetic analysis of pathogenic microorganisms. This type of recombination detection technology has great limitations in the field of coronaviruses. On the one hand, traditional recombination analysis methods are only for small-scale genetic sequence data. In the context of the rapid development of next-generation sequencing technology, traditional methods are difficult to support the genetic analysis of rapidly growing large-scale genomic data, especially the rich data information in the massive coronavirus genome represented by the new coronavirus with tens of millions of sequence data. On the other hand, traditional recombination analysis methods only study the possible recombination events between genomic sequences within the input range, and cannot detect the recombination events of the input sequence relative to the entire coronavirus background. In addition, traditional recombination detection methods require the input sequence to complete sequence alignment, and most of the phylogenetic analysis is required during the recombination detection process, which is also difficult to achieve in large-scale coronavirus genome data.

因此，亟需一种面向大规模基因组数据的冠状病毒重组检测方法，以高效、准确地实现冠状病毒重组检测。Therefore, there is an urgent need for a coronavirus recombination detection method for large-scale genomic data to achieve coronavirus recombination detection efficiently and accurately.

发明内容Summary of the invention

本发明提供一种面向大规模基因组数据的冠状病毒重组检测方法、系统、设备及介质，用以解决传统的重组分析方法难以支持海量冠状病毒基因组数据的重组检测的缺陷。The present invention provides a coronavirus recombination detection method, system, device and medium for large-scale genome data, so as to solve the defect that traditional recombination analysis methods are difficult to support recombination detection of massive coronavirus genome data.

本发明提供一种多维度重组检测语料库的构建方法，包括：The present invention provides a method for constructing a multi-dimensional reorganization detection corpus, comprising:

获取冠状病毒基因组数据，其中，冠状病毒基因组数据包括多个基因组序列数据；Obtaining coronavirus genome data, wherein the coronavirus genome data includes multiple genome sequence data;

根据多个基因组序列数据中每个基因组序列数据的非法字符率，对多个基因组序列数据进行质量筛选，得到多个高质量基因组序列数据；performing quality screening on the multiple genome sequence data according to the illegal character rate of each genome sequence data in the multiple genome sequence data to obtain multiple high-quality genome sequence data;

对多个高质量基因组序列数据进行聚类，得到多个基因组序列聚类簇，其中，多个基因组序列聚类簇中的每个基因组序列聚类簇均标注有分类标签，每个基因组序列聚类簇中的每个高质量基因组序列数据均标注有生物学类别标签，生物学类别标签包括主标签和分组序号；Clustering multiple high-quality genome sequence data to obtain multiple genome sequence clusters, wherein each genome sequence cluster in the multiple genome sequence clusters is labeled with a classification label, and each high-quality genome sequence data in each genome sequence cluster is labeled with a biological category label, and the biological category label includes a main label and a grouping number;

根据多个高质量基因组序列数据中每个高质量基因组序列数据的序列长度，对多个高质量基因组序列数据进行筛选，并根据每个基因组序列聚类簇内符合序列长度要求的高质量基因组序列数据的数量，对多个基因组序列聚类簇进行筛选，得到建库待选的多个基因组序列聚类簇及其包含的多个高质量基因组序列数据；According to the sequence length of each high-quality genome sequence data in the multiple high-quality genome sequence data, the multiple high-quality genome sequence data are screened, and according to the number of high-quality genome sequence data that meet the sequence length requirements in each genome sequence cluster, the multiple genome sequence clusters are screened to obtain multiple genome sequence clusters to be selected for library construction and the multiple high-quality genome sequence data contained therein;

根据多个预设切分长度，对建库待选的多个高质量基因组序列数据中的每个高质量基因组序列数据分别进行序列片段切分，得到多个预设切分长度中每个预设切分长度对应的多个碱基子序列数据，为每个预设切分长度对应的多个碱基子序列数据构建冠状病毒基因组序列片段库，并根据建库待选的多个高质量基因组序列数据中的每个高质量基因组序列数据的生物学类别标签，为多个碱基子序列数据中的每个碱基子序列数据标注来源标签，其中，多个预设切分长度包括第一预设切分长度和第二预设切分长度；According to a plurality of preset segmentation lengths, each high-quality genome sequence data in the plurality of high-quality genome sequence data to be selected for library construction is segmented into sequence fragments to obtain a plurality of base subsequence data corresponding to each preset segmentation length in the plurality of preset segmentation lengths, a coronavirus genome sequence fragment library is constructed for the plurality of base subsequence data corresponding to each preset segmentation length, and a source label is annotated for each base subsequence data in the plurality of base subsequence data according to the biological category label of each high-quality genome sequence data in the plurality of high-quality genome sequence data to be selected for library construction, wherein the plurality of preset segmentation lengths include a first preset segmentation length and a second preset segmentation length;

根据多个高质量基因组序列数据中每个高质量基因组序列数据的主标签，对多个高质量基因组序列数据进行分组，得到多个主标签组别，并对多个主标签组别中的每个主标签组别进行单独建库，形成一级查询库，其中，每个主标签组别包括标准有相同主标签的多个高质量基因组序列数据；According to the primary tag of each high-quality genome sequence data in the multiple high-quality genome sequence data, the multiple high-quality genome sequence data are grouped to obtain multiple primary tag groups, and each primary tag group in the multiple primary tag groups is independently built to form a primary query library, wherein each primary tag group includes multiple high-quality genome sequence data with the same primary tag;

对建库待选的多个高质量基因组序列数据进行建库，得到二级查询库；Building a library for multiple high-quality genome sequence data to be selected for library building to obtain a secondary query library;

对进行质量筛选和聚类后的多个高质量基因组序列数据进行建库，得到三级查询库。A database is built for multiple high-quality genome sequence data after quality screening and clustering to obtain a three-level query library.

在一种实施方案中，所述根据多个基因组序列数据中每个基因组序列数据的非法字符率，对多个基因组序列数据进行质量筛选，得到多个高质量基因组序列数据，包括：In one embodiment, the method of performing quality screening on the multiple genome sequence data according to the illegal character rate of each genome sequence data in the multiple genome sequence data to obtain multiple high-quality genome sequence data includes:

根据每个基因组序列数据的核苷酸字符数量和非核苷酸字符数量，得到每个基因组序列数据的非法字符率；According to the number of nucleotide characters and the number of non-nucleotide characters of each genome sequence data, the illegal character rate of each genome sequence data is obtained;

将非法字符率小于或等于预设非法字符率阈值的基因组序列数据判定为高质量基因组序列数据，以从多个基因组序列数据中筛选得到多个高质量基因组序列数据。The genome sequence data whose illegal character rate is less than or equal to a preset illegal character rate threshold is determined as high-quality genome sequence data, so as to screen out a plurality of high-quality genome sequence data from the plurality of genome sequence data.

在一种实施方案中，所述根据多个高质量基因组序列数据中每个高质量基因组序列数据的序列长度，对多个高质量基因组序列数据进行筛选，并根据每个基因组序列聚类簇内符合序列长度要求的高质量基因组序列数据的数量，对多个基因组序列聚类簇进行筛选，得到建库待选的多个基因组序列聚类簇及其包含的多个高质量基因组序列数据，包括：In one embodiment, the method comprises screening multiple high-quality genome sequence data according to the sequence length of each high-quality genome sequence data in the multiple high-quality genome sequence data, and screening multiple genome sequence clusters according to the number of high-quality genome sequence data that meet the sequence length requirements in each genome sequence cluster, to obtain multiple genome sequence clusters to be selected for library construction and multiple high-quality genome sequence data contained therein, including:

根据多个高质量基因组序列数据中每个高质量基因组序列数据的序列长度，将序列长度大于或等于预设碱基数量阈值的高质量基因组序列数据判定为符合序列长度要求的高质量基因组序列数据；According to the sequence length of each high-quality genome sequence data in the multiple high-quality genome sequence data, the high-quality genome sequence data whose sequence length is greater than or equal to a preset base number threshold is determined as the high-quality genome sequence data that meets the sequence length requirement;

根据每个基因组序列聚类簇内符合序列长度要求的高质量基因组序列数据的数量，将包含符合序列长度要求的高质量基因组序列数据的数量大于或等于预设数量阈值的基因组序列聚类簇判定为建库待选的基因组序列聚类簇，并将建库待选的基因组序列聚类簇包含的高质量基因组序列数据判定为建库待选的高质量基因组序列数据，以得到建库待选的多个基因组序列聚类簇及其包含的多个高质量基因组序列数据。According to the number of high-quality genome sequence data meeting the sequence length requirement in each genome sequence clustering cluster, the genome sequence clustering clusters containing the number of high-quality genome sequence data meeting the sequence length requirement greater than or equal to a preset number threshold are determined as genome sequence clustering clusters to be selected for library construction, and the high-quality genome sequence data contained in the genome sequence clustering clusters to be selected for library construction are determined as high-quality genome sequence data to be selected for library construction, so as to obtain multiple genome sequence clustering clusters to be selected for library construction and the multiple high-quality genome sequence data contained in them.

本发明还提供一种面向大规模基因组数据的冠状病毒重组检测方法，包括：The present invention also provides a coronavirus recombination detection method for large-scale genome data, comprising:

接收待检测的冠状病毒基因组数据；Receive coronavirus genome data to be tested;

根据待检测的冠状病毒基因组数据，利用上述任一项所述的多维度重组检测语料库的构建方法得到的二级查询库，进行先验知识查询；According to the coronavirus genome data to be detected, a priori knowledge query is performed using the secondary query library obtained by the construction method of the multi-dimensional recombinant detection corpus described in any of the above items;

根据待检测的冠状病毒基因组数据，利用上述任一项所述的多维度重组检测语料库的构建方法得到的冠状病毒基因组序列片段库，进行片段匹配，并根据匹配到的碱基子序列数据的来源标签和查询到的先验知识，得到待检测的冠状病毒基因组数据的来源标签并对待检测的冠状病毒基因组数据中对应的区段进行染色，其中，待检测的冠状病毒基因组数据的来源标签包括主标签和/或特殊标签，特殊标签表示来源未知；According to the coronavirus genome data to be detected, a coronavirus genome sequence fragment library obtained by the construction method of the multi-dimensional recombinant detection corpus described in any of the above items is used to perform fragment matching, and according to the source label of the matched base subsequence data and the queried prior knowledge, the source label of the coronavirus genome data to be detected is obtained and the corresponding segment in the coronavirus genome data to be detected is stained, wherein the source label of the coronavirus genome data to be detected includes a main label and/or a special label, and the special label indicates that the source is unknown;

根据待检测的冠状病毒基因组数据的来源标签中主标签和特殊标签的数量，判断待检测的冠状病毒基因组数据是否存在重组事件；Judging whether there is a recombination event in the coronavirus genome data to be detected according to the number of main tags and special tags in the source tags of the coronavirus genome data to be detected;

当待检测的冠状病毒基因组数据存在重组事件时，将待检测的冠状病毒基因组数据中每个对应主标签的所有染色区段的碱基片段按照在待检测的冠状病毒基因组数据中的先后位置进行串联，得到查询子序列，利用上述任一项所述的多维度重组检测语料库的构建方法得到的一级查询库、二级查询库、三级查询库对查询子序列进行亲代序列查询，得到待检测的冠状病毒基因组数据的亲代序列数据；When there is a recombination event in the coronavirus genome data to be detected, the base fragments of all the stained segments of each corresponding main tag in the coronavirus genome data to be detected are connected in series according to the chronological position in the coronavirus genome data to be detected to obtain a query subsequence, and the primary query library, the secondary query library, and the tertiary query library obtained by the construction method of the multi-dimensional recombination detection corpus described in any of the above items are used to perform a parent sequence query on the query subsequence to obtain the parent sequence data of the coronavirus genome data to be detected;

将待检测的冠状病毒基因组数据中对应主标签的每个染色区段进行子序列截取，得到子序列数据，利用序列比对法对子序列数据在亲代序列数据中进行位点定位，得到待检测的冠状病毒基因组数据的重组片段位点数据。Subsequence each stained segment corresponding to the main label in the coronavirus genome data to be detected is extracted to obtain subsequence data, and the subsequence data is located in the parent sequence data using a sequence alignment method to obtain the recombinant fragment site data of the coronavirus genome data to be detected.

在一种实施方案中，所述根据待检测的冠状病毒基因组数据，利用上述任一项所述的多维度重组检测语料库的构建方法得到的二级查询库，进行先验知识查询，包括：In one embodiment, the prior knowledge query is performed based on the coronavirus genome data to be detected using the secondary query library obtained by the construction method of the multi-dimensional recombinant detection corpus described in any one of the above items, including:

将待检测的冠状病毒基因组数据与二级查询库的高质量基因组序列数据进行相似度计算，得到与待检测的冠状病毒基因组数据相似度最高的高质量基因组序列数据及其生物学类别标签，作为待检测的冠状病毒基因组数据的先验知识。The similarity between the coronavirus genome data to be tested and the high-quality genome sequence data in the secondary query library is calculated to obtain the high-quality genome sequence data with the highest similarity to the coronavirus genome data to be tested and its biological category label as the prior knowledge of the coronavirus genome data to be tested.

在一种实施方案中，所述根据待检测的冠状病毒基因组数据，利用上述任一项所述的多维度重组检测语料库的构建方法得到的冠状病毒基因组序列片段库，进行片段匹配，并根据匹配到的碱基子序列数据的来源标签和查询到的先验知识，得到待检测的冠状病毒基因组数据的来源标签并对待检测的冠状病毒基因组数据中对应的区段进行染色，包括：In one embodiment, according to the coronavirus genome data to be detected, the coronavirus genome sequence fragment library obtained by the construction method of the multi-dimensional recombinant detection corpus described in any one of the above items is used to perform fragment matching, and according to the source label of the matched base subsequence data and the queried prior knowledge, the source label of the coronavirus genome data to be detected is obtained and the corresponding segment in the coronavirus genome data to be detected is stained, including:

根据第一预设切分长度，对待检测的冠状病毒基因组数据进行序列片段切分，得到多个第一待检测序列片段数据，将多个第一待检测序列片段数据中的每个第一待检测序列片段数据与第一预设切分长度对应的冠状病毒基因组序列片段库的高质量基因组序列数据进行匹配，以每个第一待检测序列片段数据匹配得到的高质量基因组序列数据的生物学类别标签作为该第一待检测序列片段数据对应的匹配标签，根据多个第一待检测序列片段数据的匹配结果为多个匹配标签投票；According to a first preset segmentation length, the coronavirus genome data to be detected is segmented into sequence fragments to obtain a plurality of first sequence fragment data to be detected, each first sequence fragment data to be detected in the plurality of first sequence fragment data to be detected is matched with high-quality genome sequence data of a coronavirus genome sequence fragment library corresponding to the first preset segmentation length, and a biological category label of the high-quality genome sequence data obtained by matching each first sequence fragment data to be detected is used as a matching label corresponding to the first sequence fragment data to be detected, and voting for a plurality of matching labels according to the matching results of the plurality of first sequence fragment data to be detected;

当多个匹配标签中存在唯一票数最多的匹配标签时，以该唯一票数最多的匹配标签作为最优标签，并以该最优标签作为来源标签，对为该最优标签投过票的第一待检测序列片段数据对应的碱基位点进行标签标注和染色；When there is a unique matching tag with the most votes among multiple matching tags, the unique matching tag with the most votes is used as the optimal tag, and the optimal tag is used as the source tag to label and color the base sites corresponding to the first sequence fragment data to be detected that has voted for the optimal tag;

当多个匹配标签中存在多个票数最多的匹配标签时，若先验知识中的生物学类别标签为多个票数最多的匹配标签之一时，以先验知识中的生物学类别标签为最优标签，否则，在多个票数最多的匹配标签之中随机选取一个作为最优标签，并以该最优标签作为来源标签，对为该最优标签投过票的第一待检测序列片段数据对应的碱基位点进行标签标注和染色；When there are multiple matching labels with the most votes among multiple matching labels, if the biological category label in the prior knowledge is one of the multiple matching labels with the most votes, the biological category label in the prior knowledge is the optimal label, otherwise, one of the multiple matching labels with the most votes is randomly selected as the optimal label, and the optimal label is used as the source label to label and color the base sites corresponding to the first sequence fragment data to be detected that has voted for the optimal label;

当多个第一待检测序列片段数据的匹配结果显示多个第一待检测序列片段数据均无匹配标签时，根据第二预设切分长度，对待检测的冠状病毒基因组数据进行序列片段切分，得到多个第二待检测序列片段数据，将多个第二待检测序列片段数据中的每个第二待检测序列片段数据与预设第二切分长度对应的冠状病毒基因组序列片段库的高质量基因组序列数据进行匹配，以每个第二待检测序列片段数据匹配得到的高质量基因组序列数据的生物学类别标签作为该第二待检测序列片段数据对应的匹配标签，根据多个第二待检测序列片段数据的匹配结果为多个匹配标签投票，若此时多个第二待检测序列片段数据的匹配结果依然显示多个第二待检测序列片段数据均无匹配标签时，对该多个第二待检测序列片段数据标注特殊标签；When the matching results of multiple first sequence fragment data to be detected show that none of the multiple first sequence fragment data to be detected have matching labels, segment the coronavirus genome data to be detected according to the second preset segmentation length to obtain multiple second sequence fragment data to be detected, match each second sequence fragment data to be detected in the multiple second sequence fragment data with the high-quality genome sequence data of the coronavirus genome sequence fragment library corresponding to the preset second segmentation length, use the biological category label of the high-quality genome sequence data obtained by matching each second sequence fragment data to be detected as the matching label corresponding to the second sequence fragment data to be detected, vote for the multiple matching labels according to the matching results of the multiple second sequence fragment data to be detected, and if the matching results of the multiple second sequence fragment data to be detected still show that none of the multiple second sequence fragment data to be detected have matching labels, mark the multiple second sequence fragment data to be detected with special labels;

对于待检测的冠状病毒基因组数据中的染色区段，滤除染色区段的碱基长度小于或等于预设染色长度阈值的染色区段；For the stained segments in the coronavirus genome data to be detected, filtering out the stained segments whose base length is less than or equal to a preset stained length threshold;

对于待检测的冠状病毒基因组数据中的未染色区段，若未染色区段的碱基长度小于或等于预设染色长度阈值，根据未染色区段的上下文染色情况进行相同染色，若未染色区段的碱基长度大于预设染色长度阈值，开始新一轮片段匹配和匹配标签投票。For the unstained segments in the coronavirus genome data to be tested, if the base length of the unstained segment is less than or equal to the preset staining length threshold, the same staining is performed according to the context staining of the unstained segment. If the base length of the unstained segment is greater than the preset staining length threshold, a new round of fragment matching and matching label voting is started.

在一种实施方案中，所述根据待检测的冠状病毒基因组数据的来源标签中主标签和特殊标签的数量，判断待检测的冠状病毒基因组数据是否存在重组事件，包括：In one embodiment, judging whether there is a recombination event in the coronavirus genome data to be detected according to the number of main tags and special tags in the source tags of the coronavirus genome data to be detected includes:

在待检测的冠状病毒基因组数据的来源标签中不存在特殊标签、只存在主标签的情况下，当待检测的冠状病毒基因组数据的来源标签中只有一个主标签时，判定待检测的冠状病毒基因组数据不存在重组事件，当待检测的冠状病毒基因组数据的来源标签包括多个主标签时，判定待检测的冠状病毒基因组数据存在重组事件；In the case where there is no special tag in the source tag of the coronavirus genome data to be detected and only a main tag exists, when there is only one main tag in the source tag of the coronavirus genome data to be detected, it is determined that there is no recombination event in the coronavirus genome data to be detected; when the source tag of the coronavirus genome data to be detected includes multiple main tags, it is determined that there is a recombination event in the coronavirus genome data to be detected;

当待检测的冠状病毒基因组数据的来源标签中同时存在特殊标签和主标签时，判定待检测的冠状病毒基因组数据存在重组事件；When the source tags of the coronavirus genome data to be detected contain both the special tag and the main tag, it is determined that the coronavirus genome data to be detected has a recombination event;

当待检测的冠状病毒基因组数据的来源标签中只存在特殊标签时，判定待检测的冠状病毒基因组数据不存在重组事件。When only special tags exist in the source tags of the coronavirus genome data to be detected, it is determined that there is no recombination event in the coronavirus genome data to be detected.

在一种实施方案中，所述将待检测的冠状病毒基因组数据中对应主标签的所有染色区段的碱基片段按照在待检测的冠状病毒基因组数据中的先后位置进行串联，得到查询子序列，利用上述任一项所述的多维度重组检测语料库的构建方法得到的一级查询库、二级查询库、三级查询库对查询子序列进行亲代序列查询，得到待检测的冠状病毒基因组数据的亲代序列数据，包括：In one embodiment, the base fragments of all the staining segments corresponding to the main tag in the coronavirus genome data to be detected are concatenated according to the chronological position in the coronavirus genome data to be detected to obtain a query subsequence, and the primary query library, the secondary query library, and the tertiary query library obtained by the construction method of the multidimensional recombinant detection corpus described in any one of the above items are used to perform a parent sequence query on the query subsequence to obtain the parent sequence data of the coronavirus genome data to be detected, including:

利用一级查询库对查询子序列进行亲代序列查询，当得到待检测的冠状病毒基因组数据的亲代序列数据时，结束该步骤，否则，利用二级查询库对查询子序列进行再次亲代序列查询，当得到待检测的冠状病毒基因组数据的亲代序列数据时，结束该步骤，否则，利用三级查询库对查询子序列进行又一次亲代序列查询，如果依然未能得到待检测的冠状病毒基因组数据的亲代序列数据时，以未知亲代数据作为待检测的冠状病毒基因组数据的亲代序列数据。The query subsequence is queried for a parent sequence using the primary query library. When the parent sequence data of the coronavirus genome data to be detected is obtained, the step is terminated. Otherwise, the query subsequence is queried for a parent sequence again using the secondary query library. When the parent sequence data of the coronavirus genome data to be detected is obtained, the step is terminated. Otherwise, the query subsequence is queried for a parent sequence again using the tertiary query library. If the parent sequence data of the coronavirus genome data to be detected is still not obtained, the unknown parent data is used as the parent sequence data of the coronavirus genome data to be detected.

本发明还提供一种面向大规模基因组数据的冠状病毒重组检测系统，包括：The present invention also provides a coronavirus recombination detection system for large-scale genome data, comprising:

数据接收模块，用于：接收待检测的冠状病毒基因组数据；A data receiving module, used to: receive coronavirus genome data to be detected;

先验知识查询模块，用于：根据待检测的冠状病毒基因组数据，利用上述任一项所述的多维度重组检测语料库的构建方法得到的二级查询库，进行先验知识查询；A priori knowledge query module is used to perform prior knowledge query based on the coronavirus genome data to be detected using a secondary query library obtained by any of the above-mentioned methods for constructing a multi-dimensional recombinant detection corpus;

序列片段匹配模块，用于：根据待检测的冠状病毒基因组数据，利用上述任一项所述的多维度重组检测语料库的构建方法得到的冠状病毒基因组序列片段库，进行片段匹配，并根据匹配到的碱基子序列数据的来源标签和查询到的先验知识，得到待检测的冠状病毒基因组数据的来源标签并对待检测的冠状病毒基因组数据中对应的区段进行染色，其中，待检测的冠状病毒基因组数据的来源标签包括主标签和/或特殊标签，特殊标签表示来源未知；A sequence fragment matching module is used to: perform fragment matching based on the coronavirus genome data to be detected, using the coronavirus genome sequence fragment library obtained by the construction method of the multi-dimensional recombinant detection corpus described in any of the above items, and obtain the source label of the coronavirus genome data to be detected based on the source label of the matched base subsequence data and the queried prior knowledge, and color the corresponding segment in the coronavirus genome data to be detected, wherein the source label of the coronavirus genome data to be detected includes a main label and/or a special label, and the special label indicates that the source is unknown;

重组事件判定模块，用于：根据待检测的冠状病毒基因组数据的来源标签中主标签和特殊标签的数量，判断待检测的冠状病毒基因组数据是否存在重组事件；A recombination event determination module is used to determine whether there is a recombination event in the coronavirus genome data to be detected according to the number of main tags and special tags in the source tags of the coronavirus genome data to be detected;

亲代序列查询模块，用于：当待检测的冠状病毒基因组数据存在重组事件时，将待检测的冠状病毒基因组数据中对应主标签的所有染色区段的碱基片段按照在待检测的冠状病毒基因组数据中的先后位置进行串联，得到查询子序列，利用上述任一项所述的多维度重组检测语料库的构建方法得到的一级查询库、二级查询库、三级查询库对查询子序列进行亲代序列查询，得到待检测的冠状病毒基因组数据的亲代序列数据；A parent sequence query module, used for: when there is a recombination event in the coronavirus genome data to be detected, the base fragments of all the staining segments corresponding to the main tag in the coronavirus genome data to be detected are connected in series according to the chronological position in the coronavirus genome data to be detected, to obtain a query subsequence, and the primary query library, the secondary query library, and the tertiary query library obtained by the construction method of the multi-dimensional recombination detection corpus described in any of the above items are used to perform a parent sequence query on the query subsequence to obtain the parent sequence data of the coronavirus genome data to be detected;

重组片段位点定位模块，用于：将待检测的冠状病毒基因组数据中对应主标签的每个染色区段进行子序列截取，得到子序列数据，利用序列比对法对子序列数据在亲代序列数据中进行位点定位，得到待检测的冠状病毒基因组数据的重组片段位点数据。The recombinant fragment site positioning module is used to: perform subsequence interception on each staining segment corresponding to the main tag in the coronavirus genome data to be detected to obtain subsequence data, and use the sequence alignment method to locate the site of the subsequence data in the parent sequence data to obtain the recombinant fragment site data of the coronavirus genome data to be detected.

本发明还提供一种电子设备，包括处理器和存储有计算机程序的存储器，所述处理器执行所述计算机程序时实现上述任一种所述的多维度重组检测语料库的构建方法和/或面向大规模基因组数据的冠状病毒重组检测方法。The present invention also provides an electronic device, comprising a processor and a memory storing a computer program, wherein when the processor executes the computer program, the method for constructing a multi-dimensional recombination detection corpus and/or the coronavirus recombination detection method for large-scale genomic data as described above is implemented.

本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现上述任一种所述的多维度重组检测语料库的构建方法和/或面向大规模基因组数据的冠状病毒重组检测方法。The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements any of the above-mentioned methods for constructing a multidimensional recombination detection corpus and/or a coronavirus recombination detection method for large-scale genomic data.

本发明还提供一种计算机程序产品，所述计算机程序产品包括计算机程序，计算机程序可存储在非暂态计算机可读存储介质上，所述计算机程序被处理器执行时，计算机能够执行上述任一种所述的多维度重组检测语料库的构建方法和/或面向大规模基因组数据的冠状病毒重组检测方法。The present invention also provides a computer program product, which includes a computer program. The computer program can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute any of the above-mentioned methods for constructing a multidimensional recombination detection corpus and/or the coronavirus recombination detection method for large-scale genomic data.

本发明提供的一种面向大规模基因组数据的冠状病毒重组检测方法、系统、设备及介质，至少具有以下优点：The present invention provides a coronavirus recombination detection method, system, device and medium for large-scale genome data, which has at least the following advantages:

1、本发明应用数据挖掘相关方法，结合生物信息学工具，通过对冠状病毒基因组序列的处理，建立冠状病毒基因组序列片段库与查询库；针对未知冠状病毒的序列，通过片段匹配、迭代投票等机制，实现冠状病毒序列的基因组来源判定与重组事件判定；通过查询比对，确定不同基因组来源对应的亲代序列及其对应的重组片段，能够实现面向大规模冠状病毒基因组数据的重组事件检测，为后续研究应用提供重要的数据支持。1. The present invention applies data mining related methods, combines bioinformatics tools, and establishes a coronavirus genome sequence fragment library and a query library by processing the coronavirus genome sequence; for the sequence of unknown coronavirus, the genome source determination and recombination event determination of the coronavirus sequence are realized through mechanisms such as fragment matching and iterative voting; through query comparison, the parental sequences corresponding to different genome sources and their corresponding recombinant fragments are determined, which can realize recombination event detection for large-scale coronavirus genome data and provide important data support for subsequent research and application.

2、本发明避免传统重组分析方法仅仅面向小规模遗传序列数据的局限性，基于质量控制、聚类标注、运行前建库等手段，在保留冠状病毒大规模海量基因组数据丰富信息的同时，有效使用建库数据，高效、准确地完成对待检测的冠状病毒序列的重组检测与分析任务。2. The present invention avoids the limitation of traditional recombination analysis methods that they are only applicable to small-scale genetic sequence data. Based on quality control, cluster annotation, pre-operation library construction and other means, while retaining the rich information of large-scale massive genome data of coronavirus, the present invention effectively uses the library construction data to efficiently and accurately complete the recombination detection and analysis tasks of the coronavirus sequences to be detected.

3、本发明避免传统重组分析方法只研究特定范围内的基因组序列之间可能存在的重组事件、无法对待检测的冠状病毒序列相对于整个冠状病毒大背景下重组事件检测的局限性，先利用种类丰富齐全的各类冠状病毒基因组数据进行多维度语料库构建，有效支持待检测的冠状病毒序列在整个冠状病毒大背景下所有可能的重组事件及其重组序列来源的检测与评估，数据分析结果更加客观、可靠。3. The present invention avoids the limitation of traditional recombination analysis methods that only study possible recombination events between genome sequences within a specific range and cannot detect recombination events of the coronavirus sequence to be detected relative to the entire coronavirus background. First, a multi-dimensional corpus is constructed using a rich and complete range of coronavirus genome data to effectively support the detection and evaluation of all possible recombination events and the sources of the recombinant sequences of the coronavirus sequence to be detected in the entire coronavirus background, and the data analysis results are more objective and reliable.

4、本发明无需用户在检测前对待检测的冠状病毒序列进行处理，只需提供待检测的冠状病毒序列即可，用户使用更加简单、方便。4. The present invention does not require the user to process the coronavirus sequence to be detected before detection. The user only needs to provide the coronavirus sequence to be detected, which makes it easier and more convenient for the user to use.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图做出简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present invention or the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1为本发明提供的一种面向大规模基因组数据的冠状病毒重组检测方法的流程示意图之一；图2为本发明提供的一种面向大规模基因组数据的冠状病毒重组检测方法的流程示意图之二；图3为本发明提供的一种面向大规模基因组数据的冠状病毒重组检测系统的结构示意图；图4为本发明提供的电子设备的结构示意图。Figure 1 is one of the flow charts of a coronavirus recombination detection method for large-scale genomic data provided by the present invention; Figure 2 is a second flow chart of a coronavirus recombination detection method for large-scale genomic data provided by the present invention; Figure 3 is a structural schematic diagram of a coronavirus recombination detection system for large-scale genomic data provided by the present invention; Figure 4 is a structural schematic diagram of an electronic device provided by the present invention.

具体实施方式DETAILED DESCRIPTION

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例，它们不应该理解成对本发明的限制。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。在本发明的描述中，需要理解的是，所用到的术语仅仅是用于描述的目的，而不能理解为指示或暗示相对重要性。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with the drawings in the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments, and they should not be understood as limitations on the present invention. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention. In the description of the present invention, it should be understood that the terms used are only for descriptive purposes and cannot be understood as indicating or implying relative importance.

下面结合图1-图4描述本发明提供的面向大规模基因组数据的冠状病毒重组检测方法、系统、设备及介质。The following describes the coronavirus recombination detection method, system, device and medium for large-scale genome data provided by the present invention in conjunction with Figures 1 to 4.

图1-图2是本发明提供的面向大规模基因组数据的冠状病毒重组检测方法的流程示意图。参照图1，本发明提供的一种面向大规模基因组数据的冠状病毒重组检测方法，可以包括：Figures 1 and 2 are schematic flow diagrams of a coronavirus recombination detection method for large-scale genomic data provided by the present invention. Referring to Figure 1, a coronavirus recombination detection method for large-scale genomic data provided by the present invention may include:

步骤S110、构建多维度重组检测语料库，其中，多维度重组检测语料库包括冠状病毒基因组序列片段库（冠状病毒基因组序列片段库可以包括两个冠状病毒基因组序列片段库，一个是长片段的冠状病毒基因组序列片段库（一级片段库），另一个是短片段的冠状病毒基因组序列片段库（二级片段库））、一级查询库、二级查询库、三级查询库；Step S110, constructing a multidimensional recombinant detection corpus, wherein the multidimensional recombinant detection corpus includes a coronavirus genome sequence fragment library (the coronavirus genome sequence fragment library may include two coronavirus genome sequence fragment libraries, one is a long-fragment coronavirus genome sequence fragment library (primary fragment library), and the other is a short-fragment coronavirus genome sequence fragment library (secondary fragment library)), a primary query library, a secondary query library, and a tertiary query library;

步骤S120、接收待检测的冠状病毒基因组数据；Step S120, receiving coronavirus genome data to be detected;

步骤S130、根据待检测的冠状病毒基因组数据，利用二级查询库，进行先验知识查询，在没有任何先验知识的情况下，没法使用一级查询库（需要标签），而二级查询库包含所有建库序列，又排除了三级库中的不符合建库质量要求的序列，所以使用二级库；Step S130: Based on the coronavirus genome data to be detected, the secondary query library is used to perform a priori knowledge query. Without any prior knowledge, the primary query library cannot be used (labels are required), and the secondary query library contains all library construction sequences and excludes sequences in the tertiary library that do not meet the library construction quality requirements, so the secondary library is used;

步骤S140、根据待检测的冠状病毒基因组数据，利用冠状病毒基因组序列片段库，进行片段匹配，并根据匹配到的碱基子序列数据的来源标签和查询到的先验知识，得到待检测的冠状病毒基因组数据的来源标签并对待检测的冠状病毒基因组数据中对应的区段进行染色，其中，待检测的冠状病毒基因组数据的来源标签包括主标签和/或特殊标签，特殊标签表示来源未知；Step S140: According to the coronavirus genome data to be detected, a coronavirus genome sequence fragment library is used to perform fragment matching, and according to the source label of the matched base subsequence data and the queried prior knowledge, the source label of the coronavirus genome data to be detected is obtained and the corresponding segment in the coronavirus genome data to be detected is stained, wherein the source label of the coronavirus genome data to be detected includes a main label and/or a special label, and the special label indicates that the source is unknown;

步骤S150、根据待检测的冠状病毒基因组数据的来源标签中主标签和特殊标签的数量，判断待检测的冠状病毒基因组数据是否存在重组事件；Step S150, judging whether there is a recombination event in the coronavirus genome data to be detected according to the number of main tags and special tags in the source tags of the coronavirus genome data to be detected;

步骤S160、当待检测的冠状病毒基因组数据存在重组事件时，将待检测的冠状病毒基因组数据中对应主标签的所有染色区段的碱基片段按照在待检测的冠状病毒基因组数据中的先后位置进行串联，得到查询子序列，利用一级查询库、二级查询库、三级查询库对查询子序列进行亲代序列查询，得到待检测的冠状病毒基因组数据的亲代序列数据；Step S160: When there is a recombination event in the coronavirus genome data to be detected, the base fragments of all the dyeing segments corresponding to the main tag in the coronavirus genome data to be detected are connected in series according to the chronological positions in the coronavirus genome data to be detected to obtain a query subsequence, and the query subsequence is queried for a parent sequence using the primary query library, the secondary query library, and the tertiary query library to obtain the parent sequence data of the coronavirus genome data to be detected;

步骤S170、将待检测的冠状病毒基因组数据中对应主标签的每个染色区段进行子序列截取，得到子序列数据，利用序列比对法对子序列数据在亲代序列数据中进行位点定位，得到待检测的冠状病毒基因组数据的重组片段位点数据。Step S170, subsequence each staining segment corresponding to the main tag in the coronavirus genome data to be detected to obtain subsequence data, and use the sequence alignment method to locate the site of the subsequence data in the parent sequence data to obtain the recombinant fragment site data of the coronavirus genome data to be detected.

需要说明的是，本发明提供的面向大规模基因组数据的冠状病毒重组检测方法的执行主体可以是任何符合技术要求的终端侧设备，例如面向大规模基因组数据的冠状病毒重组检测装置等。It should be noted that the executor of the coronavirus recombination detection method for large-scale genomic data provided by the present invention can be any terminal-side device that meets the technical requirements, such as a coronavirus recombination detection device for large-scale genomic data.

需要说明的是，步骤S110只需要在执行其它冠状病毒重组检测步骤（步骤S120-步骤S170）之前预先构建一个多维度重组检测语料库即可，无需在每次进行冠状病毒重组检测之前都重新构建。It should be noted that step S110 only needs to pre-build a multi-dimensional recombination detection corpus before executing other coronavirus recombination detection steps (steps S120 to S170), and there is no need to rebuild it before each coronavirus recombination detection.

在一种实施例中，步骤S110可以包括：In one embodiment, step S110 may include:

步骤S1101、获取冠状病毒基因组数据，其中，冠状病毒基因组数据包括多个基因组序列数据。在本实施例中，步骤S1101可以从NCBI Genbank、GISAID等在线公共数据库或本地基因组数据库上，获取所有冠状病毒的基因组序列数据。Step S1101, obtaining coronavirus genome data, wherein the coronavirus genome data includes multiple genome sequence data. In this embodiment, step S1101 can obtain the genome sequence data of all coronaviruses from online public databases such as NCBI Genbank, GISAID, or local genome databases.

步骤S1102、根据多个基因组序列数据中每个基因组序列数据的非法字符率，对多个基因组序列数据进行质量筛选，得到多个高质量基因组序列数据。Step S1102: performing quality screening on the multiple genome sequence data according to the illegal character rate of each genome sequence data in the multiple genome sequence data to obtain multiple high-quality genome sequence data.

在本实施例中，可以根据每个基因组序列数据的核苷酸字符数量和非核苷酸字符数量，得到每个基因组序列数据的非法字符率，再将非法字符率小于或等于预设非法字符率阈值的基因组序列数据判定为高质量基因组序列数据，以从多个基因组序列数据中筛选得到多个高质量基因组序列数据。In this embodiment, the illegal character rate of each genome sequence data can be obtained based on the number of nucleotide characters and the number of non-nucleotide characters in each genome sequence data, and the genome sequence data with an illegal character rate less than or equal to a preset illegal character rate threshold is determined as high-quality genome sequence data, so as to screen out multiple high-quality genome sequence data from multiple genome sequence data.

具体的，对某基因组序列数据（核苷酸序列），假设序列的总长度为L、当前序列中ACGT四种核苷酸以外的其他字符（非法字符）总数为e，本实施例要求当前序列的非法字符较少，即非法字符率E≤0.10，其中非法字符率E的计算公式如下：Specifically, for a certain genome sequence data (nucleotide sequence), assuming that the total length of the sequence is L , and the total number of characters (illegal characters) other than the four nucleotides ACGT in the current sequence is e , this embodiment requires that the number of illegal characters in the current sequence is small, that is, the illegal character rate E≤0.10 , where the calculation formula of the illegal character rate E is as follows:

， ,

满足上述非法字符率要求的序列，视为满足要求的高质量基因组序列，将用于后续步骤，舍弃其余低质量基因组序列数据。Sequences that meet the above illegal character rate requirements are considered high-quality genome sequences that meet the requirements and will be used in subsequent steps, and the remaining low-quality genome sequence data will be discarded.

步骤S1103、对多个高质量基因组序列数据进行聚类，得到多个基因组序列聚类簇，其中，多个基因组序列聚类簇中的每个基因组序列聚类簇均标注有分类标签，每个基因组序列聚类簇中的每个高质量基因组序列数据均标注有生物学类别标签，生物学类别标签包括主标签和分组序号，在本实施例中，采用CD-HIT聚类工具对多个高质量基因组序列数据进行聚类。Step S1103: cluster multiple high-quality genome sequence data to obtain multiple genome sequence clusters, wherein each genome sequence cluster in the multiple genome sequence clusters is marked with a classification label, and each high-quality genome sequence data in each genome sequence cluster is marked with a biological category label, and the biological category label includes a main label and a grouping number. In this embodiment, the CD-HIT clustering tool is used to cluster multiple high-quality genome sequence data.

具体的，由于新冠病毒序列数目远高于其他冠状病毒序列数目，且新冠病毒序列内部相似度极高，使用CD-HIT聚类后可以对序列进行重采样，尽可能保持序列的多样性与完整性的同时，尽量降低序列的数据偏性，从而构建非冗余的数据集。使用CD-HIT工具进行核苷酸序列聚类，它根据序列的相似度对序列进行聚类，以此划分为不同的分类标签（聚类标签）。使用CD-HIT能够对冠状病毒序列进行聚类校准分析，通过对序列数据进行聚类标注实现序列的标签分类。之后，每个保留的高质量基因组序列数据均获得一个表征自己生物学类别的标签，如SARS-rat_1、MERS_2、SARS-CoV-2_2等。以SARS-rat_1为例，该生物学类别标签包括主标签与分组数字，表示该高质量基因组序列的生物学类别为SARS-rat类中的1号分组。Specifically, since the number of novel coronavirus sequences is much higher than that of other coronavirus sequences, and the internal similarity of novel coronavirus sequences is extremely high, the sequences can be resampled after CD-HIT clustering, while maintaining the diversity and integrity of the sequences as much as possible, while minimizing the data bias of the sequences, thereby constructing a non-redundant data set. The CD-HIT tool is used for nucleotide sequence clustering, which clusters the sequences according to their similarity and divides them into different classification labels (clustering labels). CD-HIT can be used to perform cluster calibration analysis on coronavirus sequences, and the label classification of sequences can be achieved by clustering and annotating the sequence data. After that, each retained high-quality genome sequence data obtains a label that represents its own biological category, such as SARS-rat_1, MERS_2, SARS-CoV-2_2, etc. Taking SARS-rat_1 as an example, the biological category label includes the main label and the grouping number, indicating that the biological category of the high-quality genome sequence is grouping No. 1 in the SARS-rat class.

步骤S1104、根据多个高质量基因组序列数据中每个高质量基因组序列数据的序列长度，对多个高质量基因组序列数据进行筛选，并根据每个基因组序列聚类簇内符合序列长度要求的高质量基因组序列数据的数量，对多个基因组序列聚类簇进行筛选，得到建库待选的多个基因组序列聚类簇及其包含的多个高质量基因组序列数据。Step S1104: screening multiple high-quality genome sequence data according to the sequence length of each high-quality genome sequence data in the multiple high-quality genome sequence data, and screening multiple genome sequence clusters according to the number of high-quality genome sequence data that meet the sequence length requirements in each genome sequence cluster, to obtain multiple genome sequence clusters to be selected for library construction and the multiple high-quality genome sequence data contained therein.

由于重组毒株自身是两个或多个序列重组而来，故与其他序列的相似度往往有限，通过排查已知重组序列在数据集中的标签分布，可以获知重组序列数据往往以“一条或几条全长重组基因组序列（一般2万碱基以上）+若干较短重组片段”的形式组成一个CD-HIT标签聚类。Since the recombinant strain itself is the result of the recombination of two or more sequences, its similarity with other sequences is often limited. By checking the label distribution of known recombinant sequences in the data set, it can be learned that the recombinant sequence data often form a CD-HIT label cluster in the form of "one or several full-length recombinant genome sequences (generally more than 20,000 bases) + several shorter recombinant fragments".

所以在本实施例中，可以根据多个高质量基因组序列数据中每个高质量基因组序列数据的序列长度，将序列长度大于或等于预设碱基数量阈值（例如3500个碱基）的高质量基因组序列数据判定为符合序列长度要求的高质量基因组序列数据，再根据每个基因组序列聚类簇内符合序列长度要求的高质量基因组序列数据的数量，将包含符合序列长度要求的高质量基因组序列数据的数量大于或等于预设数量阈值（例如10条）的基因组序列聚类簇判定为建库待选的基因组序列聚类簇，并将建库待选的基因组序列聚类簇包含的高质量基因组序列数据判定为建库待选的高质量基因组序列数据，以得到建库待选的多个基因组序列聚类簇及其包含的多个高质量基因组序列数据。Therefore, in this embodiment, according to the sequence length of each high-quality genome sequence data in the multiple high-quality genome sequence data, the high-quality genome sequence data with a sequence length greater than or equal to a preset base number threshold (for example, 3500 bases) can be determined as the high-quality genome sequence data that meets the sequence length requirement, and then according to the number of high-quality genome sequence data that meet the sequence length requirement in each genome sequence clustering cluster, the genome sequence clustering cluster containing the number of high-quality genome sequence data that meet the sequence length requirement greater than or equal to the preset number threshold (for example, 10) can be determined as the genome sequence clustering cluster to be selected for library construction, and the high-quality genome sequence data contained in the genome sequence clustering cluster to be selected for library construction can be determined as the high-quality genome sequence data to be selected for library construction, so as to obtain multiple genome sequence clustering clusters to be selected for library construction and the multiple high-quality genome sequence data contained therein.

步骤S1105、根据多个预设切分长度，对建库待选的多个高质量基因组序列数据中的每个高质量基因组序列数据分别进行序列片段切分，得到多个预设切分长度中每个预设切分长度对应的多个碱基子序列数据，为每个预设切分长度对应的多个碱基子序列数据构建冠状病毒基因组序列片段库，并根据建库待选的多个高质量基因组序列数据中的每个高质量基因组序列数据的生物学类别标签，为多个碱基子序列数据中的每个碱基子序列数据标注来源标签，其中，多个预设切分长度包括第一预设切分长度（）和第二预设切分长度（），bp即为碱基长度，下同。Step S1105: According to a plurality of preset segmentation lengths, each of the plurality of high-quality genome sequence data to be selected for library construction is segmented into sequence fragments to obtain a plurality of base subsequence data corresponding to each of the plurality of preset segmentation lengths, a coronavirus genome sequence fragment library is constructed for the plurality of base subsequence data corresponding to each preset segmentation length, and a source label is annotated for each of the plurality of base subsequence data according to the biological category label of each of the plurality of high-quality genome sequence data to be selected for library construction, wherein the plurality of preset segmentation lengths include a first preset segmentation length ( ) and the second preset segmentation length ( ), bp is the base length, the same below.

具体的，对某一条标签为A的序列，假设其长度为，将该序列从第一个碱基开始，依次切分截取出固定长度为的碱基子序列，各次切分的步长跨度为1，则一共可以切分出个碱基子序列（可能存在重复，但不影响后续结果，不必考虑）；在序列片段库中，保存切分出的个碱基子序列（即序列片段），并将每个序列片段关联到标签上，表明该序列片段是来源于标签中基因组序列的切分；如若在序列片段库中已经保存某个序列片段（及该片段与某标签的关联关系），则新增一个关联关系，使得该序列片段同时关联到标签与标签。通过上述过程，对每个标签中的至多100条序列进行序列切分（超过100条序列的标签，只取前100条，多出来的序列认为是冗余序列不予考虑；不足100条则全部使用），能够构建出一个序列片段库，库中包含着所有长度为的序列片段，每一个序列片段均与其所有的标签聚类来源相对应，即片段的基因组来源，如SARS-rat_1、MERS_2、SARS-CoV-2_2等。在本实施例中，同时构建两个序列片段库，一级序列片段库取，二级序列片段库取。Specifically, for a label A Assume that its length is , starting from the first base, the sequence is cut into fixed lengths The base subsequence of each segmentation is 1, so a total of base subsequences (there may be duplications, but they do not affect subsequent results and need not be considered); in the sequence fragment library, save the segmented base subsequences (i.e., sequence fragments) and associate each sequence fragment with a tag It indicates that the sequence fragment is derived from the tag The segmentation of the genome sequence in the sequence fragment library; if a sequence fragment (and the fragment and a tag ), then a new association is added so that the sequence fragment is also associated with the tag With label Through the above process, up to 100 sequences in each tag are segmented (for tags with more than 100 sequences, only the first 100 are taken, and the extra sequences are considered redundant and not considered; for tags with less than 100 sequences, all are used), and a sequence fragment library can be constructed, which contains all sequences with a length of Each sequence fragment corresponds to the source of all its tag clusters, that is, the genomic source of the fragment, such as SARS-rat_1, MERS_2, SARS-CoV-2_2, etc. In this embodiment, two sequence fragment libraries are constructed at the same time. The primary sequence fragment library is taken , secondary sequence fragment library .

步骤S1106、根据多个高质量基因组序列数据中每个高质量基因组序列数据的主标签，对多个高质量基因组序列数据进行分组，得到多个主标签组别，并对多个主标签组别中的每个主标签组别进行单独建库，形成一级查询库，其中，每个主标签组别包括标准有相同主标签的多个高质量基因组序列数据。Step S1106: Group the multiple high-quality genome sequence data according to the primary tag of each high-quality genome sequence data in the multiple high-quality genome sequence data to obtain multiple primary tag groups, and build a separate library for each primary tag group in the multiple primary tag groups to form a primary query library, wherein each primary tag group includes multiple high-quality genome sequence data with the same standard primary tag.

具体的，对于每个基因组序列，其基于CD-HIT聚类标注得到对应的生物学类别标签，如SARS-rat_1、MERS_2、SARS-CoV-2_2等，对应主标签分别为SARS-rat、MERS、SARS-CoV-2。对于每个主标签，对其包含的所有基因组序列（无论是否参与序列片段库的构建），均可以使用makeblastdb指令进行BLAST建库。每一个主标签单独建立一个库。Specifically, for each genome sequence, the corresponding biological category label is obtained based on CD-HIT cluster annotation, such as SARS-rat_1, MERS_2, SARS-CoV-2_2, etc., and the corresponding main labels are SARS-rat, MERS, and SARS-CoV-2. For each main label, all the genome sequences contained in it (regardless of whether they are involved in the construction of the sequence fragment library) can be BLAST-built using the makeblastdb command. A separate library is built for each main label.

步骤S1107、对建库待选的多个高质量基因组序列数据进行建库（无论是否参与序列片段库的构建），可以使用makeblastdb指令进行BLAST建库，所有序列构建一个库，得到二级查询库。Step S1107: construct a library for the multiple high-quality genome sequence data to be selected for library construction (regardless of whether they are involved in the construction of the sequence fragment library). The makeblastdb command can be used to perform BLAST library construction, and all sequences are constructed into one library to obtain a secondary query library.

步骤S1108、对进行质量筛选和聚类后的多个高质量基因组序列数据进行建库，可以使用makeblastdb指令进行BLAST建库，所有序列构建一个库，得到三级查询库。Step S1108, construct a library for the multiple high-quality genome sequence data after quality screening and clustering. The makeblastdb command can be used to perform BLAST library construction, and all sequences are constructed into one library to obtain a three-level query library.

在一种实施例中，步骤S120可以接收用户输入的待检测的冠状病毒基因组数据（X序列）。In one embodiment, step S120 may receive coronavirus genome data (X sequence) to be detected input by a user.

在一种实施例中，步骤S130可以利用BLAST（局部相似性基本查询工具，BasicLocal Alignment Search Tool）将待检测的冠状病毒基因组数据与二级查询库的高质量基因组序列数据进行相似度计算，得到与待检测的冠状病毒基因组数据相似度最高的高质量基因组序列数据及其生物学类别标签，作为待检测的冠状病毒基因组数据的先验知识，如若查询失败（No-hit）则认为无先验知识。In one embodiment, step S130 can use BLAST (Basic Local Alignment Search Tool) to calculate the similarity between the coronavirus genome data to be detected and the high-quality genome sequence data in the secondary query library to obtain the high-quality genome sequence data with the highest similarity to the coronavirus genome data to be detected and its biological category label as the prior knowledge of the coronavirus genome data to be detected. If the query fails (No-hit), it is considered that there is no prior knowledge.

对X序列进行片段匹配，匹配出的片段根据序列片段库中的基因组来源归属进行投票，确定票数最多的基因组来源，将对应基因组片段都判定为该来源。其他未确定来源的片段继续多级迭代匹配，直至结束，在一种实施例中，步骤S140可以包括：The X sequence is matched, and the matched fragments are voted according to the genome source in the sequence fragment library, and the genome source with the most votes is determined, and the corresponding genome fragments are determined to be of this source. Other fragments whose sources are not determined continue to be matched in multiple levels until the end. In one embodiment, step S140 may include:

根据第一预设切分长度，对待检测的冠状病毒基因组数据进行序列片段切分，得到多个第一待检测序列片段数据，将多个第一待检测序列片段数据中的每个第一待检测序列片段数据与第一预设切分长度对应的冠状病毒基因组序列片段库的高质量基因组序列数据进行匹配，以每个第一待检测序列片段数据匹配得到的高质量基因组序列数据的生物学类别标签作为该第一待检测序列片段数据对应的匹配标签，根据多个第一待检测序列片段数据的匹配结果为多个匹配标签投票。According to the first preset segmentation length , segmenting the coronavirus genome data to be detected into sequence fragments to obtain a plurality of first sequence fragment data to be detected, matching each first sequence fragment data to be detected in the plurality of first sequence fragment data to be detected with high-quality genome sequence data of a coronavirus genome sequence fragment library corresponding to a first preset segmentation length, using a biological category label of the high-quality genome sequence data obtained by matching each first sequence fragment data to be detected as a matching label corresponding to the first sequence fragment data to be detected, and voting for a plurality of matching labels according to the matching results of the plurality of first sequence fragment data to be detected.

具体的，首先进行X全长的片段匹配与投票。根据一级序列片段库中片段长度，将输入的序列X也依次切分截取出固定长度为的碱基子序列，切分的步长跨度为1，得到序列X的序列片段。构建一个投票池。将序列切分得到的序列片段在一级片段库中进行匹配，匹配一致的片段在片段库中都拥有生物学类别标签，如SARS-rat_1、MERS_2、SARS-CoV-2_2等；每个X序列切分出的片段均将自己对应的标签在投票池中进行投票，如若一个片段同时关联多个标签，则这几个标签票数均加一。Specifically, firstly, the full-length fragments of X are matched and voted. , the input sequence X is also split into fixed length The base subsequence of the segmentation is 1, and the sequence fragments of sequence X are obtained. A voting pool is constructed. The sequence fragments obtained by segmentation are matched in the primary fragment library. The matching fragments have biological category labels in the fragment library, such as SARS-rat_1, MERS_2, SARS-CoV-2_2, etc.; each fragment segmented from the X sequence votes for its corresponding label in the voting pool. If a fragment is associated with multiple labels at the same time, the votes of these labels are increased by one.

一轮投票结束后，统计投票结果，在这其中存在多种可能性如下。After a round of voting is completed, the voting results are counted, among which there are several possibilities as follows.

当多个第一待检测序列片段数据的匹配结果显示多个第一待检测序列片段数据均无匹配标签时，根据第二预设切分长度，对待检测的冠状病毒基因组数据进行序列片段切分，得到多个第二待检测序列片段数据，将多个第二待检测序列片段数据中的每个第二待检测序列片段数据与预设第二切分长度对应的冠状病毒基因组序列片段库的高质量基因组序列数据进行匹配，以每个第二待检测序列片段数据匹配得到的高质量基因组序列数据的生物学类别标签作为该第二待检测序列片段数据对应的匹配标签，根据多个第二待检测序列片段数据的匹配结果为多个匹配标签投票，若此时多个第二待检测序列片段数据的匹配结果依然显示多个第二待检测序列片段数据均无匹配标签时，对该多个第二待检测序列片段数据标注特殊标签；When the matching results of the plurality of first sequence fragment data to be detected show that none of the plurality of first sequence fragment data to be detected has a matching tag, the second preset segmentation length is used. , performing sequence segmentation on the coronavirus genome data to be detected to obtain a plurality of second sequence segment data to be detected, matching each second sequence segment data to be detected in the plurality of second sequence segment data to be detected with the high-quality genome sequence data of the coronavirus genome sequence segment library corresponding to the preset second segmentation length, using the biological category label of the high-quality genome sequence data obtained by matching each second sequence segment data to be detected as the matching label corresponding to the second sequence segment data to be detected, voting for the plurality of matching labels according to the matching results of the plurality of second sequence segment data to be detected, and if the matching results of the plurality of second sequence segment data to be detected still show that none of the plurality of second sequence segment data to be detected has a matching label, marking the plurality of second sequence segment data to be detected with a special label;

对于待检测的冠状病毒基因组数据中的染色区段，滤除染色区段的碱基长度小于或等于预设染色长度阈值（表示小于该长度的片段认为不是重组而是若干突变，用于消除密集点突变对序列匹配的干扰，一般取）的染色区段；For the stained segments in the coronavirus genome data to be detected, filter out the stained segments whose base length is less than or equal to the preset stained length threshold （ Indicates that fragments shorter than this length are considered to be not recombinants but mutations, which is used to eliminate the interference of dense point mutations on sequence matching. ) of the stained segment;

此外，如若在“先验知识搜索”阶段没有无先验知识标签，则在全长的投票（第一次投票）后，将该标签定为先验知识标签，供后续迭代投票使用。In addition, if there is no no prior knowledge label in the “prior knowledge search” stage, then after the full-length vote (the first vote), the label will be set as the prior knowledge label for use in subsequent iterative voting.

在步骤S140结束后，迭代返回、汇总结果，每个碱基位点都进行了染色。此时，再从头检查一遍各染色区段的长度，如若存在个别染色区段长度小于等于，则删除该区段，直接根据该区段上下文的染色情况进行相同染色。After step S140 is completed, the iteration returns and the results are summarized. Each base site is dyed. At this time, the length of each dyed segment is checked again from the beginning. If there is an individual dyed segment with a length less than or equal to , then delete the segment and perform the same coloring according to the coloring of the segment context.

通过上述片段匹配、迭代投票、一级二级匹配等机制，将用户输入的未知冠状病毒基因组序列X的各碱基片段都能够染色（标记）为表征其生物学来源类别的CD-HIT标签，能够实现病毒基因组来源的判定。Through the above-mentioned mechanisms of fragment matching, iterative voting, primary and secondary matching, etc., each base fragment of the unknown coronavirus genome sequence X input by the user can be dyed (marked) as a CD-HIT tag that characterizes its biological source category, so that the source of the viral genome can be determined.

在一种实施例中，步骤S150可以包括：In one embodiment, step S150 may include:

需要说明的是，由于CD-HIT聚类时具有较高的序列相似度，同一个标签内部，如SARS-CoV-2_2内部，序列相似度非常高、甚至存在一定的重复乃至冗余序列；而相同主标签内的不同标签，如SARS-CoV-2内的SARS-CoV-2_1、SARS-CoV-2_2、SARS-CoV-2_3，由于物种相同或相近的原因，内部的相似度也比较高，所以本发明更加侧重不同主标签之间的重组演化事件（如MERS与SARS-CoV-2之间），对主标签内标号不一致的情况认为是正常的遗传变异事件。在前面染色阶段，使用带标号的标签，主要是为了更加精确的进行匹配。It should be noted that due to the high sequence similarity during CD-HIT clustering, the sequence similarity within the same tag, such as SARS-CoV-2_2, is very high, and there are even certain repetitions or redundant sequences; and different tags within the same main tag, such as SARS-CoV-2_1, SARS-CoV-2_2, and SARS-CoV-2_3 within SARS-CoV-2, have relatively high internal similarity due to the same or similar species. Therefore, the present invention focuses more on the recombination evolution events between different main tags (such as between MERS and SARS-CoV-2), and the inconsistency of the labels within the main tags is considered to be a normal genetic variation event. In the previous staining stage, the use of labeled tags is mainly for more accurate matching.

具体在算法层面，算法针对前一阶段全部染色（标记）完成的X序列，对每一染色区段的染色标记对应的标签进行简化，只保留主标签内容（如将SARS-CoV-2_2简化为SARS-CoV-2），并根据主标签重新进行染色标记。然后对X序列全长的染色情况进行考察，如若序列全长只有一个主标签（如SARS-CoV-2_1、SARS-CoV-2_2都简化为SARS-CoV-2），则认为不存在重组；反之，存在多个主标签情况（SARS-rat、SARS-CoV-2），则认为存在重组。特别地，特殊标签“Unknown”存在时几乎必然是存在重组的，因为它与其余任何主标签都不相同；除非序列全长都是“Unknown”，但这种极端情况在实际运行时几乎不可能出现。Specifically, at the algorithm level, the algorithm simplifies the labels corresponding to the staining marks of each staining segment for the X sequence that has been completely stained (labeled) in the previous stage, retains only the main label content (such as simplifying SARS-CoV-2_2 to SARS-CoV-2), and re-stains and labels according to the main label. Then the staining of the entire length of the X sequence is examined. If there is only one main label in the entire length of the sequence (such as SARS-CoV-2_1 and SARS-CoV-2_2 are both simplified to SARS-CoV-2), it is considered that there is no recombination; conversely, if there are multiple main labels (SARS-rat, SARS-CoV-2), it is considered that there is recombination. In particular, when the special label "Unknown" exists, there is almost certainly recombination, because it is different from any other main label; unless the entire length of the sequence is "Unknown", but this extreme case is almost impossible to occur in actual operation.

如若算法判定X序列全长不存在重组，则可以直接输出X全长判定的主标签作为其基因组来源的结果，同时结束本次运行；如若存在重组，则需要根据X序列进行的序列染色区段划分以及各染色区段主标签进行进一步的亲代序列的查询与重组片段的定位。If the algorithm determines that there is no recombination in the full length of the X sequence, the main label determined by the full length of X can be directly output as the result of its genome source, and the current operation is terminated at the same time; if there is recombination, it is necessary to further query the parental sequence and locate the recombinant fragment based on the sequence staining segment division performed by the X sequence and the main label of each staining segment.

在一种实施例中，步骤S160可以利用一级查询库对查询子序列进行亲代序列查询，当得到待检测的冠状病毒基因组数据的亲代序列数据时，结束该步骤，否则，利用二级查询库对查询子序列进行再次亲代序列查询，当得到待检测的冠状病毒基因组数据的亲代序列数据时，结束该步骤，否则，利用三级查询库对查询子序列进行又一次亲代序列查询，如果依然未能得到待检测的冠状病毒基因组数据的亲代序列数据时，以未知亲代数据作为待检测的冠状病毒基因组数据的亲代序列数据。In one embodiment, step S160 can use the primary query library to perform a parent sequence query on the query subsequence. When the parent sequence data of the coronavirus genome data to be detected is obtained, the step is terminated. Otherwise, the query subsequence is queried again for the parent sequence using the secondary query library. When the parent sequence data of the coronavirus genome data to be detected is obtained, the step is terminated. Otherwise, the query subsequence is queried again for the parent sequence using the tertiary query library. If the parent sequence data of the coronavirus genome data to be detected is still not obtained, the unknown parent data is used as the parent sequence data of the coronavirus genome data to be detected.

具体的，对某一个基因组来源标签（主标签），考察其对应的所有染色区段，将其对应染色区段的碱基片段按照在X基因组序列中的先后位置串联起来，得到一个“查询子序列”。使用多级BLAST进行亲代序列查询（具体搜索方法是BLASTn），由于“查询子序列”对应一个主标签，故先去对应的一级查询库中BLAST搜索，如若成功则直接结束查询，确定亲代序列；如若失败（no-hit）则去二级查询库中BLAST搜索，成功则确定亲代序列；如若二级也失败则去三级查询库中BLAST搜索，成功则确定亲代序列；如若三级也查询失败，则查询失败，返回亲代序列为NA（未知亲代）。对每个基因组来源标签（主标签）都进行上述操作，最终确定X序列所有基因组来源主标签对应的亲代序列。Specifically, for a certain genome source tag (primary tag), all the corresponding chromatin segments are examined, and the base fragments of the corresponding chromatin segments are connected in series according to the order of their positions in the X genome sequence to obtain a "query subsequence". Use multi-level BLAST to query the parent sequence (the specific search method is BLASTn). Since the "query subsequence" corresponds to a primary tag, the BLAST search is first performed in the corresponding primary query library. If successful, the query is terminated directly to determine the parent sequence; if it fails (no-hit), the BLAST search is performed in the secondary query library. If successful, the parent sequence is determined; if the secondary query also fails, the BLAST search is performed in the tertiary query library. If successful, the parent sequence is determined; if the tertiary query also fails, the query fails and the parent sequence is returned as NA (unknown parent). The above operation is performed for each genome source tag (primary tag), and finally the parent sequence corresponding to all genome source primary tags of the X sequence is determined.

在一种实施例中，步骤S170可以对序列X的每个染色区段，截取出其子序列，根据其对应的亲代序列，使用序列比对方法，对该部分子序列片段（重组片段）在亲代序列中对应的位点位置进行定位。特别地，对于亲代序列查询中亲代序列为NA的情况（这种情况极少数），本部分不做比对，直接跳过该片段。序列比对可选用生物信息学领域内的序列比对方法，本方法选用的是Biopython库中pairwise2包的pairwise2.align.localms()函数进行实现。在实际运行中，由于每个染色区段（重组片段）都需要在亲代序列中序列比对、位点定位，时间开销较大，故使用多进程并行库中的multiprocessing库对序列比对部分进行多进程并行加速。In one embodiment, step S170 can extract a subsequence from each staining segment of sequence X, and use a sequence alignment method to locate the corresponding site position of the subsequence fragment (recombinant fragment) in the parent sequence according to its corresponding parent sequence. In particular, for the case where the parent sequence is NA in the parent sequence query (this case is very rare), this part does not make an alignment and directly skips the fragment. The sequence alignment method in the field of bioinformatics can be used for sequence alignment. This method uses the pairwise2.align.localms() function of the pairwise2 package in the Biopython library to implement it. In actual operation, since each staining segment (recombinant fragment) needs to be sequence aligned and site located in the parent sequence, the time overhead is relatively large, so the multiprocessing library in the multi-process parallel library is used to accelerate the sequence alignment part in multi-process parallel.

最后，对判定重组的X序列，以csv的格式输出其重组判定结果，如“1-11285bp<-SARS-CoV-2|MT873842|235-11519bp”表示本序列的1-11285bp的序列片段来自于SARS-CoV-2（主标签）的MT873842号序列的235-11519bp位置片段，其他重组片段以此类推，最终输出完整的结果。Finally, for the X sequence determined to be recombined, the recombination determination result is output in csv format, such as "1-11285bp<-SARS-CoV-2|MT873842|235-11519bp" means that the sequence fragment 1-11285bp of this sequence comes from the 235-11519bp position fragment of the MT873842 sequence of SARS-CoV-2 (primary tag), and other recombinant fragments are deduced in the same way, and finally the complete result is output.

下述将提供实施例1，以描述本发明提供的面向大规模基因组数据的冠状病毒重组检测方法在挖掘SARS-LIKE病毒（类SARS病毒）中重组事件的应用。The following will provide Example 1 to describe the application of the coronavirus recombination detection method for large-scale genomic data provided by the present invention in mining recombination events in SARS-LIKE viruses (SARS-like viruses).

实施例1通过文献收集与RDP软件检测等方式，收集整理38条SARS-LIKE病毒重组序列，结合165条非重组SARS-LIKE序列，组成203条序列的SARS-LIKE数据集。实施例1旨在通过本方法，对SARS-LIKE数据集中的重组毒株及其相关重组事件（亲代序列来源、重组碱基位点）进行检测。Example 1 collects and organizes 38 SARS-LIKE virus recombinant sequences through literature collection and RDP software detection, and combines them with 165 non-recombinant SARS-LIKE sequences to form a SARS-LIKE data set of 203 sequences. Example 1 aims to detect recombinant strains and their related recombination events (parental sequence sources, recombinant base sites) in the SARS-LIKE data set through this method.

一、构建多维度重组检测语料库1. Constructing a multi-dimensional reorganization detection corpus

1.1）基因组序列的收集与预处理1.1) Collection and preprocessing of genome sequences

在NCBI Genbank与GISAID等在线数据库上下载冠状病毒基因组序列，共计约为1300万条，其中大部分为新冠病毒序列，非新冠病毒序列约4万条。为减小偏性影响，对序列进行筛选与重采样，在保持序列多样性与完整性的同时，降低序列的数据偏性。然后，考察每条基因组序列的质量，对非法字符率的高质量序列予以保留。使用CD-HIT聚类方法，对冠状病毒序列进行聚类校准分析，通过对序列数据进行聚类标注实现序列的标签分类。The coronavirus genome sequences were downloaded from online databases such as NCBI Genbank and GISAID, totaling about 13 million, most of which were SARS-CoV-2 sequences, and about 40,000 were non-SARS-CoV-2 sequences. To reduce the bias, the sequences were screened and resampled to reduce the data bias of the sequences while maintaining sequence diversity and integrity. Then, the quality of each genome sequence was examined, and the illegal character rate was calculated. The high-quality sequences are retained. The CD-HIT clustering method is used to perform cluster calibration analysis on the coronavirus sequences, and the sequence label classification is achieved by clustering and annotating the sequence data.

1.2）构建病毒基因组序列片段库1.2) Construction of viral genome sequence fragment library

基于上述经过数据预处理的基因组序列，构建病毒基因组序列片段库。Based on the above data preprocessed genome sequence, a viral genome sequence fragment library is constructed.

根据本方法中描述的质量控制策略，对CD-HIT标签聚类及其内部包含的序列进行筛选，只保留包含至少十条长度大于等于3500bp序列的CD-HIT标签聚类。经过质量控制之后，保留19个CD-HIT主标签，内含98个标签聚类。According to the quality control strategy described in this method, CD-HIT tag clusters and the sequences contained therein were screened, and only CD-HIT tag clusters containing at least ten sequences with a length greater than or equal to 3500 bp were retained. After quality control, 19 CD-HIT main tags were retained, containing 98 tag clusters.

对保留的标签聚类进行序列切分与建库，分别构建片段长度的一级片段库与的二级片段库。Sequence segmentation and library construction are performed on the retained tag clusters to construct fragment lengths. The primary fragment library and Secondary fragment library.

1.3）构建病毒基因组序列查询库1.3) Construction of viral genome sequence query library

使用makeblastdb指令进行BLAST建库，每个主标签构建一个一级查询库，19个主标签共构建19个一级查询库，加上一个二级查询库与一个三级查询库，共计21个序列查询库。Use the makeblastdb command to build the BLAST library. One primary query library is constructed for each major tag. A total of 19 primary query libraries are constructed for 19 major tags, plus one secondary query library and one tertiary query library, totaling 21 sequence query libraries.

二、输入序列的基因组来源判定2. Determination of the genomic origin of the input sequence

2.1）基因组来源判定2.1) Determination of genome origin

根据测试所用的包括203条序列的SARS-LIKE数据集，逐一读取数据集中的序列。对每一条序列，按照方法描述，先进行先验知识获取，然后使用一级片段库进行片段匹配与迭代投票，必要时根据方法描述使用二级片段库进行迭代投票，迭代返回之后，检查每个区段的长度，删去长度小于等于的染色区段、直接根据上下文染色的情况进行相同染色。最终，对每一条输入序列，都实现对序列每个碱基位点的染色（染色信息表征了基因组来源），即实现病毒基因组来源的判定。According to the SARS-LIKE data set including 203 sequences used in the test, the sequences in the data set were read one by one. For each sequence, according to the method description, the prior knowledge was first acquired, and then the first-level fragment library was used for fragment matching and iterative voting. If necessary, the second-level fragment library was used for iterative voting according to the method description. After the iteration was returned, the length of each segment was checked and the segments with a length less than or equal to 1 were deleted. The dyed segments are directly dyed the same according to the context dyeing. Finally, for each input sequence, each base site of the sequence is dyed (the dyeing information represents the genome source), that is, the source of the viral genome is determined.

2.2）重组事件的判定2.2) Determination of recombination events

根据染色情况与CD-HIT标签信息，对数据集中的每一条序列都进行重组与否的判断。对每一条序列，先对每一染色区段的染色标记对应的标签进行简化，只保留主标签内容，如若序列全长只有一个主标签，则认为不存在重组；反之存在多个主标签情况，则认为存在重组。对判定不重组的序列，直接输出当前结果，对重组序列则继续进行后续亲代序列的查询与重组片段的定位。According to the staining conditions and CD-HIT label information, each sequence in the data set is judged to be recombined or not. For each sequence, the labels corresponding to the staining marks of each staining segment are simplified first, and only the main label content is retained. If there is only one main label in the entire length of the sequence, it is considered that there is no recombination; on the contrary, if there are multiple main labels, it is considered that there is recombination. For sequences that are judged not to be recombined, the current result is directly output, and for recombined sequences, the subsequent parental sequence query and the location of the recombined fragment are continued.

三、亲代序列查询定位3. Parental sequence query and positioning

3.1）亲代序列查询3.1) Parent sequence query

对每一条SARS-LIKE数据集中已经判定重组的序列，进行如下操作：对序列中每一个基因组来源标签（主标签），构造对应的查询子序列，使用多级BLAST搜索查询亲代序列，最终确定该序列所有基因组来源对应的亲代序列。For each sequence that has been determined to be recombinant in the SARS-LIKE data set, the following operations are performed: for each genomic source label (primary label) in the sequence, a corresponding query subsequence is constructed, and a multi-level BLAST search is used to query the parent sequence, and finally the parent sequence corresponding to all genomic sources of the sequence is determined.

3.2）重组片段定位3.2) Localization of recombinant fragments

对每一条SARS-LIKE数据集中已经判定重组的序列，进行如下操作：对序列的每个区段，使用序列比对，比对该区段序列片段（重组片段）在亲代序列中对应的位置进行位点定位，最终确定重组片段的来源、输出结果。For each sequence that has been determined to be recombined in the SARS-LIKE data set, the following operations are performed: for each segment of the sequence, sequence alignment is used to compare the sequence fragment of the segment (recombined fragment) with the corresponding position in the parent sequence to locate the site, and finally determine the source of the recombinant fragment and output the result.

由于在构建SARS-LIKE数据集时，已经通过文献收集与RDP软件检测等方式提前获知了各序列的重组情况，故可以基于真实的重组结果与本方法判定的重组结果进行精度计算，以此评估本方法结果的可靠性。精度结果如表1所示，本方法在SARS-LIKE数据集上的各项精度结果都较高，说明了本发明检测结果的可靠性。Since the recombination of each sequence was known in advance through literature collection and RDP software detection when constructing the SARS-LIKE data set, the accuracy calculation can be performed based on the real recombination results and the recombination results determined by this method to evaluate the reliability of the results of this method. The accuracy results are shown in Table 1. The accuracy results of this method on the SARS-LIKE data set are all high, which shows the reliability of the detection results of the present invention.

在本应用案例中，本发明提供的面向大规模基因组数据的冠状病毒重组检测方法能够有效检测SARS-LIKE冠状病毒中存在的重组演化事件，检测精度较高、结果较为可靠，能够为后续研究应用提供重要的数据支持。In this application case, the coronavirus recombination detection method for large-scale genomic data provided by the present invention can effectively detect the recombination evolution events existing in SARS-LIKE coronaviruses, with high detection accuracy and relatively reliable results, and can provide important data support for subsequent research applications.

本发明提供的面向大规模基因组数据的冠状病毒重组检测方法，至少具有以下优点：1、本发明应用数据挖掘相关方法，结合生物信息学工具，通过对冠状病毒基因组序列的处理，建立冠状病毒基因组序列片段库与查询库；针对未知冠状病毒的序列，通过片段匹配、迭代投票等机制，实现冠状病毒序列的基因组来源判定与重组事件判定；通过查询比对，确定不同基因组来源对应的亲代序列及其对应的重组片段，能够实现面向大规模冠状病毒基因组数据的重组事件检测，为后续研究应用提供重要的数据支持。The coronavirus recombination detection method for large-scale genome data provided by the present invention has at least the following advantages: 1. The present invention applies data mining related methods, combines bioinformatics tools, and establishes a coronavirus genome sequence fragment library and a query library by processing the coronavirus genome sequence; for the sequence of unknown coronavirus, the genome source determination and recombination event determination of the coronavirus sequence are realized through mechanisms such as fragment matching and iterative voting; through query comparison, the parent sequences corresponding to different genome sources and their corresponding recombinant fragments are determined, which can realize recombination event detection for large-scale coronavirus genome data and provide important data support for subsequent research and application.

下面对本发明提供的多维度重组检测语料库的构建、面向大规模基因组数据的冠状病毒重组检测系统进行描述，下文描述的多维度重组检测语料库的构建、面向大规模基因组数据的冠状病毒重组检测系统与上文描述的多维度重组检测语料库的构建方法、面向大规模基因组数据的冠状病毒重组检测方法可相互对应参照。The construction of the multidimensional recombination detection corpus and the coronavirus recombination detection system for large-scale genomic data provided by the present invention are described below. The construction of the multidimensional recombination detection corpus and the coronavirus recombination detection system for large-scale genomic data described below and the construction method of the multidimensional recombination detection corpus and the coronavirus recombination detection method for large-scale genomic data described above can be referenced to each other.

本发明提供的一种多维度重组检测语料库的构建系统，可以包括：The present invention provides a system for constructing a multi-dimensional reorganization detection corpus, which may include:

数据获取模块，用于：获取冠状病毒基因组数据，其中，冠状病毒基因组数据包括多个基因组序列数据；A data acquisition module is used to: acquire coronavirus genome data, wherein the coronavirus genome data includes a plurality of genome sequence data;

质量筛选模块，用于：根据多个基因组序列数据中每个基因组序列数据的非法字符率，对多个基因组序列数据进行质量筛选，得到多个高质量基因组序列数据；A quality screening module is used to: perform quality screening on the multiple genome sequence data according to the illegal character rate of each genome sequence data in the multiple genome sequence data to obtain multiple high-quality genome sequence data;

聚类模块，用于：对多个高质量基因组序列数据进行聚类，得到多个基因组序列聚类簇，其中，多个基因组序列聚类簇中的每个基因组序列聚类簇均标注有分类标签，每个基因组序列聚类簇中的每个高质量基因组序列数据均标注有生物学类别标签，生物学类别标签包括主标签和分组序号；A clustering module is used to: cluster multiple high-quality genome sequence data to obtain multiple genome sequence clusters, wherein each genome sequence cluster in the multiple genome sequence clusters is marked with a classification label, and each high-quality genome sequence data in each genome sequence cluster is marked with a biological category label, and the biological category label includes a main label and a grouping number;

长度筛选模块，用于：根据多个高质量基因组序列数据中每个高质量基因组序列数据的序列长度，对多个高质量基因组序列数据进行筛选，并根据每个基因组序列聚类簇内符合序列长度要求的高质量基因组序列数据的数量，对多个基因组序列聚类簇进行筛选，得到建库待选的多个基因组序列聚类簇及其包含的多个高质量基因组序列数据；A length screening module is used to screen multiple high-quality genome sequence data according to the sequence length of each high-quality genome sequence data in the multiple high-quality genome sequence data, and screen multiple genome sequence clusters according to the number of high-quality genome sequence data that meet the sequence length requirements in each genome sequence cluster, to obtain multiple genome sequence clusters to be selected for library construction and multiple high-quality genome sequence data contained therein;

切分模块，用于：根据多个预设切分长度，对建库待选的多个高质量基因组序列数据中的每个高质量基因组序列数据分别进行序列片段切分，得到多个预设切分长度中每个预设切分长度对应的多个碱基子序列数据，为每个预设切分长度对应的多个碱基子序列数据构建冠状病毒基因组序列片段库，并根据建库待选的多个高质量基因组序列数据中的每个高质量基因组序列数据的生物学类别标签，为多个碱基子序列数据中的每个碱基子序列数据标注来源标签，其中，多个预设切分长度包括第一预设切分长度和第二预设切分长度；A segmentation module, used to: segment each high-quality genome sequence data in a plurality of high-quality genome sequence data to be selected for library construction into sequence fragments according to a plurality of preset segmentation lengths, obtain a plurality of base subsequence data corresponding to each preset segmentation length in a plurality of preset segmentation lengths, construct a coronavirus genome sequence fragment library for the plurality of base subsequence data corresponding to each preset segmentation length, and annotate a source label for each base subsequence data in the plurality of base subsequence data according to a biological category label of each high-quality genome sequence data in the plurality of high-quality genome sequence data to be selected for library construction, wherein the plurality of preset segmentation lengths include a first preset segmentation length and a second preset segmentation length;

第一建库模块，用于：根据多个高质量基因组序列数据中每个高质量基因组序列数据的主标签，对多个高质量基因组序列数据进行分组，得到多个主标签组别，并对多个主标签组别中的每个主标签组别进行单独建库，形成一级查询库，其中，每个主标签组别包括标准有相同主标签的多个高质量基因组序列数据；The first library building module is used to: group the multiple high-quality genome sequence data according to the primary label of each high-quality genome sequence data in the multiple high-quality genome sequence data to obtain multiple primary label groups, and independently build a library for each primary label group in the multiple primary label groups to form a primary query library, wherein each primary label group includes multiple high-quality genome sequence data with the same primary label;

第二建库模块，用于：对建库待选的多个高质量基因组序列数据进行建库，得到二级查询库；The second library building module is used to build a library for a plurality of high-quality genome sequence data to be selected for library building, and obtain a secondary query library;

第三建库模块，用于：对进行质量筛选和聚类后的多个高质量基因组序列数据进行建库，得到三级查询库。The third library building module is used to build a library for multiple high-quality genome sequence data after quality screening and clustering to obtain a third-level query library.

参照图3，本发明提供的一种面向大规模基因组数据的冠状病毒重组检测系统，可以包括：Referring to FIG3 , a coronavirus recombination detection system for large-scale genome data provided by the present invention may include:

在一种实施方案中，先验知识查询模块可以包括：In one embodiment, the prior knowledge query module may include:

相似度计算子模块，用于：将待检测的冠状病毒基因组数据与二级查询库的高质量基因组序列数据进行相似度计算，得到与待检测的冠状病毒基因组数据相似度最高的高质量基因组序列数据及其生物学类别标签，作为待检测的冠状病毒基因组数据的先验知识。The similarity calculation submodule is used to calculate the similarity between the coronavirus genome data to be detected and the high-quality genome sequence data in the secondary query library, and obtain the high-quality genome sequence data with the highest similarity to the coronavirus genome data to be detected and its biological category label as the prior knowledge of the coronavirus genome data to be detected.

在一种实施方案中，序列片段匹配模块可以包括：In one embodiment, the sequence fragment matching module may include:

第一切分子模块，用于：根据第一预设切分长度，对待检测的冠状病毒基因组数据进行序列片段切分，得到多个第一待检测序列片段数据，将多个第一待检测序列片段数据中的每个第一待检测序列片段数据与第一预设切分长度对应的冠状病毒基因组序列片段库的高质量基因组序列数据进行匹配，以每个第一待检测序列片段数据匹配得到的高质量基因组序列数据的生物学类别标签作为该第一待检测序列片段数据对应的匹配标签，根据多个第一待检测序列片段数据的匹配结果为多个匹配标签投票；The first molecular segmentation module is used to: segment the coronavirus genome data to be detected into sequence fragments according to a first preset segmentation length to obtain a plurality of first sequence fragment data to be detected, match each first sequence fragment data to be detected in the plurality of first sequence fragment data to be detected with the high-quality genome sequence data of the coronavirus genome sequence fragment library corresponding to the first preset segmentation length, use the biological category label of the high-quality genome sequence data obtained by matching each first sequence fragment data to be detected as the matching label corresponding to the first sequence fragment data to be detected, and vote for the plurality of matching labels according to the matching results of the plurality of first sequence fragment data to be detected;

第一投票子模块，用于：当多个匹配标签中存在唯一票数最多的匹配标签时，以该唯一票数最多的匹配标签作为最优标签，并以该最优标签作为来源标签，对为该最优标签投过票的第一待检测序列片段数据对应的碱基位点进行标签标注和染色；The first voting submodule is used to: when there is a unique matching tag with the most votes among multiple matching tags, use the unique matching tag with the most votes as the optimal tag, and use the optimal tag as the source tag to label and color the base sites corresponding to the first sequence fragment data to be detected that has voted for the optimal tag;

第二投票子模块，用于：当多个匹配标签中存在多个票数最多的匹配标签时，若先验知识中的生物学类别标签为多个票数最多的匹配标签之一时，以先验知识中的生物学类别标签为最优标签，否则，在多个票数最多的匹配标签之中随机选取一个作为最优标签，并以该最优标签作为来源标签，对为该最优标签投过票的第一待检测序列片段数据对应的碱基位点进行标签标注和染色；The second voting submodule is used for: when there are multiple matching tags with the most votes among the multiple matching tags, if the biological category tag in the prior knowledge is one of the multiple matching tags with the most votes, the biological category tag in the prior knowledge is used as the optimal tag; otherwise, one of the multiple matching tags with the most votes is randomly selected as the optimal tag, and the optimal tag is used as the source tag to label and color the base sites corresponding to the first sequence fragment data to be detected that has voted for the optimal tag;

第三投票子模块，用于：当多个第一待检测序列片段数据的匹配结果显示多个第一待检测序列片段数据均无匹配标签时，根据第二预设切分长度，对待检测的冠状病毒基因组数据进行序列片段切分，得到多个第二待检测序列片段数据，将多个第二待检测序列片段数据中的每个第二待检测序列片段数据与预设第二切分长度对应的冠状病毒基因组序列片段库的高质量基因组序列数据进行匹配，以每个第二待检测序列片段数据匹配得到的高质量基因组序列数据的生物学类别标签作为该第二待检测序列片段数据对应的匹配标签，根据多个第二待检测序列片段数据的匹配结果为多个匹配标签投票，若此时多个第二待检测序列片段数据的匹配结果依然显示多个第二待检测序列片段数据均无匹配标签时，对该多个第二待检测序列片段数据标注特殊标签；A third voting submodule is used for: when the matching results of multiple first sequence fragment data to be detected show that none of the multiple first sequence fragment data to be detected have matching labels, segmenting the coronavirus genome data to be detected according to a second preset segmentation length to obtain multiple second sequence fragment data to be detected, matching each second sequence fragment data to be detected in the multiple second sequence fragment data with the high-quality genome sequence data of the coronavirus genome sequence fragment library corresponding to the preset second segmentation length, using the biological category label of the high-quality genome sequence data obtained by matching each second sequence fragment data to be detected as the matching label corresponding to the second sequence fragment data to be detected, voting for the multiple matching labels according to the matching results of the multiple second sequence fragment data to be detected, and if the matching results of the multiple second sequence fragment data to be detected still show that none of the multiple second sequence fragment data to be detected have matching labels, marking the multiple second sequence fragment data to be detected with special labels;

第一优化子模块，用于：对于待检测的冠状病毒基因组数据中的染色区段，滤除染色区段的碱基长度小于或等于预设染色长度阈值的染色区段；The first optimization submodule is used to: for the stained segments in the coronavirus genome data to be detected, filter out the stained segments whose base length is less than or equal to a preset staining length threshold;

第二优化子模块，用于：对于待检测的冠状病毒基因组数据中的未染色区段，若未染色区段的碱基长度小于或等于预设染色长度阈值，根据未染色区段的上下文染色情况进行相同染色，若未染色区段的碱基长度大于预设染色长度阈值，开始新一轮片段匹配和匹配标签投票。The second optimization submodule is used for: for the unstained segment in the coronavirus genome data to be detected, if the base length of the unstained segment is less than or equal to the preset staining length threshold, the same staining is performed according to the context staining of the unstained segment; if the base length of the unstained segment is greater than the preset staining length threshold, a new round of fragment matching and matching label voting is started.

在一种实施方案中，重组事件判定模块可以包括：In one embodiment, the recombination event determination module may include:

第一判定子模块，用于：在待检测的冠状病毒基因组数据的来源标签中不存在特殊标签、只存在主标签的情况下，当待检测的冠状病毒基因组数据的来源标签中只有一个主标签时，判定待检测的冠状病毒基因组数据不存在重组事件，当待检测的冠状病毒基因组数据的来源标签包括多个主标签时，判定待检测的冠状病毒基因组数据存在重组事件；The first determination submodule is used to: when there is no special tag in the source tag of the coronavirus genome data to be detected and only the main tag exists, when there is only one main tag in the source tag of the coronavirus genome data to be detected, determine that there is no recombination event in the coronavirus genome data to be detected; when the source tag of the coronavirus genome data to be detected includes multiple main tags, determine that there is a recombination event in the coronavirus genome data to be detected;

第二判定子模块，用于：当待检测的冠状病毒基因组数据的来源标签中同时存在特殊标签和主标签时，判定待检测的冠状病毒基因组数据存在重组事件；The second determination submodule is used to: when the source tag of the coronavirus genome data to be detected contains both the special tag and the main tag, determine that the coronavirus genome data to be detected has a recombination event;

第三判定子模块，用于：当待检测的冠状病毒基因组数据的来源标签中只存在特殊标签时，判定待检测的冠状病毒基因组数据不存在重组事件。The third determination submodule is used to determine that there is no recombination event in the coronavirus genome data to be detected when only special tags exist in the source tags of the coronavirus genome data to be detected.

在一种实施方案中，亲代序列查询模块可以包括：In one embodiment, the parent sequence query module may include:

迭代查询子模块，用于：利用一级查询库对查询子序列进行亲代序列查询，当得到待检测的冠状病毒基因组数据的亲代序列数据时，结束该步骤，否则，利用二级查询库对查询子序列进行再次亲代序列查询，当得到待检测的冠状病毒基因组数据的亲代序列数据时，结束该步骤，否则，利用三级查询库对查询子序列进行又一次亲代序列查询，如果依然未能得到待检测的冠状病毒基因组数据的亲代序列数据时，以未知亲代数据作为待检测的冠状病毒基因组数据的亲代序列数据。The iterative query submodule is used to: use the primary query library to perform a parent sequence query on the query subsequence. When the parent sequence data of the coronavirus genome data to be detected is obtained, the step is terminated; otherwise, the parent sequence query is performed again on the query subsequence using the secondary query library. When the parent sequence data of the coronavirus genome data to be detected is obtained, the step is terminated; otherwise, the parent sequence query is performed again on the query subsequence using the tertiary query library. If the parent sequence data of the coronavirus genome data to be detected is still not obtained, the unknown parent data is used as the parent sequence data of the coronavirus genome data to be detected.

图4示例了一种电子设备的实体结构示意图，如图4所示，该电子设备可以包括：处理器(processor)810、通信接口(Communications Interface)820、存储器(memory)830和通信总线840，其中，处理器810，通信接口820，存储器830通过通信总线840完成相互间的通信。处理器810可以调用存储器830中的逻辑指令，以执行上述各项提供的多维度重组检测语料库的构建方法和/或面向大规模基因组数据的冠状病毒重组检测方法。FIG4 illustrates a schematic diagram of the physical structure of an electronic device. As shown in FIG4 , the electronic device may include: a processor 810, a communication interface 820, a memory 830, and a communication bus 840, wherein the processor 810, the communication interface 820, and the memory 830 communicate with each other through the communication bus 840. The processor 810 may call the logic instructions in the memory 830 to execute the method for constructing a multi-dimensional recombination detection corpus provided by the above items and/or the coronavirus recombination detection method for large-scale genomic data.

此外，上述的存储器830中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备（可以是个人计算机，服务器，或者网络设备等）执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器（ROM，Read-Only Memory）、随机存取存储器（RAM，Random Access Memory）、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned memory 830 can be implemented in the form of a software functional unit and can be stored in a computer-readable storage medium when it is sold or used as an independent product. Based on this understanding, the technical solution of the present invention can be essentially or partly embodied in the form of a software product that contributes to the prior art. The computer software product is stored in a storage medium, including several instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk, etc. Various media that can store program codes.

另一方面，本发明还提供一种计算机程序产品，所述计算机程序产品包括计算机程序，计算机程序可存储在非暂态计算机可读存储介质上，所述计算机程序被处理器执行时，计算机能够执行上述各方法所提供的多维度重组检测语料库的构建方法和/或面向大规模基因组数据的冠状病毒重组检测方法。On the other hand, the present invention also provides a computer program product, which includes a computer program. The computer program can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the method for constructing a multidimensional recombination detection corpus provided by the above methods and/or the coronavirus recombination detection method for large-scale genomic data.

又一方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各方法提供的多维度重组检测语料库的构建方法和/或面向大规模基因组数据的冠状病毒重组检测方法。On the other hand, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to execute the method for constructing a multidimensional recombination detection corpus provided by the above methods and/or the coronavirus recombination detection method for large-scale genomic data.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Those of ordinary skill in the art may understand and implement it without creative work.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备（可以是个人计算机，服务器，或者网络设备等）执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for constructing a multi-dimensional reorganization detection corpus, characterized by comprising:

Obtaining coronavirus genome data, wherein the coronavirus genome data includes multiple genome sequence data;

performing quality screening on the multiple genome sequence data according to the illegal character rate of each genome sequence data in the multiple genome sequence data to obtain multiple high-quality genome sequence data;

Clustering multiple high-quality genome sequence data to obtain multiple genome sequence clusters, wherein each genome sequence cluster in the multiple genome sequence clusters is labeled with a classification label, and each high-quality genome sequence data in each genome sequence cluster is labeled with a biological category label, and the biological category label includes a main label and a grouping number;

According to the sequence length of each high-quality genome sequence data in the multiple high-quality genome sequence data, the length of the clustered multiple high-quality genome sequence data is screened, and according to the number of high-quality genome sequence data that meet the sequence length requirements in each genome sequence cluster, the multiple genome sequence clusters are screened to obtain multiple genome sequence clusters to be selected for library construction and the multiple high-quality genome sequence data contained therein;

According to a plurality of preset segmentation lengths, each high-quality genome sequence data in the plurality of high-quality genome sequence data to be selected for library construction is segmented into sequence fragments to obtain a plurality of base subsequence data corresponding to each preset segmentation length in the plurality of preset segmentation lengths, a coronavirus genome sequence fragment library is constructed for the plurality of base subsequence data corresponding to each preset segmentation length, and a source label is annotated for each base subsequence data in the plurality of base subsequence data according to the biological category label of each high-quality genome sequence data in the plurality of high-quality genome sequence data to be selected for library construction, wherein the plurality of preset segmentation lengths include a first preset segmentation length and a second preset segmentation length;

According to the primary tag of each high-quality genome sequence data in the multiple high-quality genome sequence data, the multiple high-quality genome sequence data are grouped to obtain multiple primary tag groups, and each primary tag group in the multiple primary tag groups is independently built to form a primary query library, wherein each primary tag group includes multiple high-quality genome sequence data with the same primary tag;

Building a library for multiple high-quality genome sequence data to be selected for library building to obtain a secondary query library;

Building a database for multiple high-quality genome sequence data after quality screening and clustering to obtain a three-level query library;

Among them, the secondary query library is used to perform prior knowledge query when receiving the coronavirus genome data to be detected, and the primary query library, the secondary query library, and the tertiary query library are used to perform parent sequence query on the query subsequence when there is a recombination event in the coronavirus genome data to be detected.

2. The method for constructing a multidimensional recombination detection corpus according to claim 1 is characterized in that the quality screening of the multiple genome sequence data is performed according to the illegal character rate of each genome sequence data in the multiple genome sequence data to obtain multiple high-quality genome sequence data, including:

According to the number of nucleotide characters and the number of non-nucleotide characters of each genome sequence data, the illegal character rate of each genome sequence data is obtained;

Determining genome sequence data whose illegal character rate is less than or equal to a preset illegal character rate threshold as high-quality genome sequence data, so as to screen out a plurality of high-quality genome sequence data from the plurality of genome sequence data;

The method comprises screening the length of the clustered multiple high-quality genome sequence data according to the sequence length of each high-quality genome sequence data in the multiple high-quality genome sequence data, and screening the multiple genome sequence clusters according to the number of high-quality genome sequence data that meet the sequence length requirements in each genome sequence cluster, so as to obtain multiple genome sequence clusters to be selected for library construction and the multiple high-quality genome sequence data contained therein, including:

According to the sequence length of each high-quality genome sequence data in the multiple high-quality genome sequence data, the high-quality genome sequence data whose sequence length is greater than or equal to a preset base number threshold is determined as the high-quality genome sequence data that meets the sequence length requirement;

According to the number of high-quality genome sequence data meeting the sequence length requirement in each genome sequence clustering cluster, the genome sequence clustering clusters containing the number of high-quality genome sequence data meeting the sequence length requirement greater than or equal to a preset number threshold are determined as genome sequence clustering clusters to be selected for library construction, and the high-quality genome sequence data contained in the genome sequence clustering clusters to be selected for library construction are determined as high-quality genome sequence data to be selected for library construction, so as to obtain multiple genome sequence clustering clusters to be selected for library construction and the multiple high-quality genome sequence data contained in them.

3. A coronavirus recombination detection method for large-scale genomic data, characterized by comprising:

Receive coronavirus genome data to be tested;

According to the coronavirus genome data to be detected, a priori knowledge query is performed using the secondary query library obtained by the method for constructing the multi-dimensional recombinant detection corpus described in claim 1 or 2;

According to the coronavirus genome data to be detected, the coronavirus genome sequence fragment library obtained by the construction method of the multidimensional recombinant detection corpus according to claim 1 or 2 is used to perform fragment matching, and according to the source label of the matched base subsequence data and the queried prior knowledge, the source label of the coronavirus genome data to be detected is obtained and the corresponding segment in the coronavirus genome data to be detected is stained, wherein the source label of the coronavirus genome data to be detected includes a main label and/or a special label, and the special label indicates that the source is unknown;

Judging whether there is a recombination event in the coronavirus genome data to be detected according to the number of main tags and special tags in the source tags of the coronavirus genome data to be detected;

When there is a recombination event in the coronavirus genome data to be detected, the base fragments of all the staining segments corresponding to the main tag in the coronavirus genome data to be detected are connected in series according to the chronological position in the coronavirus genome data to be detected to obtain a query subsequence, and the query subsequence is queried for a parent sequence using the primary query library, the secondary query library, and the tertiary query library obtained by the method for constructing a multidimensional recombination detection corpus according to claim 1 or 2 to obtain parent sequence data of the coronavirus genome data to be detected;

Subsequence each stained segment corresponding to the main label in the coronavirus genome data to be detected is extracted to obtain subsequence data, and the subsequence data is located in the parent sequence data using a sequence alignment method to obtain the recombinant fragment site data of the coronavirus genome data to be detected.

4. The coronavirus recombination detection method for large-scale genome data according to claim 3 is characterized in that, according to the coronavirus genome data to be detected, the secondary query library obtained by the construction method of the multidimensional recombination detection corpus according to claim 1 or 2 is used to perform prior knowledge query, including:

The similarity between the coronavirus genome data to be tested and the high-quality genome sequence data in the secondary query library is calculated to obtain the high-quality genome sequence data with the highest similarity to the coronavirus genome data to be tested and its biological category label as the prior knowledge of the coronavirus genome data to be tested.

5. The coronavirus recombination detection method for large-scale genome data according to claim 4 is characterized in that, according to the coronavirus genome data to be detected, the coronavirus genome sequence fragment library obtained by the construction method of the multidimensional recombination detection corpus according to claim 1 or 2 is used to perform fragment matching, and according to the source label of the matched base subsequence data and the queried prior knowledge, the source label of the coronavirus genome data to be detected is obtained and the corresponding segment in the coronavirus genome data to be detected is stained, comprising:

According to a first preset segmentation length, the coronavirus genome data to be detected is segmented into sequence fragments to obtain a plurality of first sequence fragment data to be detected, each first sequence fragment data to be detected in the plurality of first sequence fragment data to be detected is matched with high-quality genome sequence data of a coronavirus genome sequence fragment library corresponding to the first preset segmentation length, and a biological category label of the high-quality genome sequence data obtained by matching each first sequence fragment data to be detected is used as a matching label corresponding to the first sequence fragment data to be detected, and voting for a plurality of matching labels according to the matching results of the plurality of first sequence fragment data to be detected;

When there is a unique matching tag with the most votes among multiple matching tags, the unique matching tag with the most votes is used as the optimal tag, and the optimal tag is used as the source tag to label and color the base sites corresponding to the first sequence fragment data to be detected that has voted for the optimal tag;

When there are multiple matching labels with the most votes among multiple matching labels, if the biological category label in the prior knowledge is one of the multiple matching labels with the most votes, the biological category label in the prior knowledge is the optimal label, otherwise, one of the multiple matching labels with the most votes is randomly selected as the optimal label, and the optimal label is used as the source label to label and color the base sites corresponding to the first sequence fragment data to be detected that has voted for the optimal label;

When the matching results of multiple first sequence fragment data to be detected show that none of the multiple first sequence fragment data to be detected have matching labels, segment the coronavirus genome data to be detected according to the second preset segmentation length to obtain multiple second sequence fragment data to be detected, match each second sequence fragment data to be detected in the multiple second sequence fragment data with the high-quality genome sequence data of the coronavirus genome sequence fragment library corresponding to the preset second segmentation length, use the biological category label of the high-quality genome sequence data obtained by matching each second sequence fragment data to be detected as the matching label corresponding to the second sequence fragment data to be detected, vote for the multiple matching labels according to the matching results of the multiple second sequence fragment data to be detected, and if the matching results of the multiple second sequence fragment data to be detected still show that none of the multiple second sequence fragment data to be detected have matching labels, mark the multiple second sequence fragment data to be detected with special labels, wherein the first preset segmentation length is greater than the preset second segmentation length;

For the stained segments in the coronavirus genome data to be detected, filtering out the stained segments whose base length is less than or equal to a preset stained length threshold;

For the unstained segments in the coronavirus genome data to be tested, if the base length of the unstained segment is less than or equal to the preset staining length threshold, the same staining is performed according to the context staining of the unstained segment. If the base length of the unstained segment is greater than the preset staining length threshold, a new round of fragment matching and matching label voting is started.

6. The coronavirus recombination detection method for large-scale genome data according to claim 5 is characterized in that the method comprises: judging whether there is a recombination event in the coronavirus genome data to be detected according to the number of main tags and special tags in the source tags of the coronavirus genome data to be detected, comprising:

In the case where there is no special tag in the source tag of the coronavirus genome data to be detected and only a main tag exists, when there is only one main tag in the source tag of the coronavirus genome data to be detected, it is determined that there is no recombination event in the coronavirus genome data to be detected; when the source tag of the coronavirus genome data to be detected includes multiple main tags, it is determined that there is a recombination event in the coronavirus genome data to be detected;

When the source tags of the coronavirus genome data to be detected contain both the special tag and the main tag, it is determined that the coronavirus genome data to be detected has a recombination event;

When only special tags exist in the source tags of the coronavirus genome data to be detected, it is determined that there is no recombination event in the coronavirus genome data to be detected.

7. The coronavirus recombination detection method for large-scale genome data according to any one of claims 3 to 6, characterized in that the base fragments of all the stained segments corresponding to the main tags in the coronavirus genome data to be detected are connected in series according to the chronological positions in the coronavirus genome data to be detected to obtain a query subsequence, and the primary query library, the secondary query library, and the tertiary query library obtained by the construction method of the multidimensional recombination detection corpus according to claim 1 or 2 are used to perform a parent sequence query on the query subsequence to obtain the parent sequence data of the coronavirus genome data to be detected, including:

The query subsequence is queried for a parent sequence using the primary query library. When the parent sequence data of the coronavirus genome data to be detected is obtained, the step is terminated. Otherwise, the query subsequence is queried for a parent sequence again using the secondary query library. When the parent sequence data of the coronavirus genome data to be detected is obtained, the step is terminated. Otherwise, the query subsequence is queried for a parent sequence again using the tertiary query library. If the parent sequence data of the coronavirus genome data to be detected is still not obtained, the unknown parent data is used as the parent sequence data of the coronavirus genome data to be detected.

8. A coronavirus recombination detection system for large-scale genomic data, comprising:

A data receiving module, used to: receive coronavirus genome data to be detected;

A priori knowledge query module, used to: perform prior knowledge query based on the coronavirus genome data to be detected, using the secondary query library obtained by the method for constructing the multi-dimensional recombinant detection corpus described in claim 1 or 2;

A sequence fragment matching module is used to: perform fragment matching based on the coronavirus genome data to be detected, using the coronavirus genome sequence fragment library obtained by the construction method of the multi-dimensional recombinant detection corpus according to claim 1 or 2, and obtain the source label of the coronavirus genome data to be detected and color the corresponding segment in the coronavirus genome data to be detected based on the source label of the matched base subsequence data and the queried prior knowledge, wherein the source label of the coronavirus genome data to be detected includes a main label and/or a special label, and the special label indicates that the source is unknown;

A recombination event determination module is used to determine whether there is a recombination event in the coronavirus genome data to be detected according to the number of main tags and special tags in the source tags of the coronavirus genome data to be detected;

A parent sequence query module, used for: when there is a recombination event in the coronavirus genome data to be detected, concatenating the base fragments of all the staining segments corresponding to the main tag in the coronavirus genome data to be detected according to the chronological position in the coronavirus genome data to be detected, to obtain a query subsequence, and performing a parent sequence query on the query subsequence using the primary query library, the secondary query library, and the tertiary query library obtained by the method for constructing the multidimensional recombination detection corpus according to claim 1 or 2, to obtain parent sequence data of the coronavirus genome data to be detected;

The recombinant fragment site positioning module is used to: perform subsequence interception on each staining segment corresponding to the main tag in the coronavirus genome data to be detected to obtain subsequence data, and use the sequence alignment method to locate the site of the subsequence data in the parent sequence data to obtain the recombinant fragment site data of the coronavirus genome data to be detected.

9. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the method for constructing a multidimensional recombination detection corpus as described in claim 1 or 2 and/or the coronavirus recombination detection method for large-scale genomic data as described in any one of claims 3 to 7 is implemented.

10. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that when the computer program is executed by a processor, it implements the method for constructing a multidimensional recombination detection corpus as described in claim 1 or 2 and/or the coronavirus recombination detection method for large-scale genomic data as described in any one of claims 3 to 7.