CN114580557A

CN114580557A - Method and device for determining the similarity of documents based on semantic analysis

Info

Publication number: CN114580557A
Application number: CN202210240186.2A
Authority: CN
Inventors: 程义; 李峰; 孙正茂; 潘磊; 杨长青; 李君令; 张尧尧; 郭来中; 孙伟
Original assignee: Beijing Zhongzhi Zhihui Technology Co ltd
Current assignee: Beijing Zhongzhi Zhihui Technology Co ltd
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2022-06-03
Anticipated expiration: 2042-03-10
Also published as: CN114580557B

Abstract

The invention discloses a semantic analysis-based document similarity determination method and device, wherein the method comprises the following steps: dividing each document to be compared into a plurality of parts; performing semantic analysis on each part to obtain a semantic analysis result of each part; determining a weight value of each part of each document to be compared according to the semantic analysis result of each part; obtaining a weighted average result of each document to be compared according to the weight value of each part of each document to be compared; and determining the similarity between the documents to be compared according to the weighted average result of each document to be compared. The method and the device can accurately determine the weights of different parts of the literature based on semantic analysis, and further accurately determine the similarity of the literature.

Description

Method and device for determining the similarity of documents based on semantic analysis

技术领域technical field

本发明涉及人工智能技术领域，尤其涉及一种基于语义分析的文献相似度确定方法及装置。The invention relates to the technical field of artificial intelligence, and in particular, to a method and device for determining the similarity of documents based on semantic analysis.

背景技术Background technique

本部分旨在为权利要求书中陈述的本发明实施例提供背景或上下文。此处的描述不因为包括在本部分中就承认是现有技术。This section is intended to provide a background or context to the embodiments of the invention recited in the claims. The descriptions herein are not admitted to be prior art by inclusion in this section.

目前，现有技术在确定文献相似度时，根据人工经验，针对文献不同部分内容预先设置不同权重，最后依据人为设置的固定权重，将各部分内容相似度的加权求和得到的结果确定为文献相似度。现有确定文献相似度的方法凭经验设置权重，存在权重设置不准确，进而导致文献相似度确定也不准确的问题。At present, when determining the similarity of documents in the prior art, according to manual experience, different weights are preset for different parts of the documents, and finally, according to the fixed weights set artificially, the result obtained by the weighted summation of the similarity of each part of the content is determined as the document. similarity. The existing methods for determining the similarity of documents set the weights based on experience, and the weight setting is inaccurate, which leads to the problem of inaccurate determination of the similarity of documents.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供一种基于语义分析的文献相似度确定方法，用以基于语义分析准确地确定文献不同部分的权重，进而准确地确定文献相似度，该方法包括：An embodiment of the present invention provides a method for determining the similarity of documents based on semantic analysis, which is used to accurately determine the weights of different parts of documents based on the semantic analysis, and then accurately determine the similarity of documents. The method includes:

将每一待比较文献划分为多个部分；Divide each document to be compared into multiple sections;

对每一部分进行语义分析得到每一部分的语义分析结果；Perform semantic analysis on each part to obtain the semantic analysis result of each part;

根据每一部分的语义分析结果，确定每一待比较文献的每一部分的权重值；According to the semantic analysis result of each part, determine the weight value of each part of each document to be compared;

根据每一待比较文献的每一部分的权重值，得到每一待比较文献的加权平均结果；According to the weight value of each part of each document to be compared, the weighted average result of each document to be compared is obtained;

根据每一待比较文献的加权平均结果，确定待比较文献之间的相似度。According to the weighted average result of each document to be compared, the similarity between documents to be compared is determined.

本发明实施例还提供一种基于语义分析的文献相似度确定装置，用以基于语义分析准确地确定文献不同部分的权重，进而准确地确定文献相似度，该装置包括：The embodiment of the present invention also provides a device for determining the similarity of documents based on semantic analysis, which is used to accurately determine the weights of different parts of the documents based on the semantic analysis, and then accurately determine the similarity of documents. The device includes:

划分单元，用于将每一待比较文献划分为多个部分；A division unit, used to divide each document to be compared into a plurality of parts;

语义分析单元，用于对每一部分进行语义分析得到每一部分的语义分析结果；The semantic analysis unit is used to perform semantic analysis on each part to obtain the semantic analysis result of each part;

权重值确定单元，用于根据每一部分的语义分析结果，确定每一待比较文献的每一部分的权重值；a weight value determination unit, used for determining the weight value of each part of each document to be compared according to the semantic analysis result of each part;

处理单元，用于根据每一待比较文献的每一部分的权重值，得到每一待比较文献的加权平均结果；a processing unit, configured to obtain a weighted average result of each document to be compared according to the weight value of each part of each document to be compared;

相似度确定单元，用于根据每一待比较文献的加权平均结果确定待比较文献之间的相似度。The similarity determination unit is configured to determine the similarity between the documents to be compared according to the weighted average result of each document to be compared.

本发明实施例还提供一种计算机设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现上述基于语义分析的文献相似度确定方法。An embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored in the memory and running on the processor, when the processor executes the computer program, the above-mentioned documents similar to those based on semantic analysis are implemented degree determination method.

本发明实施例还提供一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序被处理器执行时实现上述基于语义分析的文献相似度确定方法。An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the above-mentioned method for determining the similarity of documents based on semantic analysis is implemented.

本发明实施例还提供一种计算机程序产品，所述计算机程序产品包括计算机程序，所述计算机程序被处理器执行时实现上述基于语义分析的文献相似度确定方法。An embodiment of the present invention further provides a computer program product, the computer program product includes a computer program, and when the computer program is executed by a processor, the above-mentioned method for determining the similarity of documents based on semantic analysis is implemented.

本发明实施例中，基于语义分析的文献相似度确定方案，与现有技术中根据经验为文献不同部分内容预先设置不同的固定权重，进而确定文献相似度，存在权重设置不准确，进而导致文献相似度确定也不准确的问题的技术方案相比，通过：将每一待比较文献划分为多个部分；对每一部分进行语义分析得到每一部分的语义分析结果；根据每一部分的语义分析结果，确定每一待比较文献的每一部分的权重值；根据每一待比较文献的每一部分的权重值，得到每一待比较文献的加权平均结果；根据每一待比较文献的加权平均结果，确定待比较文献之间的相似度，可以实现基于语义分析准确地确定文献不同部分的权重，进而准确地确定文献相似度。In the embodiment of the present invention, the solution for determining the similarity of documents based on semantic analysis is different from the prior art, which presets different fixed weights for different parts of the documents based on experience, and then determines the similarity of documents. Compared with the technical solutions for the problem of inaccurate similarity determination, by: dividing each document to be compared into multiple parts; performing semantic analysis on each part to obtain the semantic analysis result of each part; according to the semantic analysis result of each part, Determine the weight value of each part of each document to be compared; obtain the weighted average result of each document to be compared according to the weight value of each part of each document to be compared; By comparing the similarity between documents, the weight of different parts of documents can be accurately determined based on semantic analysis, and then the similarity of documents can be accurately determined.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。在附图中：In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts. In the attached image:

图1为本发明实施例中基于语义分析的文献相似度确定方法的流程示意图；1 is a schematic flowchart of a method for determining the similarity of documents based on semantic analysis in an embodiment of the present invention;

图2为本发明实施例中对每一部分进行语义分析得到每一部分的语义分析结果的流程示意图；2 is a schematic flowchart of performing semantic analysis on each part to obtain a semantic analysis result of each part in an embodiment of the present invention;

图3为本发明另一实施例中对每一部分进行语义分析得到每一部分的语义分析结果的流程示意图；3 is a schematic flowchart of performing semantic analysis on each part to obtain a semantic analysis result of each part in another embodiment of the present invention;

图4为本发明实施例中文献预处理过程的流程示意图；4 is a schematic flowchart of a document preprocessing process in an embodiment of the present invention;

图5为本发明实施例中基于语义分析的文献相似度确定装置的结构示意图；5 is a schematic structural diagram of a device for determining the similarity of documents based on semantic analysis in an embodiment of the present invention;

图6为本发明实施例中语义分析单元的结构示意图；6 is a schematic structural diagram of a semantic analysis unit in an embodiment of the present invention;

图7为本发明另一实施例中语义分析单元的结构示意图；7 is a schematic structural diagram of a semantic analysis unit in another embodiment of the present invention;

图8为本发明实施例中专业领域中文分词训练过程示意图。FIG. 8 is a schematic diagram of a training process of Chinese word segmentation in a professional field according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚明白，下面结合附图对本发明实施例做进一步详细说明。在此，本发明的示意性实施例及其说明用于解释本发明，但并不作为对本发明的限定。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention more clearly understood, the embodiments of the present invention will be further described in detail below with reference to the accompanying drawings. Here, the exemplary embodiments of the present invention and their descriptions are used to explain the present invention, but not to limit the present invention.

图1为本发明实施例中基于语义分析的文献相似度确定方法的流程示意图，如图1所示，该方法包括如下步骤：1 is a schematic flowchart of a method for determining the similarity of documents based on semantic analysis in an embodiment of the present invention. As shown in FIG. 1 , the method includes the following steps:

步骤101：将每一待比较文献划分为多个部分；Step 101: Divide each document to be compared into multiple parts;

步骤102：对每一部分进行语义分析得到每一部分的语义分析结果；Step 102: perform semantic analysis on each part to obtain a semantic analysis result of each part;

步骤103：根据每一部分的语义分析结果，确定每一待比较文献的每一部分的权重值；Step 103: Determine the weight value of each part of each document to be compared according to the semantic analysis result of each part;

步骤104：根据每一待比较文献的每一部分的权重值，得到每一待比较文献的加权平均结果；Step 104: According to the weight value of each part of each document to be compared, obtain the weighted average result of each document to be compared;

步骤105：根据每一待比较文献的加权平均结果，确定待比较文献之间的相似度。Step 105: Determine the similarity between the documents to be compared according to the weighted average result of each document to be compared.

本发明实施例提供的基于语义分析的文献相似度确定方法，工作时：将每一待比较文献划分为多个部分(可以称为子文献)；对每一部分进行语义分析得到每一部分的语义分析结果；根据每一部分的语义分析结果，确定每一待比较文献的每一部分的权重值；根据每一待比较文献的每一部分的权重值，得到每一待比较文献的加权平均结果；根据每一待比较文献的加权平均结果，确定待比较文献之间的相似度。The method for determining the similarity of documents based on semantic analysis provided by the embodiment of the present invention, when working: divides each document to be compared into multiple parts (which may be called sub-documents); performs semantic analysis on each part to obtain the semantic analysis of each part Result; according to the semantic analysis result of each part, determine the weight value of each part of each document to be compared; according to the weight value of each part of each document to be compared, obtain the weighted average result of each document to be compared; The weighted average result of the documents to be compared is used to determine the similarity between the documents to be compared.

与现有技术中根据经验为文献不同部分内容预先设置不同的固定权重，进而确定文献相似度，存在权重设置不准确，进而导致文献相似度确定也不准确的问题的技术方案相比，本发明实施例提供的基于语义分析的文献相似度确定方法可以实现基于语义分析准确地确定文献不同部分的权重，进而准确地确定文献相似度。下面对该基于语义分析的文献相似度确定方法进行详细介绍。Compared with the technical solution in the prior art that different fixed weights are preset for different parts of the document according to experience, and then the document similarity is determined, and the weight setting is inaccurate, which leads to the problem of inaccurate determination of the document similarity. The method for determining the similarity of documents based on semantic analysis provided by the embodiment can accurately determine the weights of different parts of documents based on the semantic analysis, and then accurately determine the similarity of documents. The method for determining the similarity of documents based on semantic analysis is described in detail below.

一、首先，介绍上述步骤101。1. First, the above step 101 is introduced.

具体实施时，本发明实施例中的文献可以是专利文献、商标文献、非专利文献等等。以专利文献为例，将每一待比较文献划分为多个部分可以是发明名称、摘要、权利要求和说明书等部分。During specific implementation, the documents in the embodiments of the present invention may be patent documents, trademark documents, non-patent documents, and the like. Taking patent documents as an example, each document to be compared is divided into multiple parts, such as the title of the invention, the abstract, the claims, and the description.

二、接着，介绍上述步骤102。2. Next, the above step 102 is introduced.

在一个实施例中，如图2所示，对每一部分进行语义分析得到每一部分的语义分析结果，可以包括如下步骤：In one embodiment, as shown in FIG. 2 , performing semantic analysis on each part to obtain a semantic analysis result of each part may include the following steps:

步骤1021：对每一部分进行分词处理，得到每一部分对应的多个关键词；Step 1021: Perform word segmentation processing on each part to obtain multiple keywords corresponding to each part;

步骤1022：根据每一部分对应的多个关键词，以及预设的文献特征提取策略，从每一部分中提取出多个类型的关键特征，构成每一部分对应的特征集合；Step 1022: According to the multiple keywords corresponding to each part and the preset document feature extraction strategy, extract multiple types of key features from each part to form a feature set corresponding to each part;

步骤1024：根据每一部分对应的特征集合，对每一部分进行词级、句法级和篇章级的语义分析，得到每一部分的语义分析结果。Step 1024: According to the feature set corresponding to each part, perform word-level, syntactic-level and chapter-level semantic analysis on each part to obtain a semantic analysis result of each part.

具体实施时，在上述步骤1021中，文档分词处理：根据主题词库，对文献文档进行分词处理，删除缺乏实际意义的虚词、很少出现的低频词和经常使用的高频词，最终得到每一部分对应的多个关键词。During the specific implementation, in the above step 1021, document word segmentation processing: according to the thesaurus, the document document is subjected to word segmentation processing, and the function words lacking actual meaning, the rare low frequency words and the frequently used high frequency words are deleted, and finally each word is obtained. Some of the corresponding keywords.

具体实施时，可以利用一个预先训练好的切词模型进行分词处理。为了更好的理解本发明的分词处理如何实施，下面对该切词模型进行介绍。During specific implementation, a pre-trained word segmentation model can be used for word segmentation processing. In order to better understand how the word segmentation process of the present invention is implemented, the word segmentation model is introduced below.

1.评估指标。1. Evaluation metrics.

中文切词采用准确率(Precision)和召回率(Recall)评估指标，其中：Chinese word segmentation adopts the evaluation indicators of Precision and Recall, among which:

Precision＝正确切分出的词的数目/切分出的词的总数；Precision=Number of correctly segmented words/Total number of segmented words;

Recall＝正确切分出的词的数目/应切分出的词的总数；Recall = the number of correctly segmented words/the total number of words that should be segmented;

综合性能指标F-measure；Comprehensive performance index F-measure;

Fβ＝(β2+1)×Precision×Recall/(β2×Precision+Recall)；Fβ=(β2+1)×Precision×Recall/(β2×Precision+Recall);

β为权重因子，如果将准确率和召回率同等看待，取β＝1，就得到最常用的 F1-measure；β is the weight factor. If the precision rate and the recall rate are treated equally, and β=1, the most commonly used F1-measure is obtained;

F1＝2×Precisiton×Recall/(Precision+Recall)。F1=2×Precisiton×Recall/(Precision+Recall).

下面以切词模型是中文切词模型为例进行介绍。The following takes the word segmentation model as a Chinese word segmentation model as an example to introduce.

2.中文切词模型训练方法。2. Chinese word segmentation model training method.

常用的分词方法可以分为两类：基于词典的分词和基于统计的分词。基于统计的分词方法由于在歧义切分和未登录词识别方面相对于基于词典的分词有了较大提升，因而成为近年来主流的分词方法。常用的分词统计模型有隐马尔科夫模型、条件随机场模型、最大熵模型、神经网络模型等。然而，当测试语料和训练语料领域不一致时，分词的准确率和未登录词识别等性能会大幅度下降。因此，利用基于统计的分词方法切分专业领域文本时，需要为相应的领域制作标注好的训练语料。然而，标注专业领域的训练语料将耗费较大的人力物力，且现阶段已完成标注工作的专业领域数量稀少。Commonly used word segmentation methods can be divided into two categories: dictionary-based word segmentation and statistical-based word segmentation. The statistical-based word segmentation method has become the mainstream word segmentation method in recent years because it has greatly improved compared with the dictionary-based word segmentation in terms of ambiguity segmentation and unregistered word recognition. Commonly used statistical models for word segmentation include Hidden Markov Model, Conditional Random Field Model, Maximum Entropy Model, and Neural Network Model. However, when the test corpus is inconsistent with the training corpus, the accuracy of word segmentation and the performance of unregistered word recognition will drop significantly. Therefore, when using the statistical-based word segmentation method to segment the professional domain text, it is necessary to make annotated training corpus for the corresponding domain. However, labeling training corpora in professional fields will consume a lot of manpower and material resources, and the number of professional fields that have completed labeling work at this stage is rare.

领域术语集中体现和负载了一个学科领域的核心知识，而专业领域词典是专业领域术语的集合。专业领域词典构建的方法可以分为：利用词典资源构建领域词典、利用语料库和统计方法构建领域词典以及利用百科挖掘构建领域词典。发明人提出词典资源和百科相结合的专业领域词典构建方法，用百科获取新术语，用已有词典资源提高完备性。Domain terminology embodies and loads the core knowledge of a subject area, and professional domain dictionary is a collection of professional domain terminology. The methods of constructing specialized domain dictionaries can be divided into: constructing domain dictionaries using dictionary resources, constructing domain dictionaries using corpus and statistical methods, and constructing domain dictionaries using encyclopedia mining. The inventor proposes a professional domain dictionary construction method combining dictionary resources and encyclopedia, using encyclopedia to obtain new terms, and using existing dictionary resources to improve the completeness.

专业领域中文分词训练过程可以如图8所示：The training process of Chinese word segmentation in the professional field can be shown in Figure 8:

(1)通过挖掘百科词条，结合已有词典资源，构建领域词典。(1) Construct a domain dictionary by mining encyclopedia entries and combining existing dictionary resources.

(2)将基础自然语言处理模型中的分词结果作为一次分词结果，利用领域词典对该结果进行逆向最大匹配切分，结合歧义消解规则，得到二次分词结果。(2) Take the word segmentation result in the basic natural language processing model as the primary word segmentation result, use the domain dictionary to perform reverse maximum matching segmentation on the result, and combine the ambiguity resolution rules to obtain the secondary word segmentation result.

利用基础自然语言处理模型分词器得到分词结果之后，使用领域词典对分词结果进行逆向最大匹配，可以提高分词的领域自适用性。逆向最大匹配算法从右向左提取分词结果的字符串，在领域词典中匹配，匹配成功则将该字符串切分为词。例如：“岩棉复合板更换为苯板复合板。”用基础自然语言处理模型分词器分词，得到：“岩/棉/ 复合/板/更换/为/苯/板/复合/板/。”领域词典中包含“岩棉”、“苯板”、“复合板”。进行逆向最大匹配，调整后的分词结果为：“岩棉/复合板/更换/为/苯板/复合板/。”After using the basic natural language processing model tokenizer to obtain the word segmentation results, the domain dictionary is used to perform reverse maximum matching on the word segmentation results, which can improve the domain self-applicability of word segmentation. The reverse maximum matching algorithm extracts the string of word segmentation results from right to left, and matches it in the domain dictionary. If the match is successful, the string is divided into words. For example: "replace rock wool composite board with benzene board composite board." Use the basic natural language processing model tokenizer to segment words, and get: "rock/cotton/composite/board/replace/for/benzene/board/composite/board/." The domain dictionary includes "rock wool", "benzene board", "composite board". Perform reverse maximum matching, and the adjusted word segmentation result is: "rock wool/composite board/replacement/for/benzene board/composite board/."

对一次分词结果直接用领域词典进行逆向最大匹配调整，往往会面临歧义问题。例如，“在工期满后”，一次分词结果为：“在/工/期满/后”。“工期”和“期满”假设都在领域词典中，有两种可能结果：“在/工/期满/后”和“在/工期/满/后”。直接用逆向最大匹配方法，得到前一种结果，然而，正确切分结果应该是后一种。因此，需要设计消歧算法，提高分词正确率。Directly using the domain dictionary to perform reverse maximum matching adjustment on a word segmentation result often faces ambiguity problems. For example, "after the expiration of the term", the result of a word segmentation is: "after/work/expiration/after". Both "duration" and "expiration" are assumed to be in the domain dictionary, with two possible outcomes: "at/duration/expiration/after" and "at/duration/expiration/after". Directly use the reverse maximum matching method to obtain the former result, however, the correct segmentation result should be the latter. Therefore, it is necessary to design a disambiguation algorithm to improve the accuracy of word segmentation.

设计原理：最大逆向匹配算法调整初次分词结果时，会发生以下几种情形：保持初次分词结果不变；合并初次分词结果中的词，不影响其他词；初次分词结果中的词被重新切分，组成新词。我们提出的消歧算法主要针对情形三，利用消歧算法决定是否将当前匹配到领域词典的字符串合并为新词。我们提出的消歧算法主要考虑了调整前后词的数目、单字词数目和字在词中的位置特征。若调整后词的数目变多，则不做调整。若调整后单字词的数目变多，则不做调整。而字的位置特征参考6个关于首字、中字和尾字的字表，分别是切分为单字词频率很高的首字字表(L1)、切分为单字词频率很低的首字字表(L2)、与首字结合成词频率很高的中字字表(L3)、与尾字结合成词频率很高的中字字表(L4)、切分为单字词频率很高的尾字字表(L5) 和切分为单字词频率很低的尾字字表(L6)。Design principle: When the maximum reverse matching algorithm adjusts the initial word segmentation result, the following situations will occur: keep the initial word segmentation result unchanged; merge the words in the initial word segmentation result without affecting other words; the words in the initial word segmentation result are re-segmented , form new words. The disambiguation algorithm we propose is mainly aimed at the third case. The disambiguation algorithm is used to decide whether to merge the strings currently matched to the domain dictionary into new words. The disambiguation algorithm proposed by us mainly considers the number of words before and after adjustment, the number of single-character words, and the position characteristics of words in words. If the number of words after adjustment increases, no adjustment will be made. If the number of words after adjustment increases, no adjustment will be made. The positional features of words refer to 6 word lists about the first word, middle word and last word, which are the first word list (L1) with high frequency of single-word segmentation and the very low frequency of segmentation into single-word words. The first character list (L2), the Chinese character list with a high frequency of combining with the first character (L3), the high frequency of combining with the last character to form a Chinese character list (L4), and the frequency of splitting into single-character words. High end word list (L5) and end word list (L6) with very low frequency of segmentation into single-word words.

3.中文切词模型训练过程。3. Chinese word segmentation model training process.

义初次分词结果S：a1a2a3/a4/…/aM-1aM，W0表示当前在领域词典中匹配到的词，HD(W0)表示W0的首字，TL(W0)表示W0的尾字，LEN(W0)表示W0 长度，minSub(S,W0)表示S中包含W0的最小子序列，该子序列以切分间隔为开始和结束。combine(S,W0)表示将S中的W0合并后的序列，LMatch(ai)表示以ai为起始点，在领域词典中逆向最大匹配得到词，且该词的左侧切分点不落在已有词中。 new(S,W0)表示合并S中的W0后新产生的字数大于等于2的序列集合。假设一次分词结果为a1a2a3/a4 a5/a6a7a8，W0为a5a6，则HD(W0)为a5，TL(W0)为a6， minSub(S,W0)为a4a5/a6a7a8，combine(minSub(S,W0),W0)为a4/a5a6/a7a8， combine(minSub(S,W0),W0)的最左端是单字a4。如果a2a3a4在领域词典中， LMatch(a4)为空，因为左侧切分点落在已有词a1a2a3中。new(S,W0)＝{a7a8}。The first word segmentation result S: a1a2a3/a4/…/aM-1aM, W0 represents the word currently matched in the domain dictionary, HD(W0) represents the first word of W0, TL(W0) represents the last word of W0, LEN( W0) represents the length of W0, and minSub(S, W0) represents the smallest subsequence in S that includes W0, and the subsequence starts and ends with the segmentation interval. combine(S, W0) represents the sequence after combining W0 in S, LMatch(ai) represents the starting point of ai, and the reverse maximum matching of the word is obtained in the domain dictionary, and the left segmentation point of the word does not fall on the in existing words. new(S, W0) represents the new sequence set with the number of words greater than or equal to 2 after merging W0 in S. Assuming that the result of a word segmentation is a1a2a3/a4 a5/a6a7a8, and W0 is a5a6, then HD(W0) is a5, TL(W0) is a6, minSub(S,W0) is a4a5/a6a7a8, combine(minSub(S,W0) ,W0) is a4/a5a6/a7a8, and the leftmost end of combine(minSub(S,W0),W0) is the single word a4. If a2a3a4 is in the domain lexicon, LMatch(a4) is empty because the left split point falls in the existing word a1a2a3. new(S, W0)={a7a8}.

LEN(W0)＝3时，匹配词长为3字时，优先考虑左右侧是否产生单字及字位置特征，其次考虑minSub(S,W0)由单字和2字词组成的情况，最后考虑单字词数目的变化和词数目的变化，具体规则如下。When LEN(W0) = 3, when the matching word length is 3 characters, priority is given to whether the left and right sides generate single characters and word position features, secondly, the case where minSub(S, W0) is composed of single characters and 2 characters is considered, and finally single characters are considered The specific rules for changing the number of words and changing the number of words are as follows.

规则1：如果combine(minSub(S,W0),W0)最左端是单字，记为aL，满足 LMatch(aL)为空，并且combine(minSub(S,W0),W0)最右端不是单字，那么，如果aL ∈L1，且combine(minSub(S,W0),W0)的词数目小于等于minSub(S,W0)，合并 W0。Rule 1: If the leftmost end of combine(minSub(S,W0),W0) is a single word, denoted as aL, if LMatch(aL) is empty, and the rightmost end of combine(minSub(S,W0),W0) is not a single word, then , if aL ∈ L1, and the number of words in combine(minSub(S, W0), W0) is less than or equal to minSub(S, W0), combine W0.

规则2：如果combine(minSub(S,W0),W0)最右端是单字(aR)，最左端不是单字，或者combine(minSub(S,W0),W0)最左端是单字(aL)，但是LMatch(aL)不为空，那么，如果aR在L5字表中，且combine(minSub(S,W0),W0)的词数目小于等于 minSub(S,W0)，合并W0。Rule 2: If the rightmost end of combine(minSub(S,W0),W0) is a single word (aR) and the leftmost end is not a single word, or the leftmost end of combine(minSub(S,W0),W0) is a single word (aL), but LMatch (aL) is not empty, then if aR is in the L5 word list, and the number of words in combine(minSub(S, W0), W0) is less than or equal to minSub(S, W0), combine W0.

规则3：minSub(S,W0)是A/BC，LMatch(A)不为空，不合并W0，且下一次匹配从A开始。Rule 3: minSub(S, W0) is A/BC, LMatch(A) is not empty, W0 is not merged, and the next match starts from A.

规则4：如果combine(minSub(S,W0),W0)中单字词数目为0，且词数目小于等于minSub(S,W0)，同时，new(S,W0)中的序列在NLPIR词典或领域词典中，则合并W0。combine(minSub(S,W0),W0)最左端如果是单字词aL，且LMatch(aL)不为空，此时aL不计入单字词。Rule 4: If the number of single-word words in combine(minSub(S,W0),W0) is 0, and the number of words is less than or equal to minSub(S,W0), at the same time, the sequence in new(S,W0) is in the NLPIR dictionary or In the domain dictionary, W0 is merged. If the leftmost end of combine(minSub(S, W0), W0) is a single-word word aL, and LMatch(aL) is not empty, then aL is not counted as a single-word word.

规则5：如果W0是AB，minSub(S,W0)是A/B*，*表示长度大于2的字串，那么，如果A在L1中，不合并W0，否则，合并W0。Rule 5: If W0 is AB, minSub(S, W0) is A/B*, and * represents a string with a length greater than 2, then if A is in L1, do not merge W0, otherwise, merge W0.

规则6：如果W0是AB，minSub(S,W0)是*A/B，*表示长度大于2的字串，那么，如果B在L5中，不合并W0，否则，合并W0。Rule 6: If W0 is AB, minSub(S, W0) is *A/B, and * represents a string with a length greater than 2, then if B is in L5, do not merge W0, otherwise, merge W0.

规则7：规则2～6之外，均不合并W0。Rule 7: Except for rules 2 to 6, W0 is not merged.

以上规则按顺序依次执行。在实际效果验证时，相较于直接使用基础自然语言处理模型中的分词器，我们训练出的领域中文切词模型在准确率、召回率、F值上都有大幅提升。The above rules are executed in order. In the actual effect verification, compared with the direct use of the tokenizer in the basic natural language processing model, the Chinese word segmentation model we trained has greatly improved in accuracy, recall, and F value.

具体实施时，在上述步骤1022中，特征抽取(提取)：During specific implementation, in the above step 1022, feature extraction (extraction):

在一个实施例中，所述预设的文献特征提取策略可以包括：根据关键词在文献中出现的频率，关键词的逆文档频率，关键词的词性，关键词是否为专业词，关键词出现在文献中的位置，关键词的text-rank值，关键词的信息熵值，关键词的词向量与整体偏差值，关键词长度，关键词作为句子的成分，关键词是否再被切分成子关键词，关键词在文献中第一次出现与最后一次出现位置的长度，关键词分布偏差的其中之一或任意组合，进行文献特征提取。In one embodiment, the preset document feature extraction strategy may include: according to the frequency of the keyword appearing in the document, the inverse document frequency of the keyword, the part of speech of the keyword, whether the keyword is a professional word, the occurrence of the keyword The position in the literature, the text-rank value of the keyword, the information entropy value of the keyword, the word vector of the keyword and the overall deviation value, the length of the keyword, the keyword as a component of the sentence, whether the keyword is further divided into sub-sections One or any combination of keywords, the length of the first occurrence and the last occurrence of keywords in the document, and the deviation of the keyword distribution, are used for document feature extraction.

具体实施时，上述文献特征提取策略可以提高特征提取的准确率，从而提高语义分析的精度，进而提高每一待比较文献的每一部分的权重值确定的精度，最终提高文献相似度确定的精度。In specific implementation, the above document feature extraction strategy can improve the accuracy of feature extraction, thereby improving the accuracy of semantic analysis, thereby improving the accuracy of determining the weight value of each part of each document to be compared, and finally improving the accuracy of document similarity determination.

在一个实施例中，在关键词能再被切分到子关键词时，所述预设的文献特征提取策略还可以包括：根据子关键词的词频-逆文档频率，子关键词的词性，子关键词是否为专业词的其中之一或任意组合进行文献特征提取。In one embodiment, when the keywords can be further divided into sub-keywords, the preset document feature extraction strategy may further include: according to the word frequency of the sub-keywords-inverse document frequency, the part-of-speech of the sub-keywords, Whether the sub-keyword is one of the specialized words or any combination is used for feature extraction of documents.

具体实施时，上述考虑了子关键词情况的文献特征提取策略可以进一步提高特征提取的准确率，从而进一步提高语义分析的精度，进而进一步提高每一待比较文献的每一部分的权重值确定的精度，最终进一步提高文献相似度确定的精度。During specific implementation, the above-mentioned document feature extraction strategy considering sub-keywords can further improve the accuracy of feature extraction, thereby further improving the accuracy of semantic analysis, and further improving the accuracy of determining the weight value of each part of each document to be compared. , and finally further improve the accuracy of document similarity determination.

综上所述，为了便于理解本发明如何实施，下面以下面表格1的形式进一步说明文献特征提取策略。To sum up, in order to facilitate understanding of how the present invention is implemented, the document feature extraction strategy is further described below in the form of Table 1 below.

具体实施时，在分析了专利文献的数据特征和业务场景之后，我们提出了利用自然语言处理技术从专利文献中提取特征的具体策略，见下表1：In specific implementation, after analyzing the data features and business scenarios of patent documents, we propose a specific strategy for extracting features from patent documents using natural language processing technology, as shown in Table 1 below:

表1Table 1

根据上表1所示，与现有技术中根据经验为文献的不同部分设置固定权重来确定文献相似度的方案相比较，本发明实施例相当于基于语义分析调整了权重，从而得到更加准确的权重值，进而提高文献相似度确定的精度。As shown in Table 1 above, compared with the prior art scheme in which fixed weights are set for different parts of documents to determine the similarity of documents based on experience, the embodiment of the present invention is equivalent to adjusting the weights based on semantic analysis, thereby obtaining a more accurate The weight value is used to improve the accuracy of document similarity determination.

在一个实施例中，所述多个类型的关键特征可以包括：文献静态特征，文献与查询关联的特征，以及查询的特征。In one embodiment, the plurality of types of key characteristics may include: document static characteristics, characteristics of documents associated with the query, and characteristics of the query.

具体实施时，特征可以是算法、模型的养料之源。特征选择的好坏直接关系到算法训练学习出的模型的效果。与传统的文本分类不同，专利领域的MLR输出的是给定query(查询或检索特征)的文档集合的排序，不仅要考虑文档自身的特征，还要考虑query与文档关联关系的特征。综合来说，专利领域的MLR需要考虑三个方面的特征(文献静态特征，文献与查询关联的特征，以及查询的特征)：In specific implementation, features can be the source of nutrients for algorithms and models. The quality of feature selection is directly related to the effect of the model trained by the algorithm. Different from traditional text classification, the output of MLR in the patent field is the ranking of the document collection given the query (query or retrieval feature), which not only considers the characteristics of the document itself, but also the characteristics of the relationship between the query and the document. To sum up, MLR in the patent field needs to consider the characteristics of three aspects (document static characteristics, characteristics associated with documents and queries, and characteristics of queries):

专利信息的特征选取是指通过对专利文献的内在特征，即对专利技术内容进行归纳、演绎、分析、综合，以及抽象与概括等，以达到把握某一技术发展状况的目的。具体地说，根据专利文献提供的技术主题、专利国别、专利发明人、专利受让人、专利分类号、专利申请日、专利授权日和专利引证文献等技术内容，广泛进行信息搜集，对搜集的内容进行阅读和摘记等，在此基础上，进一步对这些信息进行分类、比较和分析等研究活动，形成有机的信息集合。进而有重点地研究那些有代表性、关键性和典型性的专利文献，最终找出专利信息之间的内在的甚至是潜在的相互关系，从而形成一个比较完整的认识。The feature selection of patent information refers to the purpose of grasping the development status of a certain technology by summarizing, deducting, analyzing, synthesizing, abstracting and generalizing the internal features of patent documents, that is, the content of patent technology. Specifically, according to the technical content provided by the patent documents, such as the technical subject, patent country, patent inventor, patent assignee, patent classification number, patent application date, patent grant date and patent citation documents, extensive information collection is carried out, and the The collected content is read and excerpted, and on this basis, further research activities such as classification, comparison and analysis of the information are carried out to form an organic information collection. And then focus on those representative, key and typical patent documents, and finally find out the internal and even potential relationship between patent information, so as to form a more complete understanding.

1、文档本身的静态特征(文献静态特征)，包括文档的文本特征，如带权重的词向量，文档不同域(标题、著录项目摘要、权利要求、专利说明书、专利正文等)的 TF、IDF、BM25和其他语言模型得分，也包括文档的质量分、专利的权利要求等重要性得分。关于文档的质量分，搜索根据不同的专利分类有不同的计算指标，比如机械领域的专利文件的质量分计算除了要考虑专利本身的文本丰富度，更多的还要考虑专利的各种相关内容如词性的变化、词的位置、机械领域句法结构等。1. The static features of the document itself (document static features), including the text features of the document, such as word vectors with weights, TF and IDF in different domains of the document (title, bibliographic item abstract, claims, patent specification, patent text, etc.) , BM25, and other language model scores, as well as document quality scores, patent claims, and other importance scores. Regarding the quality score of documents, the search has different calculation indicators according to different patent classifications. For example, the calculation of the quality score of patent documents in the mechanical field should not only consider the text richness of the patent itself, but also consider various related content of the patent. Such as the change of part of speech, the position of words, the syntactic structure of mechanical domain, etc.

2、文档和query关联的特征(文献与查询关联的特征)，比如query对应文档的 TD-IDF score，BM25 score等。2. Features associated with documents and queries (features associated with documents and queries), such as the TD-IDF score and BM25 score of the document corresponding to the query.

3、query本身的特征(查询的特征)，比如政策文本特征，带权重的专利领域专业词向量，query长度，query所述的分类，query的BM25的sum/avg/min/max/median 分数，query上个月的热度等。3. The characteristics of the query itself (the characteristics of the query), such as the policy text characteristics, the weighted professional word vector in the patent field, the length of the query, the classification described by the query, the sum/avg/min/max/median score of the BM25 of the query, The popularity of query last month, etc.

在query与文档的特征工程中，除了从词法上分析，还需要从“被阐述”的词法所“真正想表达”的语义即概念上进行分析提取。比如一词多义，同义词和近义词，不同的场景下同一个词表达不同的意思，不同场景下不同的词也可能表达相同的意思。 LSA(隐语义分析)是处理这类问题的著名技术，其主要思想是映射高维向量空间到低维的潜在语义空间或概念空间，也即进行降维。具体做法是将词项文档矩阵做奇异值分解(SVD)。奇异值分解中，C是以文档为行，词项terms为列的矩阵(假设M x N)，元素为term的词频-逆文档频率值。C被分解成3个小矩阵相乘；U的每一列表示一个主题，其中的每个非零元素表示一个主题与一篇文章的相关性，数值越大越相关；V表示keyword与所有term的相关性；∑表示文章主题和keyword之间的相关性。In the feature engineering of query and document, in addition to lexical analysis, it is also necessary to analyze and extract from the semantics, that is, the concept, which is "really expressed" by the "explained" lexical. For example, polysemy, synonyms and synonyms, the same word in different contexts expresses different meanings, and different words in different contexts may also express the same meaning. LSA (Latent Semantic Analysis) is a well-known technique for dealing with such problems. Its main idea is to map a high-dimensional vector space to a low-dimensional latent semantic space or concept space, that is, dimensionality reduction. The specific method is to do singular value decomposition (SVD) of the term document matrix. In singular value decomposition, C is a matrix with documents as rows and terms as columns (assuming M x N), and the elements are term frequency-inverse document frequency values. C is decomposed into 3 small matrices and multiplied; each column of U represents a topic, and each non-zero element in it represents the correlation between a topic and an article, and the larger the value, the more relevant; V represents the correlation between keyword and all terms ∑ represents the correlation between the article topic and the keyword.

在一个实施例中，如图3所示，上述基于语义分析的文献相似度确定方法还可以包括步骤1023：利用主成分分析法、线性判别分析法和互信息法，对每一部分对应的特征集合进行特征的筛选和组合，得到特征降维处理后的每一部分对应的特征集合；In one embodiment, as shown in FIG. 3 , the above-mentioned method for determining the similarity of documents based on semantic analysis may further include step 1023 : using principal component analysis, linear discriminant analysis and mutual information method, for each part of the corresponding feature set Screen and combine features to obtain feature sets corresponding to each part after feature dimensionality reduction processing;

根据每一部分对应的特征集合，对每一部分进行词级、句法级和篇章级的语义分析，得到每一部分的语义分析结果，可以包括：根据特征降维处理后的每一部分对应的特征集合，对每一部分进行词级、句法级和篇章级的语义分析，得到每一部分的语义分析结果。According to the feature set corresponding to each part, perform word-level, syntactic-level and chapter-level semantic analysis on each part, and obtain the semantic analysis result of each part, which may include: according to the feature set corresponding to each part after feature dimension reduction processing, Each part is semantically analyzed at the word level, syntax level and text level, and the semantic analysis result of each part is obtained.

具体实施时，构成文本的词汇数量一般特别多，因此表示文本的向量空间的维数也相当大，可以达到上千维，因此必须进行降维处理。一般通过特征提取的方法进行降维，表示词汇的特征指标可以包括：文档频率、信息获取、互信息、开方拟合检验、术语强度等。通过计算词汇的上述任一指标，然后由大到小排序选取指定数量的或指标值大于指定阈值的词汇构成特征集。下面对降维处理进行进一步的介绍。During specific implementation, the number of words that constitute a text is generally very large, so the dimension of the vector space representing the text is also quite large, which can reach thousands of dimensions, so dimensionality reduction processing is necessary. Dimensionality is generally reduced by the method of feature extraction, and the feature indicators representing vocabulary may include: document frequency, information acquisition, mutual information, square root fitting test, term strength, etc. By calculating any of the above indicators of the vocabulary, and then sorting from large to small to select a specified number of words or words whose index value is greater than a specified threshold to form a feature set. The dimensionality reduction process is further introduced below.

具体实施时，在将专利文献转换为特征集合后，即在步骤1022后的步骤1023 中，很多算法就可以直接进入检索排序阶段，但是由于专利文献的数据太过庞大，每个专利文献的特征又非常众多，算法的计算速度至关重要，否则算法可能看上去很美，但是无实用效果。为了能够进一步加快计算速度，在特征集合的基础上，利用主成分分析、线性判别分析、互信息等方法进行特征的筛选和组合，得到新的特征，实现特征降维的目的。In specific implementation, after the patent documents are converted into feature sets, that is, in step 1023 after step 1022, many algorithms can directly enter the retrieval and sorting stage, but because the data of patent documents is too large, the characteristics of each patent document are And there are many, the calculation speed of the algorithm is very important, otherwise the algorithm may look beautiful, but it has no practical effect. In order to further speed up the calculation, on the basis of the feature set, principal component analysis, linear discriminant analysis, mutual information and other methods are used to screen and combine features to obtain new features and achieve the purpose of feature dimensionality reduction.

具体实施时，降维就是一种对高维度特征数据预处理方法。降维是将高维度的数据保留下最重要的一些特征，去除噪声和不重要的特征，从而实现提升数据处理速度的目的。在实际的生产和应用中，降维在一定的信息损失范围内，可以为我们节省大量的时间和成本。降维也成为应用非常广泛的数据预处理方法。In specific implementation, dimensionality reduction is a method of preprocessing high-dimensional feature data. Dimensionality reduction is to retain some of the most important features of high-dimensional data, remove noise and unimportant features, so as to achieve the purpose of improving data processing speed. In actual production and application, dimensionality reduction can save us a lot of time and cost within a certain range of information loss. Dimensionality reduction has also become a very widely used data preprocessing method.

具体实施时，降维具有如下一些优点：When implemented, dimensionality reduction has the following advantages:

1、使得数据集更易使用。1. Make the dataset easier to use.

2、降低算法的计算开销。2. Reduce the computational cost of the algorithm.

3、去除噪声。3. Remove noise.

4、使得结果容易理解。4. Make the results easy to understand.

在上述步骤1024中，利用自然语言处理引擎对专利文献进行字词级、句法级、篇章级的语义分析，字词级分析包括中文分词、命名实体识别、词性标注、同义词分析、字词向量分析、n-gram分析、词粒度分析、停用词分析等；专利文献句法级分析包括依存文法分析、语言模型、短串分析等；专利文献篇章级分析包括专利文献标签提取、主题模型、文本聚类等。In the above step 1024, the natural language processing engine is used to perform semantic analysis at the word level, syntax level, and text level on the patent document. The word level analysis includes Chinese word segmentation, named entity recognition, part-of-speech tagging, synonym analysis, and word vector analysis. , n-gram analysis, word granularity analysis, stop word analysis, etc.; patent document syntactic level analysis includes dependency grammar analysis, language model, short string analysis, etc.; patent document chapter level analysis includes patent document label extraction, topic model, text aggregation class etc.

具体实施时，在上述步骤1024中，特征评估加权：特征权值的计算方法主要是运用词频-逆文档频率公式。根据挖掘目的不同，目前存在多种词频-逆文档频率公式构造方法。从上述内容可以看出，文本向量空间的构造完全按照概率统计规律进行，并不考虑词与词之间的关系。在文本挖掘过程中另一关键步骤是计算文本向量之间的相似距离，对任意两个向量X＝(x1,x2,…,xn)与X′＝(x′1,x′2,…,x′n)，主要存在3种最常用的距离度量：欧氏距离、余弦距离和内积，以上3种距离度量同样不涉及向量中词与词之间关系的分析。通过上述分析可以得出无论在文本向量构造过程还是相似距离度量过程都是根据概率统计的原理设计的，并不考虑特征在语义上的关系。而在自然语言使用中，为了表达的需要，文本中常常大量出现同义词和关联词(依据上下文关系经常搭配使用的词)，如在IT技术里“计算机”和“电脑”同义；在司法领域“警察”与“案件”同时出现的几率非常大。这些同义词和关联词同样会大量出现在文本特征向量中，一方面增加了文本特征向量的维数，另一方面降低了文本特征向量对文档的表达精度。虽然文本特征抽取可以通过预先设定的阈值来降低特征向量的维数，但它不是在基于保证语义精度的前提下，因而常常适得其反。虽然也可以在分词的过程中，使用同义词词典和蕴涵词词典来减少同义词和关联词，但同时也带来词典维护和更新的问题。During specific implementation, in the above step 1024, feature evaluation weight: the calculation method of the feature weight is mainly to use the word frequency-inverse document frequency formula. According to different mining purposes, there are currently a variety of word frequency-inverse document frequency formula construction methods. It can be seen from the above content that the construction of the text vector space is carried out completely according to the laws of probability and statistics, and does not consider the relationship between words. Another key step in the text mining process is to calculate the similarity distance between text vectors. For any two vectors X=(x1,x2,...,xn) and X'=(x'1,x'2,..., x'n), there are three most commonly used distance measures: Euclidean distance, cosine distance and inner product. The above three distance measures also do not involve the analysis of the relationship between words in vectors. Through the above analysis, it can be concluded that both the text vector construction process and the similarity distance measurement process are designed according to the principle of probability and statistics, and do not consider the semantic relationship of features. In the use of natural language, synonyms and related words (words that are often used in combination according to the context) often appear in the text in order to express the needs. For example, in IT technology, "computer" and "computer" are synonymous; There is a very high chance that the "police" and the "case" will appear at the same time. These synonyms and related words also appear in a large number of text feature vectors. On the one hand, the dimension of the text feature vector is increased, and on the other hand, the expression accuracy of the text feature vector to the document is reduced. Although text feature extraction can reduce the dimension of feature vectors by pre-set thresholds, it is not based on the premise of ensuring semantic accuracy, so it is often counterproductive. Although it is also possible to use synonym dictionaries and entailment dictionaries to reduce synonyms and associated words in the process of word segmentation, it also brings problems of dictionary maintenance and updating.

通过以上分析，可以看出，通过给定词联想出近义词和关联词，不论是对于用户的应用，还是对于文本语义挖掘，都非常重要。检索联想模型的功能主要是解决这个问题，下文会对该检索联想模型进行介绍。Through the above analysis, it can be seen that it is very important to associate synonyms and related words with a given word, whether it is for user applications or for text semantic mining. The function of the retrieval association model is mainly to solve this problem, and the retrieval association model will be introduced below.

三、接着，为了便于理解，一同介绍上述步骤103至步骤105。3. Next, in order to facilitate understanding, the above steps 103 to 105 are introduced together.

具体实施时，根据每一部分的语义分析结果，确定每一待比较文献的每一部分的权重值；根据每一待比较文献的每一部分的权重值，得到每一待比较文献的加权平均结果；根据每一待比较文献的加权平均结果，确定待比较文献之间的相似度。During specific implementation, the weight value of each part of each document to be compared is determined according to the semantic analysis result of each part; the weighted average result of each document to be compared is obtained according to the weight value of each part of each document to be compared; The weighted average result of each document to be compared determines the similarity between documents to be compared.

四、接着，介绍进一步优选的步骤。4. Next, further preferred steps are introduced.

本发明实施例可以用于语义检索时使用，对待检索文献数据库中的每一文献均进行文献预处理，以专利文献为例，专利文献文档预处理通过对文档进行处理，抽取，编码转化，归一化等，转化为包含纯文本的半结构化数据，这些半结构化数据将作为智能专利检索的输入，到预先建立的同样的半结构化数据中检索，根据本发明实施例提供的基于语义分析的文献相似度确定方法进行比较排序，获得智能搜索结果，也可以称作是语义搜索结果，检索精度高。文档及其处理结果，都通过Kafka分布式消息队列来传递。整个过程如图4所示：The embodiment of the present invention can be used for semantic retrieval, and each document in the document database to be retrieved is subjected to document preprocessing. Taking patent documents as an example, patent document document preprocessing is performed by processing, extracting, encoding, converting, and returning documents. It is converted into semi-structured data containing plain text, and these semi-structured data will be used as input for intelligent patent retrieval to be retrieved from the same pre-established semi-structured data. The analyzed literature similarity determination methods are compared and sorted to obtain intelligent search results, which can also be called semantic search results, with high retrieval accuracy. Documents and their processing results are delivered through the Kafka distributed message queue. The whole process is shown in Figure 4:

专利文献数据主要格式为XML。首先需要从这两个XML数据表中提取拼接出需要的数据。The main format of patent document data is XML. First, the required data needs to be extracted and spliced from the two XML data tables.

XML文件多用于信息的描述，所以在得到一个XML文档之后按照XML中的元素取出对应的信息就是XML的解析。XML解析有两种方式，一种是DOM解析，另一种是SAX解析，本发明实施例采用的是DOM解析方式对XML进行处理。XML files are mostly used for the description of information, so after obtaining an XML document, extracting the corresponding information according to the elements in the XML is the parsing of XML. There are two ways of XML parsing, one is DOM parsing, and the other is SAX parsing. The embodiment of the present invention adopts the DOM parsing way to process XML.

基于DOM解析的XML分析器是将其转换为一个对象模型的集合，用树这种数据结构对信息进行储存。通过DOM接口，应用程序可以在任何时候访问XML文档中的任何一部分数据，因此这种利用DOM接口访问的方式也被称为随机访问。An XML parser based on DOM parsing converts it into a collection of object models and stores information in a tree data structure. Through the DOM interface, the application can access any part of the data in the XML document at any time, so this way of using the DOM interface to access is also called random access.

这种方式也有缺陷，因为DOM分析器将整个XML文件转换为了树存放在内存中，当文件结构较大或者数据较复杂的时候，这种方式对内存的要求就比较高，且对于结构复杂的树进行遍历也是一种非常耗时的操作。不过DOM所采用的树结构与 XML存储信息的方式相吻合，同时其随机访问还可利用，所以DOM接口还是具有广泛的使用价值。This method is also flawed, because the DOM analyzer converts the entire XML file into a tree and stores it in memory. When the file structure is large or the data is complex, this method has higher memory requirements, and for complex structures Tree traversal is also a very time-consuming operation. However, the tree structure used by DOM is consistent with the way XML stores information, and its random access can also be used, so the DOM interface still has extensive use value.

DOM解析中有以下4个核心操作接口：There are four core operation interfaces in DOM parsing:

1.Document：此接口代表了整个XML文档，表示为整个DOM的根，即为该树的入口，通过该接口可以访问XML中所有元素的内容。其常用方法如下：1.Document: This interface represents the entire XML document, represented as the root of the entire DOM, which is the entry of the tree, through which the content of all elements in XML can be accessed. The commonly used methods are as follows:

public NodeList getElementByTagName(String tagname)；public NodeList getElementByTagName(String tagname);

取得指定节点名称的NodeList；Get the NodeList of the specified node name;

Public Element createElement(String tagName)throws DOMException；Public Element createElement(String tagName) throws DOMException;

创建一个指定名称的节点；Create a node with the specified name;

Public Text createTextNode(String data)throws DOMException；Public Text createTextNode(String data) throws DOMException;

创建一个文本内容节点；Create a text content node;

Element createElement(String tagName)throws DOMException；Element createElement(String tagName) throws DOMException;

创建一个节点元素；create a node element;

Public Attr createAttribute(String name)throws DOMException；Public Attr createAttribute(String name) throws DOMException;

创建一个属性：Create a property:

2.Node：此接口在整个DOM树中有着举足轻重的地位，DOM操作的核心接口都继承于Node(Document、Element、Attr)。在DOM树中，每一个Node接口代表了一个DOM树节点。2.Node: This interface plays a pivotal role in the entire DOM tree. The core interfaces of DOM operations are inherited from Node (Document, Element, Attr). In the DOM tree, each Node interface represents a DOM tree node.

Node接口常用方法：Common methods of Node interface:

Node appendChilid(Node newChild)throws DOMException；Node appendChilid(Node newChild) throws DOMException;

在当前节点下增加下一个新节点；Add the next new node under the current node;

Public NodeList getChildNodes()；Public NodeList getChildNodes();

取得本节点下的全部子节点；Get all child nodes under this node;

Public Node getFirstChild()；Public Node getFirstChild();

取得该节点下的第一个子节点；Get the first child node under the node;

Public Node getLastChild()；Public Node getLastChild();

取得本节点下的最后一个子节点；Get the last child node under this node;

Public boolean hasChildNodes()；Public boolean hasChildNodes();

判断是否还有其他节点；Determine whether there are other nodes;

String getNodeValue()throws DOMException；String getNodeValue() throws DOMException;

获取节点内容：Get node content:

3.NodeList：此接口表示一个点的集合，一般用于有序关系的一组节点。NodeList常用方法：3.NodeList: This interface represents a collection of points, generally used for a set of nodes in an ordered relationship. Common methods of NodeList:

Public int getLength()；Public int getLength();

取得NodeList中节点的个数；Get the number of nodes in NodeList;

Public Node item(int index)；Public Node item(int index);

根据索引取得节点对象；Get the node object according to the index;

4.NamedNodeMap：此接口表示一组节点和其唯一名称对应的一一关系，主要用于节点属性的表示。4. NamedNodeMap: This interface represents a one-to-one relationship between a group of nodes and their unique names, and is mainly used for the representation of node attributes.

除了以上四个核心接口外，如果一个程序需要进行DOM解析操作，则需要按照如下步骤进行：In addition to the above four core interfaces, if a program needs to perform DOM parsing operations, it needs to follow the steps below:

1、建立DocumentBuilderFactor，用于获得DocumentBuilder对象：1. Create a DocumentBuilderFactor to obtain a DocumentBuilder object:

DocumentBuilderFactory factory＝DocumentBuilderFactory.newInstance()；DocumentBuilderFactory factory=DocumentBuilderFactory.newInstance();

2、建立DocumentBuidler：2. Create DocumentBuilder:

DocumentBuilder builder＝factory.newDocumentBuilder()；DocumentBuilder builder=factory.newDocumentBuilder();

3、建立Document对象，获取树的入口：3. Create a Document object and get the entry of the tree:

Document doc＝builder.parse(“XML文件的相对路径或者绝对路径”)；Document doc=builder.parse("relative path or absolute path of XML file");

4、建立NodeList：4. Create NodeList:

NodeList n1＝doc.getElementByTagName(“读取节点”)。NodeList n1=doc.getElementByTagName("Read Node").

下面针对利用本发明实施例提供的基于语义分析的文献相似度确定方法，进行专利文献智能检索进行介绍。The following will introduce the intelligent retrieval of patent documents by using the method for determining the similarity of documents based on semantic analysis provided by the embodiments of the present invention.

1.检索召回算法。1. Retrieval recall algorithm.

检索系统最重要的是排序，排序分为初级排序(即检索召回)和基于业务理解、NLP、机器学习的学习排序，检索召回主要采用词频-逆文档频率、BM25等算法。The most important thing in the retrieval system is sorting, which is divided into primary sorting (ie, retrieval recall) and learning sorting based on business understanding, NLP, and machine learning. Retrieval and recall mainly use algorithms such as word frequency-inverse document frequency and BM25.

检索召回的目的是利用相对轻量级的算法快速获取候选项，从而方便后续利用复杂的排序算法进行排序，优化整个流程的查找效率，避免不必要的运算。候选项要大于我们最终要找的目标项。比如我们目标是找到最可能的50个专利文献，那么候选项可以设置为500个。The purpose of retrieval recall is to use a relatively lightweight algorithm to quickly obtain candidates, so as to facilitate subsequent sorting using complex sorting algorithms, optimize the search efficiency of the entire process, and avoid unnecessary operations. The candidate item is larger than the target item we are ultimately looking for. For example, our goal is to find the most likely 50 patent documents, then the candidates can be set to 500.

(1)基于词频-逆文档频率检索召回(1) Retrieval based on word frequency-inverse document frequency

词频-逆文档频率是一种用于信息检索与文本挖掘的常用加权技术。词频-逆文档频率是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。在一份给定的文件里，词频(term frequency，tf)指的是某一个给定的词语在该文件中出现的频率。这个数字是对词数(term count)的归一化，以防止它偏向长的文件。对于在某一特定文件里的词语ti来说，它的重要性可表示为：Term Frequency-Inverse Document Frequency is a commonly used weighting technique for information retrieval and text mining. Term Frequency - Inverse Document Frequency is a statistical method used to assess the importance of a word to a document set or one of the documents in a corpus. In a given document, term frequency (tf) refers to the frequency with which a given word appears in the document. This number is normalized to the term count to prevent it from skewing towards long files. For a word ti in a particular document, its importance can be expressed as:

以上式子中ni，j是该词在文件dj中的出现次数，而分母则是在文件dj中所有字词的出现次数之和。In the above formula, ni and j are the occurrences of the word in the file dj, and the denominator is the sum of the occurrences of all words in the file dj.

逆向文件频率(inverse document frequency，idf)是一个词语普遍重要性的度量。某一特定词语的idf，可以由总文件数目除以包含该词语之文件的数目，再将得到的商取以10为底的对数得到：Inverse document frequency (idf) is a measure of the universal importance of a word. The idf of a particular word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the base 10 logarithm of the obtained quotient to get:

其中：in:

|D|：语料库中的文件总数；|D|: The total number of documents in the corpus;

|{j:t_i∈d_j}|：包含词语ti的文件数目(即ni，j！＝0的文件数目)如果词语不在数据中，就导致分母为零，因此一般情况下使用1+|{j:t_i∈d_j}|；|{j:t _i ∈d _j }|: The number of documents containing the word ti (ie the number of documents ni, j != 0) If the word is not in the data, the denominator will be zero, so 1+| is generally used {j:t _i ∈d _j }|;

然后tfidf i，j＝tfi，j×idfi；Then tfidf i, j = tfi, j × idfi;

某一特定文件内的高词语频率，以及该词语在整个文件集合中的低文件频率，可以产生出高权重的词频-逆文档频率。因此，词频-逆文档频率倾向于过滤掉常见的词语，保留重要的词语。The high word frequency within a particular document, and the low document frequency of that word in the entire document collection, can result in a high weighted word frequency - inverse document frequency. Therefore, term frequency-inverse document frequency tends to filter out common terms and keep important terms.

(2)基于BM25检索召回(2) Recall based on BM25 retrieval

BM25是一种常见的用来做相关度打分的公式，相当于把tfidf进行了扩展，主要就是计算一个query里面所有词和文档的相关度，然后再把分数做累加操作，而每个词的相关度分数主要还是受到tf/idf的影响，BM25的原始公式如下：BM25 is a common formula for relevance scoring, which is equivalent to extending tfidf. The main purpose is to calculate the relevance of all words and documents in a query, and then accumulate the scores. The correlation score is mainly affected by tf/idf. The original formula of BM25 is as follows:

其中ri是包含词项i的相关文档数量，ni是包含词项i的文档数量，N是整个文档数据集中所有文档的数量，R是和这个查询相关的文档数量，由于在典型的情况下，没有相关信息，即ri和R都是0，而通常的查询中，不会有某个词项出现的次数大于 1，因此打分的公式score变为：where ri is the number of related documents containing term i, ni is the number of documents containing term i, N is the number of all documents in the entire document dataset, and R is the number of documents related to this query, since in the typical case, There is no relevant information, that is, ri and R are both 0, and in normal queries, there will be no occurrence of a term greater than 1, so the scoring formula score becomes:

BM25算法，在用来作搜索相关性评分时主要的操作为：对Query进行语素解析，生成语素qi；然后，对于每个搜索结果D，计算每个语素qi与D的相关性得分，最后，将qi相对于D的相关性得分进行加权求和，从而得到Query与D的相关性得分，将上述公式转写成BM25的一般性公式如下：When the BM25 algorithm is used for search relevance score, the main operations are: perform morpheme analysis on Query to generate morpheme qi; then, for each search result D, calculate the correlation score between each morpheme qi and D, and finally, The correlation score of qi relative to D is weighted and summed to obtain the correlation score of Query and D. The general formula for transcribing the above formula into BM25 is as follows:

其中，Q表示Query，qi表示Q解析后的一个语素(对中文而主，我们可以把对 Query的分词作为语素分析，把每个词看成语素qi)；d表示一个搜索结果文档；Wi 表示语素的权重；R(qi，d)表示语素qi与文档d的相关性得分。Among them, Q represents Query, and qi represents a morpheme parsed by Q (for Chinese, we can use the segmentation of Query as a morpheme analysis, and treat each word as a morpheme qi); d represents a search result document; Wi represents The weight of the morpheme; R(qi, d) represents the correlation score of the morpheme qi and the document d.

下面定义Wi，表示一个词与一个文档的相关性的权重，方法有多种，如果作为BM25的一般性公式，则可以描述为类似IDF的形式，公式如下：Wi is defined below to represent the weight of the correlation between a word and a document. There are many methods. If it is used as a general formula of BM25, it can be described as a form similar to IDF. The formula is as follows:

其中N为索引中的全部文档数，n(qi)为包含了qi的文档数。根据IDF的定义可以看出，对于给定的文档集合，包含了qi的文档数越多，qi的权重则越低，也就是说，当很多文档都包含了qi时，qi的区分度就不高，因此使用qi来判断相关性的重要度就越低。再来看语素qi与文档d的相关性得分R(qi，d)，首先看BM25中相关性得分的一般形式：where N is the total number of documents in the index, and n(qi) is the number of documents that contain qi. According to the definition of IDF, it can be seen that for a given document set, the more documents that contain qi, the lower the weight of qi, that is to say, when many documents contain qi, the degree of discrimination of qi is not high, so the less important it is to use qi to judge the correlation. Let's look at the correlation score R(qi, d) between the morpheme qi and the document d. First, let's look at the general form of the correlation score in BM25:

其中，k1.k2.b为调节因子，通常根据经验设置，一般k1＝2，b＝0.75；fi为qi在d中的出现频率，qfi为qi在Query中的出现频率。dl为文档d的长度，avgdl为所有文档的平均长度。由于在绝大部分情况，qi在Query中只会出现一次，即qfi＝1，因此公式可以简化为：Among them, k1.k2.b is the adjustment factor, usually set according to experience, generally k1=2, b=0.75; fi is the frequency of qi in d, and qfi is the frequency of qi in Query. dl is the length of document d, and avgdl is the average length of all documents. Since in most cases, qi only appears once in Query, that is, qfi=1, so the formula can be simplified as:

从K的定义中可以看到，参数b的作用是调整文档长度对相关性影响的大小。b 越大，文档长度对相关性得分的影响越大，反之越小；而文档的相对长度越大，K值将越大，则相关性得分会越小。这可以理解为，当文档较长时，包含qi的机会越大，因此，同等fi的情况下，长文档与qi的相关性应该比短文档与qi的相关性弱。As can be seen from the definition of K, the function of parameter b is to adjust the size of the impact of document length on relevance. The larger the b, the greater the impact of the document length on the relevance score, and vice versa; the greater the relative length of the document, the greater the K value, and the smaller the relevance score. This can be understood as the greater the chance of containing qi when the document is longer. Therefore, under the same fi, the correlation between the long document and qi should be weaker than the correlation between the short document and qi.

综上，BM25算法的相关性得分公式可总结为：In summary, the correlation score formula of the BM25 algorithm can be summarized as:

从BM25的公式可以看到，通过使用不同的语素分析方法、语素权重判定方法，以及语素与文档的相关性判定方法，我们可以衍生出不同的搜索相关性得分计算方法，这就为我们设计算法提供了较大的灵活性。As can be seen from the formula of BM25, by using different morpheme analysis methods, morpheme weight determination methods, and morpheme-document correlation determination methods, we can derive different search relevance score calculation methods, which design the algorithm for us Provides greater flexibility.

(3)基于函数的排序(3) Sorting based on function

排序模型有很多是基于函数的，比如衰减函数可以基于时间、数目等维度进行分数衰减降权，常见的衰减函数如gauss衰减、线性衰减等。Many sorting models are based on functions. For example, the decay function can perform fractional decay and weight reduction based on dimensions such as time and number. Common decay functions such as gauss decay and linear decay.

gauss衰减：基于高斯函数衍生出来的衰减函数，它的公式定义如下：gauss decay: a decay function derived from a Gaussian function, its formula is defined as follows:

这个公式的分数在超出origin+/-offset之后会得到衰减；在给定scale时(衰减界限)、decay(衰减程度)时，衰减函数的形状也会给定，因此σ的计算方法如下：The fraction of this formula will be attenuated after exceeding origin+/-offset; when the scale (attenuation limit) and decay (the degree of attenuation) are given, the shape of the attenuation function is also given, so the calculation method of σ is as follows:

σ²＝-scale²/(2×ln(decay))σ ² =-scale ² /(2×ln(decay))

指数衰减：Exponential decay:

S(doc)＝exp(λ×max(0，|fieldvalue_doc-origin|-offset))S(doc)=exp(λ×max(0, |fieldvalue _doc -origin|-offset))

同样的，这个公式的分数在超出origin+/-offset之后会得到衰减；在给定scale时(衰减界限)、decay(衰减程度)时，衰减函数的形状也会给定，因此λ的计算方法如下：Similarly, the score of this formula will be attenuated after exceeding origin+/-offset; when the scale (attenuation limit) and decay (attenuation degree) are given, the shape of the decay function is also given, so the calculation method of λ is as follows :

λ＝ln(decay)/scaleλ=ln(decay)/scale

线性衰减：Linear decay:

其中s的计算方法如下：where s is calculated as follows:

s＝scale/(1.0-decay))；s=scale/(1.0-decay));

与前两个公式相比，在fieldvalue超过两倍scale时，这个公式的分数将变为0。Compared with the previous two formulas, when the fieldvalue exceeds twice the scale, the score of this formula will become 0.

2.检索排序算法。2. Retrieval sorting algorithm.

检索系统本质是基于对Query和专利的理解，实现语义级的匹配，再对匹配的候选结果进行排序。排序有分为召回和精排两个阶段，召回前文已结束，返回TopN的候选结果，在精排阶段在对TopN的结果基于机器学习排序模型做精确排序选出TopM (N>M)的结果，简称LTR(learning to rank)。The essence of the retrieval system is to achieve semantic-level matching based on the understanding of Query and patents, and then sort the matching candidate results. The sorting is divided into two stages: recall and fine sorting. The recall has ended, and the candidate results of TopN are returned. In the fine sorting stage, the results of TopN are accurately sorted based on the machine learning sorting model and the results of TopM (N>M) are selected. , referred to as LTR (learning to rank).

下面详细介绍利用潜在语义分析和关联规则挖掘构造上文提到的检索联想模型。The following describes in detail the use of latent semantic analysis and association rule mining to construct the retrieval association model mentioned above.

1、潜在语义分析。1. Latent semantic analysis.

潜在语义分析LSA通过引入概念空间来减小同义噪音。LSA利用词的上下文相关性，即出现在相似上下文中的词，被认为在用法和含义上相近。为实现LSA思想，首先构造词-文档矩阵：Latent Semantic Analysis (LSA) reduces synonymous noise by introducing a concept space. LSA exploits the contextual relevance of words, i.e. words that appear in similar contexts are considered to be similar in usage and meaning. To implement the LSA idea, first construct the word-document matrix:

A＝|a_ij|m×n；A=|a _ij |m×n;

m代表词汇总量，n代表文档个数，其中a_ij为非负值，表示第i个词在第j个文档中出现的权重。不同的词对应矩阵A 不同的行，每一个文档则对应矩阵A 的列。通常aij要考虑来自两方面的贡献，即局部权值L(i,j)和全局权值C(i,j)。在 VSM模型中，局部权值L(i,j)和全局权值C(i,j)有不同的权重计算方法，如IDF， TFIDF等。由于每个词只会出现在少量文档中，故A通常为高阶稀疏矩阵。设第i 个和第j个词分别对应词-文档矩阵的第i行和第j行，分别记为ti＝(ai1,ai2,…, ain)和tj＝(aj1,aj2,…,ajn)，其相似度定义为m represents the total number of vocabulary, n represents the number of documents, where a _ij is a non-negative value, indicating the weight of the i-th word appearing in the j-th document. Different words correspond to different rows of matrix A, and each document corresponds to a column of matrix A. Usually aij has to consider the contributions from two aspects, namely the local weight L(i,j) and the global weight C(i,j). In the VSM model, the local weights L(i,j) and the global weights C(i,j) have different weight calculation methods, such as IDF, TFIDF and so on. Since each word only appears in a small number of documents, A is usually a high-order sparse matrix. Let the i-th and j-th words correspond to the i-th row and the j-th row of the word-document matrix, respectively, denoted as ti=(ai1, ai2,..., ain) and tj=(aj1,aj2,...,ajn) , whose similarity is defined as

在计算相似度之前，通常先将aij转化为log(aij+1)，再除以它的熵，这样预处理能将词的上下文考虑进来，突出了词在文章中的用文环境。经过信息熵变换后得到次序化的词-文档矩阵A′＝a′ij m×n，其中：Before calculating the similarity, aij is usually converted into log(aij+1), and then divided by its entropy, so that the preprocessing can take the context of the word into account and highlight the context of the word in the article. After information entropy transformation, the ordered word-document matrix A'=a'ij m×n is obtained, where:

潜在语义分析的理论基础是矩阵的奇异值分解(Singular ValueDecomposition， SVD)，奇异值分解是数理统计中常用的方法。词-文档矩阵A′建立后，利用奇异值分解计算A′的k秩近似矩阵A′k(k<<min(m,n))经奇异值分解，矩阵A′可表示为三个矩阵的乘积：The theoretical basis of latent semantic analysis is the singular value decomposition (SVD) of the matrix, which is a commonly used method in mathematical statistics. After the word-document matrix A' is established, the k-rank approximation matrix A'k(k<<min(m,n)) of A' is calculated by singular value decomposition. After singular value decomposition, the matrix A' can be expressed as the sum of three matrices. product:

A'＝U∑V^T；A'=U∑V ^T ;

式中U和V分别是A′的奇异值对应的左、右奇异向量矩阵；Σ是标准型；VT 是V的转秩；A′的奇异值按递减排列构成对角矩阵Σk，取U和V最前面的k 个列构建A′的k秩近似矩阵：In the formula, U and V are the left and right singular vector matrices corresponding to the singular values of A' respectively; Σ is the standard form; VT is the rank of V; the singular values of A' are arranged in decreasing order to form a diagonal matrix Σk, take U and The first k columns of V construct a k-rank approximation matrix of A':

式中Uk和Vk的列向量均为正交向量，假定A′的秩为r，则有：In the formula, the column vectors of Uk and Vk are both orthogonal vectors. Assuming that the rank of A' is r, there are:

U^TU＝V^TV＝I_r(I_r是r×r阶单位阵)；U ^T U=V ^T V=I _r (I _r is a unit matrix of order r×r);

用A'_k近似表征原词-文档矩阵A′，U_k和V_k中的行向量分别作为词向量和文档向量，在此基础上进行文本分类和其他各种文档处理，这就是隐含语义索引技术。尽管LSI也是用文档中包含的词来表示文档的语义，但LSI模型并不把文档中所有的词看作是文档概念的可靠表示。由于文档中词的多样性很大程度上掩盖了文档的语义结构，LSI通过奇异值分解和取k秩近似矩阵，一方面消减了原词-文档矩阵中包含的“噪声”因素，从而更加凸现出词和文档之间的语义关系；另一方面使得词、文档向量空间大大缩减，可以提高文本挖掘的效率。A' _k is used to approximate the original word-document matrix A', and the row vectors in U _k and V _k are used as word vectors and document vectors respectively. On this basis, text classification and various other document processing are performed, which is the implicit semantics. indexing technology. Although LSI also uses the words contained in the document to represent the semantics of the document, the LSI model does not regard all the words in the document as a reliable representation of the document concept. Since the diversity of words in a document largely obscures the semantic structure of the document, LSI reduces the "noise" factor contained in the original word-document matrix by singular value decomposition and k-rank approximation matrix, which makes it more prominent. The semantic relationship between words and documents; on the other hand, the vector space of words and documents is greatly reduced, which can improve the efficiency of text mining.

2、关联规则挖掘。2. Association rule mining.

关联规则是数据挖掘中的一种主要的挖掘技术，它可以从海量的数据中发现潜在有用的关联或相关关系。设I＝{i1,i2,…,im}是项的集合，记D为事务T的集合，这里事务T是项的集合，并且T∈I。对应每一个事务有唯一的标识，记作设TID。设X是一个I中项的集合，如果X∈T，那么称事务T包含X。一个关联规则是形如X→Y的蕴涵式，这里

并且X∩Y＝Φ规则X→Y在事务集D 中的支持度是事务集中包含X和Y的事务数与所有事务数之比，记为support(X →Y)，即：Association rules are one of the main mining techniques in data mining, which can find potentially useful associations or correlations from massive data. Let I={i1,i2,...,im} be the set of items, and denote D as the set of transactions T, where transaction T is the set of items, and T∈I. There is a unique identifier corresponding to each transaction, denoted as set TID. Let X be a set of items in I, if X ∈ T, then transaction T is said to contain X. An association rule is an implication of the form X→Y, where

And X∩Y=ΦThe support of rule X→Y in transaction set D is the ratio of the number of transactions containing X and Y to the number of all transactions in the transaction set, denoted as support(X→Y), that is:

规则X→Y在事务集中的可信度是指包含X和Y的事务数与包含X的事务数之比，记为confidence(X→Y)，即：The credibility of the rule X→Y in the transaction set refers to the ratio of the number of transactions containing X and Y to the number of transactions containing X, denoted as confidence(X→Y), that is:

给定一个事务集D，关联规则挖掘问题就是产生支持度和可信度分别大于用户给定的最小支持度和最小可信度的关联规则，常用的算法有Apriori算法、FP-Growth 算法。Given a transaction set D, the problem of association rule mining is to generate association rules whose support and credibility are greater than the minimum support and minimum credibility given by the user, respectively. Common algorithms include Apriori algorithm and FP-Growth algorithm.

对于关联词挖掘，设挖掘出的关联规则形如{ti→tj,s,c}，表示了词ti出现在文档中，则词tj出现在同一文档的支持度为s(0≤s≤1)，置信度为c(0≤c≤1)。如果支持度和置信度大于指定的阈值，则可以认为它们的关联性很大，大到即使忽略词也不会引起信息损失，这就保证了词的独立性。For associated word mining, suppose the mined association rules are in the form of {ti→tj,s,c}, indicating that the word ti appears in the document, then the support degree of the word tj appearing in the same document is s (0≤s≤1) , the confidence level is c (0≤c≤1). If the support and confidence are greater than the specified threshold, it can be considered that their correlation is so large that even ignoring words will not cause information loss, which ensures the independence of words.

3、算法过程。3. Algorithmic process.

根据以上原理分析对同义词和关联词集的构造算法描述如下：According to the above principle analysis, the construction algorithm of synonyms and associated word sets is described as follows:

算法1：同义词集构造算法。Algorithm 1: The synonym set construction algorithm.

输入：enter:

1)训练文档集；1) Training document set;

2)概念空间维数k；2) Concept space dimension k;

3)同义归并保留特征词个数N。3) The number N of feature words is reserved for synonymous merger.

输出：output:

1)共N个元素的特征词集；1) A feature word set with a total of N elements;

2)归并方案集(决定被归并的同义语该归并到那个保留特征词)。2) Merge scheme set (decide that the merged synonyms should be merged into which reserved feature words).

步骤：step:

1)根据训练文档包含词的情况构造出词-文档矩阵；1) Construct a word-document matrix according to the condition that the training document contains words;

2)对词-文档矩阵进行次序化处理；2) Perform sequential processing on the word-document matrix;

3)使用SVD对词-文档矩阵进行奇异值分解，得到词-文档矩阵的左右奇异向量和奇异值标准型；3) Use SVD to perform singular value decomposition on the word-document matrix to obtain left and right singular vectors and singular value canonical forms of the word-document matrix;

4)保留左奇异向量的前k列数据，其它列数据全部清零；4) Keep the first k columns of left singular vector data, and clear all other column data to zero;

5)保留奇异值标准型对角线上的前k个数据，其它对角线数据全部清零；5) Retain the first k data on the diagonal of the singular value standard type, and clear all other diagonal data to zero;

6)保留右奇异向量的前k列数据，其它列数据全部清零；6) Retain the data of the first k columns of the right singular vector, and clear all other column data to zero;

7)将清零后的左右奇异向量和奇异值标准型相乘，得到新的词-文档矩阵；7) Multiply the left and right singular vectors after zeroing with the singular value standard type to obtain a new word-document matrix;

8)对新的词-文档矩阵进行次序化处理；8) Perform sequential processing on the new word-document matrix;

9)当词个数超过N时做步骤10～14，进行同义归并；9) When the number of words exceeds N, perform steps 10 to 14 to merge synonyms;

10)在特征词集合中查找相似度最大的两个特征词；10) Find two feature words with the largest similarity in the feature word set;

11)在特征词集中删除相似度最大的特征词对中的任一个；11) Delete any one of the feature word pairs with the greatest similarity in the feature word set;

12)在归并方案集中查找含有相似度最大的特征词对中任一特征词的归并方案；12) Find the merging scheme containing any feature word in the feature word pair with the greatest similarity in the merging scheme set;

13)如找到，则将另一个特征词加入到归并方案的归并特征词集合中去；13) If found, add another feature word to the merged feature word set of the merge scheme;

14)如匹配不到归并方案，则以这两个特征词构造一个归并方案，并放入归并方案集中去。14) If the merging scheme cannot be matched, construct a merging scheme with these two feature words and put it into the set of merging schemes.

算法2：关联词集构造算法。Algorithm 2: Associated word set construction algorithm.

输入：enter:

1)训练文档集；1) Training document set;

2)关联归并阈值：支持度s，置信度c。2) Association merge threshold: support s, confidence c.

输出：output:

1)关联归并后的特征词集；1) Feature word set after association and merge;

2)关联归并后的归并方案集。2) A set of merging schemes after association and merging.

步骤：step:

1)利用Aprior算法求得所有支持度和置信度分别大于s和c的单维关联；1) Use the Aprior algorithm to obtain all single-dimensional associations with support and confidence greater than s and c, respectively;

2)对每一条支持度和置信度分别大于s和c的单维关联规则做步骤3～6，进行关联归并；2) Perform steps 3 to 6 for each single-dimensional association rule with support and confidence greater than s and c, respectively, and perform association merging;

3)在特征词集中删除关联规则右部的特征词；3) Delete the feature word in the right part of the association rule in the feature word set;

4)在归并方案集中查找含有关联规则任一边特征词的归并方案；4) Find the merge scheme containing the feature words on either side of the association rule in the merge scheme set;

5)如找到，则将另一特征词加入到归并方案的归并特征词集中去；5) If found, add another feature word to the merged feature word set of the merge plan;

6)如匹配不到归并方案，则以关联规则左右部的两个特征词构造一个归并方案，并放入归并方案集中去。6) If the merging scheme cannot be matched, construct a merging scheme with the two characteristic words in the left and right parts of the association rule, and put it into the merging scheme set.

实践证明，通过潜在语义分析和关联规则挖掘，可有效实现检索联想模型的构建。Practice has proved that the construction of retrieval association model can be effectively realized through latent semantic analysis and association rule mining.

综上所述，本发明实施例提供的基于语义分析的文献相似度确定方法可以实现基于语义分析准确地确定文献不同部分的权重，进而准确地确定文献相似度。To sum up, the method for determining the similarity of documents based on semantic analysis provided by the embodiments of the present invention can accurately determine the weights of different parts of documents based on semantic analysis, and then accurately determine the similarity of documents.

本发明实施例中还提供了一种基于语义分析的文献相似度确定装置，如下面的实施例所述。由于该装置解决问题的原理与基于语义分析的文献相似度确定方法相似，因此该装置的实施可以参见基于语义分析的文献相似度确定方法的实施，重复之处不再赘述。The embodiments of the present invention also provide a device for determining the similarity of documents based on semantic analysis, as described in the following embodiments. Since the principle of the device for solving the problem is similar to the method for determining the similarity of documents based on semantic analysis, the implementation of the device can refer to the implementation of the method for determining the similarity of documents based on semantic analysis, and the repetition will not be repeated.

图5为本发明实施例中基于语义分析的文献相似度确定装置的结构示意图，如图5所示，该装置包括：FIG. 5 is a schematic structural diagram of a device for determining the similarity of documents based on semantic analysis in an embodiment of the present invention. As shown in FIG. 5 , the device includes:

划分单元01，用于将每一待比较文献划分为多个部分；A dividing unit 01 is used to divide each document to be compared into a plurality of parts;

语义分析单元02，用于对每一部分进行语义分析得到每一部分的语义分析结果；A semantic analysis unit 02, configured to perform semantic analysis on each part to obtain a semantic analysis result of each part;

权重值确定单元03，用于根据每一部分的语义分析结果，确定每一待比较文献的每一部分的权重值；A weight value determination unit 03, configured to determine the weight value of each part of each document to be compared according to the semantic analysis result of each part;

处理单元04，用于根据每一待比较文献的每一部分的权重值，得到每一待比较文献的加权平均结果；The processing unit 04 is configured to obtain the weighted average result of each document to be compared according to the weight value of each part of each document to be compared;

相似度确定单元05，用于根据每一待比较文献的加权平均结果确定待比较文献之间的相似度。The similarity determination unit 05 is configured to determine the similarity between the documents to be compared according to the weighted average result of each document to be compared.

在一个实施例中，如图6所示，所述语义分析单元02可以包括：In one embodiment, as shown in FIG. 6 , the semantic analysis unit 02 may include:

分词处理模块021，用于对每一部分进行分词处理，得到每一部分对应的多个关键词；The word segmentation processing module 021 is used to perform word segmentation processing on each part to obtain a plurality of keywords corresponding to each part;

特征提取模块022，用于根据每一部分对应的多个关键词，以及预设的文献特征提取策略，从每一部分中提取出多个类型的关键特征，构成每一部分对应的特征集合；The feature extraction module 022 is used to extract multiple types of key features from each part according to a plurality of keywords corresponding to each part and a preset document feature extraction strategy to form a feature set corresponding to each part;

特征评估模块024，用于根据每一部分对应的特征集合，对每一部分进行词级、句法级和篇章级的语义分析，得到每一部分的语义分析结果。The feature evaluation module 024 is configured to perform word-level, syntactic-level and text-level semantic analysis on each part according to the feature set corresponding to each part, and obtain the semantic analysis result of each part.

在一个实施例中，如图7所示，上述所述语义分析单元02还可以包括特征降维模块023，用于利用主成分分析法、线性判别分析法和互信息法，对每一部分对应的特征集合进行特征的筛选和组合，得到特征降维处理后的每一部分对应的特征集合；In one embodiment, as shown in FIG. 7 , the above-mentioned semantic analysis unit 02 may further include a feature dimension reduction module 023 for using principal component analysis, linear discriminant analysis and mutual information The feature set performs feature screening and combination to obtain the feature set corresponding to each part after feature dimension reduction processing;

所述特征评估模块024具体用于：根据特征降维处理后的每一部分对应的特征集合，对每一部分进行词级、句法级和篇章级的语义分析，得到每一部分的语义分析结果。The feature evaluation module 024 is specifically configured to: perform word-level, syntactic-level and chapter-level semantic analysis on each part according to the feature set corresponding to each part after feature dimension reduction processing, and obtain the semantic analysis result of each part.

在一个实施例中，所述预设的文献特征提取策略可以包括：根据关键词在文献中出现的频率，关键词的逆文档频率，关键词的词性，关键词是否为专业词，关键词出现在文献中的位置，关键词的text-rank值，关键词的信息熵值，关键词的词向量与整体偏差值，关键词长度，关键词作为句子的成分，关键词是否再被切分成子关键词，关键词在文献中第一次出现与最后一次出现位置的长度，关键词分布偏差的其中之一或任意组合，进行文献特征提取。In one embodiment, the preset document feature extraction strategy may include: according to the frequency of the keyword appearing in the document, the inverse document frequency of the keyword, the part of speech of the keyword, whether the keyword is a professional word, the occurrence of the keyword The position in the literature, the text-rank value of the keyword, the information entropy value of the keyword, the word vector of the keyword and the overall deviation value, the length of the keyword, the keyword as a component of the sentence, whether the keyword is further divided into sub-sections One or any combination of keywords, the length of the first occurrence and the last occurrence of the keywords in the document, and the deviation of the keyword distribution, are used for document feature extraction.

本申请技术方案中对数据的获取、存储、使用、处理等均符合国家法律法规的相关规定。The acquisition, storage, use, and processing of data in the technical solution of this application are in compliance with the relevant provisions of national laws and regulations.

本发明实施例还提供一种计算机程序产品，所述计算机程序产品包括计算机程序，所述计算机程序被处理器执行时实现上述基于语义分析的文献相似度确定方法。An embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program, and when the computer program is executed by a processor, implements the above-mentioned method for determining the similarity of documents based on semantic analysis.

本领域内的技术人员应明白，本发明的实施例可提供为方法、系统、或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等) 上实施的计算机程序产品的形式。As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flows of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

以上所述的具体实施例，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施例而已，并不用于限定本发明的保护范围，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific embodiments described above further describe the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above-mentioned specific embodiments are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims

1. a document similarity determination method based on semantic analysis, is characterized in that, comprises:

Divide each document to be compared into multiple sections;

Perform semantic analysis on each part to obtain the semantic analysis result of each part;

According to the semantic analysis result of each part, determine the weight value of each part of each document to be compared;

According to the weight value of each part of each document to be compared, the weighted average result of each document to be compared is obtained;

According to the weighted average result of each document to be compared, the similarity between documents to be compared is determined.

2. The method for determining the similarity of documents based on semantic analysis as claimed in claim 1, wherein the semantic analysis of each part is performed to obtain the semantic analysis result of each part, comprising:

Perform word segmentation on each part to obtain multiple keywords corresponding to each part;

According to the multiple keywords corresponding to each part and the preset literature feature extraction strategy, multiple types of key features are extracted from each part to form a feature set corresponding to each part;

According to the feature set corresponding to each part, perform word-level, syntactic-level and chapter-level semantic analysis on each part, and obtain the semantic analysis result of each part.

3. the document similarity determination method based on semantic analysis as claimed in claim 2, is characterized in that, also comprises: utilize principal component analysis method, linear discriminant analysis method and mutual information method, carry out feature set to each part corresponding feature set to obtain the feature set corresponding to each part after feature dimensionality reduction processing;

According to the feature set corresponding to each part, perform word-level, syntactic-level and chapter-level semantic analysis on each part, and obtain the semantic analysis result of each part, including: according to the feature set corresponding to each part after feature dimension reduction processing, for each part One part performs word-level, syntactic-level and text-level semantic analysis, and obtains the semantic analysis results of each part.

4 . The method for determining the similarity of documents based on semantic analysis according to claim 2 , wherein the multiple types of key features include: static features of documents, features associated with documents and queries, and features of queries. 5 .

5. The method for determining the similarity of documents based on semantic analysis according to claim 2, wherein the preset document feature extraction strategy comprises: according to the frequency of keywords appearing in documents, the inverse document frequency of keywords , the part of speech of the keyword, whether the keyword is a professional word, the position of the keyword in the literature, the text-rank value of the keyword, the information entropy value of the keyword, the word vector of the keyword and the overall deviation value, the length of the keyword , the keyword is used as a component of the sentence, whether the keyword is further divided into sub-keywords, the length of the first occurrence and the last occurrence of the keyword in the literature, one or any combination of the keyword distribution deviation, and the literature Feature extraction.

6. The method for determining the similarity of documents based on semantic analysis as claimed in claim 5, wherein when the keywords can be divided into sub-keywords, the preset document feature extraction strategy further comprises: according to The word frequency of the sub-keyword - the inverse document frequency, the part-of-speech of the sub-keyword, whether the sub-keyword is one of the professional words or any combination is used for document feature extraction.

7. A device for determining the similarity of documents based on semantic analysis, comprising:

A division unit, which is used to divide each document to be compared into a plurality of parts;

The semantic analysis unit is used to perform semantic analysis on each part to obtain the semantic analysis result of each part;

a weight value determination unit, configured to determine the weight value of each part of each document to be compared according to the semantic analysis result of each part;

a processing unit, configured to obtain a weighted average result of each document to be compared according to the weight value of each part of each document to be compared;

The similarity determination unit is configured to determine the similarity between the documents to be compared according to the weighted average result of each document to be compared.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor implements any one of claims 1 to 6 when the processor executes the computer program the method.

9 . A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the method of any one of claims 1 to 6 is implemented. 10 .

10. A computer program product, characterized in that the computer program product comprises a computer program, which implements the method of any one of claims 1 to 6 when the computer program is executed by a processor.