CN118170836B

CN118170836B - File knowledge extraction method and device based on structure priori knowledge

Info

Publication number: CN118170836B
Application number: CN202410592269.7A
Authority: CN
Inventors: 马兵; 尹旭; 王玉石; 许然然; 王明华; 王永婷; 张晓良; 董航
Original assignee: Shandong Energy Shuzhiyun Technology Co ltd
Current assignee: Shandong Energy Shuzhiyun Technology Co ltd
Priority date: 2024-05-14
Filing date: 2024-05-14
Publication date: 2024-09-13
Anticipated expiration: 2044-05-14
Also published as: CN118170836A

Abstract

The embodiment of the invention provides a method and a device for extracting archival knowledge based on structure priori knowledge, which relate to the technical field of data processing and are characterized in that a target document is extracted by multiple features, the multi-feature information in the target document is obtained, the multi-feature information comprises structural features and font features, semantic information of Chinese characters can be captured more accurately, and rich and accurate feature representation is provided for entity identification. And the multi-feature information is subjected to feature fusion based on the feature correlation of the multi-feature information, and the feature extraction and the data dimension reduction are performed, so that the processing effect of the model on the long-distance dependence problem of the data can be enhanced, and the generalization capability of the model is improved. And the structure priori knowledge of deep feature representation is introduced to construct a label prediction model so as to identify the entity of the key information, so that the model can better understand the structure and the feature of the data, the dependency relationship among the entities and the context information thereof can be utilized more effectively, and the accuracy and the robustness of the entity identification are obviously improved.

Description

Archival knowledge extraction method and device based on structural prior knowledge

技术领域Technical Field

本发明涉及数据处理技术领域，尤其涉及一种基于结构先验知识的档案知识抽取方法及装置。The present invention relates to the technical field of data processing, and in particular to a method and device for extracting archive knowledge based on structural prior knowledge.

背景技术Background Art

在当今数字化时代，大量的文档和档案信息需要被有效管理和利用，特别是在中文档案领域，由于中文自身的复杂性以及档案内容的多样性，传统的信息抽取方法面临着巨大的挑战。中文汉字的结构复杂，含有丰富的偏旁部首和字形信息，这些信息对于理解汉字的语义非常关键，但在传统的档案知识抽取方法中往往得不到充分利用。此外，传统方法在处理长距离依赖关系、复杂数据结构以及实体间复杂的逻辑关系时，效果往往不尽人意，导致信息抽取的准确性和效率较低。In today's digital age, a large amount of document and archival information needs to be effectively managed and utilized, especially in the field of Chinese archives. Due to the complexity of Chinese itself and the diversity of archival content, traditional information extraction methods face huge challenges. Chinese characters have a complex structure and contain rich radicals and glyph information. This information is critical for understanding the semantics of Chinese characters, but it is often not fully utilized in traditional archival knowledge extraction methods. In addition, traditional methods often fail to achieve satisfactory results when dealing with long-distance dependencies, complex data structures, and complex logical relationships between entities, resulting in low accuracy and efficiency in information extraction.

现有技术存在以下技术问题：（1）现有技术可能未能充分利用文本数据的结构和字形特征，特别是在处理具有丰富结构和语义信息的中文文档时，限制了实体识别的准确性；（2）在处理复杂数据结构和长距离依赖问题时，现有技术可能未能有效地融合来自不同源的特征，导致模型泛化能力有限；（3）现有技术可能未能充分利用结构先验知识来增强实体识别的准确性和鲁棒性。The existing technology has the following technical problems: (1) The existing technology may fail to fully utilize the structure and glyph features of text data, especially when processing Chinese documents with rich structural and semantic information, which limits the accuracy of entity recognition; (2) When dealing with complex data structures and long-distance dependency problems, the existing technology may fail to effectively integrate features from different sources, resulting in limited model generalization ability; (3) The existing technology may fail to fully utilize structural prior knowledge to enhance the accuracy and robustness of entity recognition.

发明内容Summary of the invention

有鉴于此，本发明的目的在于提供一种基于结构先验知识的档案知识抽取方法及装置，能够增强实体识别的准确性。In view of this, an object of the present invention is to provide a method and device for extracting archival knowledge based on structural prior knowledge, which can enhance the accuracy of entity recognition.

第一方面，本发明实施例提供一种基于结构先验知识的档案知识抽取方法，该方法包括：获取目标文档；对所述目标文档进行多特征提取，得到所述目标文档中的多特征信息；所述多特征信息包括结构特征和字形特征；将所述多特征信息输入至预先构建的特征融合模型中，基于所述多特征提取特征的特征相关性对所述多特征信息进行特征融合，生成融合特征；对所述融合特征进行特征提取以及数据降维，得到所述融合特征中的关键信息；通过预先构建的基于结构先验知识的标签预测模型对所述关键信息进行实体识别，输出所述目标文档中的关键实体信息。In a first aspect, an embodiment of the present invention provides an archival knowledge extraction method based on structural prior knowledge, the method comprising: acquiring a target document; performing multi-feature extraction on the target document to obtain multi-feature information in the target document; the multi-feature information comprises structural features and glyph features; inputting the multi-feature information into a pre-constructed feature fusion model, performing feature fusion on the multi-feature information based on the feature correlation of the multi-feature extraction features to generate fused features; performing feature extraction and data dimensionality reduction on the fused features to obtain key information in the fused features; performing entity recognition on the key information through a pre-constructed label prediction model based on structural prior knowledge, and outputting the key entity information in the target document.

第二方面，本发明实施例还提供一种基于结构先验知识的档案知识抽取装置，该装置包括：数据获取模块，用于获取目标文档；特征提取模块，用于对所述目标文档进行多特征提取，得到所述目标文档中的多特征信息；所述多特征信息包括结构特征和字形特征；特征融合模块，用于将所述多特征信息输入至预先构建的特征融合模型中，基于所述多特征提取特征的特征相关性对所述多特征信息进行特征融合，生成融合特征；预处理模块，用于对所述融合特征进行特征提取以及数据降维，得到所述融合特征中的关键信息；输出模块，用于通过预先构建的基于结构先验知识的标签预测模型对所述关键信息进行实体识别，输出所述目标文档中的关键实体信息。In a second aspect, an embodiment of the present invention further provides an archival knowledge extraction device based on structural prior knowledge, the device comprising: a data acquisition module, used to acquire a target document; a feature extraction module, used to perform multi-feature extraction on the target document to obtain multi-feature information in the target document; the multi-feature information includes structural features and glyph features; a feature fusion module, used to input the multi-feature information into a pre-constructed feature fusion model, perform feature fusion on the multi-feature information based on the feature correlation of the multi-feature extraction features, and generate fused features; a pre-processing module, used to perform feature extraction and data dimensionality reduction on the fused features to obtain key information in the fused features; an output module, used to perform entity recognition on the key information through a pre-constructed label prediction model based on structural prior knowledge, and output the key entity information in the target document.

本发明实施例带来了以下有益效果：本发明提供的一种基于结构先验知识的档案知识抽取方法及装置，通过结合汉字的结构特征和字形特征的细致提取，能够更精确地捕获到汉字的语义信息，为实体识别提供了丰富和精确的特征表示。得到多特征信息后，基于特征相关性进行特征融合，能够增强模型对复杂数据结构的处理能力及对序列数据中长距离依赖问题的处理效果，从而提升了模型的泛化能力。此外，还引入深层特征表示的结构先验知识构建标签预测模型，能够使模型更好的理解数据的结构和特征，因此，本发明实施例能够更有效地利用实体间的依赖关系及其上下文信息，显著提升实体识别的准确性和鲁棒性。The embodiments of the present invention bring about the following beneficial effects: The present invention provides a method and device for extracting archival knowledge based on structural prior knowledge, which can more accurately capture the semantic information of Chinese characters by combining the structural features of Chinese characters and the careful extraction of glyph features, and provide rich and accurate feature representation for entity recognition. After obtaining multi-feature information, feature fusion is performed based on feature correlation, which can enhance the model's ability to handle complex data structures and the effect of handling long-distance dependency problems in sequence data, thereby improving the generalization ability of the model. In addition, the structural prior knowledge of deep feature representation is introduced to construct a label prediction model, which can enable the model to better understand the structure and characteristics of the data. Therefore, the embodiments of the present invention can more effectively utilize the dependencies between entities and their contextual information, significantly improving the accuracy and robustness of entity recognition.

本发明的其他特征和优点将在随后的说明书中阐述，并且，部分地从说明书中变得显而易见，或者通过实施本发明而了解。本发明的目的和其他优点在说明书以及附图中所特别指出的结构来实现和获得。Other features and advantages of the present invention will be described in the following description, and partly become apparent from the description, or understood by practicing the present invention. The purpose and other advantages of the present invention are realized and obtained by the structures particularly pointed out in the description and the drawings.

为使本发明的上述目的、特征和优点能更明显易懂，下文特举较佳实施例，并配合所附附图，作详细说明如下。In order to make the above-mentioned objects, features and advantages of the present invention more obvious and easy to understand, preferred embodiments are given below and described in detail with reference to the accompanying drawings.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明具体实施方式或现有技术中的技术方案，下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施方式，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the specific implementation methods of the present invention or the technical solutions in the prior art, the drawings required for use in the specific implementation methods or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are some implementation methods of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1为本发明实施例提供的一种基于结构先验知识的档案知识抽取模型训练方法的流程图；FIG1 is a flow chart of a method for training an archive knowledge extraction model based on structural prior knowledge provided by an embodiment of the present invention;

图2为本发明实施例提供的另一种基于结构先验知识的档案知识抽取模型训练方法的流程图；FIG2 is a flow chart of another archival knowledge extraction model training method based on structural prior knowledge provided by an embodiment of the present invention;

图3为本发明实施例提供的一种多特征提取模型的结构图；FIG3 is a structural diagram of a multi-feature extraction model provided by an embodiment of the present invention;

图4为本发明实施例提供的一种基于结构先验知识的档案知识抽取模型训练装置的结构示意图；FIG4 is a schematic diagram of the structure of an archive knowledge extraction model training device based on structural prior knowledge provided by an embodiment of the present invention;

图5为本发明实施例提供的一种电子设备的结构示意图。FIG5 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为使本发明实施例的目的、技术方案和优点更加清楚，以下通过特定的具体实例说明本公开的实施方式，本领域技术人员可由本说明书所揭露的内容轻易地了解本公开的其他优点与功效。显然，所描述的实施例仅仅是本公开一部分实施例，而不是全部的实施例。本公开还可以通过另外不同的具体实施方式加以实施或应用，本说明书中的各项细节也可以基于不同观点与应用，在没有背离本公开的精神下进行各种修饰或改变。需说明的是，在不冲突的情况下，以下实施例及实施例中的特征可以相互组合。基于本公开中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本公开保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the following embodiments of the present invention are described by specific specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification. Obviously, the described embodiments are only part of the embodiments of the present invention, rather than all of the embodiments. The present invention can also be implemented or applied through other different specific embodiments, and the details in this specification can also be modified or changed in various ways based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the following embodiments and features in the embodiments can be combined with each other without conflict. Based on the embodiments in the present invention, all other embodiments obtained by ordinary technicians in the field without making creative work are within the scope of protection of the present invention.

为解决上述技术问题，本发明实施例提出一种基于结构先验知识的档案知识抽取方法及装置，可以增强实体识别的准确性。To solve the above technical problems, an embodiment of the present invention proposes a method and device for extracting archival knowledge based on structural prior knowledge, which can enhance the accuracy of entity recognition.

实施例一Embodiment 1

为便于对本实施例进行理解，首先对本发明的实施例所公开的一种基于结构先验知识的档案知识抽取方法进行详细介绍，图1示出了本发明实施例提供的一种基于结构先验知识的档案知识抽取方法的流程图，如图1所示，该方法包括如下步骤：To facilitate understanding of this embodiment, a method for extracting archival knowledge based on structural prior knowledge disclosed in an embodiment of the present invention is first introduced in detail. FIG. 1 shows a flow chart of a method for extracting archival knowledge based on structural prior knowledge provided in an embodiment of the present invention. As shown in FIG. 1 , the method includes the following steps:

步骤S101，获取目标文档。Step S101, obtaining a target document.

步骤S102，对目标文档进行多特征提取，得到目标文档中的多特征信息。Step S102, performing multi-feature extraction on the target document to obtain multi-feature information in the target document.

文档中包含人物、地点、场所等文字，这些文字代表不同实体，为了获取文档中的实体信息，本发明实施例预先对文档中的内容进行多特征提取，以基于提取到的多特征信息识别文档中的实体。本发明进行知识抽取的档案为中文档案，通常而言，中文汉字具有偏旁和部首结构，本发明实施例的多特征信息包括结构特征和字形特征，通过采用多特征提取的方式，对汉字的结构特征和字形特征进行提取，以识别实体类别。The document contains characters such as people, places, and venues, which represent different entities. In order to obtain entity information in the document, the embodiment of the present invention pre-extracts multiple features from the content in the document to identify the entities in the document based on the extracted multiple feature information. The archives for knowledge extraction in the present invention are Chinese archives. Generally speaking, Chinese characters have a radical and a radical structure. The multiple feature information in the embodiment of the present invention includes structural features and glyph features. By adopting a multi-feature extraction method, the structural features and glyph features of Chinese characters are extracted to identify the entity category.

具体的，字形和结构特征表征汉字的外部形状，和部首的组合形式、偏旁的位置关系以及整体的构造方式，这些特征有助于揭示字词的语义和功能属性，进而有助于实体识别的准确性。通过对文档中的每个文字的结构特征和字形特征进行提取，能够利用汉字的视觉和结构属性来增强模型对文档内容的理解能力，尤其是在处理中文文本时增强语义理解，提高模型的准确性和鲁棒性。具体的，可以根据文字内容组合确定文档内所包含的语义，进而确定该文档中的实体。Specifically, the glyph and structural features characterize the external shape of Chinese characters, the combination of radicals, the positional relationship of radicals, and the overall construction method. These features help reveal the semantics and functional attributes of words, and thus help improve the accuracy of entity recognition. By extracting the structural features and glyph features of each word in a document, the visual and structural properties of Chinese characters can be used to enhance the model's ability to understand the content of the document, especially to enhance semantic understanding when processing Chinese text, and improve the accuracy and robustness of the model. Specifically, the semantics contained in the document can be determined based on the combination of text content, and then the entities in the document can be determined.

步骤S103，将多特征信息输入至预先构建的特征融合模型中，基于多特征信息的特征相关性对多特征信息进行特征融合，生成融合特征。Step S103: input the multiple feature information into a pre-built feature fusion model, perform feature fusion on the multiple feature information based on feature correlation of the multiple feature information, and generate fusion features.

提取出来的多特征信息表示为高维的特征向量，这些向量能够综合表达文本的多种属性。如，字形特征向量可能包括笔画数、笔画类型和相互关系的编码；结构特征向量则可能包括编码部首的种类、数量以及它们在汉字中的布局信息。The extracted multi-feature information is represented as high-dimensional feature vectors, which can comprehensively express multiple attributes of the text. For example, the glyph feature vector may include the number of strokes, stroke types and the encoding of their relationship; the structural feature vector may include the type and number of encoded radicals and their layout information in the Chinese character.

本发明实施例还将提取到的多种特征结合起来，捕捉特征间的动态相关性，增强模型对复杂数据结构的处理能力，同时有效地处理序列数据中的长距离依赖问题。以提升信息抽取或分类任务的效果，通过提取并融合多种特征，可以显著提升处理中文文档的能力，解决现有技术在处理中文复杂性和多样性方面的不足。The embodiment of the present invention also combines the extracted multiple features to capture the dynamic correlation between the features, enhance the model's ability to process complex data structures, and effectively handle the long-distance dependency problem in sequence data. This can improve the effect of information extraction or classification tasks. By extracting and fusing multiple features, the ability to process Chinese documents can be significantly improved, solving the shortcomings of existing technologies in processing the complexity and diversity of Chinese.

在具体实现时，本发明实施例利用已训练完成的模型对新样本进行处理，以实现档案知识抽取。在一个实施例中，在会议纪要文档中抽取关键参与人员、日期、地点等实体信息。首先，对新的会议纪要文档进行多特征提取。进一步地，使用已训练的特征融合模型进行特征融合。In a specific implementation, the embodiment of the present invention uses a trained model to process new samples to achieve archival knowledge extraction. In one embodiment, entity information such as key participants, dates, and locations are extracted from meeting minutes documents. First, multiple features are extracted from the new meeting minutes document. Further, feature fusion is performed using a trained feature fusion model.

步骤S104，对融合特征进行特征提取以及数据降维，得到融合特征中的关键信息。Step S104, performing feature extraction and data dimension reduction on the fused features to obtain key information in the fused features.

步骤S105，通过预先构建的标签预测模型对关键信息进行实体识别，输出目标文档中的关键实体信息。Step S105, performing entity recognition on key information through a pre-built tag prediction model, and outputting key entity information in the target document.

为提高识别精度，本发明实施例还对上述融合特征进行特征提取和数据降维，以减少计算复杂度同时保留关键信息。在具体实现时，可以通过预先训练的模型处理文档，得到高维特征向量。进一步地，使用已训练的特征降维模型对特征向量进行降维，以减少计算复杂度同时保留关键信息。To improve recognition accuracy, the embodiment of the present invention also performs feature extraction and data dimensionality reduction on the above fusion features to reduce computational complexity while retaining key information. In a specific implementation, the document can be processed by a pre-trained model to obtain a high-dimensional feature vector. Further, the feature vector is reduced in dimension using a trained feature dimensionality reduction model to reduce computational complexity while retaining key information.

进一步地，使用已训练的模型将降维后的特征向量进行标签预测，进行实体识别和分类。其中，基于结构先验知识的标签预测模型基于标签进行实体识别和分类，标签包括文本中特定的实体类别，如人名。进一步地，根据模型预测结果，输出文档中的关键实体信息，如参与人员名单、会议日期和地点等。Furthermore, the trained model is used to predict the labels of the reduced feature vectors for entity recognition and classification. The label prediction model based on structural prior knowledge performs entity recognition and classification based on labels, which include specific entity categories in the text, such as names. Furthermore, based on the model prediction results, key entity information in the document is output, such as a list of participants, meeting date and location, etc.

其中，本发明实施例引入深层特征表示的结构先验知识构建标签预测模型，能够使模型更好的理解数据的结构和特征。进一步地，通过结合汉字的结构特征和字形特征的细致提取，得到多特征信息，并得到融合信息和关键信息后，本发明能够更精确地捕获到汉字的语义信息，为实体识别提供了丰富和精确的特征表示。Among them, the embodiment of the present invention introduces structural prior knowledge of deep feature representation to construct a label prediction model, which enables the model to better understand the structure and characteristics of the data. Furthermore, by combining the structural features of Chinese characters and the careful extraction of glyph features, obtaining multi-feature information, and obtaining fusion information and key information, the present invention can more accurately capture the semantic information of Chinese characters and provide rich and accurate feature representation for entity recognition.

综上，本发明实施例提供的一种基于结构先验知识的档案知识抽取方法，能够更有效地利用实体间的依赖关系及其上下文信息，尤其是在处理复杂的实体关系和丰富的上下文信息时，显著提升实体识别的准确性和鲁棒性。In summary, the archival knowledge extraction method based on structural prior knowledge provided by the embodiment of the present invention can more effectively utilize the dependencies between entities and their contextual information, especially when dealing with complex entity relationships and rich contextual information, and significantly improve the accuracy and robustness of entity recognition.

实施例二Embodiment 2

进一步地，现有技术可能缺乏在多个粒度上搜索和提取特征的能力，以及缺少有效的双向增强机制来优化特征提取过程，此外，在特征压缩和降维过程中，现有技术可能未能有效地保留关键的交互信息，导致高维特征空间中的信息损失，同时也可能影响了处理效率。因此，在上述实施例的基础上，本发明实施例还提供另一种基于结构先验知识的档案知识抽取方法，图2示出了本发明实施例提供的另一种基于结构先验知识的档案知识抽取方法的流程图，如图2所示，该方法包括如下步骤：Furthermore, the prior art may lack the ability to search and extract features at multiple granularities, and lack an effective bidirectional enhancement mechanism to optimize the feature extraction process. In addition, in the process of feature compression and dimensionality reduction, the prior art may fail to effectively retain key interactive information, resulting in information loss in the high-dimensional feature space, and may also affect the processing efficiency. Therefore, based on the above embodiment, the embodiment of the present invention also provides another archival knowledge extraction method based on structural prior knowledge. FIG2 shows a flowchart of another archival knowledge extraction method based on structural prior knowledge provided by an embodiment of the present invention. As shown in FIG2, the method includes the following steps:

步骤S201，获取目标文档。Step S201, obtaining a target document.

步骤S202，将目标文档输入至预先构建的多特征提取模型，通过多特征提取模型对目标文档进行多特征提取，得到目标文档中的多特征信息。Step S202: input the target document into a pre-built multi-feature extraction model, perform multi-feature extraction on the target document through the multi-feature extraction model, and obtain multi-feature information in the target document.

在具体实现时，本发明实施例通过预先构建的多特征提取模型进行多特征提取，以确定文档中的字形特征和结构特征。其中，多特征提取模型包括结构特征提取模型和字形特征提取模型；结构特征提取模型用于提取目标文档中文字的结构特征，字形特征提取模型用于提取目标文档中文字的字形特征。In specific implementation, the embodiment of the present invention performs multi-feature extraction through a pre-built multi-feature extraction model to determine the glyph features and structural features in the document. The multi-feature extraction model includes a structural feature extraction model and a glyph feature extraction model; the structural feature extraction model is used to extract the structural features of the characters in the target document, and the glyph feature extraction model is used to extract the glyph features of the characters in the target document.

在结构特征提取方面，本发明通过深度学习模型来识别和提取汉字中的偏旁部首和构造特征。与传统方法相比，本方法不仅识别汉字的基本偏旁，还进一步分析了汉字的构造方式，如偏旁的组合形态和相对位置关系，这些都是传统方法难以触及的深层特征。本发明实施例对偏旁部首和构造特征的融合能够提高汉字及其语义的理解深度，还增强了模型在特定数据集和情境下的适应性和准确性，特别是在中文文本的处理中，能够显著提升信息抽取的准确性。基于此，所提出方法能够更加精准地捕捉到汉字的语义信息，为后续的实体识别提供了更为丰富和精确的特征表示。In terms of structural feature extraction, the present invention uses a deep learning model to identify and extract radicals and construction features in Chinese characters. Compared with traditional methods, this method not only identifies the basic radicals of Chinese characters, but also further analyzes the construction methods of Chinese characters, such as the combination form and relative position relationship of radicals, which are deep features that are difficult to reach with traditional methods. The fusion of radicals and construction features in the embodiments of the present invention can improve the depth of understanding of Chinese characters and their semantics, and also enhance the adaptability and accuracy of the model in specific data sets and situations, especially in the processing of Chinese texts, which can significantly improve the accuracy of information extraction. Based on this, the proposed method can more accurately capture the semantic information of Chinese characters, and provide a richer and more accurate feature representation for subsequent entity recognition.

在字形特征提取方面，本发明通过改进的卷积神经网络模型来处理汉字文档，在一个实施例中，采用VGG16网络（卷积神经网络的一类）架构作为基础，但对其输入层进行了特别设计，以适应汉字的特殊性。所提出方法能够有效提取出汉字字形的高维特征，这些特征不仅包括了汉字的笔画和结构，还细化到了笔画的弯曲度、粗细变化等微观特征，丰富了特征集合，为档案知识抽取的精度提供基础。In terms of character shape feature extraction, the present invention processes Chinese character documents through an improved convolutional neural network model. In one embodiment, the VGG16 network (a type of convolutional neural network) architecture is used as the basis, but its input layer is specially designed to adapt to the particularity of Chinese characters. The proposed method can effectively extract high-dimensional features of Chinese character shapes. These features not only include the strokes and structure of Chinese characters, but also include microscopic features such as the curvature and thickness changes of the strokes, enriching the feature set and providing a basis for the accuracy of archival knowledge extraction.

在具体实现时，图3示出了一种多特征提取模型的结构图，该模型能够对每个字进行处理，得到分别对应的特征向量。多特征提取模型的构建方法，包括以下步骤：In the specific implementation, FIG3 shows a structure diagram of a multi-feature extraction model, which can process each word to obtain the corresponding feature vector. The construction method of the multi-feature extraction model includes the following steps:

1）获取预先收集的中文档案，采集中文档案中的汉字数据。1) Obtain the pre-collected Chinese archives and collect the Chinese character data in the Chinese archives.

2）针对汉字数据的汉字特征，对汉字数据进行标注，得到数据标签。2) Based on the Chinese character features of the Chinese character data, the Chinese character data is labeled to obtain data labels.

其中，先从中文档案中进行数据采集，并对采集的数据根据其汉字特征进行标注，以构建多特征提取模型。汉字特征包括汉字数据的偏旁部首和字形，可以基于汉字的偏旁部首、字形等特征生成数据标签，如，将“村”的偏旁部首分为“木”和“寸”，则偏旁为“木”、部首为“寸”的字标注为“村”，进而基于该数据标签由标签预测模型确定该汉字对应的实体，如地点。在一个实施例中，中文文档是会议纪要文档，通过采集会议纪要文档中的汉字数据，并对这些数据进行标注，得到标签，以训练多特征提取模型。Among them, data is first collected from Chinese archives, and the collected data is annotated according to its Chinese character features to build a multi-feature extraction model. Chinese character features include radicals and glyphs of Chinese character data. Data labels can be generated based on the radicals, glyphs and other features of Chinese characters. For example, the radicals of "村" are divided into "木" and "寸", and the characters with the radical "木" and the radical "寸" are annotated as "村", and then the entity corresponding to the Chinese character, such as a place, is determined by the label prediction model based on the data label. In one embodiment, the Chinese document is a meeting minutes document, and labels are obtained by collecting Chinese character data in the meeting minutes document and annotating these data to train a multi-feature extraction model.

3）利用第一卷积神经网络对汉字数据按部首结构进行拆分和结构特征提取，确定结构特征表示；以及，通过VGG16网络捕捉汉字数据的字形细节特征。3) Use the first convolutional neural network to split the Chinese character data according to the radical structure and extract the structural features to determine the structural feature representation; and capture the glyph details of the Chinese character data through the VGG16 network.

对文档中的数据进行标注后，本发明实施例分别使用不同的模型处理结构特征和字形特征，再将结构特征和字形特征拼接为训练样本，基于该训练样本训练第三种模型，以基于训练好的第三种模型构建多特征提取模型。After annotating the data in the document, the embodiment of the present invention uses different models to process structural features and glyph features respectively, and then splices the structural features and glyph features into training samples, trains a third model based on the training samples, and constructs a multi-feature extraction model based on the trained third model.

a-通过下述步骤确定结构特征表示：a-Determine the structural feature representation by the following steps:

在具体实现时，利用卷积神经网络，对每个汉字按部首结构进行拆分和结构特征提取。卷积神经网络将学习到汉字部首的组合方式和结构特征，并将这些特征转化为向量形式，为每个汉字生成一个结构特征表示。In the specific implementation, a convolutional neural network is used to split each Chinese character according to its radical structure and extract its structural features. The convolutional neural network will learn the combination and structural features of the Chinese character radicals, and convert these features into vector form to generate a structural feature representation for each Chinese character.

具体的，给定一组中文档案数据，其中表示单个文档，每个文档包含一系列的汉字。对于每个汉字，通过结构分解获取其偏旁部首集合，其中是汉字的一个偏旁部首。Specifically, given a set of Chinese archive data ,in Represents a single document, each document contains a series of Chinese characters For each Chinese character , obtain its radical set through structural decomposition ,in It is a Chinese character A radical of .

进一步地，结构特征提取的目标是将每个偏旁部首映射到一个高维特征空间，可以表示为：Furthermore, the goal of structural feature extraction is to transform each radical Mapped to a high-dimensional feature space, it can be expressed as:

其中，是偏旁部首的结构特征表示，和是卷积神经网络中的权重和偏置参数，是Sigmoid非线性激活函数。in, It is a radical The structural characteristics of and are the weight and bias parameters in the convolutional neural network, It is the Sigmoid nonlinear activation function.

进一步地，和是通过卷积层学习得到的权重和偏置，的更新依赖于梯度下降法，计算方式可以表示为：Further, and are the weights and biases learned through the convolutional layer, The update depends on the gradient descent method, and the calculation method can be expressed as:

其中，是权重的学习率，是损失函数，是对的梯度。在一个实施例中，设置为0.01。且，的更新也遵循梯度下降法，计算方式可以表示为：in, is the learning rate of the weights, is the loss function, yes right In one embodiment, is set to 0.01. And, The update also follows the gradient descent method, and the calculation method can be expressed as:

其中，是对的梯度，为偏置的学习率。in, yes right The gradient of is the bias learning rate.

b-通过下述步骤捕捉汉字数据的字形细节特征：b- Capture the detailed features of the Chinese character data through the following steps:

在具体实现时，利用VGG16网络处理汉字，以提取字形特征。VGG16网络的特点在于其深度和简单性，它包含了16个层，其中包括13个卷积层、5个池化层和3个全连接层。VGG16网络使用3x3的小尺寸卷积核进行特征提取，并通过ReLU激活函数进行非线性激活，这种小的卷积核尺寸被广泛应用于后续的卷积层中。在每个卷积层之后，VGG16使用2x2的最大池化层进行空间降采样，以提取主要特征并减少参数数量。In the specific implementation, the VGG16 network is used to process Chinese characters to extract glyph features. The VGG16 network is characterized by its depth and simplicity. It contains 16 layers, including 13 convolutional layers, 5 pooling layers, and 3 fully connected layers. The VGG16 network uses a small 3x3 convolution kernel for feature extraction and performs nonlinear activation through the ReLU activation function. This small convolution kernel size is widely used in subsequent convolution layers. After each convolution layer, VGG16 uses a 2x2 maximum pooling layer for spatial downsampling to extract the main features and reduce the number of parameters.

网络通过深层次的卷积和池化操作，捕捉到汉字字形的细节特征，并将这些特征编码为高维特征向量，在一个实施例中，汉字的字形特征通过VGG16网络提取，对于给定汉字文档，字形特征提取可以表示为：The network captures the detailed features of the Chinese character shape through deep convolution and pooling operations, and encodes these features into high-dimensional feature vectors. In one embodiment, the Chinese character The glyph features are extracted through the VGG16 network. For a given Chinese character document , the glyph feature extraction can be expressed as:

其中，表示经过输入层与输入字符对应的VGG16网络，是汉字的字形特征表示。in, Represents the VGG16 network corresponding to the input character after the input layer, It is a Chinese character Character feature representation of .

4）将结构特征表示和字形细节特征拼接为综合特征向量，通过综合特征向量对第二卷积神经网络模型进行训练，采用Adam优化器对第二卷积神经网络模型的模型参数进行优化。4) The structural feature representation and the glyph detail features are concatenated into a comprehensive feature vector, the second convolutional neural network model is trained through the comprehensive feature vector, and the model parameters of the second convolutional neural network model are optimized using the Adam optimizer.

其中，根据结构特征表示和字形细节特征，确定汉字数据对应的上下文特征，该上下文特征表征结构特征表示和字形细节特征之间的关联。设为文档的上下文特征，由结构特征和字形特征共同生成，可以表示为：According to the structural feature representation and the glyph detail feature, the context feature corresponding to the Chinese character data is determined, and the context feature represents the association between the structural feature representation and the glyph detail feature. is the context feature of the document, which is generated by structural features and glyph features and can be expressed as:

其中，和分别是权重矩阵和偏置向量，表示特征拼接。in, and are the weight matrix and bias vector respectively, Represents feature concatenation.

进一步地，通过函数评估结构特征表示和字形细节特征分别与上下文特征的相关性；基于相关性，计算结构特征表示和字形细节特征分别对应的特征权重。基于计算结构特征和字形特征的动态权重，可以表示为：Further, through The function evaluates the correlation between the structural feature representation and the glyph detail feature and the context feature respectively; based on the correlation, the feature weights corresponding to the structural feature representation and the glyph detail feature are calculated respectively. The dynamic weights of structural features and glyph features can be calculated as follows:

其中，和分别为结构特征权重和字形特征的动态权重。函数用于评估特征与上下文的相关性，可以表示为：in, and They are the structural feature weights and the dynamic weights of glyph features respectively. The function is used to evaluate the relevance of features to context and can be expressed as:

其中，和是卷积神经网络中的权重和偏置参数，是权重矩阵，F_x为被评估特征。in, and are the weight and bias parameters in the convolutional neural network, is the weight matrix, and _Fx is the feature being evaluated.

进一步地，基于特征权重将结构特征表示和字形细节特征拼接为综合特征向量。具体地，通过一个基于动态特征校正机制的拼接函数实现，该函数考虑了两种特征的相互作用和补充性，并且通过评估文档中的上下文信息动态调整结构特征和字形特征在拼接过程中的权重，能够根据文档的具体内容和特征的重要性自适应调整特征权重，进而优化特征表示，可以表示为：Furthermore, the structural feature representation and the glyph detail features are concatenated into a comprehensive feature vector based on the feature weights. Specifically, a concatenation function based on a dynamic feature correction mechanism is used. This function takes into account the interaction and complementarity of the two features, and dynamically adjusts the weights of structural features and glyph features in the splicing process by evaluating the contextual information in the document. It can adaptively adjust the feature weights according to the specific content of the document and the importance of the features, thereby optimizing the feature representation. It can be expressed as:

进一步地，α_struct的计算是基于文档的上下文特征F_context得到。Furthermore, the calculation of α _struct is based on the contextual features F _context of the document.

得到综合特征向量后，通过综合特征向量对第二卷积神经网络模型进行训练，具体步骤如下：After obtaining the comprehensive feature vector, the second convolutional neural network model is trained through the comprehensive feature vector. The specific steps are as follows:

将综合特征向量输入另一卷积神经网络模型（也即第二卷积神经网络模型），以学习汉字与其语义信息之间的映射关系。具体的，给定综合特征向量，训练一个卷积神经网络模型来预测汉字的标签（也即上述数据标签），模型训练的目标是最小化预测标签和真实标签之间的差异，使用交叉熵损失函数表示为：The comprehensive feature vector is input into another convolutional neural network model (i.e., the second convolutional neural network model) to learn the mapping relationship between Chinese characters and their semantic information. Specifically, given the comprehensive feature vector , train a convolutional neural network model To predict the label of the Chinese character (that is, the data label mentioned above), the goal of model training is to minimize the difference between the predicted label and the true label, using the cross entropy loss function as follows:

其中，是真实标签，是模型对汉字的预测标签，求和遍历所有训练样本。in, is the true label, It is a model Chinese characters The predicted labels are summed over all training samples.

进一步地，预测标签通过预设的softmax函数计算得到的，可以表示为：Furthermore, the predicted label Calculated by the preset softmax function, it can be expressed as:

其中，是softmax函数输出的原始分数，是类别的总数。in, is the raw score output by the softmax function, is the total number of categories.

进一步地，模型的参数优化采用Adam优化器，更新规则为：Furthermore, the model parameters are optimized using the Adam optimizer, and the update rule is:

其中，是在第次迭代的模型参数，是在第次迭代的模型参数。是学习率，由人为预设。和分别是一阶和二阶矩阵估计的偏差校正值，是为了维持数值稳定性而加入的一个很小的常数。在一个实施例中，设置为0.01，设置为0.001。in, It is in The model parameters of the iteration, It is in The model parameters for the iteration. It is the learning rate, which is preset manually. and are the bias correction values of the first-order and second-order matrix estimates, respectively, is a small constant added to maintain numerical stability. In one embodiment, Set to 0.01, Set to 0.001.

进一步地，和的计算方式可以表示为：Further, and The calculation method can be expressed as:

其中，为第t次迭代的一阶矩阵估计，为第t-1次迭代的一阶矩阵估计；是一阶矩阵估计的指数衰减率，是二阶矩阵估计的指数衰减率。为第t次迭代的损失函数；为第t次迭代的二阶矩阵估计；in, is the first-order matrix estimate for the t-th iteration, is the first-order matrix estimate for the t-1th iteration; is the exponential decay rate of the first-order matrix estimate, is the exponential decay rate of the second-order matrix estimate. is the loss function of the tth iteration; is the second-order matrix estimate of the t-th iteration;

5）基于训练好的第二卷积神经网络模型构建多特征提取模型。5) Build a multi-feature extraction model based on the trained second convolutional neural network model.

综上，得到构建好的多特征提取模型，以通过该模型对目标文档进行多特征提取，得到的特征为多特征信息。In summary, a well-constructed multi-feature extraction model is obtained, through which multi-feature extraction is performed on the target document, and the obtained features are multi-feature information.

之后，对多特征信息进行特征融合。其中，本发明提出一种基于裂化特征的Transformer算法，通过利用切比雪夫理论进行改进，以优化特征融合过程，提升算法的效率和准确性，增强模型对复杂数据结构的处理能力，同时有效地处理序列数据中的长距离依赖问题。具体地，参照下述步骤S203-S206。After that, the multi-feature information is subjected to feature fusion. The present invention proposes a Transformer algorithm based on cracking features, which is improved by using Chebyshev theory to optimize the feature fusion process, improve the efficiency and accuracy of the algorithm, enhance the model's ability to handle complex data structures, and effectively handle the long-distance dependency problem in sequence data. Specifically, refer to the following steps S203-S206.

步骤S203，对多特征信息进行特征裂化处理，得到细粒度的子特征集。Step S203, performing feature cracking processing on the multi-feature information to obtain a fine-grained sub-feature set.

为了更细致地捕捉特征间的相关性，本发明实施例将结构特征和字形特征进行特征裂化处理，将特征分解为更细粒度的子特征集。对于结构特征和字形特征，通过裂化处理细分这些特征以捕获更细粒度的信息，设裂化操作为，则裂化后的特征表示为：In order to capture the correlation between features in more detail, the embodiment of the present invention performs feature cracking processing on structural features and glyph features, and decomposes the features into more fine-grained sub-feature sets. and glyph features , these features are subdivided by cracking to capture more fine-grained information. Let the cracking operation be , then the characteristics after cracking are expressed as:

在一个实施例中，裂化处理的目的是将原始特征细分成更小的子特征集合，以提高特征处理的精细度，具体计算过程可以表示为：In one embodiment, the cracking process The purpose of is to subdivide the original features into smaller sub-feature sets to improve the precision of feature processing. The specific calculation process can be expressed as:

其中，是原始特征向量，是裂化矩阵，用于将原始特征向量映射到一个更高维度的特征空间。且，在本实施例中，裂化矩阵的计算基于特征的统计属性得到，具体为，通过主成分分析提取特征的主要变化方向。in, is the original eigenvector, is a cracking matrix, which is used to map the original feature vector to a higher-dimensional feature space. And, in this embodiment, the cracking matrix The calculation of is based on the statistical properties of the features. Specifically, the main change directions of the features are extracted through principal component analysis.

步骤S204，采用Transformer结构捕捉子特征集对应的动态相关性，得到整合特征。Step S204: using the Transformer structure to capture the dynamic correlation corresponding to the sub-feature sets to obtain integrated features.

采用Transformer结构处理裂化后的特征，通过自注意力机制有效捕捉特征间的动态相关性，使得远距离的特征信息能够被有效整合。具体的，设Transformer的编码函数为，则特征经过Transformer处理后的表示为：The Transformer structure is used to process the cracked features, and the dynamic correlation between features is effectively captured through the self-attention mechanism, so that the long-distance feature information can be effectively integrated. Specifically, let the encoding function of the Transformer be , then the feature after Transformer processing is represented as:

其中，Transformer编码函数包括自注意力机制和前馈神经网络，自注意力机制中的权重计算如下：Among them, Transformer encoding function Including the self-attention mechanism and the feedforward neural network, the weight calculation in the self-attention mechanism is as follows:

其中，，，分别为查询、键、值矩阵，为键向量的维度。in, , , are query, key, and value matrices respectively, is the dimension of the key vector.

进一步地，查询、键和值的计算方式如下：Further, query ,key Sum The calculation method is as follows:

其中，，，是可学习的权重矩阵，用于将特征映射到查询、键和值空间。是键向量的维度，用于缩放点积，防止点积过大导致的梯度消失问题。in, , , is a learnable weight matrix used to transform features Maps to query, key, and value spaces. It is the dimension of the key vector, which is used to scale the dot product to prevent the gradient vanishing problem caused by too large a dot product.

步骤S205，基于整合特征在高维空间中的失真和信息损失，利用切比雪夫多项式对整合特征进行优化，得到优化特征。Step S205 , based on the distortion and information loss of the integrated features in the high-dimensional space, the integrated features are optimized using Chebyshev polynomials to obtain optimized features.

具体的，利用切比雪夫多项式来优化Transformer输出的特征，该优化能够有效降低特征在高维空间中的失真和信息损失，设的递归定义为：Specifically, using the Chebyshev polynomial To optimize the features of Transformer output, this optimization can effectively reduce the distortion and information loss of features in high-dimensional space. The recursive definition of is:

进一步地，将应用于和的每个元素，以实现特征优化，得到优化后的特征表示为：Further, Application and Each element of is optimized to achieve feature optimization, and the optimized feature is expressed as:

步骤S206，计算优化特征和多特征信息分别对应的重要性得分，基于重要性得分将优化特征和多特征信息进行融合，得到融合特征。Step S206, calculating the importance scores corresponding to the optimized features and the multi-feature information respectively, and fusing the optimized features and the multi-feature information based on the importance scores to obtain a fused feature.

具体的，设假设特征、和的重要性得分分别为、和，以结构特征的动态权重为例，计算方式可以表示为：Specifically, assume that the characteristic , and The importance scores are , and , with the dynamic weight of structural features For example, the calculation method can be expressed as:

将特征优化后的特征、与拼接后的综合特征进行融合，得到融合特征，融合函数可以表示为：The optimized features , The comprehensive characteristics after splicing Fusion is performed to obtain fusion features and fusion functions It can be expressed as:

其中，是加权函数，且加权的权重为动态权重，即，通过动态权重分配可以根据特征的重要性自适应调整其在融合特征中的占比。in, It is a weighted function, and the weighted weight is a dynamic weight, that is, through dynamic weight allocation, the proportion of the feature in the fusion feature can be adaptively adjusted according to its importance.

最终的融合特征为：The final fusion features are:

其中，重要性得分和基于特征的主成分分析中特征的重要性得到。、、分别为特征、、的动态权重。Among them, the importance score and The importance of features is obtained in the feature-based principal component analysis. , , Characteristics , , Dynamic weight.

步骤S207，对融合特征进行特征提取以及数据降维，得到融合特征中的关键信息。Step S207, performing feature extraction and data dimension reduction on the fused features to obtain key information in the fused features.

上述融合特征是基于多特征信息更细粒度的特征间的动态相关性融合确定，远距离的特征信息能够被有效整合，并能够有效地处理序列数据中的长距离依赖问题。通过确定融合特征中的关键信息，能够提高实体识别的准确性。具体的，本发明实施例先通过预先构建的特征提取模型对融合特征进行特征提取，再通过特征降维模型对提取的特征进行降维，以确定关键信息。The above-mentioned fusion features are determined by the dynamic correlation fusion between finer-grained features based on multiple feature information. The long-distance feature information can be effectively integrated and can effectively handle the long-distance dependency problem in the sequence data. By determining the key information in the fusion features, the accuracy of entity recognition can be improved. Specifically, the embodiment of the present invention first extracts the fusion features through a pre-constructed feature extraction model, and then reduces the dimensionality of the extracted features through a feature dimensionality reduction model to determine the key information.

a-通过下述步骤构建特征提取模型：a-Build the feature extraction model through the following steps:

在具体实现时，可以通过训练样本对应的融合特征进行特征提取模型的训练，本发明提出一种基于多粒度鲸鱼优化算法的神经网络算法，采用双向增强的方式从数据中提取更加丰富和有意义的特征，以便更准确地进行实体识别。本发明对鲸鱼优化算法进行了改进，使其能够在多个粒度上搜索特征空间，从而捕获不同层次的数据特征，使得算法能够更全面地理解数据结构，提取出对实体识别更有利的特征。此外，在多粒度特征提取的基础上，本发明采用双向增强机制，从两个方向调整特征提取过程：一是通过深度学习模型加深特征的语义理解，二是通过反向传播优化算法指导鲸鱼优化算法的搜索策略，这种双向作用不仅提高了特征提取的准确性，也增强了算法的自适应能力。In specific implementation, the feature extraction model can be trained by the fusion features corresponding to the training samples. The present invention proposes a neural network algorithm based on the multi-granularity whale optimization algorithm, which uses a bidirectional enhancement method to extract richer and more meaningful features from the data, so as to more accurately perform entity recognition. The present invention improves the whale optimization algorithm so that it can search the feature space at multiple granularities, thereby capturing data features at different levels, so that the algorithm can understand the data structure more comprehensively and extract features that are more beneficial to entity recognition. In addition, on the basis of multi-granularity feature extraction, the present invention adopts a bidirectional enhancement mechanism to adjust the feature extraction process from two directions: one is to deepen the semantic understanding of the features through a deep learning model, and the other is to guide the search strategy of the whale optimization algorithm through a back-propagation optimization algorithm. This two-way effect not only improves the accuracy of feature extraction, but also enhances the algorithm's adaptive ability.

在具体实现时，包括下述步骤a1-a4：In the specific implementation, the following steps a1-a4 are included:

a1）使用改进后的鲸鱼优化算法在多个粒度上对预设的训练样本集进行初步特征提取，通过模拟鲸鱼的社会行为和捕食策略，在特征空间中搜索最优特征子集。a1) Use the improved whale optimization algorithm to perform preliminary feature extraction on the preset training sample set at multiple granularities, and search for the optimal feature subset in the feature space by simulating the social behavior and predation strategy of whales.

具体的，初步特征提取的步骤如下：Specifically, the steps of preliminary feature extraction are as follows:

（1）初始化：生成初始鲸鱼群体位置，其中，是鲸鱼群体的大小。每个鲸鱼位置代表特征空间中的一个解，即一组特征的选择。(1) Initialization: Generate the initial whale group position ,in , is the size of the whale group. Each whale position Represents a solution in the feature space, i.e., a selection of a set of features.

（2）计算适应度：对每个鲸鱼位置，计算其适应度。适应度评估是基于特征选择后的分类器性能，在一个实施例中，适应度值为F1分数。(2) Calculate fitness: For each whale position , calculate its fitness The fitness evaluation is based on the classifier performance after feature selection, and in one embodiment, the fitness value is the F1 score.

（3）更新位置：根据当前最优解更新鲸鱼的位置，位置更新公式包括模拟鲸鱼捕食行为和随机搜索，具体为：(3) Update position: based on the current optimal solution Update the position of the whale. The position update formula includes simulating whale predation behavior and random search, specifically:

其中，和是根据算法迭代计算的系数，是当前鲸鱼位置与目标鲸鱼（最优解）之间的距离。进一步地，系数和是根据算法的迭代进行动态计算，计算方式可以表示为：in, and is the coefficient calculated by the algorithm iteration, is the distance between the current whale position and the target whale (optimal solution). Furthermore, the coefficient and It is dynamically calculated according to the iteration of the algorithm, and the calculation method can be expressed as:

其中，是一个随迭代次数线性减小从2到0的系数，是一个[0，1]范围内的随机数。in, is a coefficient that decreases linearly from 2 to 0 with the number of iterations, is a random number in the range [0, 1].

a2）将最优特征子集输入到深度学习模型中进行语义加深，得到增强特征。a2) Input the optimal feature subset into the deep learning model for semantic deepening to obtain enhanced features.

a3）计算增强特征对应的损失，根据损失调整鲸鱼优化算法的搜索策略。a3) Calculate the loss corresponding to the enhanced features and adjust the search strategy of the whale optimization algorithm based on the loss.

具体的，将初步提取的特征输入到深度学习模型中进行语义加深，并利用模型的反馈调整鲸鱼优化算法的搜索策略。深度学习模型和鲸鱼优化算法相互作用，相互学习，共同优化特征集。Specifically, the initially extracted features are input into the deep learning model for semantic deepening, and the feedback from the model is used to adjust the search strategy of the whale optimization algorithm. The deep learning model and the whale optimization algorithm interact and learn from each other to jointly optimize the feature set.

具体的，优化特征集的步骤如下：Specifically, the steps for optimizing the feature set are as follows:

（1）利用预设的深度神经网络对提取的特征进行语义增强，即：(1) Using the preset deep neural network The extracted features are semantically enhanced, namely:

其中，是经过神经网络增强后的特征表示。in, It is the feature representation enhanced by the neural network.

（2）根据深度学习模型的反馈调整鲸鱼优化算法的搜索策略。(2) Adjust the search strategy of the whale optimization algorithm based on the feedback from the deep learning model.

具体的，利用损失函数的梯度信息指导鲸鱼的位置更新，即：Specifically, using the loss function The gradient information guides the whale's position update, namely:

其中，是动态标签校正后的标签，是学习率，是损失函数关于的梯度。在一个实施例中，损失函数为交叉熵损失。in, is the label after dynamic label correction, is the learning rate, is the loss function about In one embodiment, the loss function is the cross entropy loss.

动态标签校正后的标签通过动态标签校正模块采用自适应调整机制的方式得到，根据模型在训练过程中的性能反馈动态调整标签的权重，不仅考虑了原始标注的不确定性，还通过模型的实时反馈来优化标签分配，以此来减少错误标签对训练的负面影响，增强模型的鲁棒性。Dynamic label correction label The dynamic label correction module adopts an adaptive adjustment mechanism to dynamically adjust the label weight according to the performance feedback of the model during the training process. It not only takes into account the uncertainty of the original annotation, but also optimizes the label allocation through the real-time feedback of the model, so as to reduce the negative impact of incorrect labels on training and enhance the robustness of the model.

具体的，设为原始训练标签集，模型预测输出为，校正后的标签的计算方式可以表示为：Specifically, is the original training label set, and the model prediction output is , the corrected label The calculation method can be expressed as:

其中，是自适应调整系数，根据模型在验证集上的性能动态调整，可以表示为：in, is the adaptive adjustment coefficient, which is dynamically adjusted according to the performance of the model on the validation set and can be expressed as:

其中，是预设的缩放因子，用于控制标签校正的强度。在一个实施例中，缩放因子设置为0.01。是模型在当前迭代的验证集上的准确率，用于评估模型的当前性能。in, is a preset scaling factor used to control the strength of label correction. In one embodiment, the scaling factor Set to 0.01. It is the accuracy of the model on the validation set of the current iteration, which is used to evaluate the current performance of the model.

进一步地，梯度信息的计算方式可以表示为：Furthermore, the gradient information The calculation method can be expressed as:

其中，是基于增强特征和真实标签计算的损失函数，在一个实施例中，为交叉熵损失。in, Based on enhanced features and the true label The loss function calculated, in one embodiment, is the cross entropy loss.

进一步地，为了确保最终用于实体识别的特征集既包含丰富的语义信息，又具有较高的区分度，本发明实施例还将经过双向增强优化后的特征进行融合和选择，以基于选择后的特征确定深度学习模型的反馈。Furthermore, in order to ensure that the feature set ultimately used for entity recognition contains both rich semantic information and high discrimination, the embodiment of the present invention also fuses and selects the features after bidirectional enhancement optimization to determine the feedback of the deep learning model based on the selected features.

具体的，将经过双向增强后的特征与原始特征（也即步骤a1得到的最优特征子集）进行融合，得到最终的特征表示为：Specifically, the features after bidirectional enhancement With the original features (that is, the optimal feature subset obtained in step a1) is fused to obtain the final feature representation for:

其中，是融合权重，用于平衡增强特征和原始特征的贡献，由人为预设。在一个实施例中，设置为0.3。in, is a fusion weight used to balance the contribution of enhanced features and original features, which is preset manually. In one embodiment, Set to 0.3.

进一步地，基于的重要性评分进行特征选择，选择一定比例的特征，以减少特征维度并提高模型性能。可以采用预设的随机森林算法通过交叉验证的方式计算得到重要性评分。在一个实施例中，根据重要性评分，选择前50%的特征。Further, based on The importance score is used to perform feature selection, and a certain proportion of features are selected to reduce the feature dimension and improve the model performance. The importance score can be calculated by cross-validation using a preset random forest algorithm. In one embodiment, the top 50% of features are selected based on the importance score.

a4）直到损失达到预设阈值，基于鲸鱼优化算法和深度学习模型构建特征提取模型。a4) Until the loss reaches the preset threshold, a feature extraction model is built based on the whale optimization algorithm and deep learning model.

综上，构建特征提取模型，以通过该模型对融合特征进行特征提取。In summary, a feature extraction model is constructed to extract features from fusion features through the model.

b-通过下述步骤构建特征降维模型：b-Build the feature dimensionality reduction model through the following steps:

本发明提出一种基于潜在交互压缩的自编码神经网络算法，通过采用潜在交互压缩机制，允许自编码器在压缩特征的同时，学习和保留不同特征之间的关键交互信息，增加了模型对特征之间非线性关系的捕获能力，从而为分类器提供了更丰富的信息。The present invention proposes an autoencoder neural network algorithm based on potential interaction compression. By adopting the potential interaction compression mechanism, the autoencoder is allowed to learn and retain the key interaction information between different features while compressing the features, thereby increasing the model's ability to capture the nonlinear relationship between features, thereby providing richer information for the classifier.

具体地，包括以下步骤b1-b4：Specifically, the following steps b1-b4 are included:

b1）自定义自编码器的编码器和解码器。b1) Customize the encoder and decoder of the autoencoder.

定义自编码器结构，包括编码器部分和解码器部分，其中为输入特征向量，为潜在表示向量，和分别为编码器和解码器的参数。编码器用于将高维输入特征映射到一个低维潜在空间，解码器则将这些低维表示恢复到原始空间。Define the autoencoder structure, including the encoder part and the decoder part ,in is the input feature vector, is the potential representation vector, and are the parameters of the encoder and decoder respectively. The encoder is used to map high-dimensional input features to a low-dimensional latent space, and the decoder restores these low-dimensional representations to the original space.

进一步地，设置潜在交互学习损失函数，鼓励模型在潜在空间中学习特征之间的交互信息。潜在交互学习损失函数包括重构误差和正则化项，重构误差基于特征的重要性加权计算得到。其中，训练过程中优化的损失函数包括重构误差和正则化项，可以表示为：Furthermore, a potential interaction learning loss function is set to encourage the model to learn the interaction information between features in the latent space. The potential interaction learning loss function includes reconstruction error and regularization term, and the reconstruction error is calculated based on the weighted importance of the feature. Among them, the loss function optimized during the training process includes reconstruction error and regularization term, which can be expressed as:

其中，是输入和重构输出之间的重构误差，是交互权重的L2正则化项，为正则化系数。in, Yes Input and reconstruct the output The reconstruction error between is the interaction weight The L2 regularization term, is the regularization coefficient.

在一个实施例中，重构误差的计算采用加权计算的方式，考虑到不同特征的重要性可能不同，其计算方式可以表示为：In one embodiment, the reconstruction error The calculation of is weighted. Considering that the importance of different features may be different, the calculation method can be expressed as:

其中，是特征的权重，根据特征在训练数据中的表现自适应调整。in, It is a feature The weights are adaptively adjusted according to the performance of the features in the training data.

且，正则化项用于防止模型过拟合，确保交互权重的值不会过大，计算方式可以表示为：And, the regularization term Used to prevent model overfitting and ensure interaction weights The value of will not be too large, and the calculation method can be expressed as:

b2）获取预设的训练样本集，通过训练样本集对自编码器的编码器和解码器进行训练。b2) Obtain a preset training sample set, and train the encoder and decoder of the autoencoder using the training sample set.

具体的，可以使用特征提取模型的输出训练自编码器模型，其中，在编码过程，输入特征向量被编码为潜在表示，可以表示为：；即，编码器通过一系列非线性变换学习将高维输入映射到低维潜在空间。Specifically, the output of the feature extraction model can be used to train the autoencoder model, where, during the encoding process, the input feature vector Encoded as a latent representation , which can be expressed as: ; That is, the encoder It learns to map high-dimensional inputs into a low-dimensional latent space through a series of nonlinear transformations.

b3）通过预先设置的潜在交互学习损失函数对自编码器的参数进行优化，并在潜在空间引入特征交互项，以基于特征交互项进行特征重构。b3) The parameters of the autoencoder are optimized through a pre-set latent interaction learning loss function, and feature interaction terms are introduced in the latent space to reconstruct features based on the feature interaction terms.

在具体实现时，基于上述潜在交互学习损失函数进行参数优化，且，本发明实施例在潜在空间中引入交互项，用于增强模型捕捉特征间复杂关系的能力，可以表示为：In the specific implementation, the parameters are optimized based on the potential interaction learning loss function, and the embodiment of the present invention introduces the interaction term in the latent space. , which is used to enhance the model's ability to capture complex relationships between features and can be expressed as:

其中，表示特征和特征之间交互的权重。进一步地，权重用来衡量潜在表示中特征和特征之间交互强度的参数，通过自适应学习机制得到，该机制考虑了特征间的相关性和对任务的贡献度。in, Representation characteristics and Features The weight of the interaction between . Further, the weight To measure the potential Medium Features and Features The parameters of the interaction strength between them are obtained through an adaptive learning mechanism that takes into account the correlation between features and their contribution to the task.

在一个实施例中，的计算通过一个预设的神经网络实现，该网络以和为输入，输出为，可以表示为：In one embodiment, The calculation is done through a preset neural network To achieve this, the network and is the input and the output is , which can be expressed as:

其中，是参数化的网络，是网络参数，该网络通过训练学习如何有效地评估特征间交互的重要性。in, is a parameterized network, are network parameters, and the network learns through training how to effectively evaluate the importance of interactions between features.

进一步地，在解码过程中，潜在表示通过解码器重构为，同时考虑交互项，可以表示为：Furthermore, during decoding, the latent representation Through the decoder Refactored to , while considering the interaction terms , which can be expressed as:

b4）基于训练好的自编码器构建特征降维模型。b4) Build a feature dimensionality reduction model based on the trained autoencoder.

综上，构建特征降维模型，以通过训练好的编码器部分对输入特征进行降维处理，得到压缩后的特征表示，该特征表示作为关键信息由标签预测模型进行实体识别。In summary, a feature dimensionality reduction model is constructed to reduce the dimensionality of the input features through the trained encoder part to obtain the compressed feature representation. ,This feature representation is used as key information by the label prediction model for entity recognition.

进一步地，本发明实施例还计算关键信息的重标定权重和潜在维度任务相关性度量；根据重标定权重和潜在维度任务相关性度量对关键信息进行调整，得到重标定信息；将重标定信息确定为融合特征的关键信息。Furthermore, the embodiment of the present invention also calculates the recalibration weight of the key information and the potential dimension task relevance measure; adjusts the key information according to the recalibration weight and the potential dimension task relevance measure to obtain the recalibrated information; and determines the recalibrated information as the key information of the fusion feature.

在具体实现时，计算每个特征的重标定权重，通过预设的神经元计算得到，可以表示为：In the specific implementation, calculate the recalibration weight of each feature , calculated by the preset neurons, can be expressed as:

其中，和是预设的神经元的权重和偏置参数，是sigmoid激活函数，用于将权重限制在(0，1)范围内。in, and are the preset neuron weights and bias parameters, is the sigmoid activation function, which is used to limit the weights to the range (0, 1).

进一步地，使用来调整潜在表示，得到重标定后的表示，可以表示为：Further, using To adjust the latent representation , and get the recalibrated representation , which can be expressed as:

其中，表示元素的Hadamard积，是一个对角矩阵，其对角元素表示第个潜在维度的权重，通过以下方式动态计算：in, represents the Hadamard product of elements, is a diagonal matrix whose diagonal elements Indicates The weights of the potential dimensions are dynamically calculated as follows:

在这里，relevance(z_i) 表示第个潜在维度与最终任务相关性的度量，是一个可调节的尺度参数，用于控制解耦强度。在一个实施例中，函数通过辅助网络或任务性能反馈来实现。Here, relevance(z _i ) represents the A measure of the relevance of latent dimensions to the final task, is an adjustable scale parameter used to control the decoupling strength. In one embodiment, The function is implemented through auxiliary networks or task performance feedback.

基于此，得到的特征为降维后的特征，也即关键信息。Based on this, the obtained features It is the feature after dimensionality reduction, that is, the key information.

步骤S208，通过预先构建的标签预测模型对关键信息进行实体识别，输出目标文档中的关键实体信息。Step S208, performing entity recognition on key information through a pre-built tag prediction model, and outputting key entity information in the target document.

本发明提出一种基于结构先验知识的标签预测模型，结合结构先验知识的深度学习框架与条件随机场模型，形成一个结构化深度学习模型。与传统条件随机场相比，结构化深度学习模型能够更有效地利用实体间的依赖关系及其上下文信息，通过结构化的方式显著提升实体识别的准确性和鲁棒性。其中，模型的训练流程首先包括使用自编码器学习文档的深层特征表示。随后，结构化深度学习模型利用这些特征和结构先验知识来预测文档中实体的标签。The present invention proposes a label prediction model based on structural prior knowledge, combining a deep learning framework of structural prior knowledge with a conditional random field model to form a structured deep learning model. Compared with traditional conditional random fields, structured deep learning models can more effectively utilize the dependencies between entities and their contextual information, and significantly improve the accuracy and robustness of entity recognition in a structured manner. Among them, the training process of the model first includes using an autoencoder to learn the deep feature representation of the document. Subsequently, the structured deep learning model uses these features and structural prior knowledge to predict the labels of entities in the document.

在具体实现时，本发明实施例通过下述步骤构建标签预测模型：In specific implementation, the embodiment of the present invention constructs a label prediction model through the following steps:

1）获取预设的训练样本集。1) Obtain the preset training sample set.

训练样本集包括文档样本，以及文档样本对应的样本标签，样本标签用于表征文档样本对应的实体，如人名、地点、时间等。具体地，可以通过降维后的上述特征训练标签预测模型。The training sample set includes document samples and sample labels corresponding to the document samples, and the sample labels are used to characterize entities corresponding to the document samples, such as names, places, times, etc. Specifically, the label prediction model can be trained by the above-mentioned features after dimensionality reduction.

2）通过自编码器学习训练样本集的深层特征表示，并将深层特征表示融入条件随机场模型中，进行结构化实体标签预测。2) The deep feature representation of the training sample set is learned through the autoencoder, and the deep feature representation is integrated into the conditional random field model to perform structured entity label prediction.

在具体实现时，本发明实施例基于档案文档的结构先验知识，设计特定的模型结构和训练策略，以显式地模拟实体间的逻辑关系和依赖性，并进行结构先验知识的融合。In a specific implementation, the embodiment of the present invention designs a specific model structure and training strategy based on the structural prior knowledge of archival documents to explicitly simulate the logical relationship and dependency between entities and integrate the structural prior knowledge.

对于降维后的特征，将其融入条件随机场模型中进行结构化实体标签的预测，条件随机场层考虑提取特征序列并输出实体标签序列。For the features after dimensionality reduction, they are integrated into the conditional random field model to predict the structured entity labels. The conditional random field layer considers extracting feature sequences And output entity label sequence .

具体的，对于给定输入序列的标签序列的条件概率定义为：Specifically, the conditional probability of the label sequence for a given input sequence is defined as:

其中，是规范化因子，是潜力函数，捕捉相邻标签与输入特征之间的依赖性。in, is the normalization factor, is a potential function that captures the dependencies between adjacent labels and input features.

潜力函数可以进一步展开为：The potential function can be further expanded as:

其中，和是与从标签转移到的转移相关的权重和偏置，表示从输入在位置提取的特征向量。in, and is with the label from Transfer to The transfer-related weights and biases, Indicates that from the input at position Extracted feature vectors.

在一个实施例中，特征函数结合词嵌入向量和位置编码，可以表示为：In one embodiment, the characteristic function Combining word embedding vectors and position encoding, it can be expressed as:

其中，表示位置处词的嵌入向量，表示位置的编码向量，表示向量拼接操作。in, Indicates location The embedding vector of the word, Indicates location The encoding vector of Represents a vector concatenation operation.

进一步地，对于特定的状态转移权重和偏置，其更新策略为：Furthermore, for a specific state transition weight and bias , and its update strategy is:

其中，是学习率，是条件随机场层的对数似然损失，反映了真实标签序列和模型预测序列之间的差异。in, is the learning rate, is the log-likelihood loss of the conditional random field layer, reflecting the difference between the true label sequence and the model prediction sequence.

3）在结构化深度学习模型框架下，结合自编码器的重构损失和条件随机场模型中正确标签序列的对数似然，联合优化自编码器和条件随机场模型的参数。3) Under the framework of structured deep learning models, the reconstruction loss of the autoencoder and the log-likelihood of the correct label sequence in the conditional random field model are combined to jointly optimize the parameters of the autoencoder and the conditional random field model.

在结构化深度学习模型框架下，联合优化自编码器和条件随机场模型参数，确保实体标签预测的准确性，同时反映实体间的依赖关系。该自编码器为上述特征降维模型对应的自编码器。In the framework of structured deep learning model, the parameters of autoencoder and conditional random field model are jointly optimized to ensure the accuracy of entity label prediction and reflect the dependency relationship between entities. This autoencoder is the autoencoder corresponding to the above-mentioned feature dimensionality reduction model.

具体的，训练目标结合了深度学习框架的重构损失和条件随机场模型中正确标签序列的对数似然，可以表示为：Specifically, the training objective combines the reconstruction loss of the deep learning framework and the log-likelihood of the correct label sequence in the conditional random field model, which can be expressed as:

其中，是训练样本的数量，是平衡两个损失部分的正则化参数，和分别是第个示例的真实标签序列和输入序列。in, is the number of training samples, is the regularization parameter that balances the two loss parts, and They are The true label sequence and input sequence of examples.

在一个实施例中，深度学习框架为上述步骤S214对应的数据降维模型所采用的自编码器网络。对于自编码器中的编码器权重和解码器权重的计算，本发明采用基于梯度下降的优化方法，结合反向传播算法进行参数更新。In one embodiment, the deep learning framework is the autoencoder network used by the data dimensionality reduction model corresponding to step S214. and decoder weights The present invention adopts an optimization method based on gradient descent and combines the back propagation algorithm to update the parameters.

具体的，对于编码器权重的更新可以表示为：Specifically, the update of the encoder weight can be expressed as:

其中，是学习率，是损失函数，表示为输入和重构输出之间的均方误差；解码器权重的更新遵循相似的过程。in, is the learning rate, is the loss function, expressed as input and reconstruct the output The mean square error between the decoder weights Updates follow a similar process.

4）对自编码器和条件随机场模型进行性能评估，并基于满足性能评估要求的自编码器和条件随机场模型构建标签预测模型。4) Perform performance evaluation on the autoencoder and conditional random field models, and build a label prediction model based on the autoencoder and conditional random field models that meet the performance evaluation requirements.

通过交叉验证和独立测试集评估模型性能，确保模型在未见过的档案文档上具有良好的泛化能力。Model performance was evaluated through cross-validation and independent test sets to ensure that the model has good generalization ability on unseen archival documents.

综上，本发明实施例提供的一种基于结构先验知识的档案知识抽取方法，通过结合汉字的结构特征和字形特征的提取，采用深度学习模型和VGG16网络，对汉字的偏旁部首、构造特征及字形的高维特征进行精准提取，能够更全面地捕获汉字的语义信息，为实体识别提供了更为丰富和精确的特征表示。通过结合汉字的结构特征和字形特征的细致提取，本发明能够更精确地捕获到汉字的语义信息，为实体识别提供了丰富和精确的特征表示。特别是在处理具有复杂结构的中文文档时，能够显著提高实体识别的准确率。In summary, the embodiment of the present invention provides a method for extracting archival knowledge based on structural prior knowledge. By combining the extraction of structural features and glyph features of Chinese characters, a deep learning model and a VGG16 network are used to accurately extract the radicals, structural features and high-dimensional features of the glyphs of Chinese characters, which can more comprehensively capture the semantic information of Chinese characters and provide a richer and more accurate feature representation for entity recognition. By combining the careful extraction of structural features and glyph features of Chinese characters, the present invention can more accurately capture the semantic information of Chinese characters and provide a richer and more accurate feature representation for entity recognition. In particular, when processing Chinese documents with complex structures, the accuracy of entity recognition can be significantly improved.

此外，本发明通过特征融合模型的创新应用，采用基于裂化特征的Transformer算法，并利用切比雪夫多项式进行改进，优化特征融合过程，不仅提升了算法处理复杂数据结构的能力，还有效地处理了序列数据中的长距离依赖问题，增强了模型对特征间相互作用和补充性的捕获能力，从而提升了模型的泛化能力。In addition, the present invention uses a Transformer algorithm based on cracking features and uses Chebyshev polynomials to improve and optimize the feature fusion process through the innovative application of feature fusion models. This not only improves the algorithm's ability to handle complex data structures, but also effectively handles long-distance dependency problems in sequence data, and enhances the model's ability to capture interactions and complementarities between features, thereby improving the model's generalization ability.

本发明还对鲸鱼优化算法进行了改进，使其能够在多个粒度上搜索特征空间，捕获不同层次的数据特征。结合双向增强机制，既加深了特征的语义理解，也优化了特征的搜索和提取策略，使得算法能够更全面地理解数据结构，提高了特征提取的准确性和模型的自适应能力，从而在数据中提取出对实体识别更有利的特征。The present invention also improves the whale optimization algorithm, enabling it to search feature space at multiple granularities and capture data features at different levels. Combined with the bidirectional enhancement mechanism, it not only deepens the semantic understanding of features, but also optimizes the search and extraction strategy of features, enabling the algorithm to understand the data structure more comprehensively, improve the accuracy of feature extraction and the adaptability of the model, and thus extract features that are more beneficial to entity recognition in the data.

进一步地，本发明实施例通过采用潜在交互压缩机制，允许自编码器在压缩特征的同时，学习和保留不同特征之间的关键交互信息，有效地减少了高维特征空间中的信息损失，同时提升了处理效率。并且，增加了模型对特征之间非线性关系的捕获能力，为分类器提供了更丰富的信息。Furthermore, the embodiment of the present invention adopts a potential interaction compression mechanism, which allows the autoencoder to learn and retain key interaction information between different features while compressing features, effectively reducing information loss in high-dimensional feature space and improving processing efficiency. In addition, the model's ability to capture nonlinear relationships between features is increased, providing richer information for the classifier.

进一步地，本发明实施例还结合结构先验知识的深度学习框架与条件随机场模型，形成一个结构化深度学习模型，能够有效地利用实体间的依赖关系及其上下文信息，尤其是在处理复杂的实体关系和丰富的上下文信息时，通过结构化的方式显著提升实体识别的准确性和鲁棒性。Furthermore, the embodiments of the present invention also combine the deep learning framework of structural prior knowledge with the conditional random field model to form a structured deep learning model, which can effectively utilize the dependencies between entities and their contextual information, especially when dealing with complex entity relationships and rich contextual information, and significantly improve the accuracy and robustness of entity recognition in a structured manner.

进一步地，在上述方法实施例的基础上，本发明实施例还提供一种基于结构先验知识的档案知识抽取装置，图4示出了本发明实施例提供的一种基于结构先验知识的档案知识抽取装置的结构示意图，如图4所示，基于结构先验知识的档案知识抽取装置包括：数据获取模块100，用于获取目标文档；特征提取模块200，用于对目标文档进行多特征提取，得到目标文档中的多特征信息；多特征信息包括结构特征和字形特征；特征融合模块300，用于将多特征信息输入至预先构建的特征融合模型中，基于多特征提取特征的特征相关性对多特征信息进行特征融合，生成融合特征；预处理模块400，用于对融合特征进行特征提取以及数据降维，得到融合特征中的关键信息；输出模块500，用于通过预先构建的标签预测模型对关键信息进行实体识别，输出目标文档中的关键实体信息；标签预测模型基于引入深层特征表示的结构先验知识构建。Further, on the basis of the above method embodiment, the embodiment of the present invention further provides an archival knowledge extraction device based on structural prior knowledge. FIG4 shows a schematic structural diagram of an archival knowledge extraction device based on structural prior knowledge provided by the embodiment of the present invention. As shown in FIG4, the archival knowledge extraction device based on structural prior knowledge includes: a data acquisition module 100, used to acquire a target document; a feature extraction module 200, used to perform multi-feature extraction on the target document to obtain multi-feature information in the target document; the multi-feature information includes structural features and glyph features; a feature fusion module 300, used to input the multi-feature information into a pre-constructed feature fusion model, perform feature fusion on the multi-feature information based on the feature correlation of the multi-feature extraction features, and generate fusion features; a pre-processing module 400, used to perform feature extraction and data dimensionality reduction on the fused features to obtain key information in the fused features; an output module 500, used to perform entity recognition on the key information through a pre-constructed label prediction model, and output the key entity information in the target document; the label prediction model is constructed based on structural prior knowledge introduced with deep feature representation.

本发明实施例提供的一种基于结构先验知识的档案知识抽取装置，与上述实施例提供的一种基于结构先验知识的档案知识抽取方法具有相同的技术特征，所以也能解决相同的技术问题，达到相同的技术效果。An archival knowledge extraction device based on structural prior knowledge provided in an embodiment of the present invention has the same technical features as an archival knowledge extraction method based on structural prior knowledge provided in the above embodiment, so it can also solve the same technical problems and achieve the same technical effects.

进一步地，在上述实施例的基础上，本发明实施例还提供另一种基于结构先验知识的档案知识抽取装置，其中，上述特征提取模块200，还用于对目标文档进行多特征提取，得到目标文档中的多特征信息的步骤，包括：将目标文档输入至预先构建的多特征提取模型，通过多特征提取模型对目标文档进行多特征提取，得到目标文档中的多特征信息；其中，多特征提取模型包括结构特征提取模型和字形特征提取模型；结构特征提取模型用于提取目标文档中文字的结构特征，字形特征提取模型用于提取目标文档中文字的字形特征。Furthermore, based on the above-mentioned embodiment, the embodiment of the present invention also provides another archival knowledge extraction device based on structural prior knowledge, wherein the above-mentioned feature extraction module 200 is also used to perform multi-feature extraction on the target document, and the step of obtaining multi-feature information in the target document includes: inputting the target document into a pre-built multi-feature extraction model, performing multi-feature extraction on the target document through the multi-feature extraction model, and obtaining multi-feature information in the target document; wherein the multi-feature extraction model includes a structural feature extraction model and a glyph feature extraction model; the structural feature extraction model is used to extract the structural features of the characters in the target document, and the glyph feature extraction model is used to extract the glyph features of the characters in the target document.

上述特征提取模块200，还用于获取预先收集的中文档案，采集中文档案中的汉字数据；针对汉字数据的汉字特征，对汉字数据进行标注，得到数据标签；汉字特征包括汉字数据的偏旁部首和字形；利用第一卷积神经网络对汉字数据按部首结构进行拆分和结构特征提取，确定结构特征表示；以及，通过VGG16网络捕捉汉字数据的字形细节特征；将结构特征表示和字形细节特征拼接为综合特征向量，通过综合特征向量和数据标签对第二卷积神经网络模型进行训练，采用Adam优化器对第二卷积神经网络模型的模型参数进行优化；基于训练好的第二卷积神经网络模型构建多特征提取模型。The feature extraction module 200 is also used to obtain pre-collected Chinese archives and collect Chinese character data in the Chinese archives; annotate the Chinese character data according to the Chinese character features of the Chinese character data to obtain data labels; the Chinese character features include radicals and glyphs of the Chinese character data; use the first convolutional neural network to split the Chinese character data according to the radical structure and extract structural features to determine the structural feature representation; and capture the glyph detail features of the Chinese character data through the VGG16 network; splice the structural feature representation and the glyph detail features into a comprehensive feature vector, train the second convolutional neural network model through the comprehensive feature vector and the data label, and use the Adam optimizer to optimize the model parameters of the second convolutional neural network model; and build a multi-feature extraction model based on the trained second convolutional neural network model.

上述特征提取模块200，还用于根据结构特征表示和字形细节特征，确定汉字数据对应的上下文特征；通过函数评估结构特征表示和字形细节特征分别与上下文特征的相关性；基于相关性，计算结构特征表示和字形细节特征分别对应的特征权重；基于特征权重将结构特征表示和字形细节特征拼接为综合特征向量。The feature extraction module 200 is also used to determine the context features corresponding to the Chinese character data according to the structural feature representation and the glyph detail features; The function evaluates the correlation between the structural feature representation and the glyph detail feature and the context feature respectively; based on the correlation, the feature weights corresponding to the structural feature representation and the glyph detail feature are calculated; based on the feature weights, the structural feature representation and the glyph detail feature are concatenated into a comprehensive feature vector.

上述特征融合模块300，还用于对多特征信息进行特征裂化处理，得到细粒度的子特征集；采用Transformer结构捕捉子特征集对应的动态相关性，得到整合特征；基于整合特征在高维空间中的失真和信息损失，利用切比雪夫多项式对整合特征进行优化，得到优化特征；计算优化特征和多特征信息分别对应的重要性得分，基于重要性得分将优化特征和多特征信息进行融合，得到融合特征。The feature fusion module 300 is also used to perform feature cracking processing on multiple feature information to obtain a fine-grained sub-feature set; use the Transformer structure to capture the dynamic correlation corresponding to the sub-feature set to obtain an integrated feature; based on the distortion and information loss of the integrated feature in the high-dimensional space, use the Chebyshev polynomial to optimize the integrated feature to obtain an optimized feature; calculate the importance scores corresponding to the optimized feature and the multiple feature information respectively, and fuse the optimized feature and the multiple feature information based on the importance scores to obtain a fused feature.

上述预处理模块400，还用于计算关键信息的重标定权重和潜在维度任务相关性度量；根据重标定权重和潜在维度任务相关性度量对关键信息进行调整，得到重标定信息；将重标定信息确定为融合特征的关键信息。The above-mentioned preprocessing module 400 is also used to calculate the recalibration weight of the key information and the potential dimension task relevance measurement; adjust the key information according to the recalibration weight and the potential dimension task relevance measurement to obtain the recalibrated information; and determine the recalibrated information as the key information of the fusion feature.

进一步地，本发明实施例通过预先构建的特征提取模型对融合特征进行特征提取，上述预处理模块400，还用于：使用改进后的鲸鱼优化算法在多个粒度上对预设的训练样本集进行初步特征提取，通过模拟鲸鱼的社会行为和捕食策略，在特征空间中搜索最优特征子集；将最优特征子集输入到深度学习模型中进行语义加深，得到增强特征；计算增强特征对应的损失，根据损失调整鲸鱼优化算法的搜索策略；直到损失达到预设阈值，基于鲸鱼优化算法和深度学习模型构建特征提取模型。Furthermore, the embodiment of the present invention extracts features from fused features through a pre-constructed feature extraction model. The preprocessing module 400 is also used to: use the improved whale optimization algorithm to perform preliminary feature extraction on a preset training sample set at multiple granularities, and search for the optimal feature subset in the feature space by simulating the social behavior and predation strategy of whales; input the optimal feature subset into the deep learning model for semantic deepening to obtain enhanced features; calculate the loss corresponding to the enhanced features, and adjust the search strategy of the whale optimization algorithm according to the loss; until the loss reaches a preset threshold, a feature extraction model is constructed based on the whale optimization algorithm and the deep learning model.

进一步地，本发明实施例通过预先构建的特征降维模型对融合特征进行数据降维，上述预处理模块400，还用于：自定义自编码器的编码器和解码器；获取预设的训练样本集，通过训练样本集对自编码器的编码器和解码器进行训练；通过预先设置的潜在交互学习损失函数对自编码器的参数进行优化，并在潜在空间引入特征交互项，以基于特征交互项进行特征重构；其中，潜在交互学习损失函数包括重构误差和正则化项，重构误差基于特征的重要性加权计算得到；基于训练好的自编码器构建特征降维模型。Furthermore, the embodiment of the present invention performs data dimensionality reduction on the fused features through a pre-constructed feature dimensionality reduction model, and the above-mentioned preprocessing module 400 is also used to: customize the encoder and decoder of the autoencoder; obtain a preset training sample set, and train the encoder and decoder of the autoencoder through the training sample set; optimize the parameters of the autoencoder through a pre-set potential interaction learning loss function, and introduce feature interaction terms in the latent space to reconstruct features based on the feature interaction terms; wherein the potential interaction learning loss function includes a reconstruction error and a regularization term, and the reconstruction error is calculated based on the weighted importance of the feature; and construct a feature dimensionality reduction model based on the trained autoencoder.

上述输出模块500，还用于获取预设的训练样本集；训练样本集包括文档样本，以及文档样本对应的样本标签，样本标签用于表征文档样本对应的实体；通过自编码器学习训练样本集的深层特征表示，并将深层特征表示融入条件随机场模型中，进行结构化实体标签预测；在结构化深度学习模型框架下，结合自编码器的重构损失和条件随机场模型中正确标签序列的对数似然联合优化自编码器和条件随机场模型的参数；对自编码器和条件随机场模型进行性能评估；基于满足性能评估要求的自编码器和条件随机场模型构建标签预测模型。The above-mentioned output module 500 is also used to obtain a preset training sample set; the training sample set includes document samples and sample labels corresponding to the document samples, and the sample labels are used to characterize the entities corresponding to the document samples; the deep feature representation of the training sample set is learned through the autoencoder, and the deep feature representation is integrated into the conditional random field model to perform structured entity label prediction; under the framework of the structured deep learning model, the parameters of the autoencoder and the conditional random field model are jointly optimized by combining the reconstruction loss of the autoencoder and the log-likelihood of the correct label sequence in the conditional random field model; the performance of the autoencoder and the conditional random field model is evaluated; and a label prediction model is constructed based on the autoencoder and the conditional random field model that meet the performance evaluation requirements.

本发明实施例还提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，处理器执行计算机程序时实现上述图1至图3所示的方法的步骤。An embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method shown in FIGS. 1 to 3 when executing the computer program.

本发明实施例还提供一种计算机可读存储介质，该计算机可读存储介质上存储有计算机程序，计算机程序被处理器运行时执行上述图1至图3所示的方法的步骤。An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, the steps of the method shown in FIG. 1 to FIG. 3 are executed.

本发明实施例还提供了一种电子设备的结构示意图，如图5所示，为该电子设备的结构示意图，其中，该电子设备包括处理器51和存储器50，该存储器50存储有能够被该处理器51执行的计算机可执行指令，该处理器51执行该计算机可执行指令以实现上述图1至图3所示的方法。An embodiment of the present invention also provides a structural diagram of an electronic device, as shown in Figure 5, which is a structural diagram of the electronic device, wherein the electronic device includes a processor 51 and a memory 50, the memory 50 stores computer executable instructions that can be executed by the processor 51, and the processor 51 executes the computer executable instructions to implement the methods shown in Figures 1 to 3 above.

在图5示出的实施方式中，该电子设备还包括总线52和通信接口53，其中，处理器51、通信接口53和存储器50通过总线52连接。In the embodiment shown in FIG. 5 , the electronic device further includes a bus 52 and a communication interface 53 , wherein the processor 51 , the communication interface 53 and the memory 50 are connected via the bus 52 .

其中，存储器50可能包含高速随机存取存储器（RAM，Random Access Memory），也可能还包括非易失性存储器（non-volatile memory），例如至少一个磁盘存储器。通过至少一个通信接口53（可以是有线或者无线）实现该系统网元与至少一个其他网元之间的通信连接，可以使用互联网，广域网，本地网，城域网等。总线52可以是ISA（Industry StandardArchitecture，工业标准体系结构）总线、PCI（Peripheral Component Interconnect，外设部件互连标准）总线或EISA（Extended Industry Standard Architecture，扩展工业标准结构）总线等，还可以是AMBA（Advanced Microcontroller Bus Architecture，片上总线的标准）总线，其中，AMBA定义了三种总线，包括APB（Advanced Peripheral Bus）总线、AHB（Advanced High-performance Bus）总线和AXI（Advanced eXtensible Interface）总线。总线52可以分为地址总线、数据总线、控制总线等。为便于表示，图5中仅用一个双向箭头表示，但并不表示仅有一根总线或一种类型的总线。The memory 50 may include a high-speed random access memory (RAM), and may also include a non-volatile memory (non-volatile memory), such as at least one disk storage. The communication connection between the system network element and at least one other network element is realized through at least one communication interface 53 (which may be wired or wireless), and the Internet, wide area network, local area network, metropolitan area network, etc. may be used. The bus 52 may be an ISA (Industry Standard Architecture, Industrial Standard Architecture) bus, a PCI (Peripheral Component Interconnect, Peripheral Component Interconnect Standard) bus or an EISA (Extended Industry Standard Architecture, Extended Industry Standard Architecture) bus, etc., and may also be an AMBA (Advanced Microcontroller Bus Architecture, on-chip bus standard) bus, wherein AMBA defines three types of buses, including APB (Advanced Peripheral Bus) bus, AHB (Advanced High-performance Bus) bus and AXI (Advanced eXtensible Interface) bus. The bus 52 may be divided into an address bus, a data bus, a control bus, etc. For ease of representation, FIG5 shows only one bidirectional arrow, but this does not mean that there is only one bus or one type of bus.

处理器51可能是一种集成电路芯片，具有信号的处理能力。在实现过程中，上述方法的各步骤可以通过处理器51中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器51可以是通用处理器，包括中央处理器(Central Processing Unit，简称CPU)、网络处理器(Network Processor，简称NP)等；还可以是数字信号处理器(Digital SignalProcessor，简称DSP)、专用集成电路(Application Specific Integrated Circuit，简称ASIC)、现场可编程门阵列(Field-Programmable Gate Array，简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成，或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器，闪存、只读存储器，可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器，处理器51读取存储器中的信息，结合其硬件完成前述图1至图3任一所示的方法。The processor 51 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by the hardware integrated logic circuit or software instructions in the processor 51. The above processor 51 can be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The general-purpose processor can be a microprocessor or the processor can also be any conventional processor. The steps of the method disclosed in the embodiment of the present application can be directly embodied as a hardware decoding processor to execute, or the hardware and software modules in the decoding processor can be executed. The software module can be located in a mature storage medium in the field such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, or an electrically erasable programmable memory, a register, etc. The storage medium is located in the memory, and the processor 51 reads the information in the memory and completes the method shown in any one of the above-mentioned Figures 1 to 3 in combination with its hardware.

本发明实施例所提供的一种基于结构先验知识的档案知识抽取方法及装置的计算机程序产品，包括存储了程序代码的计算机可读存储介质，所述程序代码包括的指令可用于执行前面方法实施例中所述的方法，具体实现可参见方法实施例，在此不再赘述。The computer program product of a method and apparatus for extracting archival knowledge based on structural prior knowledge provided in an embodiment of the present invention includes a computer-readable storage medium storing program code, and the instructions included in the program code can be used to execute the method described in the previous method embodiment. The specific implementation can be found in the method embodiment, which will not be repeated here.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system described above can refer to the corresponding process in the aforementioned method embodiment, and will not be repeated here.

最后应说明的是：以上实施例，仅为本发明的具体实施方式，用以说明本发明的技术方案，而非对其限制，本发明的保护范围并不局限于此，尽管参照前述实施例对本发明进行了详细的说明，本领域技术人员应当理解：任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化，或者对其中部分技术特征进行等同替换；而这些修改、变化或者替换，并不使相应技术方案的本质脱离本发明实施例技术方案的精神和范围，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以权利要求的保护范围为准。Finally, it should be noted that the above embodiments are only specific implementations of the present invention, which are used to illustrate the technical solutions of the present invention, rather than to limit them. The protection scope of the present invention is not limited thereto. Although the present invention is described in detail with reference to the above embodiments, those skilled in the art should understand that any person skilled in the art can still modify the technical solutions recorded in the above embodiments within the technical scope disclosed by the present invention, or can easily think of changes, or make equivalent replacements for some of the technical features therein; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should be included in the protection scope of the present invention. Therefore, the protection scope of the present invention shall be based on the protection scope of the claims.

Claims

1. The archive knowledge extraction method based on the structure priori knowledge is characterized by comprising the following steps of:

Acquiring a target document;

Extracting multiple features of the target document to obtain multiple feature information in the target document; the multi-feature information comprises structural features and font features;

inputting the multi-feature information into a pre-constructed feature fusion model, and carrying out feature fusion on the multi-feature information based on the feature correlation of the multi-feature information to generate fusion features;

performing feature extraction and data dimension reduction on the fusion features to obtain key information in the fusion features;

Performing entity identification on the key information through a pre-constructed label prediction model, and outputting key entity information in the target document; the label prediction model is constructed based on structure priori knowledge of the deep feature representation;

The step of extracting the multiple features of the target document to obtain the multiple feature information in the target document comprises the following steps:

Inputting the target document into a pre-constructed multi-feature extraction model, and performing multi-feature extraction on the target document through the multi-feature extraction model to obtain multi-feature information in the target document;

the multi-feature extraction model comprises a structural feature extraction model and a font feature extraction model; the structural feature extraction model is used for extracting structural features of characters in the target document, and the font feature extraction model is used for extracting font features of the characters in the target document;

The method for constructing the multi-feature extraction model comprises the following steps:

Acquiring a Chinese file collected in advance, and collecting Chinese character data in the Chinese file;

labeling the Chinese character data according to the Chinese character characteristics of the Chinese character data to obtain a data label; the Chinese character features comprise radicals and fonts of the Chinese character data;

Splitting the Chinese character data according to a radical structure and extracting structural features by using a first convolutional neural network to determine structural feature representation; capturing the character form detail characteristics of the Chinese character data through a VGG16 network;

Splicing the structural feature representation and the font detail feature into a comprehensive feature vector;

Training a second convolutional neural network model through the comprehensive feature vector and the data tag, and optimizing model parameters of the second convolutional neural network model by adopting an Adam optimizer;

constructing a multi-feature extraction model based on the trained second convolutional neural network model;

the step of stitching the structural feature representation and the font detail feature into a composite feature vector comprises:

Determining the context characteristics corresponding to the Chinese character data according to the structural characteristic representation and the font detail characteristics;

By passing through Functionally evaluating the relevance of the structural feature representation and the font detail feature, respectively, to the contextual feature;

calculating feature weights respectively corresponding to the structural feature representation and the font detail features based on the correlation;

and splicing the structural feature representation and the font detail feature into a comprehensive feature vector based on the feature weight.

2. The method according to claim 1, wherein the step of inputting the multi-feature information into a pre-built feature fusion model, feature-fusing the multi-feature information based on feature correlation of the multi-feature information, and generating a fused feature, comprises:

Performing characteristic cracking treatment on the multi-characteristic information to obtain a sub-characteristic set with fine granularity;

capturing the dynamic correlation corresponding to the sub-feature set by adopting a transducer structure to obtain an integrated feature;

Optimizing the integrated feature by using a chebyshev polynomial based on the distortion and information loss of the integrated feature in a high-dimensional space to obtain an optimized feature;

and calculating importance scores corresponding to the optimized features and the multi-feature information respectively, and fusing the optimized features and the multi-feature information based on the importance scores to obtain fusion features.

3. The method according to claim 1, wherein the method further comprises:

calculating recalibration weight and potential dimension task correlation measurement of the key information;

Adjusting the key information according to the recalibration weight and the potential dimension task correlation measurement to obtain recalibration information;

and determining the recalibration information as key information of the fusion characteristic.

4. The method according to claim 1, wherein the feature extraction is performed on the fused features by a pre-built feature extraction model, and the method for constructing the feature extraction model comprises:

preliminary feature extraction is carried out on a preset training sample set on a plurality of granularities by using an improved whale optimization algorithm, and an optimal feature subset is searched in a feature space by simulating social behaviors and predation strategies of whales;

inputting the optimal feature subset into a deep learning model for semantic deepening to obtain enhanced features;

Calculating the loss corresponding to the enhancement features, and adjusting the searching strategy of the whale optimization algorithm according to the loss;

And constructing a feature extraction model based on the whale optimization algorithm and the deep learning model until the loss reaches a preset threshold.

5. The method according to claim 1, wherein the data dimension reduction is performed on the fused features by a pre-constructed feature dimension reduction model, and the method for constructing the feature dimension reduction model comprises the following steps:

An encoder and a decoder of the custom encoder;

Acquiring a preset training sample set, and training an encoder and the decoder of the self-encoder through the training sample set;

Optimizing parameters of the self-encoder through a preset potential interaction learning loss function, and introducing a feature interaction item into a potential space so as to reconstruct features based on the feature interaction item; the potential interactive learning loss function comprises a reconstruction error and a regularization term, wherein the reconstruction error is obtained by weighting calculation based on the importance of the features;

And constructing a feature dimension reduction model based on the trained self-encoder.

6. The method of claim 1, wherein the method of constructing the tag prediction model comprises:

Acquiring a preset training sample set; the training sample set comprises a document sample and a sample label corresponding to the document sample, wherein the sample label is used for representing an entity corresponding to the document sample;

Learning a deep feature representation of the training sample set through a self-encoder, and merging the deep feature representation into a conditional random field model to perform structured entity tag prediction;

under the framework of a structured deep learning model, combining the reconstruction loss of the self-encoder and the log likelihood of a correct tag sequence in a conditional random field model to jointly optimize parameters of the self-encoder and the conditional random field model;

Performing a performance evaluation on the self-encoder and the conditional random field model;

a label prediction model is constructed based on the self-encoder and conditional random field model that meet performance evaluation requirements.

7. An archival knowledge extraction device based on structure priori knowledge, the device comprising:

The data acquisition module is used for acquiring a target document;

The feature extraction module is used for extracting multiple features of the target document to obtain multiple feature information in the target document; the multi-feature information comprises structural features and font features;

The feature fusion module is used for inputting the multi-feature information into a pre-constructed feature fusion model, and carrying out feature fusion on the multi-feature information based on the feature correlation of the multi-feature extraction features to generate fusion features;

The preprocessing module is used for carrying out feature extraction and data dimension reduction on the fusion features to obtain key information in the fusion features;

The output module is used for carrying out entity identification on the key information through a pre-constructed label prediction model and outputting the key entity information in the target document; the label prediction model is constructed based on structure priori knowledge of the deep feature representation;

the feature extraction module is further used for inputting the target document into a pre-built multi-feature extraction model, and performing multi-feature extraction on the target document through the multi-feature extraction model to obtain multi-feature information in the target document; the multi-feature extraction model comprises a structural feature extraction model and a font feature extraction model; the structural feature extraction model is used for extracting structural features of characters in the target document, and the font feature extraction model is used for extracting font features of the characters in the target document;

The feature extraction module is also used for acquiring a Chinese file collected in advance and collecting Chinese character data in the Chinese file; labeling the Chinese character data according to the Chinese character characteristics of the Chinese character data to obtain a data label; the Chinese character features comprise radicals and fonts of the Chinese character data; splitting the Chinese character data according to a radical structure and extracting structural features by using a first convolutional neural network to determine structural feature representation; capturing the character form detail characteristics of the Chinese character data through a VGG16 network; splicing the structural feature representation and the font detail feature into a comprehensive feature vector; training a second convolutional neural network model through the comprehensive feature vector and the data tag, and optimizing model parameters of the second convolutional neural network model by adopting an Adam optimizer; constructing a multi-feature extraction model based on the trained second convolutional neural network model;

The characteristic extraction module is further used for determining context characteristics corresponding to the Chinese character data according to the structural characteristic representation and the font detail characteristics; by passing through Functionally evaluating the relevance of the structural feature representation and the font detail feature, respectively, to the contextual feature; calculating feature weights respectively corresponding to the structural feature representation and the font detail features based on the correlation; and splicing the structural feature representation and the font detail feature into a comprehensive feature vector based on the feature weight.