CN114580389B

CN114580389B - Chinese medical field causal relation extraction method integrating radical information

Info

Publication number: CN114580389B
Application number: CN202210220870.4A
Authority: CN
Inventors: 李晓庆; 朱广丽; 张顺香; 吴厚月; 许鑫; 苏明星; 李健; 黄菊; 魏苏波; 孙争艳; 张镇江; 赵彤
Original assignee: Anhui University of Science and Technology
Current assignee: Anhui University of Science and Technology
Priority date: 2022-03-08
Filing date: 2022-03-08
Publication date: 2024-08-20
Anticipated expiration: 2042-03-08
Also published as: CN114580389A

Abstract

The invention discloses a Chinese text causal relation extraction method in the medical field integrating radical information, which relates to the technical field of data mining and comprises the following steps: acquiring a Chinese text data set in the medical field through a web crawler, preprocessing the acquired data, converting English terms in the text into Chinese by adopting a translation technology, acquiring radicals of all characters by utilizing an online Xinhua dictionary, performing incremental training on the radicals by utilizing a Word2Vec architecture to obtain radical characteristic representation, and then taking a radical characteristic vector as input of a causality extraction model to extract causality from the data set to obtain a causality entity. The invention solves the problem of extracting the effective causal relation of the Chinese text data in the medical field. The causal entity of Chinese text data in the medical field can be obtained through the invention.

Description

A causal relationship extraction method in Chinese medical field integrating radical information

技术领域Technical Field

本发明涉及医疗领域因果关系抽取，尤其涉及一种融合部首信息的中文医疗领域因果关系抽取方法。The present invention relates to causal relationship extraction in the medical field, and in particular to a causal relationship extraction method in the Chinese medical field integrating radical information.

背景技术Background Art

目前，医疗领域的信息化建设稳步开展，现代化的医疗信息系统已经积累了海量医疗数据。随着数据的不断积累，利用自然语言处理技术和深度学习的方法挖掘医疗领域文本数据中蕴含的丰富信息，已经成为医学领域和人工智能领域交叉研究的热点。医疗领域文本数据中蕴含着大量医疗活动的记录，包含所患疾病、药物、检查和治疗结果等。这些信息是重要的临床数据，对其进行精确高效地分析和挖掘，能给建立医学知识库、构建临床诊疗系统等提供理论和技术支持。但是，医疗领域文本数据与传统的文本有许多不同的特征，如包含大量英文实体名、语义与部首高度相关等特性，这些特性给因果关系抽取来了新的挑战。此时，就需要一个能融合部首信息、丰富文本语义信息的因果关系抽取方法。At present, the information construction in the medical field is steadily developing, and modern medical information systems have accumulated massive medical data. With the continuous accumulation of data, the use of natural language processing technology and deep learning methods to mine the rich information contained in medical text data has become a hot spot in the cross-study of the medical field and artificial intelligence. Text data in the medical field contains a large number of records of medical activities, including diseases, drugs, examinations and treatment results. This information is important clinical data. Accurate and efficient analysis and mining of it can provide theoretical and technical support for the establishment of medical knowledge bases and the construction of clinical diagnosis and treatment systems. However, medical text data has many different characteristics from traditional texts, such as containing a large number of English entity names, and semantics are highly correlated with radicals. These characteristics bring new challenges to causal extraction. At this time, a causal extraction method that can integrate radical information and enrich text semantic information is needed.

目前，人们对部首信息的研究主要集中在命名实体识别领域。汉字具有单字可成词的特点，且汉字的偏旁部首往往蕴含着重要的信息。对部首信息的研究主要是通过条件随机场模型、双向长短期记忆网络模型等，获取部首特征，将部首特征融入到字符特征中，实现文本语义信息的丰富，得到融合部首信息的字符特征向量表示。At present, the research on radical information is mainly focused on the field of named entity recognition. Chinese characters have the characteristics of being able to form words with a single word, and the radicals of Chinese characters often contain important information. The research on radical information is mainly to obtain radical features through conditional random field models, bidirectional long short-term memory network models, etc., integrate radical features into character features, enrich text semantic information, and obtain character feature vector representation that integrates radical information.

对于得到的融合部首信息的字符特征表示，还需将其作为因果关系抽取模型的输入，得到因果关系实体。对于因果关系抽取的研究，常用的方法为基于机器学习的方法和基于深度学习的方法。机器学习的方法首先建模成一个多分类问题，提取特征向量后再使用有监督的分类器进行事件抽取。随着神经网络的火热研究，将神经网络模型应用于因果关系抽取中，可以提高因果关系抽取准确效率。但现有的方法很少考虑到字符的部首特征，导致语义信息获取不够完善，给因果关系抽取模型应用在医疗领域带来风险。本文通过融合部首信息，对医疗领域文本数据进行因果关系抽取，提高因果关系抽取准确率。The obtained character feature representation of the fused radical information needs to be used as the input of the causal extraction model to obtain the causal entity. For the research on causal extraction, the commonly used methods are machine learning-based methods and deep learning-based methods. The machine learning method first models it as a multi-classification problem, extracts the feature vector, and then uses a supervised classifier for event extraction. With the hot research of neural networks, the application of neural network models to causal extraction can improve the accuracy and efficiency of causal extraction. However, the existing methods rarely consider the radical features of characters, resulting in incomplete semantic information acquisition, which brings risks to the application of causal extraction models in the medical field. This paper extracts causal relationships from medical text data by fusing radical information to improve the accuracy of causal extraction.

发明内容Summary of the invention

为了解决上述问题，本发明的目的在于提供一种融合部首信息的中文医疗领域因果关系抽取方法。In order to solve the above problems, the purpose of the present invention is to provide a method for extracting causal relationships in the Chinese medical field by integrating radical information.

为了达到上述目的，本发明提供的一种融合部首信息的中文医疗领域因果关系抽取方法方法是按以下步骤进行的：In order to achieve the above-mentioned purpose, the present invention provides a method for extracting causal relationships in the Chinese medical field by integrating radical information, which is carried out in the following steps:

步骤1：数据获取。获取中文医疗领域文本数据集合D＝{D₁,D₂...D_n},D_i表示第i个文本，1≤i≤n,n为集合D中的文本总数；Step 1: Data acquisition. Obtain the Chinese medical field text data set D = {D ₁ ,D ₂ ...D _n }, where D _i represents the i-th text, 1≤i≤n, and n is the total number of texts in the set D;

步骤2：对获取的文本数据进行预处理，其基本步骤如下：Step 2: Preprocess the acquired text data. The basic steps are as follows:

步骤2.1：去除文本中的停用词、网页标签等，进行分词；Step 2.1: Remove stop words, web page tags, etc. from the text and perform word segmentation;

步骤2.2：将文本提取成结构化数据，装入数据库；Step 2.2: Extract the text into structured data and load it into the database;

步骤3：将文本数据中的英文专业术语转化为中文，其基本步骤如下：Step 3: Convert the English professional terms in the text data into Chinese. The basic steps are as follows:

步骤3.1：利用ASCII码值定位数据集中的英文专业术语；Step 3.1: Use ASCII code values to locate English professional terms in the data set;

步骤3.2：利用翻译接口将英文专业术语转化为中文，得到仅含中文字符的数据集；Step 3.2: Use the translation interface to convert English professional terms into Chinese to obtain a data set containing only Chinese characters;

步骤4：部首特征获取，其基本步骤如下：Step 4: Radical feature acquisition, the basic steps are as follows:

步骤4.1：通过查询在线新华字典，获取数据集中所有字符的部首，对于没有部首的汉字，将字符本身看作词；Step 4.1: Obtain the radicals of all characters in the dataset by querying the online Xinhua Dictionary. For Chinese characters without radicals, treat the characters themselves as words.

步骤4.2：将部首看作词，作为Word2Vec架构的输入，对部首进行增量训练，得到部首特征向量表示；Step 4.2: Treat the radical as a word and use it as the input of the Word2Vec architecture. Perform incremental training on the radical to obtain the radical feature vector representation.

步骤5：融合部首信息的中文医疗领域因果关系抽取，其基本步骤如下：Step 5: Extracting causal relationships in the Chinese medical field by integrating radical information. The basic steps are as follows:

步骤5.1：输入层，对于中文医疗领域原始文本数据，将句子输入到BERT模型中获取字符级特征，同时将部首输入到Word2Vec中进行增量训练，得到部首特征表示；Step 5.1: Input layer: For the original text data in the Chinese medical field, the sentences are input into the BERT model to obtain character-level features, and the radicals are input into Word2Vec for incremental training to obtain radical feature representation;

步骤5.2：接收字符特征与部首特征，并通过查找嵌入字典输出两个嵌入矩阵，将字符与部首的向量维数设为相同大小，这样，一个中文字符可以由两个向量序列来表示，即字符序列和部首序列；Step 5.2: Receive character features and radical features, and output two embedding matrices by searching the embedding dictionary, and set the vector dimensions of characters and radicals to the same size. In this way, a Chinese character can be represented by two vector sequences, namely, a character sequence and a radical sequence;

步骤5.3：表示层将字符信息与部首信息结合起来，生成输入文本的全面表示，利用双向长短期记忆网络可以捕获前后上下文信息，捕获双向的语义依赖，考虑将部首特征作为行向量拼接在字符特征之后，将部首信息编码到字符特征向量中，将文本分别通过BERT模型和Word2Vec架构，得到字符特征与部首特征，再将这两种独立的特征向量进行拼接，得到融合部首信息的文本特征向量表示；Step 5.3: The representation layer combines character information with radical information to generate a comprehensive representation of the input text. The bidirectional long short-term memory network can capture the previous and next context information and the bidirectional semantic dependency. Consider concatenating the radical feature as a row vector after the character feature, encode the radical information into the character feature vector, pass the text through the BERT model and the Word2Vec architecture respectively, obtain the character feature and radical feature, and then concatenate these two independent feature vectors to obtain the text feature vector representation fused with the radical information.

步骤5.4：将表示层中Bi-LSTM的最终隐层状态作为输出，并将其连接形成一个综合表示。然后将其输入到条件随机场模型中，采用Softmax函数作为激活函数，对每个词进行映射得到条件概率；最后，利用BIO序列标注方法对输出文本进行标记，得到最终抽取结果；Step 5.4: Take the final hidden state of the Bi-LSTM in the representation layer as the output and connect it to form a comprehensive representation. Then input it into the conditional random field model, use the Softmax function as the activation function, map each word to get the conditional probability; finally, use the BIO sequence labeling method to label the output text and get the final extraction result;

步骤6：序列标注，用序列标注的方式进行因果关系抽取，需要对句子中的每个单词标记相应的标签，B-cause表示原因事件的开始，B-effect表示结果事件的开始，I-cause表示原因事件的中间词或结尾词，I-effect表示结果事件的中间词或结尾词，O标签表示这个词既不属于原因事件也不属于结果事件，对预测层的语句进行概率计算，得到每个字符对应的因果标签，得到因果实体。Step 6: Sequence labeling. Use sequence labeling to extract causal relationships. Each word in the sentence needs to be marked with a corresponding label. B-cause indicates the beginning of the cause event, B-effect indicates the beginning of the result event, I-cause indicates the middle word or ending word of the cause event, I-effect indicates the middle word or ending word of the result event, and the O label indicates that this word belongs to neither the cause event nor the result event. Perform probability calculation on the sentences in the prediction layer to obtain the causal label corresponding to each character and the causal entity.

本发明所具有的优点和积极效果是：本发明的融合部首信息的中文医疗领域因果关系抽取方法能够融合字符部首信息，丰富文本语义信息，并将其作为因果关系抽取模型的输入，提高因果关系抽取准确率，在医疗领域可用于建立医学知识库、构建在线问诊平台等任务。The advantages and positive effects of the present invention are: the Chinese medical field causal relationship extraction method integrating radical information of the present invention can integrate character radical information, enrich text semantic information, and use it as the input of the causal relationship extraction model, thereby improving the accuracy of causal relationship extraction, and can be used in the medical field for tasks such as establishing a medical knowledge base and building an online consultation platform.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明的技术方案，对本发明所需要使用的附图作简单的介绍。In order to more clearly illustrate the technical solution of the present invention, a brief introduction is given to the drawings required for use in the present invention.

图1为本发明提供的一种融合部首信息的中文医疗领域因果关系抽取方法流程图；FIG1 is a flow chart of a method for extracting causal relationships in the Chinese medical field by integrating radical information provided by the present invention;

图2为本发明提供的一种进行中文字符部首特征获取的结构框图FIG. 2 is a structural block diagram of a method for obtaining Chinese character radical features provided by the present invention.

图3为本发明提供的一种进行因果关系抽取的结构框图。FIG3 is a structural block diagram of a causal relationship extraction method provided by the present invention.

具体实施方式DETAILED DESCRIPTION

下面对本发明做进一步说明：The present invention will be further described below:

本发明的目的在于提供一种融合部首信息的中文医疗领域因果关系抽取方法。这是一种在现有因果关系抽取的基础上，通过融合中文字符部首信息，丰富文本语义信息，来达到更好抽取效果的方法。The purpose of the present invention is to provide a method for extracting causal relationships in the Chinese medical field by integrating radical information. This is a method that achieves better extraction effects by integrating Chinese character radical information and enriching text semantic information on the basis of existing causal relationship extraction.

结合图1、2、3，本发明一种融合部首信息的中文医疗领域因果关系抽取方法是按以下步骤进行的：In conjunction with FIGS. 1 , 2 and 3 , a method for extracting causal relationships in the Chinese medical field by integrating radical information of the present invention is performed in the following steps:

此外，以上实施方式仅用以说明本发明的具体实施方式而不是对其限制，本领域技术人员应当理解，还可以对其中部分技术特征进行同等替换，这些修改和替换亦属于本发明保护范围。In addition, the above embodiments are only used to illustrate the specific embodiments of the present invention rather than to limit it. Those skilled in the art should understand that some of the technical features therein may be replaced by equivalents, and these modifications and replacements also fall within the scope of protection of the present invention.

Claims

1. A Chinese medical field causal relation extraction method integrating radical information is characterized by comprising the following steps:

Step 1: acquiring data, wherein a text data set D= { D ₁,D₂...D_n},D_i in the Chinese medical field represents an ith text, i is more than or equal to 1 and less than or equal to n, and n is the total number of texts in the set D;

Step 2: the method comprises the following basic steps of:

step 2.1: removing stop words and webpage labels in the text, and performing word segmentation;

step 2.2: extracting the text into structured data, and loading the structured data into a database;

Step 3: the English technical term in the text data is converted into Chinese, and the basic steps are as follows:

Step 3.1: locating English technical terms in the data set by using ASCII code values;

Step 3.2: converting English technical terms into Chinese by using a translation interface to obtain a data set only containing Chinese characters;

Step 4: radical feature acquisition, which basically comprises the following steps:

step 4.1: acquiring radicals of all characters in a data set by inquiring an online Xinhua dictionary, and regarding characters as words for Chinese characters without radicals;

step 4.2: taking the radicals as words, and performing incremental training on the radicals to obtain radical feature vector representation as input of a Word2Vec architecture;

step 5: the extraction of the causal relationship in the Chinese medical field integrating the radical information comprises the following basic steps:

Step 5.1: the input layer is used for inputting sentences into the BERT model to obtain character-level features for original text data in the Chinese medical field, and inputting radicals into Word2Vec to perform incremental training to obtain radical feature representation;

step 5.2: receiving character features and radical features, and outputting two embedding matrixes by searching an embedding dictionary, and setting vector dimensions of the characters and the radicals to be the same size, so that one Chinese character can be represented by two vector sequences, namely a character sequence and a radical sequence;

Step 5.3: the representation layer combines character information and radical information to generate comprehensive representation of an input text, the context information can be captured by utilizing a bidirectional long-short-term memory network, the bidirectional semantic dependence is captured, the radical characteristics are considered to be spliced into character characteristics, then the radical information is encoded into character characteristic vectors, the text is respectively passed through a BERT model and a Word2Vec architecture to obtain character characteristics and radical characteristics, and then the two independent characteristic vectors are spliced to obtain text characteristic vector representation of fused radical information;

Step 5.4: taking the final hidden layer of Bi-LSTM in the representation layer as output, and connecting the final hidden layer to form a comprehensive representation; then inputting the words into a conditional random field model, and mapping each word by adopting a Softmax function as an activation function to obtain conditional probability; finally, marking the output text by using a BIO sequence marking method to obtain a final extraction result;

Step 6: sequence labeling, namely performing causal relation extraction in a sequence labeling mode, namely marking corresponding labels on each word in a sentence, wherein B-cause represents the beginning of a cause event, B-effect represents the beginning of a result event, I-cause represents the middle word or end word of the cause event, I-effect represents the middle word or end word of the result event, O labels represent that the word does not belong to the cause event or the result event, performing probability calculation on sentences of a prediction layer to obtain causal labels corresponding to each character, and obtaining a causal entity.