CN117708339A

CN117708339A - ICD automatic coding method based on pre-training language model

Info

Publication number: CN117708339A
Application number: CN202410165651.XA
Authority: CN
Inventors: 陈先来; 黄伟斌; 黄金彩; 陈翔; 安莹
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2024-02-05
Filing date: 2024-02-05
Publication date: 2024-03-15
Anticipated expiration: 2044-02-05
Also published as: CN117708339B

Abstract

The embodiment of the invention provides an ICD automatic coding method based on a pre-training language model, which belongs to the technical field of data processing and specifically comprises the following steps: constructing an ICD automatic coding data set; forming a mapping set; constructing a prefix tree, and forming an LEDT model by combining the prefix tree form; dividing the ICD automatic coding data set into a training set and a verification set; dividing clinical texts in the training set and the verification set and corresponding ICD codes thereof respectively; training the LEDT model using the seq2seq training dataset; inputting an input text in a data set to be encoded into a target model, limiting generated characters by using a prefix tree in the decoding generation process of the target model, simultaneously reserving k prediction descriptions with highest output scores by using a bundling algorithm, and finally converting the k output prediction descriptions into corresponding ICD codes by using a mapping set to serve as prediction output. By the scheme of the invention, the coding efficiency, the precision and the adaptability are improved.

Description

An ICD automatic encoding method based on pre-trained language model

技术领域Technical field

本发明实施例涉及数据处理技术领域，尤其涉及一种基于预训练语言模型的ICD自动编码方法。Embodiments of the present invention relate to the field of data processing technology, and in particular, to an ICD automatic encoding method based on a pre-trained language model.

背景技术Background technique

目前，国际疾病分类法（International Classification of Diseases，简称ICD）的编码任务是指对医疗文本中的医学原语分配与之对应的ICD代码，在推动医疗研究实践自动化、提升编码质量、减少人为错误和主观因素的影响、辅助诊疗、支持疾病相关分组（Diagnosis Related Groups，简称DRGs）以及实现智能医保控费系统等方面，具有重要意义。Currently, the coding task of the International Classification of Diseases (ICD) refers to assigning corresponding ICD codes to medical primitives in medical texts. It promotes the automation of medical research practices, improves coding quality, and reduces human errors. It is of great significance in aspects such as the influence of subjective factors, auxiliary diagnosis and treatment, supporting disease-related groups (DRGs), and realizing intelligent medical insurance fee control systems.

早期的编码任务主要依赖人工完成。据统计，一位编码专家平均需要30多分钟才能完成一份电子病历的编码任务，无法满足医疗数据急速增长下的实际需要。此外，人工编码任务需要专家具备充足的背景知识，在查阅相关信息的基础上，仔细阅读医疗记录内容，该过程开销巨大，效率低下并且易于出错。Early coding tasks were mainly done manually. According to statistics, it takes a coding expert more than 30 minutes on average to complete the coding task of an electronic medical record, which cannot meet the actual needs of the rapid growth of medical data. In addition, manual coding tasks require experts with sufficient background knowledge to carefully read the content of medical records based on consulting relevant information. This process is expensive, inefficient, and error-prone.

可见，亟需一种高效精准、适应性强的基于预训练语言模型的ICD自动编码方法。It can be seen that there is an urgent need for an efficient, accurate, and adaptable ICD automatic encoding method based on a pre-trained language model.

发明内容Contents of the invention

有鉴于此，本发明实施例提供一种基于预训练语言模型的ICD自动编码方法，至少部分解决现有技术中存在编码效率、精准度和适应性较差的问题。In view of this, embodiments of the present invention provide an ICD automatic encoding method based on a pre-trained language model, which at least partially solves the problems of poor encoding efficiency, accuracy, and adaptability in the prior art.

本发明实施例提供了一种基于预训练语言模型的ICD自动编码方法，包括：The embodiment of the present invention provides an ICD automatic encoding method based on a pre-trained language model, including:

步骤1，根据电子病历构建ICD自动编码数据集，其中，所述ICD自动编码数据集包括临床文本和其对应的ICD代码；Step 1: Construct an ICD automatic coding data set based on electronic medical records, where the ICD automatic coding data set includes clinical texts and their corresponding ICD codes;

步骤2，从ICD代码描述库中获取ICD代码对应的代码描述，并形成映射集；Step 2: Obtain the code description corresponding to the ICD code from the ICD code description library and form a mapping set;

步骤3，对代码描述进行分词，得到ids序列并据此构造前缀树，在预训练模型的基础上调整编码器的输入范围和视野范围，结合前缀树形成LEDT模型；Step 3, segment the code description, obtain the ids sequence and construct a prefix tree accordingly, adjust the input range and field of view of the encoder based on the pre-trained model, and combine the prefix tree to form an LEDT model;

步骤4，将ICD自动编码数据集分为训练集和验证集；Step 4: Divide the ICD automatic encoding data set into a training set and a validation set;

步骤5，分别将训练集和验证集中的临床文本和其对应的ICD代码分割，得到文本序列和其对应的ICD代码序列，并将ICD代码序列通过映射集得到对应的代码描述，据此形成seq2seq训练数据集和seq2seq验证数据集；Step 5: Segment the clinical text and its corresponding ICD code in the training set and validation set respectively to obtain the text sequence and its corresponding ICD code sequence, and pass the ICD code sequence through the mapping set to obtain the corresponding code description, thereby forming seq2seq Training data set and seq2seq validation data set;

步骤6，采用teacher forcing方法，利用seq2seq训练数据集训练LEDT模型，更新模型参数，在每个训练轮次结束之后，将seq2seq验证数据集输入到LEDT模型中，记录损失最小时的模型参数；Step 6: Use the teacher forcing method to train the LEDT model using the seq2seq training data set, update the model parameters, and after each training round, input the seq2seq verification data set into the LEDT model, and record the model parameters when the loss is minimal;

步骤7，选择在seq2seq验证数据集中损失最小的模型参数得到目标模型，将待编码数据集中的输入文本输入目标模型，并在目标模型的解码生成过程中，使用前缀树对生成的字符进行限制，使得LEDT模型生成的字符串是代码描述中的子集，同时使用集束算法保留输出得分最高的k个预测描述，最终利用映射集将输出的k个预测描述转换为对应的ICD代码作为预测输出。Step 7: Select the model parameters with the smallest loss in the seq2seq verification data set to obtain the target model, input the input text in the data set to be encoded into the target model, and use the prefix tree to limit the generated characters during the decoding and generation process of the target model. The string generated by the LEDT model is a subset of the code description. At the same time, the clustering algorithm is used to retain the k prediction descriptions with the highest output scores. Finally, the mapping set is used to convert the output k prediction descriptions into the corresponding ICD codes as the prediction output.

根据本发明实施例的一种具体实现方式，所述步骤3具体包括：According to a specific implementation manner of the embodiment of the present invention, step 3 specifically includes:

步骤3.1，对代码描述进行分词并转换成预训练语言模型中的ids序列，在该ids序列前加上预训练语言模型生成过程中所使用的开始符号的ids，在ids序列末尾添加上模型生成过程中所使用的结束符号的ids，构造模型生成的目标代码ids序列；Step 3.1, segment the code description and convert it into an ids sequence in the pre-trained language model. Add the ids of the start symbol used in the pre-training language model generation process before the ids sequence, and add model generation at the end of the ids sequence. The ids of the end symbols used in the process, and the target code ids sequence generated by the construction model;

步骤3.2，对全部代码描述的ids序列进行上述操作，构造前缀树；Step 3.2: Perform the above operations on the ids sequence described by all codes to construct a prefix tree;

步骤3.3，扩展预训练模型中编码器可处理的输入数据的范围，并设置其注意力的视野范围，结合前缀树形成LEDT模型。Step 3.3: Expand the range of input data that the encoder in the pre-training model can process, and set the field of view of its attention, combined with the prefix tree to form the LEDT model.

根据本发明实施例的一种具体实现方式，所述步骤6具体包括：According to a specific implementation manner of the embodiment of the present invention, step 6 specifically includes:

步骤6.1，在每个时间步使用seq2seq训练数据集的ICD代码作为当前时刻LEDT模型的输入，使用使用分词器对该ICD代码对应的临床文本和代码描述进行分词，转换成临床文本和代码描述对应的ids序列并作为LEDT模型中编码器和解码器的输入；Step 6.1, at each time step, use the ICD code of the seq2seq training data set as the input of the LEDT model at the current moment, use a word segmenter to segment the clinical text and code description corresponding to the ICD code, and convert it into a corresponding clinical text and code description. The ids sequence is used as the input of the encoder and decoder in the LEDT model;

步骤6.2，编码器对临床文本的ids序列进行编码，得到临床文本每个时间步对应的上下文编码向量；Step 6.2: The encoder encodes the ids sequence of the clinical text to obtain the context encoding vector corresponding to each time step of the clinical text;

步骤6.3，将每个时间步对应的上下文编码向量和代码描述的ids输入解码器得到预测描述，计算预测描述和代码描述之间的损失值并通过反向传播更新LEDT模型的模型参数；Step 6.3, input the context encoding vector corresponding to each time step and the ids of the code description into the decoder to obtain the prediction description, calculate the loss value between the prediction description and the code description, and update the model parameters of the LEDT model through backpropagation;

步骤6.4，在每个训练轮次结束之后，将seq2seq验证数据集输入到LEDT模型中，记录损失最小时的模型参数。Step 6.4, after each training round, input the seq2seq validation data set into the LEDT model, and record the model parameters when the loss is minimal.

根据本发明实施例的一种具体实现方式，所述步骤6.3具体包括：According to a specific implementation manner of the embodiment of the present invention, step 6.3 specifically includes:

步骤6.3.1，将上一时间步的解码器的隐状态初始化为当前时间步对应的上下文编码向量的值；Step 6.3.1, initialize the hidden state of the decoder of the previous time step to the value of the context encoding vector corresponding to the current time step;

步骤6.3.2，以上一时间步的解码器隐状态和代码描述分词后的符号作为解码器输入，更新解码器隐状态；Step 6.3.2, use the decoder hidden state of the previous time step and the symbols after code description segmentation as decoder input, and update the decoder hidden state;

步骤6.3.3，将更新后的解码器隐状态送入一个线性层网络，计算当前时间步的输出作为预测描述；Step 6.3.3, send the updated decoder hidden state to a linear layer network, and calculate the output of the current time step as the prediction description;

步骤6.3.4，计算当前时间步的预测描述和下一时间步代码描述之间的损失值并通过反向传播更新LEDT模型的模型参数，并返回步骤6.3.1执行下一时间步，直至完成全部时间步的预测则判定当前训练轮次结束；Step 6.3.4, calculate the loss value between the prediction description of the current time step and the code description of the next time step and update the model parameters of the LEDT model through back propagation, and return to step 6.3.1 to execute the next time step until completed The prediction of all time steps determines the end of the current training round;

步骤6.3.5，完成当前训练轮次后，将seq2seq验证数据集输入到LEDT模型中，记录损失最小时的模型参数。Step 6.3.5, after completing the current training round, input the seq2seq validation data set into the LEDT model, and record the model parameters when the loss is minimal.

根据本发明实施例的一种具体实现方式，所述步骤7具体包括：According to a specific implementation manner of the embodiment of the present invention, step 7 specifically includes:

步骤7.1，将待编码数据集中的输入文本输入目标模型，目标模型的分词器将输入文本分词并转换为ids序列，利用目标模型的编码器将其进行编码得到输入文本的上下文向量；Step 7.1, input the input text in the data set to be encoded into the target model. The word segmenter of the target model will segment the input text into words and convert it into an ids sequence. Use the encoder of the target model to encode it to obtain the context vector of the input text;

步骤7.2，设定解码器生成的第一个字符为开始字符；Step 7.2, set the first character generated by the decoder as the start character;

步骤7.3，以上一时间步的解码器隐状态和预测描述作为解码器输入，更新解码器隐状态；Step 7.3, use the decoder hidden state and prediction description of the previous time step as decoder input, and update the decoder hidden state;

步骤7.4，将更新后的解码器隐状态送入一个线性层网络，计算当前时间步的输出作为预测描述；Step 7.4, send the updated decoder hidden state to a linear layer network, and calculate the output of the current time step as the prediction description;

步骤7.5，查询前缀树，将不属于前缀树的预测描述的符号得分概率置零；Step 7.5, query the prefix tree and set the probability of the symbol score of the prediction description that does not belong to the prefix tree to zero;

步骤7.6，使用集束搜索算法保留得分最高k个的预测描述；Step 7.6, use the beam search algorithm to retain the prediction descriptions with the highest k scores;

步骤7.7，重复步骤7.3至步骤7.6直至完成全部时间步预测，得到k个预测描述并利用映射集将输出的k个预测描述转换为对应的ICD代码作为预测输出。Step 7.7, repeat steps 7.3 to 7.6 until all time step predictions are completed, obtain k prediction descriptions, and use the mapping set to convert the output k prediction descriptions into corresponding ICD codes as prediction output.

本发明实施例的有益效果为：The beneficial effects of the embodiments of the present invention are:

1、使用预训练语言模型LEDT模型进行ICD自动编码任务，能够更好地处理电子病历中词语的形态变换问题。1. Using the pre-trained language model LEDT model for ICD automatic encoding tasks can better handle the morphological transformation problem of words in electronic medical records.

2、使用生成式方式将ICD编码任务从多标签分类任务转换成生成任务，在生成式模型高效除噪的基础上，还有效利用ICD编码描述，加强输入文本与ICD编码描述之间的交互，最终结合前缀树来限制LEDT模型的生成过程，避免生成过程中产生的未登录词（Out-Of-Vocabulary，简称OOV）问题。2. Use a generative approach to convert the ICD coding task from a multi-label classification task to a generation task. On the basis of the efficient denoising of the generative model, the ICD coding description is also effectively used to strengthen the interaction between the input text and the ICD coding description. Finally, the prefix tree is combined to limit the generation process of the LEDT model and avoid the Out-Of-Vocabulary (OOV) problem generated during the generation process.

3、使用Longformer作为预训练语言模型的架构，以基于窗口的局部注意力代替基础Transformer的全局注意力，使模型在进行ICD编码任务中达到计算复杂度、内存开销与模型性能之间达到一个平衡，提高了编码效率、精准度和适应性。3. Use Longformer as the architecture of the pre-trained language model, replacing the global attention of the basic Transformer with window-based local attention, so that the model can achieve a balance between computational complexity, memory overhead and model performance in the ICD encoding task. , improving coding efficiency, accuracy and adaptability.

附图说明Description of the drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其它的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention and are not relevant to the present invention. Those skilled in the art can also obtain other drawings based on these drawings without exerting creative efforts.

图1为本发明实施例提供的一种基于预训练语言模型的ICD自动编码方法的流程示意图；Figure 1 is a schematic flow chart of an ICD automatic encoding method based on a pre-trained language model provided by an embodiment of the present invention;

图2为本发明实施例提供的一种基于预训练语言模型的ICD自动编码方法的具体实施流程示意图；Figure 2 is a schematic flowchart of a specific implementation process of an ICD automatic encoding method based on a pre-trained language model provided by an embodiment of the present invention;

图3为本发明实施例提供的一种LEDT模型中局部注意力和全局注意力示意图；Figure 3 is a schematic diagram of local attention and global attention in an LEDT model provided by an embodiment of the present invention;

图4为本发明实施例提供的一种LEDT模型结合前缀树的生成过程示意图；Figure 4 is a schematic diagram of the generation process of an LEDT model combined with a prefix tree provided by an embodiment of the present invention;

图5为本发明实施例所使用的数据集长度统计示意图。Figure 5 is a statistical diagram of the length of the data set used in the embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明实施例进行详细描述。The embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

以下通过特定的具体实例说明本发明的实施方式，本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。本发明还可以通过另外不同的具体实施方式加以实施或应用，本说明书中的各项细节也可以基于不同观点与应用，在没有背离本发明的精神下进行各种修饰或改变。需说明的是，在不冲突的情况下，以下实施例及实施例中的特征可以相互组合。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following describes the embodiments of the present invention through specific examples. Those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. Obviously, the described embodiments are only some of the embodiments of the present invention, but not all of the embodiments. The present invention can also be implemented or applied through other different specific embodiments. Various details in this specification can also be modified or changed in various ways based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that, as long as there is no conflict, the following embodiments and the features in the embodiments can be combined with each other. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of the present invention.

需要说明的是，下文描述在所附权利要求书的范围内的实施例的各种方面。应显而易见，本文中所描述的方面可体现于广泛多种形式中，且本文中所描述的任何特定结构及/或功能仅为说明性的。基于本发明，所属领域的技术人员应了解，本文中所描述的一个方面可与任何其它方面独立地实施，且可以各种方式组合这些方面中的两者或两者以上。举例来说，可使用本文中所阐述的任何数目个方面来实施设备及/或实践方法。另外，可使用除了本文中所阐述的方面中的一或多者之外的其它结构及/或功能性实施此设备及/或实践此方法。To illustrate, the following describes various aspects of embodiments that are within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is illustrative only. Based on this disclosure, those skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, apparatuses may be implemented and/or methods practiced using any number of aspects set forth herein. Additionally, such apparatus may be implemented and/or methods practiced using other structures and/or functionality in addition to one or more of the aspects set forth herein.

还需要说明的是，以下实施例中所提供的图示仅以示意方式说明本发明的基本构想，图式中仅显示与本发明中有关的组件而非按照实际实施时的组件数目、形状及尺寸绘制，其实际实施时各组件的型态、数量及比例可为一种随意的改变，且其组件布局型态也可能更为复杂。It should also be noted that the drawings provided in the following embodiments are only schematically illustrating the basic concept of the present invention. The drawings only show the components related to the present invention and are not based on the number, shape and number of components during actual implementation. Dimension drawing, in actual implementation, the type, quantity and proportion of each component can be arbitrarily changed, and the component layout type may also be more complex.

另外，在以下描述中，提供具体细节是为了便于透彻理解实例。然而，所属领域的技术人员将理解，可在没有这些特定细节的情况下实践所述方面。Additionally, in the following description, specific details are provided to facilitate a thorough understanding of the examples. However, one skilled in the art will understand that the described aspects may be practiced without these specific details.

目前，在ICD自动编码领域存在着以下挑战与机遇：Currently, the following challenges and opportunities exist in the field of ICD automatic coding:

（1）多标签分类任务：主流的学者将ICD自动编码任务看作一个多标签分类任务，但ICD代码巨大的标签空间使得模型难以准确捕捉到ICD代码与输入文本片段之间的关系。(1) Multi-label classification task: Mainstream scholars regard the ICD automatic encoding task as a multi-label classification task, but the huge label space of ICD codes makes it difficult for the model to accurately capture the relationship between ICD codes and input text fragments.

（2）医学自然语言处理的特点：ICD自动编码任务属于医学自然语言处理领域，涉及词语形态不一致、书写风格差异等问题，医疗文本中可能包含着大量缩写词、口语词、近义词、一词多义、多词一义甚至是拼写错误，利用大规模通用语料库进行预训练，可以提升模型对文本语境的理解能力。(2) Characteristics of medical natural language processing: The ICD automatic encoding task belongs to the field of medical natural language processing and involves issues such as inconsistent word forms and differences in writing styles. Medical texts may contain a large number of abbreviations, colloquial words, synonyms, and multiple words. meaning, multiple words with one meaning, and even spelling errors. Pre-training using a large-scale general corpus can improve the model's ability to understand the text context.

（3）长文本处理的挑战：ICD自动编码任务中输入的临床文本长度往往超过基础预训练语言模型的处理范围，这些基础预训练语言模型全连接的注意力机制以及深层的网络特性使得它们处理长文本时的计算复杂度和内存开销呈指数上升。因此，目前ICD自动编码任务仍倾向于使用卷积神经网络（Convolutional Neural Network，CNN）和循环神经网络（Recurrent Neural Network，RNN）或其变体来处理输入文本，而较少使用预训练语言模型。然而，基于CNN的模型难以准确捕捉输入文本和标签之间的复杂关系，而基于RNN的模型在处理长文本时可能出现遗忘问题，这使得它们在ICD自动编码任务中仍有改进空间。(3) Challenges in long text processing: The length of clinical text input in ICD automatic encoding tasks often exceeds the processing range of basic pre-trained language models. The fully connected attention mechanism and deep network characteristics of these basic pre-trained language models make them process The computational complexity and memory overhead increases exponentially when the text is long. Therefore, current ICD automatic encoding tasks still tend to use Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) or their variants to process input text, and less use of pre-trained language models . However, CNN-based models have difficulty accurately capturing the complex relationships between input text and labels, while RNN-based models may suffer from forgetting problems when processing long texts, which leaves them with room for improvement in ICD automatic encoding tasks.

（4）噪声处理的挑战：在ICD自动编码任务中输入文本常含有大量噪声，给编码任务带来了挑战。通过Transformer-Encoder-Decoder 类生成式预训练语言模型（如BART）可以在预训练过程中对原始文本注入噪声，再通过解码器进行解码，还原出原始的文本，从而学习到其中的知识。BART模型在问答系统、摘要生成、机器翻译等任务显示出优越性能，其自回归生成文本的方式能够更好地捕捉输入文本和目标句子之间的交互关系。(4) Challenges in noise processing: In ICD automatic encoding tasks, input text often contains a large amount of noise, which brings challenges to the encoding task. Through the Transformer-Encoder-Decoder class generative pre-training language model (such as BART), noise can be injected into the original text during the pre-training process, and then decoded by the decoder to restore the original text, thereby learning the knowledge in it. The BART model shows superior performance in tasks such as question and answer systems, summary generation, and machine translation. Its autoregressive method of generating text can better capture the interactive relationship between input text and target sentences.

本发明实施例提供一种基于预训练语言模型的ICD自动编码方法，所述方法可以应用于医疗场景中的电子病历编码过程中。Embodiments of the present invention provide an ICD automatic encoding method based on a pre-trained language model, which method can be applied to the electronic medical record encoding process in medical scenarios.

参见图1，为本发明实施例提供的一种基于预训练语言模型的ICD自动编码方法的流程示意图。如图1和图2所示，所述方法主要包括以下步骤：Refer to Figure 1, which is a schematic flow chart of an ICD automatic encoding method based on a pre-trained language model provided by an embodiment of the present invention. As shown in Figures 1 and 2, the method mainly includes the following steps:

具体实施时，可以收集电子病历信息，对于患者的一次就诊记录i，相关相关诊断信息包括：入院诊断（Admission Diagnosis）：、出院诊断（Discharge Diagnosis）：、手术名称（Procedure Name）：/>、术前诊断（Preoperative Diagnosis）：/>、术后诊断（Postoperative Diagnosis）：/>、医嘱（Medical Order）：/>等。我们将第i次就诊的相关诊断信息合并成一个就诊文档（Document）：During specific implementation, electronic medical record information can be collected. For a patient's medical record i, the relevant diagnostic information includes: Admission Diagnosis: , Discharge Diagnosis: , Procedure Name:/> , Preoperative Diagnosis:/> , Postoperative Diagnosis:/> , Medical Order:/> wait. We merge the relevant diagnostic information of the i-th visit into a visit document (Document):

与此同时，收集中涉及到的ICD代码：At the same time, collect ICD codes involved in:

当涉及/>时/>，反之，/>。/>表示为本次任务中的ICD代码空间（以下部分如无特殊说明，“/>”均表示A其中的元素数量）。所以，本实施例中ICD自动编码任务可以描述为：针对输入文本/> ，预测其所有涉及的ICD代码/>。when Involving/> Time/> , on the contrary,/> . /> Represented as the ICD code space in this task (if there is no special explanation in the following parts, "/> ” both represent the number of elements in A). Therefore, the ICD automatic encoding task in this embodiment can be described as: for the input text/> , predicting all its involved ICD codes/> .

其中：in:

。 .

具体实施时，可以从ICD代码描述库获取数据集中涉及的ICD代码相应的文本描述。构造一一对应的（ICD代码，ICD代码描述）映射集，考虑到ICD代码和代码描述之间是一一对应的，则该映射集是可逆的。During specific implementation, the corresponding text descriptions of the ICD codes involved in the data set can be obtained from the ICD code description library. Construct a one-to-one mapping set of (ICD code, ICD code description). Considering that there is a one-to-one correspondence between ICD codes and code descriptions, the mapping set is reversible.

在上述实施例的基础上，所述步骤3具体包括：Based on the above embodiment, step 3 specifically includes:

具体实施时，对步骤2中所有的代码描述For specific implementation, describe all the codes in step 2

可以进行分词并转换成预训练语言模型中的ids序列，在该ids序列前加上预训练语言模型生成过程中所使用的开始符号的ids，在ids序列末尾添加上模型生成过程中所使用的结束符号的ids，构造模型生成的目标代码ids序列，并对所有的代码描述的ids序列进行上述操作，构造前缀树Trie。Word segmentation can be performed and converted into an ids sequence in the pre-trained language model. In front of the ids sequence, add the ids of the start symbol used in the pre-training language model generation process, and at the end of the ids sequence, add the ids used in the model generation process. For the ids of the ending symbol, construct the target code ids sequence generated by the model, perform the above operations on the ids sequences described by all codes, and construct a prefix tree Trie.

在此基础上，考虑到目前尚未发现针对中文医学领域的LEDT模型的检查点，所以可以将以中文的BART预训练模型进行修改：扩展编码器可处理的输入数据的范围，并设置其注意力的视野范围，以减少其计算复杂度，形成LED，最终结合上述的前缀树形成LEDT模型。如图3所示，Longformer作为预训练模型的架构，赋予全局注意力的token[CLS] ，使得它可以关注到其他所有token，而其他所有token 也能够关注到它。而其他token只能关注到附近的token。但是由于Longformer是按块进行堆叠起来的，当所堆叠的Longformer块的层数足够多的时候，顶层的Longformer块的只赋予到局部注意力的token的感受野也是足够大的，这种注意力计算方式使得预训练语言模型在进行ICD编码任务中达到计算复杂度的开销、内存开销与模型性能之间达到一个平衡。On this basis, considering that no checkpoints for the LEDT model in the Chinese medical field have yet been found, the Chinese BART pre-trained model can be modified: expand the range of input data that the encoder can process, and set its attention The field of view range is used to reduce its computational complexity to form an LED, and finally combined with the above-mentioned prefix tree to form an LEDT model. As shown in Figure 3, Longformer, as the architecture of the pre-training model, gives the global attention token [CLS], so that it can pay attention to all other tokens, and all other tokens can also pay attention to it. Other tokens can only focus on nearby tokens. However, since Longformer is stacked in blocks, when the number of stacked Longformer blocks is sufficient, the receptive field of the token of the top-level Longformer block that is only given local attention is also large enough. This kind of attention calculation This method enables the pre-trained language model to achieve a balance between computational complexity overhead, memory overhead and model performance in ICD encoding tasks.

以ICD代码描述“胃淋巴结继发恶性肿瘤”为例，其分词后的结果为，对应的ids序列为 Taking the ICD code describing "secondary malignant tumor of gastric lymph nodes" as an example, the result after word segmentation is , the corresponding ids sequence is

在本预训练语言模型中，模型生成过程的开始符号是“”，模型生成过程的结束符号也是“/>”，其ids对应为2。所以，在“胃淋巴结继发恶性肿瘤”目标代码的ids序列为In this pre-trained language model, the start symbol of the model generation process is " ", the end symbol of the model generation process is also "/> ”, its ids correspond to 2. Therefore, the ids sequence of the target code of “secondary malignant tumor of gastric lymph nodes” is

。 .

具体实施时，可以将步骤1中收集到的数据集按9:1的比例随机划分出训练集Train set和验证集Dev set，并设置它们对应的索引集合Train set index和Devset index，以便进行后续操作流程。During specific implementation, the data set collected in step 1 can be Randomly divide the training set and the verification set Dev set at a ratio of 9:1, and set their corresponding index sets Train set index and Devset index for subsequent operation processes.

具体实施时，将训练集、验证集中的数据转换成seq2seq训练数据集、seq2seq验证数据集。以训练集为例，对训练集的每一条数据，进行如下操作：对/>按ICD代码进行分割，形成以ICD代码分割的训练数据集：During specific implementation, the data in the training set and validation set are converted into seq2seq training data set and seq2seq validation data set. Taking the training set as an example, for each piece of data in the training set , Proceed as follows: Right/> Split by ICD code to form a training data set split by ICD code:

当时，通过(ICD代码，ICD代码描述)将/> 转换成ICD代码描述，形成最终的seq2seq训练数据集seq2seqTrain set。其数据形式是when When, pass (ICD code, ICD code description) to // Convert to ICD code description to form the final seq2seq training data set seq2seqTrain set. Its data format is

。 .

在上述实施例的基础上，所述步骤6具体包括：Based on the above embodiment, step 6 specifically includes:

步骤6.1，在每个时间步使用seq2seq训练数据集的ICD代码作为当前时刻LEDT模型的输入，使用使用分词器对该ICD代码对应的临床文本和代码描述进行分词，转换成临床文本和代码描述对应的ids序列并作为LEDT模型中编码器和解码器的输入；Step 6.1, at each time step, use the ICD code of the seq2seq training data set as the input of the LEDT model at the current moment, use a word segmenter to segment the clinical text and code description corresponding to the ICD code, and convert it into a corresponding clinical text and code description. The ids sequence is used as input to the encoder and decoder in the LEDT model;

进一步的，所述步骤6.3具体包括：Further, the step 6.3 specifically includes:

具体实施时，模型训练阶段，以seq2seq的方式对LEDT模型进行训练，本实施例中使用teacher forcing方式进行训练，即在训练seq2seq模型的过程中，每个时间步不使用上一个时刻的输出作为当前时刻的输入，而是直接使用训练数据的标准答案的标签作为当前时刻的输入，让模型学习到从输入临床文档DOC到该文档对应的ICD代码描述（desc）的转换能力。对于，使用分词器tokenizer分别对文档/> 和ICD描述进行分词/>，转换成ids并作为LED编码器、LED解码器的输入。LED编码器对文档分词，并转换成ids的结果后进行编码，得到关于DOC的上下文编码向量：During the specific implementation, during the model training phase, the LEDT model is trained in the seq2seq manner. In this embodiment, the teacher forcing method is used for training, that is, in the process of training the seq2seq model, each time step does not use the output of the previous moment as Instead of directly using the label of the standard answer of the training data as the input at the current moment, the model can learn the conversion ability from the input clinical document DOC to the ICD code description (desc) corresponding to the document. for , use the tokenizer to separate the documents/> and ICD description Carry out word segmentation/> , converted into ids and used as input to the LED encoder and LED decoder. The LED encoder segments the document and converts the result into ids for encoding to obtain the context encoding vector about the DOC:

指的是对（/>）先进行分词（/>）再转换成ids的操作。在解码过程中，将生成的标签与/>经分词并转换成ids的结果进行对齐，训练的方式使用的是teacher forcing方式。首先将解码器的隐状态初始化为上下文的值： Refers to (/> ) first perform word segmentation (/> ) and then convert it into ids. During the decoding process, the generated tags are compared with/> The results after word segmentation and conversion into ids are aligned, and the training method uses teacher forcing. First initialize the decoder’s hidden state to the value of the context:

假设的分词并转换成ids的结果为/>对于每个时间步，/>，进行如下操作：hypothesis The result of word segmentation and conversion into ids is/> For each time step, /> , perform the following operations:

步骤6.1：更新LED解码器隐状态，以上一时刻的解码器隐状态和标准答案（描述）分词后的符号作为解码器输入Step 6.1: Update the hidden state of the LED decoder, and use the hidden state of the decoder at the previous moment and the symbols after word segmentation of the standard answer (description) as the decoder input.

步骤6.2：将解码器隐状态送入一个线性层网络，计算当前时间步的输出Step 6.2: Feed the decoder hidden state into a linear layer network and calculate the output of the current time step

步骤6.3：使用交叉熵函数计算描述与/> 之间的损失。Step 6.3: Compute the description using the cross-entropy function with/> between losses.

当使用seq2seq训练数据集送入给模型时，步骤6.3在计算与之间的损失后通过误差反向传播的方式，更新模型的参数，优化模型训练。在每一个训练轮次结束之后，将seq2seq验证数据集送入给模型，步骤6.3在计算与/> 之间的损失后，计算模型在整个seq2seq验证数据集的损失，并保存损失最小时的模型参数。When the seq2seq training data set is fed to the model, step 6.3 calculates and After the loss between them, the parameters of the model are updated through error back propagation to optimize model training. After each training round, the seq2seq validation data set is fed to the model. Step 6.3 calculates with/> After the loss between, calculate the loss of the model on the entire seq2seq validation data set, and save the model parameters when the loss is minimal.

步骤7，选择在seq2seq验证数据集中损失最小的模型参数得到目标模型，将待编码数据集(设置其索引为Test set index)中的输入文本输入目标模型，并在目标模型的解码生成过程中，使用前缀树对生成的字符进行限制，使得LEDT模型生成的字符串是代码描述中的子集，同时使用集束算法保留输出得分最高的k个预测描述，最终利用映射集将输出的k个预测描述转换为对应的ICD代码作为预测输出。Step 7: Select the model parameters with the smallest loss in the seq2seq verification data set to obtain the target model, input the input text in the data set to be encoded (set its index to Test set index) into the target model, and during the decoding and generation process of the target model, Use a prefix tree to limit the generated characters so that the strings generated by the LEDT model are a subset of the code descriptions. At the same time, a clustering algorithm is used to retain the k prediction descriptions with the highest output scores. Finally, the mapping set is used to output the k prediction descriptions. Convert to corresponding ICD code as prediction output.

在上述实施例的基础上，所述步骤7具体包括：Based on the above embodiment, step 7 specifically includes:

具体实施时，如图4所示，模型加载步骤6中在验证集损失最小时的参数。模型测试、生成阶段以LEDT模型结合前缀树Trie来生成ICD疾病代码描述，结合集束搜索算法，实现多标签分类任务。对于使用分词器tokenizer分别对待编码数据集中的文档/> ，转换成ids并作为LEDT模型中编码器的输入。LEDT模型中编码器对文档分词，并转换成ids的结果后进行编码，得到关于DOC的上下文编码向量：During the specific implementation, as shown in Figure 4, the parameters in the model loading step 6 are the parameters when the loss of the verification set is minimal. In the model testing and generation stage, the LEDT model is combined with the prefix tree Trie to generate ICD disease code descriptions, and combined with the beam search algorithm to achieve multi-label classification tasks. for Use the tokenizer to treat documents in the encoded data set separately/> , converted into ids and used as input to the encoder in the LEDT model. In the LEDT model, the encoder segments the document and converts the result into ids for encoding to obtain the context encoding vector about the DOC:

将解码器的隐状态初始化为上下文的值：Initialize the decoder's hidden state to the context's value:

假设LEDT生成的第一个token为 Assume that the first token generated by LEDT is

对于每个时间步，，进行如下操作：For each time step, , perform the following operations:

步骤7.1：更新解码器隐状态，以上一时刻的解码器隐状态和预测结果作为解码器输入Step 7.1: Update the decoder hidden state, using the decoder hidden state and prediction result at the previous moment as the decoder input

步骤7.2：将解码器隐状态送入一个线性层网络Linear，计算当前时间步的输出：Step 7.2: Send the decoder hidden state to a linear layer network Linear to calculate the output of the current time step:

步骤7.3：查询前缀树Trie，将不属于前缀树的字符的得分概率置零Step 7.3: Query the prefix tree Trie and set the score probability of characters that do not belong to the prefix tree to zero.

步骤7.4：使用集束搜索算法保留最有可能的topk个输出序列。Step 7.4: Use the beam search algorithm to retain the topk most likely output sequences.

经过步骤7.1-7.4 的处理之后，我们得到了关于的topk个ICD代码描述。使用(ICD代码,ICD代码描述)映射将产生的topk个ICD代码描述转换成关于/> 的topk个代码预测After processing steps 7.1-7.4, we get the information about Description of topk ICD codes. Use (ICD code, ICD code description) mapping to convert the generated topk ICD code descriptions into about/> Topk code predictions

。 .

本实施例提供的基于预训练语言模型的ICD自动编码方法，通过使用预训练语言模型LEDT进行ICD自动编码任务，能够更好地处理电子病历中词语的形态变换问题；使用生成式方式将ICD编码任务从多标签分类任务转换成生成任务，在生成式模型高效除噪的基础上，还有效利用ICD编码描述，加强输入文本与ICD编码描述之间的交互，最终结合前缀树来限制LED的生成过程，避免生成过程中产生的未登录词问题；使用Longformer作为预训练语言模型的架构，以基于窗口的局部注意力代替基础Transformer的全局注意力，使模型在进行ICD编码任务中达到计算复杂度、内存开销与模型性能之间达到一个平衡，提高了编码效率、精准度和适应性。The ICD automatic encoding method based on the pre-trained language model provided in this embodiment can better handle the morphological transformation problem of words in electronic medical records by using the pre-trained language model LEDT to perform the ICD automatic encoding task; using a generative method to encode the ICD The task is converted from a multi-label classification task to a generation task. On the basis of the efficient denoising of the generative model, the ICD coding description is also effectively used to strengthen the interaction between the input text and the ICD coding description, and finally combines the prefix tree to limit the generation of LEDs. process to avoid the problem of unregistered words generated during the generation process; using Longformer as the architecture of the pre-trained language model, replacing the global attention of the basic Transformer with window-based local attention, allowing the model to achieve computational complexity in ICD coding tasks , achieve a balance between memory overhead and model performance, improving coding efficiency, accuracy and adaptability.

下面将结合一个具体实施例对本发明进行进一步说明。The present invention will be further described below with reference to a specific embodiment.

1，数据描述：本次实验验证的数据集来自CHIP2022评测五临床诊断编码任务，长度统计如图5所示。给定一次就诊的相关诊断信息（包括入院诊断、术前诊断、术后诊断、出院诊断），以及手术名称、药品名称、医嘱名称，要求给出其在《疾病分类与代码国家临床版2.0》词表中对应的ICD代码。所有就诊数据均来自于真实医疗数据，并以《疾病分类与代码国家临床版2.0》词表为标准进行了标注。1. Data description: The data set verified in this experiment comes from the CHIP2022 evaluation five clinical diagnosis coding tasks. The length statistics are shown in Figure 5. Given the relevant diagnostic information for a visit (including admission diagnosis, preoperative diagnosis, postoperative diagnosis, and discharge diagnosis), as well as the name of the operation, the name of the drug, and the name of the medical order, it is required to provide its information in the "National Clinical Version of Disease Classification and Code 2.0" The corresponding ICD code in the vocabulary. All medical treatment data come from real medical data and are annotated based on the vocabulary of the "Classification and Code of Diseases National Clinical Version 2.0".

1.1，训练集、测试集、验证集对应的文档数量、每条文档最大标签数量（即中/> 的数目之和），最小标签数量，平均标签数量的统计如表1所示：1.1, the number of documents corresponding to the training set, test set, and verification set, and the maximum number of tags per document (i.e. Medium/> ), the minimum number of tags, and the statistics of the average number of tags are shown in Table 1:

表1Table 1

1.1，本实施例中ICD代码描述来自《疾病分类与代码国家临床版2.0》词表。1.1. The ICD code description in this example comes from the vocabulary of the "National Clinical Edition of Disease Classification and Codes 2.0".

2，预训练语言模型参数来源及LED的构造：由于我们并未发现公开的针对中文的LED的预训练语言模型参数，本实施例将在同样是seq2seq模型的BART 基础上使用脚本进行转换，扩展其所能处理的最长文本大小以及向其中添加局部注意力机制，形成LED。本实施例中预训练语言模型来自预设的模型数据库。2. Source of pre-trained language model parameters and the structure of LED: Since we have not found public pre-trained language model parameters for Chinese LED, this embodiment will use scripts to convert and expand based on BART, which is also a seq2seq model. The longest text size it can handle and adding a local attention mechanism to it form LED. In this embodiment, the pre-trained language model comes from a preset model database.

3，评价指标：ICD自动编码任务为一个多标签分类任务，我们使用ICD自动编码任务常用的评估指标：Micro_F1，Macro_F1，Micro_AUC，Macro_AUC，P@K，在此实验中，K=1，2，5。特别地，在消融实验中，因为采取的是生成模型作为框架基础，可能会导致OOV问题，我们加入len（OOV）作为评估指标，len（OOV）表示在模型在生成过程中生成未登录词的数量总和。3. Evaluation indicators: The ICD automatic encoding task is a multi-label classification task. We use the commonly used evaluation indicators for ICD automatic encoding tasks: Micro_F1, Macro_F1, Micro_AUC, Macro_AUC, P@K. In this experiment, K=1, 2, 5. In particular, in the ablation experiment, because the generation model is used as the basis of the framework, it may cause OOV problems. We add len (OOV) as an evaluation index. len (OOV) represents the number of unregistered words generated by the model during the generation process. Total quantity.

3.1，Micro_F1和Macro_F1：ICD编码任务是一个多标签分类任务，而Micro_F1和Macro_F1是评估模型在多标签分类任务中的性能指标。Micro_F1考虑了所有标签的精确率和召回率，并将它们统一到一个整体评估指标中。Macro_F1计算每个标签精确率和召回率的平均值。这两个指标可以衡量模型对于所有ICD标签的整体分类性能和各个标签的分类性能。3.1, Micro_F1 and Macro_F1: The ICD encoding task is a multi-label classification task, and Micro_F1 and Macro_F1 are performance indicators for evaluating the model in multi-label classification tasks. Micro_F1 takes the precision and recall of all labels into account and unifies them into an overall evaluation metric. Macro_F1 calculates the average of precision and recall for each label. These two indicators can measure the overall classification performance of the model for all ICD tags and the classification performance of each tag.

3.2，Micro_AUC和Macro_AUC：AUC（Area Under the ROC Curve）是一个常用的评估指标，用于衡量模型的分类质量。在ICD编码任务中，Micro_AUC将所有的预测标签合并为一个整体并计算其AUC值。Macro_AUC则计算每个标签的真实标签和预测概率，并对所有标签的AUC取平均。这两个指标可以评估模型在整体预测质量和各个标签之间的差异性。3.2, Micro_AUC and Macro_AUC: AUC (Area Under the ROC Curve) is a commonly used evaluation metric used to measure the classification quality of the model. In the ICD encoding task, Micro_AUC combines all predicted labels into a whole and calculates its AUC value. Macro_AUC calculates the true label and predicted probability of each label, and averages the AUC of all labels. These two metrics can evaluate the model's overall prediction quality and the difference between individual labels.

3.3，P@K：对于ICD编码任务，P@K用于评估模型在top-k预测中的平均准确率。它对模型在预测集合中的排名效果进行评估，帮助选择排名靠前的预测结果。这对于ICD编码来说是很有用的，因为它可以提供专业人员最相关的ICD代码进行参考。3.3, P@K: For the ICD encoding task, P@K is used to evaluate the average accuracy of the model in top-k predictions. It evaluates the ranking effect of the model in the prediction set and helps select the top-ranked prediction results. This is useful for ICD coding as it provides professionals with the most relevant ICD codes to reference.

4，与其他方法的比较：4. Comparison with other methods:

本实验选用了若干ICD自动编码任务中先进的模型作为基线模型：This experiment selected several advanced models in ICD automatic encoding tasks as baseline models:

4.1，CAML：使用CNN提取输入文本的信息，并结合标签注意力机制进行ICD自动编码任务。4.1, CAML: Use CNN to extract information from input text, and combine it with the label attention mechanism to perform ICD automatic encoding tasks.

4.2，LAAT：为缓解不同ICD代码所对应的输入文本片段不一致的关系，LAAT使用双向LSTM来提取输入文本的特征，并使用一种新的标签注意力机制将这些文本片段关联到对应的ICD代码。4.2, LAAT: In order to alleviate the inconsistent relationship between input text segments corresponding to different ICD codes, LAAT uses bidirectional LSTM to extract the features of the input text, and uses a new label attention mechanism to associate these text segments to the corresponding ICD codes. .

4.3，MVC-LDA：该模型使用多视角卷积从多个角度提取文本的特征，并引入描述信息约束来提高模型的预测准确率。4.3, MVC-LDA: This model uses multi-view convolution to extract text features from multiple angles, and introduces description information constraints to improve the prediction accuracy of the model.

4.4，KAICD：KAICD使用多尺度CNN提取输入文本特征，在此基础上，利用双向GRU处理ICD代码描述以建立代码描述知识库，结合注意力机制将知识库的知识引入模型的预测过程中，该模型在自动ICD编码任务中英文的MIMIC-III数据集和中文的湘雅数据集均取得优秀的结果。4.4, KAICD: KAICD uses multi-scale CNN to extract input text features. On this basis, it uses bidirectional GRU to process ICD code descriptions to establish a code description knowledge base, and combines the attention mechanism to introduce the knowledge of the knowledge base into the prediction process of the model. The model achieved excellent results in the automatic ICD coding task on both the English MIMIC-III data set and the Chinese Xiangya data set.

4.5，LD-PLAM：该模型提出新的标签注意力机制-伪标签注意力机制进行ICD自动编码任务，它对相似的ICD代码使用同一种注意力模式以减少计算开销和提升模型的预测性能，该模型在自动ICD编码任务中英文的MIMIC-III数据集和中文的湘雅数据集均取得优秀的结果。4.5, LD-PLAM: This model proposes a new label attention mechanism-pseudo-label attention mechanism for ICD automatic encoding tasks. It uses the same attention mode for similar ICD codes to reduce computational overhead and improve the prediction performance of the model. This model achieved excellent results in the automatic ICD coding task on both the English MIMIC-III dataset and the Chinese Xiangya dataset.

表2Table 2

由表2可知，提出的模型LEDT_ICD的模型性能在整体上优于当前的基线模型，验证了本发明中所使用的生成式预训练语言模型进行ICD自动编码的有效性。As can be seen from Table 2, the model performance of the proposed model LEDT_ICD is overall better than the current baseline model, which verifies the effectiveness of the generative pre-training language model used in the present invention for automatic encoding of ICD.

5，消融实验5. Ablation experiment

表3table 3

MSL_1024 （Max Source Length 1024）表示将输入文本截断，保留最前面的1024个字符输入到模型中。由表3可知，在同样结合Trie的情况下，MSL_2048 要比MSL_1024 的各个评价指标都要好，这是因为将模型的输入文本进行截断了，将会导致信息的丢失，同时也验证了Longformer中局部注意力机制对于长文本建模的有效性。对比于第2 行的LED_ICD（MSL_2048）模型，因为在本数据集中针对每个记录的平均标签数目不大，模型取生成的前面topk标签（k在比较小的时候）作为预测结果计算F1分数时，二者的Macro_F1和Micro_F1生成的结果相差不大，结合前缀树的LEDT_ICD（MSL_2048）性能上要好一点，除此之外，结合前缀树的模型相比于没有结合前缀树的生成模型做多标签分类任务时有一个突出的优点：不产生OOV问题。MSL_1024 (Max Source Length 1024) means that the input text will be truncated and the first 1024 characters will be retained and entered into the model. It can be seen from Table 3 that when combined with Trie, MSL_2048 is better than MSL_1024 in various evaluation indicators. This is because truncating the input text of the model will lead to the loss of information, and it also verifies that the local part in Longformer Effectiveness of attention mechanisms for modeling long texts. Compared with the LED_ICD (MSL_2048) model in line 2, because the average number of labels for each record in this data set is not large, the model takes the generated topk labels (when k is relatively small) as the prediction result when calculating the F1 score. , the results generated by Macro_F1 and Micro_F1 are not much different. The performance of LEDT_ICD (MSL_2048) combined with prefix tree is slightly better. In addition, the model combined with prefix tree does more tags than the generated model without combined prefix tree. There is an outstanding advantage when classifying tasks: it does not cause OOV problems.

应当理解，本发明的各部分可以用硬件、软件、固件或它们的组合来实现。It should be understood that various parts of the present invention may be implemented in hardware, software, firmware, or a combination thereof.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto. Any person familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present invention. All are covered by the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. An ICD automatic coding method based on a pre-training language model, comprising:

step 1, an ICD automatic coding data set is constructed according to an electronic medical record, wherein the ICD automatic coding data set comprises clinical texts and corresponding ICD codes;

step 2, obtaining the code description corresponding to the ICD code from the ICD code description library, and forming a mapping set;

step 3, word segmentation is carried out on the code description to obtain an ids sequence, a prefix tree is constructed according to the ids sequence, the input range and the visual field range of the encoder are adjusted on the basis of a pre-training model, and an LEDT model is formed by combining the prefix tree form;

step 4, dividing the ICD automatic coding data set into a training set and a verification set;

step 5, respectively segmenting clinical texts in the training set and the verification set and corresponding ICD codes thereof to obtain text sequences and corresponding ICD code sequences thereof, and obtaining corresponding code descriptions through the mapping set by the ICD code sequences to form a seq2seq training data set and a seq2seq verification data set according to the text sequences and the corresponding ICD code sequences;

step 6, training the LEDT model by using a teacher training data set by using the seq2seq training data set, updating model parameters, inputting the seq2seq verification data set into the LEDT model after each training round is finished, and recording model parameters when the loss is minimum;

and 7, selecting model parameters with minimum loss in the seq2seq verification data set to obtain a target model, inputting an input text in the data set to be encoded into the target model, and in the decoding generation process of the target model, using a prefix tree to limit generated characters, so that character strings generated by the LEDT model are subsets in code description, simultaneously using a bundling algorithm to reserve k prediction descriptions with highest output scores, and finally using a mapping set to convert the output k prediction descriptions into corresponding ICD codes as prediction output.

2. The method according to claim 1, wherein the step 3 specifically comprises:

step 3.1, word segmentation is carried out on the code description, the code description is converted into an ids sequence in a pre-training language model, ids of a start symbol used in the generation process of the pre-training language model are added in front of the ids sequence, ids of an end symbol used in the generation process of the model are added at the tail of the ids sequence, and an object code ids sequence generated by the model is constructed;

step 3.2, performing the operation on the ids sequences described by all codes, and constructing a prefix tree;

and 3.3, expanding the range of input data which can be processed by the encoder in the pre-training model, setting the field of view of the attention of the encoder, and forming the LEDT model by combining the prefix tree form.

3. The method according to claim 2, wherein the step 6 specifically comprises:

step 6.1, using ICD codes of the seq2seq training data set as input of an LED T model at the current moment in each time step, using a word segmentation device to segment clinical texts and code descriptions corresponding to the ICD codes, converting the clinical texts and code descriptions into corresponding ids sequences, and using the ids sequences as input of an encoder and a decoder in the LED T model;

step 6.2, the encoder encodes the ids sequence of the clinical text to obtain a context encoding vector corresponding to each time step of the clinical text;

step 6.3, inputting the corresponding context coding vector and ids of the code description of each time step into a decoder to obtain a prediction description, calculating a loss value between the prediction description and the code description, and updating model parameters of the LEDT model through back propagation;

at step 6.4, after each training round has ended, the seq2seq verification dataset is entered into the LEDT model, recording the model parameters at which the loss is minimal.

4. A method according to claim 3, wherein said step 6.3 comprises:

step 6.3.1, initializing the hidden state of the decoder in the previous time step to be the value of the context coding vector corresponding to the current time step;

step 6.3.2, using the decoder hidden state and the code description word-segmented symbol in the previous time step as decoder input to update the decoder hidden state;

step 6.3.3, sending the updated decoder hidden state into a linear layer network, and calculating the output of the current time step as a prediction description;

step 6.3.4, calculating a loss value between the predicted description of the current time step and the code description of the next time step, updating model parameters of the LEDT model through back propagation, and returning to step 6.3.1 to execute the next time step until the prediction of all the time steps is completed, and judging that the current training round is finished;

and 6.3.5, after the current training round is completed, inputting the seq2seq verification data set into the LEDT model, and recording model parameters when the loss is minimum.

5. The method according to claim 4, wherein the step 7 specifically includes:

step 7.1, inputting an input text in a data set to be encoded into a target model, segmenting words of the input text by a word segmentation device of the target model, converting the words into an ids sequence, and encoding the ids sequence by an encoder of the target model to obtain a context vector of the input text;

step 7.2, setting the first character generated by the decoder as a start character;

step 7.3, the decoder hidden state and the prediction description of the previous time step are used as decoder input to update the decoder hidden state;

step 7.4, sending the updated decoder hidden state into a linear layer network, and calculating the output of the current time step as a prediction description;

step 7.5, inquiring the prefix tree, and setting the symbol score probability of the predictive description which does not belong to the prefix tree to zero;

step 7.6, reserving predictive descriptions with the highest score k by using a cluster search algorithm;

and 7.7, repeating the steps 7.3 to 7.6 until all time step predictions are completed, obtaining k prediction descriptions, and converting the output k prediction descriptions into corresponding ICD codes by using a mapping set to serve as prediction output.