CN117093729A

CN117093729A - Retrieval method, system and retrieval terminal based on medical scientific research information

Info

Publication number: CN117093729A
Application number: CN202311336929.7A
Authority: CN
Inventors: 蒋江涛; 马杰; 金剑; 邓小宁
Original assignee: North Health Medical Big Data Technology Co ltd
Current assignee: North Health Medical Big Data Technology Co ltd
Priority date: 2023-10-17
Filing date: 2023-10-17
Publication date: 2023-11-21
Anticipated expiration: 2043-10-17
Also published as: CN117093729B

Abstract

The application provides a retrieval method, a retrieval system and a retrieval terminal based on medical scientific research information, which belong to the technical field of medical scientific research information processing and acquire retrieval information input by a user; analyzing the search information based on a natural language processing mode of the large model, analyzing sentence structures, word meaning relations and context information, and extracting keyword information; configuring the extracted keyword information into a retrieval expression; matching the medical scientific literature in the medical scientific literature database based on the retrieval expression; and displaying the matched medical scientific literature on a user interface. According to the medical scientific research-oriented retrieval method based on the large model, the intention of the user can be more accurately understood based on the natural language processing technology, and the relevance of the retrieval result is improved. And the search document can be displayed for the user, so that better user experience is provided.

Description

A retrieval method, system and retrieval terminal based on medical scientific research information

技术领域Technical field

本发明属于医疗科研信息检索技术领域，具体涉及一种基于医疗科研信息的检索方法、系统及检索终端。The invention belongs to the technical field of medical scientific research information retrieval, and specifically relates to a retrieval method, system and retrieval terminal based on medical scientific research information.

背景技术Background technique

在医疗科研领域，研究人员需要从大量的文献和数据中获取相关信息以支持科研工作。传统的检索方法通常要求用户编辑检索表达式，但这样的方式对于非专业人士来说可能存在困难，而且编辑表达式需要花费大量时间。In the field of medical research, researchers need to obtain relevant information from a large amount of literature and data to support scientific research work. Traditional search methods usually require users to edit search expressions, but such methods may be difficult for non-experts, and editing expressions requires a lot of time.

已有的与本发明最相似的实现方案是基于关键词的检索方法。这种方法要求用户编辑检索表达式，通过关键词匹配来检索文献和数据。然而，这种方法存在以下缺点：用户编辑表达式困难：非专业人士可能不熟悉领域相关的术语和表达方式，导致编辑表达式存在困难；时间消耗大：编辑复杂的检索表达式需要花费大量时间，降低了用户的检索效率。The existing implementation solution most similar to the present invention is a keyword-based retrieval method. This method requires users to edit search expressions and search documents and data through keyword matching. However, this method has the following disadvantages: It is difficult for users to edit expressions: non-professionals may not be familiar with domain-related terminology and expressions, making it difficult to edit expressions; It is time-consuming: it takes a lot of time to edit complex retrieval expressions , reducing the user’s retrieval efficiency.

现有技术的缺点：用户编辑表达式困难：非专业人士可能不熟悉领域相关的术语和表达方式，导致编辑表达式存在困难。而且医疗科研文献检索所耗费的时间长，需要编辑专业词汇，再将专业词汇进行组合，才能进行检索查找，如果专业词汇无法形成有效的检索式导致无法匹配出想要查询的文献，影响科研人员对系统的使用体验。Disadvantages of the existing technology: Difficulty for users to edit expressions: Non-professionals may not be familiar with domain-related terminology and expressions, resulting in difficulty in editing expressions. Moreover, retrieval of medical scientific research literature takes a long time. It is necessary to edit professional vocabulary and then combine professional vocabulary before searching. If the professional vocabulary cannot form an effective search formula, it will not be possible to match the literature you want to query, which will affect scientific researchers. Experience using the system.

发明内容Contents of the invention

本发明提供一种基于医疗科研信息的检索方法，方法可以解决传统编辑检索表达式困难和时间消耗大的问题。本发明的目的是提高用户的检索效率，减少用户编辑表达式的时间。The invention provides a retrieval method based on medical scientific research information, which can solve the traditional problems of difficulty in editing retrieval expressions and high time consumption. The purpose of the present invention is to improve the user's retrieval efficiency and reduce the time for the user to edit expressions.

方法包括：Methods include:

步骤一、获取用户输入的检索信息；Step 1. Obtain the search information input by the user;

步骤二、基于大模型的自然语言处理方式对检索信息进行解析，解析出句子结构、词义关系和上下文信息，并提取出关键词信息；Step 2: Analyze the retrieval information based on the natural language processing method of the large model, analyze the sentence structure, word meaning relationship and contextual information, and extract the keyword information;

其中，大模型包括：多个相同的编码器层，每个编码器层都包含自注意力机制和前馈神经网络；Among them, the large model includes: multiple identical encoder layers, each encoder layer contains a self-attention mechanism and a feed-forward neural network;

自注意力机制的数学公式如下：The mathematical formula of the self-attention mechanism is as follows:

Attention(Q1, K1, V1) = softmax(Q1×K1^T 1/ sqrt(d_k1)) * V1；Attention(Q1, K1, V1) = softmax(Q1×K1^T 1/ sqrt(d_k1)) * V1;

其中，Q1、K1和V1分别表示查询、键和值的输入矩阵，d_k1表示注意力机制的维度；Among them, Q1, K1 and V1 represent the input matrix of query, key and value respectively, and d_k1 represents the dimension of the attention mechanism;

前馈神经网络的数学公式如下：The mathematical formula of feedforward neural network is as follows:

FFN(x1) = max(0, xW_1 + b_1)W_2 + b_2；FFN(x1) = max(0, xW_1 + b_1)W_2 + b_2;

其中，x1表示输入向量，W_1、b_1、W_2和b_2表示模型的参数；Among them, x1 represents the input vector, W_1, b_1, W_2 and b_2 represent the parameters of the model;

大模型还包括：多个相同的解码器层，每个解码器层均包含自注意力机制和编码器-解码器注意力机制；The large model also includes: multiple identical decoder layers, each of which contains a self-attention mechanism and an encoder-decoder attention mechanism;

在大模型结构中，编码器层和解码器层之间的连接使用了残差连接和层归一化处理；In the large model structure, the connection between the encoder layer and the decoder layer uses residual connection and layer normalization processing;

步骤三、获取用户设置在逻辑运算符，并基于用户设置的逻辑运算符将提取的关键词信息进行组合，配置成检索表达式；Step 3: Obtain the logical operators set by the user, combine the extracted keyword information based on the logical operators set by the user, and configure it into a search expression;

步骤四、基于检索表达式与医疗科研文献数据库中的医疗科研文献进行匹配；Step 4: Match the medical research literature in the medical research literature database based on the search expression;

步骤五、将匹配出的医疗科研文献在用户界面上显示。Step 5: Display the matched medical research documents on the user interface.

进一步需要说明的是，步骤二中基于大模型的自然语言处理方式对检索信息进行解析的方式还包括：It should be further noted that the method of parsing the retrieved information using natural language processing based on large models in step 2 also includes:

基于输入的自然语言文本切分成词语；Segment the input natural language text into words;

确定每个词语的词性，并提取出关键词信息。Determine the part of speech of each word and extract keyword information.

分析检索信息中的句子结构，确定词语之间的依存关系，并提取出关键词信息。Analyze the sentence structure in the retrieved information, determine the dependencies between words, and extract keyword information.

进一步需要说明的是，步骤二中基于大模型的自然语言处理方式对检索信息进行解析的方式还包括：对用户输入的检索信息中的句子进行词法分析，并将句子切分成词语。It should be further noted that the method of parsing the retrieval information based on the natural language processing method of the large model in step 2 also includes: performing lexical analysis on the sentences in the retrieval information input by the user and dividing the sentences into words.

进一步需要说明的是，对句子进行词法分析包括：确定句子中词语之间的依存关系，并使用依存句法分析方式、或短语结构句法分析方式来实现。It should be further noted that lexical analysis of sentences includes: determining the dependency relationship between words in the sentence, and using dependency syntax analysis or phrase structure syntax analysis to achieve this.

进一步需要说明的是，逻辑运算符包括：与逻辑、或逻辑以及非逻辑。It should be further noted that logical operators include: AND logic, OR logic and non-logic.

进一步需要说明的是，步骤四还包括：医疗科研文献数据库中储存有多篇医疗科研文献；It should be further noted that step four also includes: storing multiple medical scientific research documents in the medical scientific research literature database;

每篇医疗科研文献配置有关键词标签；Each medical research document is configured with keyword tags;

基于检索表达式与医疗科研文献数据库中的关键词标签进行匹配；Match the keyword tags in the medical research literature database based on the search expression;

将匹配出关键词标签所对应的医疗科研文献在用户界面上显示。The medical scientific research documents corresponding to the matched keyword tags are displayed on the user interface.

本发明还提供一种基于医疗科研信息的检索系统，系统包括：信息输入模块、信息解析模块、表达式配置模块、文献匹配模块以及文献展示模块；The invention also provides a retrieval system based on medical scientific research information. The system includes: an information input module, an information analysis module, an expression configuration module, a document matching module and a document display module;

信息输入模块，用于获取用户输入的检索信息；The information input module is used to obtain the retrieval information input by the user;

信息解析模块，用于结合大模型的自然语言处理方式对检索信息进行解析，解析出句子结构、词义关系和上下文信息，并提取出关键词信息；The information parsing module is used to parse the retrieved information in combination with the natural language processing method of the large model, parse out the sentence structure, word meaning relationships and contextual information, and extract keyword information;

FFN(x1) = max(0, xW_1 + b_1)W_2 + b_2；FFN(x1) = max(0, xW_1 + b_1)W_2 + b_2;

在大模型结构中，编码器层和解码器层之间的连接使用了残差连接和层归一化处理；表达式配置模块，用于获取用户设置在逻辑运算符，并基于用户设置的逻辑运算符将提取的关键词信息进行组合，配置成检索表达式；In the large model structure, the connection between the encoder layer and the decoder layer uses residual connection and layer normalization processing; the expression configuration module is used to obtain the user settings in the logical operator, and based on the user set logic The operator combines the extracted keyword information and configures it into a search expression;

文献匹配模块，用于根据检索表达式与医疗科研文献数据库中的医疗科研文献进行匹配；The literature matching module is used to match the medical research literature in the medical research literature database based on the search expression;

文献展示模块，用于显示检索过程信息以及将匹配出的医疗科研文献进行显示。The document display module is used to display the search process information and display the matched medical scientific research documents.

本发明还提供一种检索终端，包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序，所述处理器执行所述程序时实现所述基于医疗科研信息的检索方法的步骤。The present invention also provides a retrieval terminal, including a memory, a processor and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements the medical scientific research information-based The steps of the search method.

从以上技术方案可以看出，本发明具有以下优点：It can be seen from the above technical solutions that the present invention has the following advantages:

本发明涉及的基于医疗科研信息的检索方法中，用户通过自然语言表达的方式输入检索需求，界面将用户输入传递给大模型进行自然语言理解。大模型将用户输入转换成检索表达式，并将其传递给检索系统。检索系统根据检索表达式从数据库中检索相关的医疗科研文献和数据，并将结果返回给用户界面展示给用户。这样，用户无需编辑复杂的检索表达式，通过自然语言表达的方式进行检索，减少了用户的学习成本和编辑时间。而且本发明涉及的大模型具备自然语言理解能力，能够更准确地理解用户的意图，提高检索结果的相关性。用户友好的界面提供了更好的用户体验，使非专业人士也能够轻松进行检索。In the retrieval method based on medical scientific research information involved in the present invention, the user inputs retrieval requirements through natural language expression, and the interface passes the user input to the large model for natural language understanding. The large model converts user input into retrieval expressions and passes them to the retrieval system. The retrieval system retrieves relevant medical research documents and data from the database according to the search expression, and returns the results to the user interface to display to the user. In this way, users do not need to edit complex search expressions and can search through natural language expressions, reducing users' learning costs and editing time. Moreover, the large model involved in the present invention has natural language understanding capabilities, can more accurately understand the user's intention, and improve the relevance of the search results. The user-friendly interface provides a better user experience, making retrieval easy even for non-experts.

附图说明Description of the drawings

为了更清楚地说明本发明的技术方案，下面将对描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the technical solution of the present invention more clearly, the drawings required for the description will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, As far as workers are concerned, other drawings can also be obtained based on these drawings without exerting creative work.

图1为基于医疗科研信息的检索方法流程图；Figure 1 is a flow chart of the retrieval method based on medical scientific research information;

图2为基于医疗科研信息的检索系统示意图。Figure 2 is a schematic diagram of a retrieval system based on medical scientific research information.

具体实施方式Detailed ways

本发明提供的基于医疗科研信息的检索方法主要是针对医疗科研领域的一种检索方式，本发明的方法是为研究人员提供检索医疗科研文献所用。其中，为了方便科研人员检索医疗科研文献，本发明可以基于人工智能技术对关联的数据进行获取和处理。其中，方法可以包括如专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统等技术。基于医疗科研信息的检索方法软件技术主要包括计算机视角技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等。当然对于深度学习来讲，通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。进而能快速匹配出科研人员想要的医疗科研文献。具体可以匹配相应数量的医疗科研文献供科研人员进行参考使用，也可以精准的查到某一篇医疗科研文献。进一步有效解决医疗科研文献检索所耗费的时间长，需要编辑专业词汇，再将专业词汇进行组合，才能进行检索查找，如果专业词汇无法形成有效的检索式导致无法匹配出想要查询的文献，影响科研人员对系统使用体验的问题。The retrieval method based on medical scientific research information provided by the present invention is mainly a retrieval method for the field of medical scientific research. The method of the present invention is provided for researchers to search medical scientific research documents. Among them, in order to facilitate scientific researchers to search medical scientific research documents, the present invention can obtain and process related data based on artificial intelligence technology. Among them, methods can include technologies such as dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, etc. Retrieval method software technologies based on medical scientific research information mainly include computer perspective technology, speech processing technology, natural language processing technology, and machine learning/deep learning. Of course, for deep learning, it usually includes artificial neural network, belief network, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies. This can quickly match the medical research documents that researchers want. Specifically, it can match a corresponding number of medical scientific research documents for reference by scientific researchers, and can also accurately find a certain medical scientific research document. To further effectively solve the time-consuming search of medical scientific research literature, it is necessary to edit professional vocabulary and then combine the professional vocabulary before searching. If the professional vocabulary cannot form an effective search formula and the literature you want to query cannot be matched, the impact will be Questions about researchers’ experience with the system.

对于本发明的基于医疗科研信息的检索方法来讲，可以应用于一个或者多个检索终端中，所述检索终端是一种能够按照事先设定或存储的指令，自动进行数值计算和/或信息处理的设备，其硬件包括但不限于微处理器、专用集成电路(ApplicationSpecificIntegratedCircuit，ASIC)、可编程门阵列(Field－ProgrammableGate Array，FPGA)、数字处理器(DigitalSignalProcessor，DSP)、嵌入式设备等。The retrieval method based on medical scientific research information of the present invention can be applied to one or more retrieval terminals. The retrieval terminal is a type that can automatically perform numerical calculations and/or information according to preset or stored instructions. The hardware of processing equipment includes but is not limited to microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable gate arrays (Field-Programmable Gate Array, FPGA), digital processors (Digital Signal Processor, DSP), embedded devices, etc.

检索终端可以是任何一种可与用户进行人机交互的电子产品，例如，个人计算机、平板电脑、智能手机、个人数字助理(PersonalDigitalAssistant，PDA)、交互式网络电视(InternetProtocolTelevision，IPTV)等。The retrieval terminal can be any electronic product that can perform human-computer interaction with the user, such as a personal computer, tablet computer, smart phone, personal digital assistant (Personal Digital Assistant, PDA), interactive network television (Internet Protocol Television, IPTV), etc.

检索终端所处的网络包括但不限于互联网、广域网、城域网、局域网、虚拟专用网络(VirtualPrivateNetwork，VPN)等。The network where the retrieval terminal is located includes but is not limited to the Internet, wide area network, metropolitan area network, local area network, virtual private network (Virtual Private Network, VPN), etc.

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

请参阅图1所示是一具体实施例中基于医疗科研信息的检索方法的流程图，方法包括：Please refer to Figure 1 which is a flow chart of a retrieval method based on medical scientific research information in a specific embodiment. The method includes:

S101、获取用户输入的检索信息；S101. Obtain the search information input by the user;

根据本申请的实施例，用户可以通过键盘、小键盘、开关、拨号盘、鼠标、轨迹球、语音识别器等等相应设备输入需要检索的信息。当然不局限于打字和语音。可以以句子的形式，也可以以词组的形式输入。According to embodiments of the present application, users can input information to be retrieved through corresponding devices such as keyboards, keypads, switches, dials, mice, trackballs, voice recognizers, etc. Of course it’s not limited to typing and speech. It can be entered in the form of sentences or phrases.

S102、基于大模型的自然语言处理方式对检索信息进行解析，解析出句子结构、词义关系和上下文信息，并提取出关键词信息；S102. The natural language processing method based on large models analyzes the retrieval information, analyzes the sentence structure, word meaning relationships and contextual information, and extracts keyword information;

根据本申请的实施例，大模型的自然语言处理方式可以采用如T5、GLM、GPT模型这些模型具备强大的自然语言理解能力。大模型能够理解用户输入的自然语言，提取其中的关键信息，并将其转换成检索系统可以理解的检索表达式。According to the embodiments of this application, the natural language processing method of large models can use T5, GLM, and GPT models. These models have powerful natural language understanding capabilities. The large model can understand the natural language input by the user, extract the key information, and convert it into a retrieval expression that the retrieval system can understand.

本实施例利用大模型的语义理解能力，将用户输入的检索信息自动的转换成关键词信息，然后匹配出相应的检索式，可以实现自动检索。将检索到的文献返回给用户，这样用户可以直接使用自然语言的方式和检索系统交互，不用人工通过与或非的方式构建检索表达式，提高检索过程的可操作性，提高检索效率。This embodiment uses the semantic understanding ability of the large model to automatically convert the search information input by the user into keyword information, and then matches the corresponding search formula, thereby realizing automatic retrieval. Return the retrieved documents to the user, so that the user can directly interact with the retrieval system using natural language, without manually constructing retrieval expressions through AND or NOT, improving the operability of the retrieval process and improving retrieval efficiency.

作为一个示例，本实施例的大模型可以是可以使用Transformer-encoder、Transformer-decoder和Transformer结构，其中，Transformer-encoder主要用于编码输入序列，例如自然语言文本。大模型由多个相同的编码器层组成，每个编码器层都包含自注意力机制和前馈神经网络。As an example, the large model of this embodiment can use Transformer-encoder, Transformer-decoder and Transformer structures, where Transformer-encoder is mainly used to encode input sequences, such as natural language text. Large models consist of multiple identical encoder layers, each containing a self-attention mechanism and a feedforward neural network.

其中，Q1、K1和V1分别表示查询（query）、键（key）和值（value）的输入矩阵，d_k1表示注意力机制的维度。Among them, Q1, K1 and V1 represent the input matrix of query, key and value respectively, and d_k1 represents the dimension of the attention mechanism.

FFN(x1) = max(0, xW_1 + b_1)W_2 + b_2；FFN(x1) = max(0, xW_1 + b_1)W_2 + b_2;

其中，x1表示输入向量，W_1、b_1、W_2和b_2表示模型的参数。Among them, x1 represents the input vector, and W_1, b_1, W_2 and b_2 represent the parameters of the model.

对于Transformer-decoder结构来讲，其主要用于生成输出序列，例如机器翻译任务。大模型由多个相同的解码器层组成，每个解码器层都包含自注意力机制、编码器-解码器注意力机制。自注意力机制的数学公式与Transformer-encoder中的相同。For the Transformer-decoder structure, it is mainly used to generate output sequences, such as machine translation tasks. The large model consists of multiple identical decoder layers, each of which contains a self-attention mechanism and an encoder-decoder attention mechanism. The mathematical formula of the self-attention mechanism is the same as that in Transformer-encoder.

本实施例的编码器-解码器注意力机制的数学公式如下：The mathematical formula of the encoder-decoder attention mechanism in this embodiment is as follows:

Attention(Q2, K2, V2) = softmax(Q2×K2^T 2/ sqrt(d_k2)) * V2；Attention(Q2, K2, V2) = softmax(Q2×K2^T 2/ sqrt(d_k2)) * V2;

其中，Q2表示解码器的查询（query）向量，K2表示编码器的键（key）向量，V2表示编码器的值（value）向量。前馈神经网络的数学公式与Transformer-encoder中的相同。Among them, Q2 represents the query vector of the decoder, K2 represents the key vector of the encoder, and V2 represents the value vector of the encoder. The mathematical formula of feedforward neural network is the same as that in Transformer-encoder.

对于Transformer来讲，Transformer结构是将Transformer-encoder和Transformer-decoder结合在一起，用于序列到序列的任务，例如机器翻译。它由多个编码器层和多个解码器层交替堆叠而成。For Transformer, the Transformer structure combines Transformer-encoder and Transformer-decoder for sequence-to-sequence tasks, such as machine translation. It consists of multiple encoder layers and multiple decoder layers stacked alternately.

在Transformer结构中，编码器层和解码器层之间的连接使用了残差连接（residual connection）和层归一化（layer normalization）。In the Transformer structure, the connection between the encoder layer and the decoder layer uses residual connection and layer normalization.

残差连接的数学公式如下：The mathematical formula of the residual connection is as follows:

LayerNorm(x3+ Sublayer(x3))；LayerNorm(x3+ Sublayer(x3));

其中，x3表示输入向量，Sublayer(x3)表示子层的输出。Among them, x3 represents the input vector, and Sublayer(x3) represents the output of the sublayer.

层归一化的数学公式如下：The mathematical formula for layer normalization is as follows:

LayerNorm(x3) = (x3 - mean(x3)) / sqrt(var(x3) + Ep) * GA + Be；LayerNorm(x3) = (x3 - mean(x3)) / sqrt(var(x3) + Ep) * GA + Be;

其中，mean(x3)和var(x3)分别表示x3的均值和方差，Ep是一个小的常数用于数值稳定性，GA和Be是可学习的参数。Among them, mean(x3) and var(x3) represent the mean and variance of x3 respectively, Ep is a small constant for numerical stability, and GA and Be are learnable parameters.

通过上述处理方式，Transformer模型能够有效地捕捉输入序列中的上下文信息，并生成相应的输出序列。Through the above processing method, the Transformer model can effectively capture the contextual information in the input sequence and generate the corresponding output sequence.

在一个示例性实施例中，大模型的训练通常包括以下步骤：In an exemplary embodiment, training of a large model generally includes the following steps:

数据准备：准备大规模的训练数据集，包括输入样本和对应的目标输出。Data preparation: Prepare large-scale training data sets, including input samples and corresponding target outputs.

模型初始化：初始化神经网络的权重和偏置参数。Model initialization: Initialize the weights and bias parameters of the neural network.

前向传播：将输入样本通过神经网络进行前向传播，得到模型的预测输出。Forward propagation: Forward propagation of input samples through the neural network to obtain the predicted output of the model.

计算损失：将模型的预测输出与目标输出进行比较，计算损失函数的值，衡量预测结果与真实结果之间的差异。Calculate the loss: Compare the model's predicted output to the target output, calculate the value of the loss function, and measure the difference between the predicted results and the true results.

反向传播：通过反向传播算法，计算损失函数对模型参数的梯度，即损失函数对权重和偏置的导数。Backpropagation: Through the backpropagation algorithm, the gradient of the loss function to the model parameters is calculated, that is, the derivative of the loss function to the weights and biases.

参数更新：使用优化算法（如梯度下降）根据梯度信息更新模型的参数，使损失函数逐渐减小。Parameter update: Use an optimization algorithm (such as gradient descent) to update the parameters of the model based on gradient information so that the loss function gradually decreases.

重复迭代：重复执行前向传播、损失计算、反向传播和参数更新的步骤，直到达到预定的训练轮数或收敛条件。Repeat iteration: Repeat the steps of forward propagation, loss calculation, back propagation, and parameter update until a predetermined number of training epochs or convergence conditions are reached.

大模型的训练过程通常需要使用大规模的计算资源和训练时间，可以使用分布式训练、并行计算等技术加速训练过程。The training process of large models usually requires the use of large-scale computing resources and training time. Distributed training, parallel computing and other technologies can be used to accelerate the training process.

本实施例中，大模型在训练过程中通过暴露给它大量的自然语言文本数据，学习了语言的统计规律和语义信息。它通过建模文本数据中的词语、句子和上下文之间的关系，学习到了丰富的语言知识。当用户输入自然语言时，大模型可以根据已学习的知识对输入进行理解和处理，提取其中的关键信息，并将其转换成检索系统可以理解的检索表达式。In this embodiment, the large model learns the statistical laws and semantic information of language by exposing it to a large amount of natural language text data during the training process. It learns rich language knowledge by modeling the relationships between words, sentences and context in text data. When the user inputs natural language, the large model can understand and process the input based on the learned knowledge, extract the key information, and convert it into a retrieval expression that the retrieval system can understand.

根据本申请的实施例，方法利用自然语言处理技术对用户的自然语言输入进行语义分析和理解。通过分析用户输入的句子结构、词义关系和上下文信息，系统能够准确理解用户的检索意图，并提取其中的关键信息。示例性的讲，用户可以输入“查找治疗癌症的新药物”，系统将理解用户需要查找与治疗癌症相关的新药物。这里是对用户输入的信息进行拆分，可以基于动词、名词进行拆分，可以拆分出定语、状语等等。然后识别出关键词来组合形成检索式进行检索。对于“查找治疗癌症的新药物”来讲，可以拆分成“治疗癌症”和“药物”。这样匹配相应的文献。对于“治疗癌症”和“药物”可以基于用户的选择逻辑组合方式，比如用户选择与逻辑方式，则是“治疗癌症”与“药物”进行逻辑组合形成检索组合式来满足检索要求。According to embodiments of the present application, the method uses natural language processing technology to perform semantic analysis and understanding of the user's natural language input. By analyzing the sentence structure, word meaning relationships and contextual information input by the user, the system can accurately understand the user's retrieval intention and extract key information. For example, the user can enter "find new drugs to treat cancer", and the system will understand that the user needs to find new drugs related to the treatment of cancer. Here, the information input by the user is split. It can be split based on verbs and nouns, and it can be split into attributives, adverbials, etc. Then identify the keywords and combine them to form a search formula for searching. For "finding new drugs to treat cancer", it can be split into "treating cancer" and "drugs". This matches the corresponding documents. For "cancer treatment" and "drugs", the logical combination method can be based on the user's selection. For example, if the user selects and logically, "cancer treatment" and "drugs" are logically combined to form a search combination to meet the search requirements.

对于本发明的自然语言处理方式的整体过程步骤通常包括以下几个阶段：The overall process steps of the natural language processing method of the present invention usually include the following stages:

分词：将输入的自然语言文本切分成词语。Word segmentation: segment the input natural language text into words.

词性标注：为每个词语确定其词性（如名词、动词等）。Part-of-speech tagging: determine the part of speech (such as noun, verb, etc.) for each word.

句法分析：分析句子的结构，确定词语之间的依存关系。Syntactic analysis: Analyze the structure of sentences and determine the dependencies between words.

语义分析：理解句子的语义，确定句子的意义和表达的含义。Semantic analysis: Understand the semantics of sentences, determine the meaning of sentences and the meaning of expressions.

语义理解：从句子中提取关键信息和语义角色，理解句子的含义和意图。Semantic understanding: extract key information and semantic roles from sentences, and understand the meaning and intention of the sentences.

在语义分析和理解的过程中，常使用的数学模型包括：In the process of semantic analysis and understanding, commonly used mathematical models include:

词嵌入模型（如Word2Vec、GloVe）：将词语映射到连续向量空间，捕捉词语之间的语义关系。Word embedding models (such as Word2Vec, GloVe): map words to continuous vector space and capture the semantic relationships between words.

循环神经网络（Recurrent Neural Network，RNN）：用于处理序列数据，如句子或文本。Recurrent Neural Network (RNN): used to process sequence data, such as sentences or text.

注意力机制（Attention Mechanism）：用于在处理长文本时，对输入的不同部分进行加权关注。Attention Mechanism: used to give weighted attention to different parts of the input when processing long text.

转换器模型（Transformer）：一种基于自注意力机制的神经网络模型，用于处理序列数据。Transformer: A neural network model based on the self-attention mechanism for processing sequence data.

本实施例中，完成分析用户输入的检索信息之后，可以形成句子结构、词义关系。这里可以通常采用自然语言处理（Natural Language Processing，NLP）技术和相应的数学模型。In this embodiment, after analyzing the retrieval information input by the user, the sentence structure and word meaning relationship can be formed. Natural Language Processing (NLP) technology and corresponding mathematical models can usually be used here.

以下是大模型完成该过程的一种方式：Here's one way the process is done with a large model:

词法分析（Lexical Analysis）：大模型首先对用户输入的句子进行词法分析，将句子切分成词语。这可以使用词法分析器或预训练的词法分析模型来实现。Lexical Analysis: The large model first performs lexical analysis on the sentences input by the user and divides the sentences into words. This can be achieved using a lexer or a pre-trained lexer model.

句法分析（Syntactic Analysis）：接下来，大模型进行句法分析，分析句子的结构，并确定词语之间的依存关系，如主谓关系、动宾关系等。句法分析可以使用依存句法分析器、短语结构句法分析器或预训练的句法分析模型来实现。Syntactic Analysis: Next, the large model performs syntactic analysis, analyzes the structure of the sentence, and determines the dependency relationships between words, such as subject-predicate relationships, verb-object relationships, etc. Parsing can be accomplished using a dependency parser, a phrase structure parser, or a pre-trained parsing model.

语义分析（Semantic Analysis）：在语义分析阶段，大模型理解句子的语义，确定句子的意义和表达的含义。这包括词义消歧、指代消解、语义角色标注等任务。语义分析可以使用词义消歧模型、指代消解模型、语义角色标注模型或预训练的语义分析模型来实现。Semantic Analysis: In the semantic analysis stage, the large model understands the semantics of the sentence and determines the meaning of the sentence and the meaning of the expression. This includes tasks such as word sense disambiguation, reference resolution, and semantic role annotation. Semantic analysis can be implemented using word sense disambiguation models, reference resolution models, semantic role annotation models, or pre-trained semantic analysis models.

上下文建模（Context Modeling）：为了更好地理解句子的含义，大模型考虑句子中的上下文信息，包括前文和后文。上下文建模可以利用上下文感知的模型，如循环神经网络（RNN）或转换器模型（Transformer），对句子进行建模和表示。Context Modeling: In order to better understand the meaning of a sentence, the large model considers the contextual information in the sentence, including the preceding and following context. Context modeling can use context-aware models, such as recurrent neural networks (RNN) or transformer models (Transformer), to model and represent sentences.

通过以上步骤，大模型可以对用户输入的句子进行分析，提取句子的结构、词义关系和上下文信息，从而更好地理解用户的意图和需求。这样，大模型可以进一步将用户输入转换为检索系统可以理解和处理的检索表达式，并提供更准确的检索结果。需要说明的是，具体的实现方式和所采用的模型可能因应用场景和具体需求而有所不同。Through the above steps, the large model can analyze the sentences input by the user and extract the structure, word meaning relationship and contextual information of the sentences, so as to better understand the user's intentions and needs. In this way, the large model can further convert user input into retrieval expressions that the retrieval system can understand and process, and provide more accurate retrieval results. It should be noted that the specific implementation methods and models adopted may vary depending on application scenarios and specific requirements.

S103、获取用户设置在逻辑运算符，并基于用户设置的逻辑运算符将提取的关键词信息进行组合，配置成检索表达式；S103. Obtain the logical operators set by the user, combine the extracted keyword information based on the logical operators set by the user, and configure it into a search expression;

本实施例的检索表达式是将提取出关键词信息进行有效的组合形成。The search expression in this embodiment is formed by effectively combining the extracted keyword information.

需要说明的是，关键词信息的组合方式可以获取用户设置在逻辑运算符，基于用户设置的逻辑运算符将提取的关键词信息进行配合，形成检索表达式。It should be noted that the combination method of keyword information can be obtained by using logical operators set by the user. Based on the logical operators set by the user, the extracted keyword information is combined to form a search expression.

可选地，逻辑运算符包括：与逻辑、或逻辑以及非逻辑。Optionally, logical operators include: AND logic, OR logic, and non-logic.

比如检索治疗癌症药物，用户输入检索信息的同时可以选择逻辑关系。For example, when searching for cancer drugs, users can select logical relationships while inputting the search information.

当然系统也可以默认逻辑关系为与逻辑。即为“治疗”与“癌症”与“药物”，当然也可以设置成或逻辑。Of course, the system can also default the logical relationship to AND logic. That is "treatment" and "cancer" and "drug". Of course, it can also be set to OR logic.

这样，对于检索式的形成是利用自然语言处理之后，将用户输入的自然语言转换成结构化的检索表达式，以便检索系统能够理解和处理。In this way, the search formula is formed by using natural language processing to convert the natural language input by the user into a structured search expression so that the search system can understand and process it.

S104、基于检索表达式与医疗科研文献数据库中的医疗科研文献进行匹配；S104. Match the medical scientific research literature in the medical scientific research literature database based on the search expression;

可以理解的是，医疗科研文献数据库中预先储存并收集了大量的医疗科研文献，当然也不局限于医疗科研文献，还可以涉及其他文献。为了便于系统对文献的查找，每篇医疗科研文献配置有关键词标签；当然如果文献涉及的章节较多，可以在医疗科研文献上配置多个关键词标签，只有由相符合的查找关键词就可以匹配出该文献。It is understandable that a large amount of medical scientific research literature is pre-stored and collected in the medical scientific research literature database. Of course, it is not limited to medical scientific research literature and can also involve other literature. In order to facilitate the systematic search of documents, each medical scientific research document is configured with a keyword tag; of course, if the document involves many chapters, multiple keyword tags can be configured on the medical scientific research document, and only the matching search keywords will be used. The document can be matched.

对于关键词标签的设置可以基于医疗科研文献的主题名称，涉及领域，摘要，核心章节的内容等等来进行设置，可以由系统自动匹配，也可以有人工设置。The setting of keyword tags can be set based on the subject name, involved field, abstract, core chapter content, etc. of medical scientific research documents. It can be automatically matched by the system or manually set.

这样，基于检索表达式与医疗科研文献数据库中的关键词标签进行匹配；将匹配出关键词标签所对应的医疗科研文献在用户界面上显示。In this way, the keyword tags in the medical scientific research literature database are matched based on the search expression; the medical scientific research documents corresponding to the matched keyword tags are displayed on the user interface.

S105、将匹配出的医疗科研文献在用户界面上显示。S105. Display the matched medical scientific research documents on the user interface.

根据本申请的实施例，本发明提供用户友好的界面，用户可以通过自然语言表达的方式输入检索需求。用户界面还可以提供实时的反馈，帮助用户更好地理解和调整检索需求。According to the embodiments of the present application, the present invention provides a user-friendly interface, and the user can input search requirements through natural language expression. The user interface can also provide real-time feedback to help users better understand and adjust retrieval needs.

这里，可以根据用户的反馈不断优化检索结果，可以采用以下方式之一：反馈回路：在用户获取检索结果后，系统可以要求用户提供反馈，例如通过评分、喜欢/不喜欢等方式。根据用户的反馈，系统可以调整检索算法或重新排序结果，以提供更符合用户需求的结果。Here, the search results can be continuously optimized based on user feedback, which can be done in one of the following ways: Feedback loop: After the user obtains the search results, the system can ask the user to provide feedback, such as through ratings, likes/dislikes, etc. Based on user feedback, the system can adjust the search algorithm or reorder the results to provide results that better suit the user's needs.

强化学习：系统可以使用强化学习算法，通过与用户的交互来学习如何优化检索结果。系统根据用户的反馈（奖励信号）调整检索算法的参数，以使得未来的检索结果更符合用户需求。Reinforcement learning: The system can use reinforcement learning algorithms to learn how to optimize search results through interaction with users. The system adjusts the parameters of the retrieval algorithm based on user feedback (reward signals) so that future retrieval results are more in line with user needs.

用户模型：系统可以根据用户的历史行为和偏好建立用户模型，利用该模型来预测用户的喜好和需求，并根据预测结果进行检索结果的优化。User model: The system can establish a user model based on the user's historical behavior and preferences, use the model to predict the user's preferences and needs, and optimize the search results based on the prediction results.

应理解，上述实施例中各步骤的序号的大小并不意味着执行顺序的先后，各过程的执行顺序应以其功能和内在逻辑确定，而不应对本发明实施例的实施过程构成任何限定。It should be understood that the sequence number of each step in the above embodiment does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present invention.

可以看出本发明涉及的基于医疗科研信息的检索方法中，用户通过自然语言表达的方式输入检索需求，界面将用户输入传递给大模型进行自然语言理解。大模型将用户输入转换成检索表达式，并将其传递给检索系统。检索系统根据检索表达式从数据库中检索相关的医疗科研文献和数据，并将结果返回给用户界面展示给用户。这样，用户无需编辑复杂的检索表达式，通过自然语言表达的方式进行检索，减少了用户的学习成本和编辑时间。而且本发明涉及的大模型具备自然语言理解能力，能够更准确地理解用户的意图，提高检索结果的相关性。用户友好的界面提供了更好的用户体验，使非专业人士也能够轻松进行检索。It can be seen that in the retrieval method based on medical scientific research information involved in the present invention, the user inputs retrieval requirements through natural language expression, and the interface passes the user input to the large model for natural language understanding. The large model converts user input into retrieval expressions and passes them to the retrieval system. The retrieval system retrieves relevant medical research documents and data from the database according to the search expression, and returns the results to the user interface to display to the user. In this way, users do not need to edit complex search expressions and can search through natural language expressions, reducing users' learning costs and editing time. Moreover, the large model involved in the present invention has natural language understanding capabilities, can more accurately understand the user's intention, and improve the relevance of the search results. The user-friendly interface provides a better user experience, making retrieval easy even for non-experts.

以下是本公开实施例提供的基于医疗科研信息的检索系统的实施例，该系统与上述各实施例的基于医疗科研信息的检索方法属于同一个发明构思，在基于医疗科研信息的检索系统的实施例中未详尽描述的细节内容，可以参考上述基于医疗科研信息的检索方法的实施例。The following is an example of a retrieval system based on medical and scientific research information provided by the embodiment of the present disclosure. This system and the retrieval method based on medical and scientific research information in the above embodiments belong to the same inventive concept. In the implementation of the retrieval system based on medical and scientific research information For details not described in detail in the example, please refer to the above embodiment of the retrieval method based on medical scientific research information.

如图2所示，系统包括：信息输入模块、信息解析模块、表达式配置模块、文献匹配模块以及文献展示模块；As shown in Figure 2, the system includes: information input module, information analysis module, expression configuration module, document matching module and document display module;

信息输入模块用户向用户提供信息输入的输入装置，并基于输入装置获取用户输入的检索信息；The information input module provides the user with an input device for information input, and obtains the search information input by the user based on the input device;

信息解析模块基于大模型的自然语言处理方式对检索信息进行解析，解析出句子结构、词义关系和上下文信息，并提取出关键词信息；The information parsing module parses the retrieved information based on the natural language processing method of large models, parses out the sentence structure, word meaning relationships and contextual information, and extracts keyword information;

表达式配置模块用于将提取的关键词信息配置成检索表达式；The expression configuration module is used to configure the extracted keyword information into a search expression;

文献匹配模块基于检索表达式与医疗科研文献数据库中的医疗科研文献进行匹配；The literature matching module matches the medical research literature in the medical research literature database based on the search expression;

文献展示模块提供系统运行信息的显示模块，显示检索过程信息以及将匹配出的医疗科研文献在用户界面上显示。The literature display module provides a display module for system operation information, displays retrieval process information, and displays matched medical scientific research documents on the user interface.

本发明提供的基于医疗科研信息的检索系统是结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，为了清楚地说明硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。The retrieval system based on medical scientific research information provided by the present invention is based on the units and algorithm steps of each example described in the embodiments disclosed herein, and can be implemented by electronic hardware, computer software, or a combination of both. In order to clearly illustrate the hardware and software interchangeability. In the above description, the composition and steps of each example have been generally described according to functions. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered to be beyond the scope of the present invention.

通过以上的实施方式的描述，本领域的技术人员易于理解，这里描述的基于医疗科研信息的检索系统可以通过软件实现，也可以通过软件结合必要的硬件的方式来实现。因此，根据基于医疗科研信息的检索方法公开实施方式的技术方案可以以软件产品的形式体现出来，该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM，U盘，移动硬盘等)中或网络上，包括若干指令以使得一台计算设备(可以是个人计算机、服务器、移动终端、或者网络设备等)执行根据本公开实施方式的索引方法。Through the above description of the embodiments, those skilled in the art can easily understand that the retrieval system based on medical scientific research information described here can be implemented by software, or can be implemented by software combined with necessary hardware. Therefore, the technical solution according to the disclosed embodiment of the retrieval method based on medical scientific research information can be embodied in the form of a software product, and the software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk etc.) or on a network, including several instructions to cause a computing device (which may be a personal computer, a server, a mobile terminal, a network device, etc.) to execute an indexing method according to an embodiment of the present disclosure.

在本发明的实施例中，可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码，上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++，还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。In embodiments of the present invention, computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages, or a combination thereof— Such as Java, Smalltalk, C++, but also conventional procedural programming languages - such as "C" language or similar programming languages.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be practiced in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A retrieval method based on medical scientific research information, characterized in that the method includes:

Step 1. Obtain the search information input by the user;

Step 2: Analyze the retrieval information based on the natural language processing method of the large model, analyze the sentence structure, word meaning relationship and contextual information, and extract the keyword information;

Among them, the large model includes: multiple identical encoder layers, each encoder layer contains a self-attention mechanism and a feed-forward neural network;

The mathematical formula of the self-attention mechanism is as follows:

Attention(Q1, K1, V1) = softmax(Q1×K1^T 1/ sqrt(d_k1)) * V1;

Among them, Q1, K1 and V1 represent the input matrix of query, key and value respectively, and d_k1 represents the dimension of the attention mechanism;

The mathematical formula of feedforward neural network is as follows:

FFN(x1) = max(0, xW_1 + b_1)W_2 + b_2;

Among them, x1 represents the input vector, W_1, b_1, W_2 and b_2 represent the parameters of the model;

The large model also includes: multiple identical decoder layers, each of which contains a self-attention mechanism and an encoder-decoder attention mechanism;

In the large model structure, the connection between the encoder layer and the decoder layer uses residual connection and layer normalization processing;

Step 3: Obtain the logical operators set by the user, combine the extracted keyword information based on the logical operators set by the user, and configure it into a search expression;

Step 4: Match the medical research literature in the medical research literature database based on the search expression;

Step 5: Display the matched medical research documents on the user interface.

2. The retrieval method based on medical scientific research information according to claim 1, characterized in that the method of parsing the retrieval information based on the natural language processing method of the large model in step 2 also includes:

Segment the input natural language text into words;

Determine the part of speech of each word and extract keyword information.

3. The retrieval method based on medical scientific research information according to claim 1, characterized in that the method of parsing the retrieval information based on the natural language processing method of the large model in step 2 also includes:

Analyze the sentence structure in the retrieved information, determine the dependencies between words, and extract keyword information.

4. The retrieval method based on medical scientific research information according to claim 3, characterized in that the method of parsing the retrieval information based on the natural language processing method of the large model in step 2 also includes: parsing the retrieval information input by the user. Sentences are lexically analyzed and divided into words.

5. The retrieval method based on medical scientific research information according to claim 4, characterized in that performing lexical analysis on the sentence includes: determining the dependency relationship between words in the sentence, and using dependency syntax analysis or phrase structure syntax analysis. way to achieve it.

6. The retrieval method based on medical scientific research information according to claim 1, characterized in that logical operators include: AND logic, OR logic and non-logic.

7. The retrieval method based on medical scientific research information according to claim 1, characterized in that step 4 further includes: storing multiple medical scientific research documents in the medical scientific research document database;

Each medical research document is configured with keyword tags;

Match the keyword tags in the medical research literature database based on the search expression;

The medical scientific research documents corresponding to the matched keyword tags are displayed on the user interface.

8. A retrieval system based on medical scientific research information, characterized in that the system implements the retrieval method based on medical scientific research information as described in any one of claims 1 to 7;

The system includes: information input module, information analysis module, expression configuration module, document matching module and document display module;

The information input module is used to obtain the retrieval information input by the user;

The information parsing module is used to parse the retrieved information in combination with the natural language processing method of the large model, parse out the sentence structure, word meaning relationships and contextual information, and extract keyword information;

The mathematical formula of the self-attention mechanism is as follows:

Attention(Q1, K1, V1) = softmax(Q1×K1^T 1/ sqrt(d_k1)) * V1;

The mathematical formula of feedforward neural network is as follows:

FFN(x1) = max(0, xW_1 + b_1)W_2 + b_2;

In the large model structure, the connection between the encoder layer and the decoder layer uses residual connection and layer normalization processing; the expression configuration module is used to obtain the user settings in the logical operator, and based on the user set logic The operator combines the extracted keyword information and configures it into a search expression;

The literature matching module is used to match the medical research literature in the medical research literature database based on the search expression;

The document display module is used to display the search process information and display the matched medical scientific research documents.

9. A retrieval terminal, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that when the processor executes the program, the processor implements claim 1 Go to the steps of the retrieval method based on medical scientific research information described in any one of 7.