CN117093729B - Retrieval method, system and retrieval terminal based on medical scientific research information - Google Patents
Retrieval method, system and retrieval terminal based on medical scientific research information Download PDFInfo
- Publication number
- CN117093729B CN117093729B CN202311336929.7A CN202311336929A CN117093729B CN 117093729 B CN117093729 B CN 117093729B CN 202311336929 A CN202311336929 A CN 202311336929A CN 117093729 B CN117093729 B CN 117093729B
- Authority
- CN
- China
- Prior art keywords
- information
- retrieval
- medical scientific
- user
- scientific research
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 238000011160 research Methods 0.000 title claims abstract description 61
- 230000014509 gene expression Effects 0.000 claims abstract description 57
- 238000003058 natural language processing Methods 0.000 claims abstract description 22
- 238000004458 analytical method Methods 0.000 claims description 31
- 230000007246 mechanism Effects 0.000 claims description 30
- 238000013528 artificial neural network Methods 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 12
- 238000010606 normalization Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 5
- 230000010365 information processing Effects 0.000 abstract description 2
- 206010028980 Neoplasm Diseases 0.000 description 8
- 201000011510 cancer Diseases 0.000 description 8
- 238000012549 training Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 239000003814 drug Substances 0.000 description 5
- 239000000284 extract Substances 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 229940079593 drug Drugs 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 239000002547 new drug Substances 0.000 description 3
- 230000000306 recurrent effect Effects 0.000 description 3
- 230000002787 reinforcement Effects 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000013178 mathematical model Methods 0.000 description 2
- 238000010845 search algorithm Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Molecular Biology (AREA)
- Public Health (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Pathology (AREA)
- Library & Information Science (AREA)
- Epidemiology (AREA)
- Primary Health Care (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a retrieval method, a retrieval system and a retrieval terminal based on medical scientific research information, which belong to the technical field of medical scientific research information processing and acquire retrieval information input by a user; analyzing the search information based on a natural language processing mode of the large model, analyzing sentence structures, word meaning relations and context information, and extracting keyword information; configuring the extracted keyword information into a retrieval expression; matching the medical scientific literature in the medical scientific literature database based on the retrieval expression; and displaying the matched medical scientific literature on a user interface. According to the medical scientific research-oriented retrieval method based on the large model, the intention of the user can be more accurately understood based on the natural language processing technology, and the relevance of the retrieval result is improved. And the search document can be displayed for the user, so that better user experience is provided.
Description
Technical Field
The invention belongs to the technical field of medical scientific research information retrieval, and particularly relates to a retrieval method, a retrieval system and a retrieval terminal based on medical scientific research information.
Background
In the field of medical research, researchers need to obtain relevant information from a large amount of documents and data to support the research work. Conventional retrieval methods typically require the user to edit the retrieved expression, but such an approach can be difficult for non-professionals and can take a significant amount of time to edit the expression.
The existing implementation scheme most similar to the invention is a keyword-based retrieval method. This approach requires the user to edit the retrieval expression, retrieving documents and data through keyword matching. However, this method has the following disadvantages: user editing expressions is difficult: non-professionals may not be familiar with domain-related terms and expressions, resulting in difficulty in editing expressions; the time consumption is large: editing complex search expressions takes a lot of time, reducing the search efficiency of the user.
Disadvantages of the prior art: user editing expressions is difficult: non-professionals may not be familiar with related terms and expressions in the field, resulting in difficulty in editing the expression. Moreover, the medical scientific research literature retrieval consumes long time, professional vocabularies are required to be edited and then combined to perform retrieval and search, and if the professional vocabularies cannot form an effective retrieval formula, the literature which is required to be queried cannot be matched, so that the use experience of scientific research staff on the system is affected.
Disclosure of Invention
The invention provides a retrieval method based on medical scientific research information, which can solve the problems of difficult and large time consumption of traditional editing and retrieving expressions. The invention aims to improve the retrieval efficiency of a user and reduce the time for editing expressions by the user.
The method comprises the following steps:
step one, acquiring search information input by a user;
analyzing the search information based on a natural language processing mode of the large model, analyzing sentence structures, word meaning relations and context information, and extracting keyword information;
wherein, big model includes: a plurality of identical encoder layers, each encoder layer comprising a self-attention mechanism and a feed-forward neural network;
the mathematical formula for the self-attention mechanism is as follows:
Attention(Q1, K1, V1) = softmax(Q1×K1^T 1/ sqrt(d_k1)) * V1;
wherein Q1, K1 and V1 represent input matrices of queries, keys and values, respectively, d_k1 represents the dimension of the attention mechanism;
the mathematical formula of the feedforward neural network is as follows:
FFN(x1) = max(0, xW_1 + b_1)W_2 + b_2;
where x1 represents the input vector, w_1, b_1, w_2, and b_2 represent parameters of the model;
the large model further includes: a plurality of identical decoder layers, each decoder layer containing a self-attention mechanism and an encoder-decoder attention mechanism;
in large model structures, the connection between the encoder layer and the decoder layer uses residual connection and layer normalization;
step three, acquiring a logical operator set by a user, and combining the extracted keyword information based on the logical operator set by the user to configure a retrieval expression;
step four, matching with medical scientific research literature in a medical scientific research literature database based on the retrieval expression;
and fifthly, displaying the matched medical scientific research literature on a user interface.
It should be further noted that, in the second step, the method for analyzing the search information based on the natural language processing method of the large model further includes:
segmenting into words based on the input natural language text;
the part of speech of each word is determined, and keyword information is extracted.
It should be further noted that, in the second step, the method for analyzing the search information based on the natural language processing method of the large model further includes:
and analyzing sentence structures in the search information, determining the dependency relationship among the words, and extracting keyword information.
It should be further noted that, in the second step, the method for analyzing the search information based on the natural language processing method of the large model further includes: and performing lexical analysis on sentences in the search information input by the user, and segmenting the sentences into words.
It should be further noted that, performing lexical analysis on the sentence includes: the dependency relationship between words in the sentence is determined and implemented using a dependency syntax analysis manner or a phrase structure syntax analysis manner.
It should be further noted that the logical operators include: and logic, or logic, and not logic.
It should be further noted that, the fourth step further includes: a plurality of medical scientific literature is stored in a medical scientific literature database;
each medical scientific literature is configured with a keyword tag;
matching the keyword labels in the medical scientific literature database based on the retrieval expression;
and displaying the medical scientific literature corresponding to the matched keyword label on a user interface.
The invention also provides a retrieval system based on medical scientific research information, which comprises: the system comprises an information input module, an information analysis module, an expression configuration module, a document matching module and a document display module;
the information input module is used for acquiring search information input by a user;
the information analysis module is used for analyzing the search information by combining the natural language processing mode of the large model, analyzing sentence structure, word meaning relation and context information, and extracting keyword information;
wherein, big model includes: a plurality of identical encoder layers, each encoder layer comprising a self-attention mechanism and a feed-forward neural network;
the mathematical formula for the self-attention mechanism is as follows:
Attention(Q1, K1, V1) = softmax(Q1×K1^T 1/ sqrt(d_k1)) * V1;
wherein Q1, K1 and V1 represent input matrices of queries, keys and values, respectively, d_k1 represents the dimension of the attention mechanism;
the mathematical formula of the feedforward neural network is as follows:
FFN(x1) = max(0, xW_1 + b_1)W_2 + b_2;
where x1 represents the input vector, w_1, b_1, w_2, and b_2 represent parameters of the model;
the large model further includes: a plurality of identical decoder layers, each decoder layer containing a self-attention mechanism and an encoder-decoder attention mechanism;
in large model structures, the connection between the encoder layer and the decoder layer uses residual connection and layer normalization; the expression configuration module is used for acquiring the logical operators set by the user, combining the extracted keyword information based on the logical operators set by the user and configuring the extracted keyword information into a retrieval expression;
the document matching module is used for matching with medical scientific research documents in the medical scientific research document database according to the retrieval expression;
and the document display module is used for displaying the information of the retrieval process and displaying the matched medical scientific research documents.
The invention also provides a retrieval terminal which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the steps of the retrieval method based on medical scientific research information when executing the program.
From the above technical scheme, the invention has the following advantages:
in the retrieval method based on medical scientific research information, a user inputs retrieval requirements in a natural language expression mode, and an interface transmits user input to a large model for natural language understanding. The large model converts the user input into a retrieval expression and passes it to the retrieval system. The retrieval system retrieves relevant medical scientific research documents and data from the database according to the retrieval expression, and returns the result to the user interface for display to the user. Therefore, the user does not need to edit complex search expressions, and searches in a natural language expression mode, so that the learning cost and editing time of the user are reduced. Moreover, the large model related by the invention has natural language understanding capability, so that the intention of a user can be more accurately understood, and the relevance of a search result is improved. The user-friendly interface provides a better user experience that enables non-professionals to easily retrieve as well.
Drawings
In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the description will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a retrieval method based on medical scientific research information;
fig. 2 is a schematic diagram of a retrieval system based on medical scientific research information.
Detailed Description
The retrieval method based on the medical scientific research information provided by the invention is mainly a retrieval mode aiming at the medical scientific research field, and is used for providing retrieval of medical scientific research documents for researchers. In order to facilitate scientific research personnel to search medical scientific research documents, the invention can acquire and process the associated data based on an artificial intelligence technology. The method may include techniques such as dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing techniques, operation/interaction systems, and the like. The retrieval method based on medical scientific research information mainly comprises a computer visual angle technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like. Of course, for deep learning, techniques such as artificial neural network, confidence network, reinforcement learning, transfer learning, induction learning, teaching learning, and the like are generally included. And further can quickly match the medical scientific literature wanted by the scientific researchers. The method can be matched with a corresponding number of medical scientific research documents for reference of scientific researchers, and a certain medical scientific research document can be accurately found. The problems that the time consumed by medical scientific research literature retrieval is long, professional vocabularies are required to be edited and then combined to perform retrieval and search, and if the professional vocabularies cannot form an effective retrieval type, the literature which is required to be queried cannot be matched, and the use experience of scientific research personnel on a system is affected are further effectively solved.
The retrieval method based on medical scientific research information can be applied to one or more retrieval terminals, wherein the retrieval terminals are equipment capable of automatically carrying out numerical calculation and/or information processing according to preset or stored instructions, and the hardware comprises, but is not limited to, microprocessors, application-specific integrated circuits (SpecificIntegratedCircuit, ASIC), programmable gate arrays (Field-ProgrammableGate Array, FPGA), digital processors (DigitalSignalProcessor, DSP), embedded equipment and the like.
The search terminal may be any electronic product that can interact with a user, such as a personal computer, tablet computer, smart phone, personal digital assistant (PersonalDigitalAssistant, PDA), interactive web TV (InternetProtocolTelevision, IPTV), etc.
The network in which the search terminal is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (VirtualPrivateNetwork, VPN), and the like.
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, a flowchart of a method for retrieving information based on medical scientific research in an embodiment is shown, where the method includes:
s101, acquiring search information input by a user;
according to embodiments of the present application, a user may input information to be retrieved through a corresponding device such as a keyboard, keypad, switch, dial, mouse, trackball, voice-recognition device, and the like. Of course not limited to typing and speech. The input can be in the form of sentences or phrases.
S102, analyzing the search information based on a natural language processing mode of a large model, analyzing sentence structures, word meaning relations and context information, and extracting keyword information;
according to the embodiment of the application, the natural language processing mode of the large model can adopt models such as T5, GLM and GPT models, and the models have strong natural language understanding capability. The large model can understand the natural language input by the user, extract key information therein, and convert the key information into a retrieval expression which can be understood by the retrieval system.
According to the embodiment, the semantic understanding capability of the large model is utilized, the search information input by the user is automatically converted into the keyword information, and then the corresponding search formula is matched, so that automatic search can be realized. The searched documents are returned to the user, so that the user can interact with the search system in a natural language mode directly, a search expression is not needed to be constructed manually in a nor mode, operability of the search process is improved, and search efficiency is improved.
As an example, the large model of the present embodiment may use a transducer-encoder, transformer-decoder and transducer structure, where the transducer-decoder is used primarily to encode input sequences, such as natural language text. The large model is composed of multiple identical encoder layers, each of which contains a self-attention mechanism and a feed-forward neural network.
The mathematical formula for the self-attention mechanism is as follows:
Attention(Q1, K1, V1) = softmax(Q1×K1^T 1/ sqrt(d_k1)) * V1;
where Q1, K1, and V1 represent input matrices of queries (queries), keys (keys), and values (values), respectively, and d_k1 represents the dimension of the attention mechanism.
The mathematical formula of the feedforward neural network is as follows:
FFN(x1) = max(0, xW_1 + b_1)W_2 + b_2;
where x1 represents the input vector and w_1, b_1, w_2, and b_2 represent parameters of the model.
For the transducer-decoder structure, it is mainly used to generate output sequences, such as machine translation tasks. The large model is composed of multiple identical decoder layers, each containing a self-attention mechanism, encoder-decoder attention mechanism. The mathematical formula of the self-attention mechanism is the same as in the transducer-encoder.
The mathematical formula of the encoder-decoder attention mechanism of this embodiment is as follows:
Attention(Q2, K2, V2) = softmax(Q2×K2^T 2/ sqrt(d_k2)) * V2;
where Q2 represents a query vector of the decoder, K2 represents a key vector of the encoder, and V2 represents a value vector of the encoder. The mathematical formula of the feedforward neural network is the same as that in the transducer-encoder.
For a transducer, the transducer structure is a combination of a transducer-encoder and a transducer-decoder for sequence-to-sequence tasks such as machine translation. It is formed by alternately stacking a plurality of encoder layers and a plurality of decoder layers.
In the Transformer structure, the connection between the encoder layer and the decoder layer uses a residual connection (residual connection) and layer normalization (layer normalization).
The mathematical formula for the residual connection is as follows:
LayerNorm(x3+ Sublayer(x3));
where x3 represents the input vector and subayer (x 3) represents the output of the sub-layer.
The mathematical formula for layer normalization is as follows:
LayerNorm(x3) = (x3 - mean(x3)) / sqrt(var(x3) + Ep) * GA + Be;
where mean (x 3) and var (x 3) represent the mean and variance of x3, respectively, ep is a small constant for numerical stability, GA and Be are learnable parameters.
Through the processing mode, the transducer model can effectively capture the context information in the input sequence and generate a corresponding output sequence.
In one exemplary embodiment, training of a large model generally includes the steps of:
data preparation: a large-scale training dataset is prepared, including input samples and corresponding target outputs.
Model initialization: the weights and bias parameters of the neural network are initialized.
Forward propagation: and carrying out forward propagation on the input sample through a neural network to obtain the prediction output of the model.
Calculating loss: and comparing the predicted output of the model with the target output, calculating the value of the loss function, and measuring the difference between the predicted result and the real result.
Back propagation: the gradient of the loss function to the model parameters, i.e. the derivative of the loss function with respect to the weights and biases, is calculated by a back propagation algorithm.
Parameter updating: the parameters of the model are updated according to gradient information by using an optimization algorithm (such as gradient descent) so that the loss function is gradually reduced.
Repeating the iteration: the steps of forward propagation, loss calculation, backward propagation and parameter updating are repeatedly performed until a predetermined number of training rounds or convergence conditions are reached.
The training process of large models usually requires the use of large-scale computing resources and training time, and may be accelerated using techniques such as distributed training, parallel computing, etc.
In this embodiment, the large model learns statistical rules and semantic information of the language by exposing it to a large amount of natural language text data during training. It learns rich linguistic knowledge by modeling the relationships between words, sentences and contexts in text data. When a user inputs natural language, the large model can understand and process the input according to the learned knowledge, extract key information in the input and convert the key information into a retrieval expression which can be understood by a retrieval system.
According to embodiments of the present application, methods utilize natural language processing techniques to semantically analyze and understand a user's natural language input. By analyzing sentence structure, word meaning relation and context information input by a user, the system can accurately understand the retrieval intention of the user and extract key information therein. Illustratively, the user may enter "find new drugs for treating cancer," and the system will understand that the user needs to find new drugs associated with treating cancer. Here, the information input by the user is split, which can be based on verbs and nouns, and can split fixed language, scholarly language and the like. Then, keywords are identified to be combined to form a search formula for searching. For "find new drugs for treating cancer", it can be split into "treat cancer" and "drug". This matches the corresponding document. For "treating cancer" and "drug" can be based on the user selection of a logical combination mode, for example, the user selection and the logical mode, then the "treating cancer" and the "drug" are logically combined to form a search combination mode to meet the search requirement.
The overall process steps for the natural language processing approach of the present invention generally include the following stages:
word segmentation: the input natural language text is segmented into words.
Part of speech tagging: each word is determined with respect to its part of speech (e.g., noun, verb, etc.).
Syntax analysis: the structure of the sentence is analyzed and the dependency relationship between the words is determined.
Semantic analysis: the meaning of the sentence is understood, and the meaning of the sentence and the meaning of the expression are determined.
Semantic understanding: key information and semantic roles are extracted from sentences, and meaning and intention of the sentences are understood.
In the process of semantic analysis and understanding, commonly used mathematical models include:
word embedding models (e.g., word2Vec, gloVe): the words are mapped to a continuous vector space, capturing semantic relationships between the words.
Recurrent neural network (Recurrent Neural Network, RNN): for processing sequence data such as sentences or text.
Attention mechanism (Attention Mechanism): for weighted attention to different parts of the input when processing long text.
Converter model (transducer): a neural network model based on a self-attention mechanism is used for processing sequence data.
In this embodiment, after the search information input by the user is analyzed, a sentence structure and a word meaning relationship may be formed. Natural language processing (Natural Language Processing, NLP) techniques and corresponding mathematical models may be generally employed herein.
The following is one way for the large model to accomplish this:
lexical Analysis (Lexical Analysis): the large model firstly carries out lexical analysis on sentences input by a user, and cuts the sentences into words. This may be accomplished using a lexical analyzer or a pre-trained lexical analysis model.
Syntactic analysis (Syntactic Analysis): next, the large model performs syntactic analysis, analyzes the structure of the sentence, and determines the dependency relationship between words, such as a master-predicate relationship, a motor-guest relationship, and the like. Syntactic analysis may be implemented using a dependency syntactic analyzer, a phrase structure syntactic analyzer, or a pre-trained syntactic analysis model.
Semantic analysis (Semantic Analysis): in the semantic analysis stage, the large model understands the semantics of the sentence and determines the meaning of the sentence and the meaning of the expression. This includes tasks such as word sense disambiguation, reference resolution, semantic role labeling, etc. Semantic analysis may be implemented using word sense disambiguation models, reference resolution models, semantic role annotation models, or pre-trained semantic analysis models.
Context Modeling (Context Modeling): to better understand the meaning of a sentence, the large model considers context information in the sentence, including both the context and the postamble. Context modeling may model and represent sentences using context-aware models, such as Recurrent Neural Networks (RNNs) or transducer models (transducers).
Through the steps, the large model can analyze sentences input by the user, and extract the structure, word meaning relation and context information of the sentences, so that the intention and the requirement of the user are better understood. In this way, the large model may further translate user input into a search expression that the search system can understand and process, and provide more accurate search results. It should be noted that the specific implementation and the model used may vary depending on the application scenario and the specific requirements.
S103, acquiring a logical operator set by a user, and combining the extracted keyword information based on the logical operator set by the user to configure a retrieval expression;
the search expression of the present embodiment is formed by effectively combining extracted keyword information.
It should be noted that, the combination manner of the keyword information may obtain the user setting in the logical operator, and match the extracted keyword information based on the logical operator set by the user to form the search expression.
Optionally, the logical operators include: and logic, or logic, and not logic.
Such as retrieving a cancer treatment medication, the user may select a logical relationship while entering the retrieved information.
Of course, the system may default the logical relationship to AND logic. Namely "treatment" and "cancer" and "medicament", although it may be arranged as or logic.
Thus, the search expression is formed by converting the natural language input by the user into a structured search expression after the natural language processing so that the search system can understand and process the search expression.
S104, matching the medical scientific literature in the medical scientific literature database based on the retrieval expression;
it will be appreciated that a large number of medical research documents are pre-stored and collected in the medical research document database, and of course, the medical research documents are not limited to the medical research documents, and other documents may be involved. In order to facilitate the search of the system for the documents, each medical scientific research document is configured with a keyword tag; of course, if the documents have more chapters, a plurality of keyword labels can be configured on the medical scientific research documents, and the documents can be matched only by matching the search keywords.
The setting of the keyword label can be set based on the topic name of medical scientific literature, the related field, abstract, the content of core chapter and the like, and can be automatically matched by a system or manually set.
In this way, matching is performed with keyword tags in the medical scientific literature database based on the search expression; and displaying the medical scientific literature corresponding to the matched keyword label on a user interface.
And S105, displaying the matched medical scientific research literature on a user interface.
According to the embodiment of the application, the user-friendly interface is provided, and the user can input the search requirement in a natural language expression mode. The user interface may also provide real-time feedback to help the user better understand and adjust the retrieval needs.
Here, the search result may be continuously optimized according to the feedback of the user, and one of the following manners may be adopted: feedback loop: after the user obtains the search results, the system may ask the user to provide feedback, such as by scoring, likes/dislikes, and the like. Based on the user feedback, the system may adjust the search algorithm or reorder the results to provide results that more meet the user's needs.
Reinforcement learning: the system may use a reinforcement learning algorithm to learn how to optimize the search results through interactions with the user. The system adjusts the parameters of the search algorithm according to the feedback (reward signal) of the user so that the future search result meets the user requirement.
User model: the system can build a user model according to the historical behaviors and preferences of the user, predict the preferences and demands of the user by using the model, and optimize the retrieval result according to the prediction result.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.
According to the retrieval method based on medical scientific research information, a user inputs retrieval requirements in a natural language expression mode, and the interface transmits the user input to the large model for natural language understanding. The large model converts the user input into a retrieval expression and passes it to the retrieval system. The retrieval system retrieves relevant medical scientific research documents and data from the database according to the retrieval expression, and returns the result to the user interface for display to the user. Therefore, the user does not need to edit complex search expressions, and searches in a natural language expression mode, so that the learning cost and editing time of the user are reduced. Moreover, the large model related by the invention has natural language understanding capability, so that the intention of a user can be more accurately understood, and the relevance of a search result is improved. The user-friendly interface provides a better user experience that enables non-professionals to easily retrieve as well.
The following is an embodiment of a medical scientific information-based search system provided by the embodiment of the present disclosure, which belongs to the same inventive concept as the medical scientific information-based search method of the above embodiments, and details which are not described in detail in the medical scientific information-based search system embodiment may refer to the above medical scientific information-based search method embodiment.
As shown in fig. 2, the system includes: the system comprises an information input module, an information analysis module, an expression configuration module, a document matching module and a document display module;
an input device for providing information input for a user by a user of the information input module, and acquiring search information input by the user based on the input device;
the information analysis module analyzes the search information based on the natural language processing mode of the large model, analyzes sentence structure, word meaning relation and context information, and extracts keyword information;
the expression configuration module is used for configuring the extracted keyword information into a retrieval expression;
the document matching module is used for matching the medical scientific literature in the medical scientific literature database based on the retrieval expression;
the document display module provides a display module of system operation information, displays retrieval process information and displays matched medical scientific research documents on a user interface.
The medical scientific research information-based retrieval system provided by the invention is a unit and algorithm step of each example described in connection with the embodiments disclosed herein, and can be implemented in electronic hardware, computer software or a combination of both, and in order to clearly illustrate the interchangeability of hardware and software, the components and steps of each example have been generally described in terms of functions in the above description. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
As will be readily understood by those skilled in the art from the description of the above embodiments, the retrieval system based on medical scientific information described herein may be implemented by software or by a combination of software and necessary hardware. Accordingly, the technical solution according to the disclosed embodiments of the retrieval method based on medical scientific information may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a usb disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the indexing method according to the disclosed embodiments.
In embodiments of the present invention, computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including but not limited to an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (9)
1. The retrieval method based on the medical scientific research information is characterized by comprising the following steps:
step one, acquiring search information input by a user;
analyzing the search information based on a natural language processing mode of the large model, analyzing sentence structures, word meaning relations and context information, and extracting keyword information;
wherein, big model includes: a plurality of identical encoder layers, each encoder layer comprising a self-attention mechanism and a feed-forward neural network;
the mathematical formula for the self-attention mechanism is as follows:
Attention(Q1, K1, V1) = softmax(Q1×K1^T 1/ sqrt(d_k1)) * V1;
wherein Q1, K1 and V1 represent input matrices of queries, keys and values, respectively, d_k1 represents the dimension of the self-attention mechanism;
the mathematical formula of the feedforward neural network is as follows:
FFN(x1) = max(0, xW_1 + b_1)W_2 + b_2;
where x1 represents the input vector, w_1, b_1, w_2, and b_2 represent parameters of the model;
the large model further includes: a plurality of identical decoder layers, each decoder layer comprising a self-attention mechanism and an encoder-decoder self-attention mechanism;
in large model structures, the connection between the encoder layer and the decoder layer uses residual connection and layer normalization;
step three, acquiring a logical operator set by a user, and combining the extracted keyword information based on the logical operator set by the user to configure a retrieval expression;
step four, matching with medical scientific research literature in a medical scientific research literature database based on the retrieval expression;
and fifthly, displaying the matched medical scientific research literature on a user interface.
2. The medical scientific research information-based retrieval method according to claim 1, wherein the analyzing the retrieval information based on the natural language processing mode of the large model in the second step further comprises:
segmenting into words based on the input natural language text;
the part of speech of each word is determined, and keyword information is extracted.
3. The medical scientific research information-based retrieval method according to claim 1, wherein the analyzing the retrieval information based on the natural language processing mode of the large model in the second step further comprises:
and analyzing sentence structures in the search information, determining the dependency relationship among the words, and extracting keyword information.
4. The medical scientific research information-based retrieval method according to claim 3, wherein the parsing of the retrieval information based on the natural language processing method of the large model in the second step further comprises: and performing lexical analysis on sentences in the search information input by the user, and segmenting the sentences into words.
5. The method for retrieving information based on medical science research of claim 4, wherein lexical analysis of sentences comprises: the dependency relationship between words in the sentence is determined and implemented using a dependency syntax analysis manner or a phrase structure syntax analysis manner.
6. The medical research information-based retrieval method of claim 1, wherein the logical operators include: and logic, or logic, and not logic.
7. The medical research information-based retrieval method of claim 1, wherein step four further comprises: a plurality of medical scientific literature is stored in a medical scientific literature database;
each medical scientific literature is configured with a keyword tag;
matching the keyword labels in the medical scientific literature database based on the retrieval expression;
and displaying the medical scientific literature corresponding to the matched keyword label on a user interface.
8. A retrieval system based on medical scientific research information, characterized in that the system implements the retrieval method based on medical scientific research information according to any one of claims 1 to 7;
the system comprises: the system comprises an information input module, an information analysis module, an expression configuration module, a document matching module and a document display module;
the information input module is used for acquiring search information input by a user;
the information analysis module is used for analyzing the search information by combining the natural language processing mode of the large model, analyzing sentence structure, word meaning relation and context information, and extracting keyword information;
wherein, big model includes: a plurality of identical encoder layers, each encoder layer comprising a self-attention mechanism and a feed-forward neural network;
the mathematical formula for the self-attention mechanism is as follows:
Attention(Q1, K1, V1) = softmax(Q1×K1^T 1/ sqrt(d_k1)) * V1;
wherein Q1, K1 and V1 represent input matrices of queries, keys and values, respectively, d_k1 represents the dimension of the self-attention mechanism;
the mathematical formula of the feedforward neural network is as follows:
FFN(x1) = max(0, xW_1 + b_1)W_2 + b_2;
where x1 represents the input vector, w_1, b_1, w_2, and b_2 represent parameters of the model;
the large model further includes: a plurality of identical decoder layers, each decoder layer comprising a self-attention mechanism and an encoder-decoder self-attention mechanism;
in large model structures, the connection between the encoder layer and the decoder layer uses residual connection and layer normalization; the expression configuration module is used for acquiring the logical operators set by the user, combining the extracted keyword information based on the logical operators set by the user and configuring the extracted keyword information into a retrieval expression;
the document matching module is used for matching with medical scientific research documents in the medical scientific research document database according to the retrieval expression;
and the document display module is used for displaying the information of the retrieval process and displaying the matched medical scientific research documents.
9. A search terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the search method based on medical scientific information according to any one of claims 1 to 7 when executing the program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311336929.7A CN117093729B (en) | 2023-10-17 | 2023-10-17 | Retrieval method, system and retrieval terminal based on medical scientific research information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311336929.7A CN117093729B (en) | 2023-10-17 | 2023-10-17 | Retrieval method, system and retrieval terminal based on medical scientific research information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117093729A CN117093729A (en) | 2023-11-21 |
CN117093729B true CN117093729B (en) | 2024-01-09 |
Family
ID=88783580
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311336929.7A Active CN117093729B (en) | 2023-10-17 | 2023-10-17 | Retrieval method, system and retrieval terminal based on medical scientific research information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117093729B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117688220A (en) * | 2023-12-12 | 2024-03-12 | 山东浪潮科学研究院有限公司 | Multi-mode information retrieval method and system based on large language model |
CN117493585B (en) * | 2023-12-29 | 2024-03-22 | 安徽大学 | Data retrieval system based on large language model |
CN117877737B (en) * | 2024-03-12 | 2024-07-05 | 北方健康医疗大数据科技有限公司 | Method, system and device for constructing primary lung cancer risk prediction model |
CN118193759B (en) * | 2024-04-16 | 2024-10-25 | 中南大学湘雅医院 | Medical scientific research data interactive guiding index method and system based on large model |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018031656A1 (en) * | 2016-08-09 | 2018-02-15 | Ripcord, Inc. | Systems and methods for contextual retrieval of electronic records |
WO2019027696A1 (en) * | 2017-08-03 | 2019-02-07 | Motorola Solutions, Inc. | Role-based perception filter |
CN113239148A (en) * | 2021-05-14 | 2021-08-10 | 廖伟智 | Scientific and technological resource retrieval method based on machine reading understanding |
CN114020862A (en) * | 2021-11-04 | 2022-02-08 | 中国矿业大学 | Retrieval type intelligent question-answering system and method for coal mine safety regulations |
CN114880439A (en) * | 2022-06-09 | 2022-08-09 | 同方知网(北京)技术有限公司 | Chinese and foreign language literature unified theme retrieval system |
WO2022191395A1 (en) * | 2021-03-09 | 2022-09-15 | 삼성전자주식회사 | Apparatus for processing user command, and operating method therefor |
CN115309879A (en) * | 2022-08-05 | 2022-11-08 | 中国石油大学(华东) | Multi-task semantic parsing model based on BART |
CN116662582A (en) * | 2023-08-01 | 2023-08-29 | 成都信通信息技术有限公司 | Specific domain business knowledge retrieval method and retrieval device based on natural language |
-
2023
- 2023-10-17 CN CN202311336929.7A patent/CN117093729B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018031656A1 (en) * | 2016-08-09 | 2018-02-15 | Ripcord, Inc. | Systems and methods for contextual retrieval of electronic records |
WO2019027696A1 (en) * | 2017-08-03 | 2019-02-07 | Motorola Solutions, Inc. | Role-based perception filter |
WO2022191395A1 (en) * | 2021-03-09 | 2022-09-15 | 삼성전자주식회사 | Apparatus for processing user command, and operating method therefor |
CN113239148A (en) * | 2021-05-14 | 2021-08-10 | 廖伟智 | Scientific and technological resource retrieval method based on machine reading understanding |
CN114020862A (en) * | 2021-11-04 | 2022-02-08 | 中国矿业大学 | Retrieval type intelligent question-answering system and method for coal mine safety regulations |
CN114880439A (en) * | 2022-06-09 | 2022-08-09 | 同方知网(北京)技术有限公司 | Chinese and foreign language literature unified theme retrieval system |
CN115309879A (en) * | 2022-08-05 | 2022-11-08 | 中国石油大学(华东) | Multi-task semantic parsing model based on BART |
CN116662582A (en) * | 2023-08-01 | 2023-08-29 | 成都信通信息技术有限公司 | Specific domain business knowledge retrieval method and retrieval device based on natural language |
Non-Patent Citations (2)
Title |
---|
中文分词技术研究综述;唐琳;郭崇慧;陈静锋;;数据分析与知识发现(第Z1期);全文 * |
基于BERT特征的双向LSTM神经网络在中文电子病历输入推荐中的应用;赵璐偲;岁波;罗海琼;陈旭;宋晓霞;洪平;;中国数字医学(第04期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117093729A (en) | 2023-11-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN117093729B (en) | Retrieval method, system and retrieval terminal based on medical scientific research information | |
Qin et al. | A survey on text-to-sql parsing: Concepts, methods, and future directions | |
CN104361127B (en) | The multilingual quick constructive method of question and answer interface based on domain body and template logic | |
Cai et al. | An encoder-decoder framework translating natural language to database queries | |
CN111209412B (en) | Periodical literature knowledge graph construction method for cyclic updating iteration | |
CN112100356A (en) | Knowledge base question-answer entity linking method and system based on similarity | |
Zhang et al. | SG-Net: Syntax guided transformer for language representation | |
US20080052262A1 (en) | Method for personalized named entity recognition | |
Wu et al. | Community answer generation based on knowledge graph | |
Fuchs | Natural language processing for building code interpretation: systematic literature review report | |
CN110991180A (en) | Command identification method based on keywords and Word2Vec | |
CN111666764B (en) | Automatic abstracting method and device based on XLNet | |
Liu et al. | Question answering over knowledge bases | |
CN117493379A (en) | Natural language-to-SQL interactive generation method based on large language model | |
Pais et al. | In-depth evaluation of Romanian natural language processing pipelines | |
Nabavi et al. | Leveraging Natural Language Processing for Automated Information Inquiry from Building Information Models. | |
CN113515616A (en) | Task driving system based on natural language | |
Ahkouk et al. | Comparative study of existing approaches on the Task of Natural Language to Database Language | |
CN112183110A (en) | Artificial intelligence data application system and application method based on data center | |
Srinivasagan et al. | An automated system for tamil named entity recognition using hybrid approach | |
Revanth et al. | Nl2sql: Natural language to sql query translator | |
CN115017271A (en) | Method and system for intelligently generating RPA flow component block | |
CN114417008A (en) | Construction engineering field-oriented knowledge graph construction method and system | |
Dereje et al. | Sentence level Amharic word sense disambiguation | |
Rautaray et al. | A Naive approach: Translation of Natural Language to Structured Query Language |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |