CN118013017A - Intelligent text automatic generation method based on AI large language model - Google Patents
Intelligent text automatic generation method based on AI large language model Download PDFInfo
- Publication number
- CN118013017A CN118013017A CN202410281005.XA CN202410281005A CN118013017A CN 118013017 A CN118013017 A CN 118013017A CN 202410281005 A CN202410281005 A CN 202410281005A CN 118013017 A CN118013017 A CN 118013017A
- Authority
- CN
- China
- Prior art keywords
- entity
- input
- screening
- named entity
- input named
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 238000012216 screening Methods 0.000 claims abstract description 165
- 230000011218 segmentation Effects 0.000 claims abstract description 54
- 238000012937 correction Methods 0.000 claims abstract description 40
- 239000013598 vector Substances 0.000 claims description 27
- 239000011159 matrix material Substances 0.000 claims description 12
- 238000005516 engineering process Methods 0.000 claims description 7
- 238000001914 filtration Methods 0.000 claims description 7
- 238000001228 spectrum Methods 0.000 claims description 6
- 238000004140 cleaning Methods 0.000 claims description 5
- 238000009825 accumulation Methods 0.000 claims description 2
- 238000003058 natural language processing Methods 0.000 abstract description 3
- 230000008569 process Effects 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 6
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Human Computer Interaction (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The application relates to the technical field of natural language processing, and provides an automatic intelligent text generation method based on an AI large language model, which comprises the following steps: converting the text information data into a required word segmentation sequence; determining a requirement descriptor and screening entities based on the named entity recognition results of the words in the requirement word segmentation sequence and the dependency relationship between the words; determining an entity dissimilarity index based on the depth of words in the dependency syntax tree corresponding to all the required word segmentation sequences containing the input named entity; determining entity guiding weight based on the similarity degree of the information of the named entity and the adjacent entity of the screening entity; determining an entity-directed extension correction weight based on the information extension characteristics between the input named entity and each of the screening entities thereof; and automatically generating a reply text based on the entity-oriented extension correction weight by using the generation model. The application avoids the influence of the phenomena such as word ambiguity and the like on the matching of the named entities, and improves the efficiency and quality of text generation.
Description
Technical Field
The application relates to the technical field of natural language processing, in particular to an automatic intelligent text generation method based on an AI large language model.
Background
The intelligent text automatic generation is a process of automatically generating text contents by utilizing natural language processing and a large language model in artificial intelligence technology, including but not limited to automatically generating various types of text contents such as articles, news, blogs and the like, automatically generating documents such as customer service replies, mails, reports and the like, and is characterized in that a large amount of text data is learned through a large-scale language model, and grammar, semantics and context of the language are understood, so that the text with logicality, consistency and naturality is generated.
The traditional method for automatically generating the text, such as a knowledge graph, can provide semantic information such as attributes of the entities, relationships among the entities and the like, help the generation model to better understand the input text, generate richer, more accurate and reasonable text content, provide rich context information, can be used for supplementing the context of text generation, and enable the generated text to be richer and more consistent, but the number of nodes in the knowledge graph is more, and complicated semantic analysis and reasoning are needed to be carried out to effectively utilize the generated model, so that the complexity of the model is increased, the calculation cost is increased, and the efficiency of automatically generating the text is reduced.
Disclosure of Invention
The application provides an automatic intelligent text generation method based on an AI large language model, which aims to solve the problem that the generated text is low in efficiency because complex semantic analysis and reasoning are carried out on knowledge graph nodes to be utilized by a generation model, and adopts the following technical scheme:
the application relates to an intelligent text automatic generation method based on an AI large language model, which comprises the following steps:
Converting the collected text information data into a required word segmentation sequence by using a data cleaning technology and a word segmentation tool;
determining a requirement descriptor of each input named entity and a screening entity of each input named entity based on a named entity recognition result of the words in the requirement word segmentation sequence and the dependency relationship among the words;
determining an entity dissimilarity index between each input named entity and each description word based on the depth of the words in the dependency syntax tree corresponding to all the required word segmentation sequences comprising each input named entity;
determining entity guiding weights between each input named entity and each screening entity based on the entity dissimilarity index and the information similarity degree of each named entity and the screening entity adjacent entities of each named entity;
Determining an entity-oriented extension correction weight between each input named entity and each screening entity based on the information extension feature between each input named entity and each screening entity and the entity-oriented weight between each input named entity and each screening entity; and automatically generating a reply text corresponding to each text information data input by a user based on the entity guiding extension correction weight by using a generation model.
Preferably, the method for determining the requirement descriptor of each input named entity and the screening entity of each input named entity based on the recognition result of the named entity of the words in the requirement word segmentation sequence and the dependency relationship between the words comprises the following steps:
taking all the required Word segmentation sequences as input, and acquiring Word vectors of each Word in each required Word segmentation sequence by adopting a Word2Vec model;
The method comprises the steps of taking word vectors of all words as input of a named entity recognition model, obtaining named entity recognition results of all words by using the named entity recognition model, taking the named entity recognition results of all words in each required word segmentation sequence as an input named entity, taking words corresponding to each node directly connected with the node where each input named entity is located in a dependency syntax tree of each required word segmentation sequence as a description word of each input named entity, and taking a sequence consisting of all description words of each input named entity as a required descriptor of each input named entity;
And calculating cosine similarity between the word vector of each input named entity and the word vector of the entity corresponding to each entity node in the knowledge graph, and taking the entity corresponding to the word vector, of which the cosine similarity between each word vector and the input named entity in the knowledge graph is larger than a preset threshold value, as a screening entity of the input named entity.
Preferably, the method for determining the entity dissimilarity index between each input named entity and each description word based on the depth of the word in the dependency syntax tree corresponding to each required word segmentation sequence of each input named entity comprises the following steps:
The method comprises the steps that a word corresponding to each node, which is directly connected with a node where each input named entity is located, in a dependency syntax tree of each requirement word segmentation sequence is used as a description word of each input named entity, and a sequence formed by all description words of each input named entity is used as a requirement descriptor of each input named entity;
Taking the average value of the depth of a node where each input named entity is located and the depth of a node where the description word is located in a dependency syntax tree containing each input named entity and one description word of each input named entity as an average depth value;
Taking the minimum value of the number of nodes from the node of each input named entity to the node of the description word in a dependency syntax tree containing each input named entity and one description word of each input named entity as the dependency path distance between each input named entity and one description word of each input named entity;
and taking the accumulated result of the product of the average depth value and the dependency path distance on all dependency syntax trees containing each input named entity and one description word of each input named entity as an entity dissimilarity index between each input named entity and each description word.
Preferably, the method for determining the entity guiding weight between each input named entity and each screening entity based on the entity dissimilarity index and the similarity of information of each named entity and the screening entity adjacent entity of each named entity is as follows:
taking cosine similarity between each description word of each input named entity and word vectors of each attribute information of any screening entity of each input named entity as a molecule;
taking the sum of the normalized Google distance between each description word of each input named entity and each attribute information of any screening entity of each input named entity in all the requirement sequences, and the entity dissimilarity index between each input named entity and each description word and 0.01 as denominators;
And taking the average value of the accumulated results of the ratio of the numerator to the denominator on all the description words of all the screening entities of each input named entity as the entity guiding weight between each input named entity and each screening entity.
Preferably, the method for determining the entity-oriented extension correction weight between each input named entity and each screening entity based on the information extension feature between each input named entity and each screening entity and the entity-oriented weight between each input named entity and each screening entity comprises the following steps:
acquiring the generated information supply and demand degree of the screening entity in the knowledge graph based on the information node set determined by the screening entity of each input named entity and the adjacent entity node of the screening entity in the knowledge graph;
determining entity guide correction coefficients between each input named entity and each screening entity based on the generated information supply and demand degree of the screening entity of each input named entity and entity guide weights between each input named entity and different screening entities;
taking the product of the entity guiding weight and the entity guiding correction coefficient between each input named entity and each screening entity as a molecule;
And taking the ratio of the generated information supply and demand of the molecule and each screening entity of each input named entity as the entity guiding extension correction weight between each input named entity and each screening entity.
Preferably, the method for acquiring the information node set comprises the following steps:
The method comprises the steps that a set formed by all entity nodes which are directly connected with entity nodes of each screening entity of each input named entity in a knowledge graph is recorded as a node intersection set of the screening entities;
the intersection between the set of all screening entities of each input named entity and the node intersection set of each screening entity of each input named entity is taken as the information node set of each input named entity.
Preferably, the method for obtaining the information supply and demand degree of the screening entity in the knowledge graph based on the information node set determined by the screening entity of each input named entity and the adjacent entity node of the screening entity in the knowledge graph includes:
taking a set formed by all the attributes of each entity node in the knowledge graph as an attribute information set of each entity node;
Taking the accumulation result of cosine similarity between word vectors of two input named entities corresponding to each screening entity on all the input named entities corresponding to each screening entity as a first characteristic value, and taking the product of the sum of Jaccard coefficients between attribute information sets of all entity nodes in a node intersection set of each screening entity and the first characteristic value and the sum of 0.01 as a denominator;
and taking the variation coefficient of the entity guiding weight between all the input named entities and the same screening entity as a numerator and taking the ratio of the numerator to the denominator as the generated information supply and demand degree of each screening entity.
Preferably, the method for determining the entity guiding correction coefficient between each input named entity and each filtering entity based on the generated information supply and demand degree of the filtering entity of each input named entity and the entity guiding weight between each input named entity and different filtering entities comprises the following steps:
taking the ratio of the number of nodes in the information node set of each input named entity to the number of screening entities of each input named entity as the guiding proportion of each input named entity;
taking each screening entity of each input named entity as a reference entity, and taking entity guiding weights between each input named entity and each reference entity as reference values;
Taking the sum of absolute values of entity guiding weight difference values between the reference value and the nodes of any two non-reference entities in each input named entity and the information node set thereof as a molecule;
taking the sum of the absolute value of the entity guiding weight difference value corresponding to the nodes of any two non-reference entities in the information node set of each input named entity and 0.01 as a denominator;
and taking the product of the accumulated result of the ratio of the numerator and the denominator on the nodes of the non-reference entity in the information node set of each input named entity and the guiding proportion of each input named entity as an entity guiding correction coefficient between each input named entity and each screening entity.
Preferably, the method for automatically generating the reply text corresponding to each text information data input by the user based on the entity guiding extension correction weight by using the generation model comprises the following steps:
Obtaining a matching entity of each input named entity based on entity guiding extension correction weight between each input named entity and all screening entities by using a GAT model;
And taking each required word segmentation sequence and attribute information of all the matching entities of the input named entities in each required word segmentation sequence as input of a generation model, and generating a reply text corresponding to each required word segmentation sequence by using the generation model.
Preferably, the method for obtaining the matching entity of each input named entity based on the entity-oriented extension correction weight between each input named entity and all the screening entities by using the GAT model includes:
Taking the sum of entity guiding extension correction weights between each input named entity and all screening entities as a denominator, and taking the ratio of the entity guiding extension correction weights between each input named entity and each screening entity and the denominator as a map structure similarity score between each input named entity and each screening entity;
Taking the spectrum structure similarity score between each input named entity and each screening entity as one element in a matrix, and taking the matrix formed by the spectrum structure similarity scores between each input named entity and all screening entities as a structure adjacent matrix of each input named entity;
And taking all the required word segmentation sequences, the structural adjacency matrix of each input named entity, word vectors and attribute information of screening entities of all named entities as inputs, and obtaining a matching entity of each input named entity by using a GAT model.
The beneficial effects of the application are as follows: according to the method, the entity guiding weight between each input named entity and each screening entity is determined by analyzing the semantic influence degree of context information between the named entity identification result of the requirement word segmentation sequence and the entity nodes in the knowledge graph, the entity guiding weight considers the similarity degree of information hidden by each input named entity in the requirement input by a user and attribute information of the entity nodes in the knowledge graph, and the guiding condition of different entity nodes in the knowledge graph can be accurately reflected when the entities are matched; and secondly, evaluating the information supply and demand degree of each screening entity based on semantic extension characteristics between entity nodes and adjacent nodes in the knowledge graph, determining entity guide extension correction weights between each input naming entity and each screening entity based on the richness degree of guide information between the same screening entity and different input naming entities, wherein the entity guide extension correction weights consider the transitivity of attribute information between connected entity nodes in the knowledge graph, can avoid the guide influence of complex semantics such as word ambiguity in Chinese words on entity matching, improve the accuracy of entity matching results, and enable the quality of reply text generated according to each requirement of a user to be better.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the application, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a schematic flow chart of an automatic intelligent text generation method based on an AI large language model according to one embodiment of the application;
fig. 2 is a flowchart of an implementation of an intelligent text automatic generation method based on an AI large language model according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Referring to fig. 1, a flowchart of an intelligent text automation generating method based on an AI large language model according to an embodiment of the application is shown, the method includes the following steps:
and S001, collecting input text information data and preprocessing the obtained text information data.
The embodiment uses automatic generation of movie evaluation text to carry out subsequent analysis, and aims to enable the matched entity to better meet the requirements of a user by analyzing the requirement matching degree between the entity in the text information data input by the user and the entity in the knowledge graph, improve the accuracy of entity matching in the knowledge graph and further improve the efficiency of automatic generation of the movie evaluation text.
Specifically, the entity involved in the dog searching knowledge graph reaches the billion level, the relation between the entities reaches the billion level, and the aspects of figures, film and television drama and the like are included, so the dog searching knowledge graph is used as the knowledge graph required by intelligent text automatic generation.
Further, text information data input by a user is collected through a user interaction module of the large language model, so that key information in a text input by the user can be conveniently identified in subsequent steps, the automatic generation efficiency and the automatic generation effect are improved, and the text information data input by the user is preprocessed, specifically: firstly, cleaning collected text information data, and removing useless information such as special characters, punctuation marks, HTML labels, URL links and the like in the text so as to keep main contents in the text; secondly, word segmentation and stop word removal are carried out on each piece of text information data after cleaning, and the method firstly decomposes each piece of text information data after cleaning into a plurality of words through jieba word segmentation tools, so that a large language model can understand and process word-level information conveniently; and then, removing stop words in the word segmentation result by using the Chinese stop word list to complete the whole preprocessing flow, so that the calculation efficiency of the text generation model is improved. And taking a sequence formed by all words of each acquired text information data after the preprocessing flow as a required word segmentation sequence.
So far, the required word segmentation sequence is obtained and used for matching with the entity in the knowledge graph in the follow-up process.
Step S002, determining an entity dissimilarity index between the input named entity and each description word based on the depth of the word in the dependency syntax tree of the demand word segmentation sequence; and determining entity guiding weights between each input named entity and the screening entity based on the entity dissimilarity index and the similarity degree of the information of each named entity and the adjacent entities of the screening entity.
Because the number of entity nodes in the knowledge graph is too large, after a user inputs a requirement for obtaining an answer, if the same matching or embedding comparison operation is performed on each entity node in the knowledge graph, the generation time of a reply text is too long, and the more entity nodes are involved in the knowledge graph, the more entity node attributes and semantics influence the semantics of the reply text, which may cause the difference between the automatically produced reply text and the standard reply of the user input requirement. Therefore, the application considers that the matching guide between the named entity recognition result of the words in the requirement word segmentation sequence and the entity nodes in the knowledge graph is firstly determined, and then the input information of the entity matching result determination generation model aiming at each requirement is obtained, and the implementation flow of the scheme is shown in figure 2.
Specifically, taking all the required Word segmentation sequences as input, and acquiring Word vectors of each Word in each required Word segmentation sequence by adopting a Word2Vec model; secondly, word vectors of all words are used as input of a named-body recognition model, the named-body recognition model is utilized to obtain a named-body recognition result of each Word, and the named-body recognition result of each Word in each required Word segmentation sequence is used as an input named entity, wherein the main structure of the named-entity recognition model is a bidirectional long-short-term memory network and a conditional random field, the Word2Vec model and the named-body recognition model are known technologies, and detailed description of specific processes is omitted. Secondly, taking an a-th input named entity as an example, calculating cosine similarity between word vectors of the a-th input named entity and word vectors of entities corresponding to each entity node in a knowledge graph, taking each entity corresponding to the word vector with cosine similarity larger than a threshold value in the knowledge graph as a screening entity of the a-th input named entity, taking an experience value of 0.9 as the threshold value, acquiring all attribute information of each screening entity from a knowledge base of the knowledge graph, and taking a set formed by all attributes of each entity node in the knowledge graph as an attribute information set of each entity node. The attribute information is words of all attributes of each screening entity in the knowledge base, and it should be noted that an implementer may set a suitable threshold according to a specific situation of a user input requirement.
Further, each required word segmentation sequence is taken as input, a dependency syntax tree of each required word segmentation sequence is obtained by utilizing a dependency syntax, a dependency relation between each input naming entity and the rest word in each required word segmentation sequence and a word segmentation part of speech are output, the dependency relation comprises a main-name relation, a dynamic guest relation and the like, the word segmentation part of speech comprises nouns, verbs, adjectives and the like, the dependency syntax is a known technology, and the specific process is not repeated. And the word corresponding to each node directly connected with the node where each input named entity is located in the dependency syntax tree of each requirement word segmentation sequence is used as a description word of each input named entity, and the sequence formed by all the description words of each input named entity is used as a requirement descriptor of each input named entity. Information hidden by each input named entity in the need of user input is determined based on the decorated relation between each input named entity and each description word. For example, the description words "humorous", "comic", "interactive" exist in the a-th input named entity "comedy", and due to the existence of the description words "interactive", it is indicated that the a-th input named entity "comedy" may be a situation comedy, and when entity nodes related to the comedy are hidden in the knowledge graph, the entity nodes close to the situation comedy need to be searched, so that reply text aiming at the user needs can be generated efficiently and correctly.
Based on the above analysis, entity-oriented weights are constructed herein for characterizing information-oriented between each input named entity and each screening entity thereof. Calculating entity guiding weight between the a-th input named entity and the i-th screening entity:
Wherein S ac is the entity dissimilarity index between the a-th input named entity and its c-th descriptive term, N 1 is the number of dependency syntax trees that contain both the a-th input named entity and its c-th descriptive term, j is the j-th dependency syntax tree that contains both the a-th input named entity and its c-th descriptive term, The average value of depth of a node corresponding to an a-th input named entity and a c-th description word in the j-th dependency syntax tree is the average value of depth of a node corresponding to the c-th description word in the j-th dependency syntax tree, d ac (j) is the dependency path distance between the a-th input named entity and the c-th description word in the j-th dependency syntax tree, and the size of d ac (j) is equal to the minimum value of the node number between the a-th input named entity and the c-th description word corresponding node in the j-th dependency syntax tree;
T ai is entity guide weight between an a-th input named entity and an i-th screening entity, m 1 is the number of description words in a demand descriptor of the a-th input named entity, m 2 is the number of screening entities of the a-th input named entity, A ac is the number of the a-th input named entity and the c-th description words, l ac is a word vector of A ac, A ib is b-th attribute information of the i-th screening entity, l ib is a word vector of A ib, Y (l ac,lib) is cosine similarity between l ac、lib, gd (A ac,Aib) is normalized Google distance between A ac、Aib, μ is a parameter, μ is effective in preventing denominator from being 0, the size of μ takes a checked value of 0.01, cosine similarity and normalized Google distance are all known techniques, and specific processes are not repeated.
Wherein, the more complex the context semantic information between the a-th input named entity and the c-th description word in the demand analysis sequence, the deeper the depth of the a-th input named entity and the c-th description word in the same dependency syntax tree, the average depth valueThe larger the value of a (a) input named entity and the c (c) description word, the more complex the semantic meaning expressed by the a (b) input named entity and the c (c) description word, the longer the distance from the node of the a (a) input named entity to the corresponding node of the c (c) description word in the same dependency syntax tree, the larger the value of d ac (j), and the larger the value of S ac; the greater the frequency of the co-occurrence of the context information of the a-th input named entity and the attribute information of the i-th screening entity in the user input requirement, the more similar the description words in the requirement descriptor of the a-th input named entity and the attribute information of the i-th screening entity, the greater the value of Y (l ac,lib), the smaller the value of gd (A ac,Aib), namely the greater the value of T ai, the more the i-th screening entity of the a-th input named entity is used as a preferential matching object, and the matching guide of the knowledge graph to the nodes of other entities is reduced.
Thus, the entity guiding weight between each input named entity and each screening entity of each input named entity is obtained and is used for obtaining the matching entity of each input named entity by using the matching model subsequently.
Step S003, determining entity-oriented extension correction weights between the input named entity and each screening entity based on the information extension characteristics between the input named entity and each screening entity and the entity-oriented weights between the input named entity and each screening entity.
The entity guiding weight obtained by the steps reflects the guiding weight between the input named entity in the required word segmentation sequence and the entity node in the knowledge graph, and as the ambiguity and word ambiguity phenomena are more in the text related to the film, such as 'Cai Wenji', the guiding weight can be expressed as a film name 'Cai Wenji', a character name 'Cai Wenji', if the description words such as "east-west Chinese", "female", "literature", "doctor how much" and the like are used in the movie Cai Wenji ", the wrong modification relationship occurs, that is, the wrong modification relationship occurs between the named entity and the description word, and at this time, the appropriate matching entity may occur when the knowledge graph is searched for only through the named entity and the modification word.
For example, if the user inputs a question of "movie Cai Wenji is good, the knowledge graph needs to query" Cai Wenji "for related named entities in the aspect of the movie to connect the rest of the entities, such as finding" high-occupation "(director)," ancient costume "(movie type) two entity nodes, and then, after entity nodes adjacent to the two entity nodes have entity nodes such as"Zhang Yimeng, Chen Zhihui, Pan Yuanjia "(director), the related attributes of the found entity nodes are integrated, so that an answer for solving the user input question can be obtained, for example, the answer result may be" movie "Cai Wenji" is a ancient costume movie conducted by high-occupation full director, and by Zhangmeng, chen Zhihui, pan Yuanjia director, which is a movie that is very valuable recommended for life and has the legend of opening for girl Cai Wenji. That is, the attribute similarity between each entity node and the adjacent entity nodes in the knowledge graph has certain transmissibility, and the attribute similarity between the entity nodes is weaker as the distance in the knowledge graph is farther. Therefore, if a certain entity node in the knowledge graph is a filtering entity of a plurality of input named entities at the same time, the word corresponding to the entity node has a high probability of having a strong prominent semantic meaning, for example, a certain class of subject terms which may be texts, and the words corresponding to the plurality of input named entities are synonyms of the subject terms or paraphrasing words which have semantic change under the semantic influence of the rest of context description words.
Further, for each screening entity of each input naming entity, taking the ith screening entity of the a-th input naming entity as an example, and recording a set formed by all entity nodes directly connected with the entity nodes of the ith screening entity in the knowledge graph as a node intersection set of the ith screening entity. Secondly, counting the intersection between the set formed by all screening entities of the a-th input named entity and the node intersection set of the i-th screening entity, and taking the intersection as the information node set of the a-th input named entity.
Based on the analysis, entity-oriented extension correction weights are constructed herein to characterize the guiding effect of each screening entity when the knowledge graph is used to obtain the matching entity of each input named entity. Calculating entity-oriented extension correction weights between the a-th input named entity and the i-th screening entity:
Wherein D i is the information supply and demand of generation of the ith screening entity, G A,i is a set of entity guide weights between all input naming entities and the ith screening entity, cv (G A,i) is a variation coefficient of the set G A,i, jac i is a sum of jaccard coefficients between attribute information sets of all entity nodes in a node intersection set of the ith screening entity, n 1 is the number of input naming entities taking the ith screening entity as the screening entity, a and G are the a-th and G-th input naming entities taking the ith screening entity as the screening entity respectively, C a、Cg is a-th and G-th word vectors of the input naming entities respectively, Y (C a,Cg) is cosine similarity between C a、Cg, μ is a parameter adjusting factor, μ has the function of preventing denominator from being 0 and μ is checked to be 0.01;
U ai is an entity guide correction coefficient between an a-th input naming entity and an i-th screening entity, r a is a guide proportion of the a-th input naming entity, r a is equal to a ratio of the number of nodes in an information node set of the a-th input naming entity to the number of screening entities of the a-th input naming entity, n 2 is the number of elements in the information node set of the a-th input naming entity, x and y are the x-th and y-th screening entities except the i-th screening entity in the information node set of the a-th input naming entity, and T ai、Tax、Tay is entity guide weight between the a-th input naming entity and the i-th, x-th and y-th screening entities;
V ai is the entity-oriented extension correction weight between the a-th input named entity and its i-th screening entity.
The more input named entities take the ith entity node in the knowledge graph as a screening entity, the stronger the semantic extensibility of the ith entity node in the knowledge graph is, the more easily the semantic information of the ith entity node is influenced by context semantic information to change, the larger the value of cv (G A,i) is, the more abundant the ith screening entity is in generating semantic information which can be provided by text, the larger the difference between attribute information sets of all entity nodes in a node intersection set of the ith screening entity is, the smaller the value of Jac i is, the smaller the value of Y (C a,Cg) is, and the first characteristic value isThe smaller the value of D i, the larger the value of D i; the more definite the semantic information of the a-th input named entity, the more the same screening entities are in the information node set of the a-th input named entity and the screening entity composition set of the a-th input named entity, the stronger the semantic extensibility of the i-th screening entity of the a-th input named entity, the more connected entity nodes exist in the knowledge graph, the larger the value of r a, the larger the value of entity guiding weight between the a-th input named entity and the i-th screening entity, the more obvious the guiding between the a-th input named entity and the i-th screening entity, the more the entity guiding weight between the a-th input named entity and the rest of the screening entities in the information node set thereof is close, the smaller the value of I T ax-Tay I, and the less the value of I/is >The larger the value of U ai, the larger the value of U ai; that is, the greater the value of V ai, the less likely that a named entity steering error will occur when the a-th input named entity is matched with its i-th screening entity, and the less the entity steering weight T ai is modified.
The entity guiding extension correction weight between each input named entity and each screening entity is obtained, so that the subsequent generation model can automatically generate texts according to the input requirements of the user.
Step S004, automatically generating a reply text corresponding to each text information data input by the user based on the entity guiding extension correction weight by using the generation model.
According to the steps, the entity guiding extension correction weight between each input named entity and each screening entity is calculated, and the degree of deviation of each input named entity to each entity node in the knowledge graph is reflected. Next, a graph structure similarity score between each input named entity and each screening entity is determined based on the entity-directed extension correction weights between each input named entity and each screening entity. Calculating a graph structure similarity score between the a-th input named entity and the i-th screening entity:
Where P ai is the graph structure similarity score between the a-th input named entity and its i-th screening entity, M 1 is the number of screening entities of the a-th input named entity, and V ai is the entity-directed extension correction weight between the a-th input named entity and its i-th screening entity.
Further, according to the steps, the similarity scores of the atlas structures between each input named entity and each screening entity are respectively obtained. Secondly, taking the spectrum structure similarity score between each input named entity and each screening entity as an element in a matrix, and taking the matrix formed by the spectrum structure similarity scores between each input named entity and all the screening entities as a structure adjacent matrix of each input named entity. Secondly, the word vector and attribute information of all required word segmentation sequences, the structure adjacency matrix of each input named entity and the screening entity of all named entities are used as inputs, and a GAT (Graph Attention Network) model is used for outputting the matching entity of each input named entity, which is a known technology and a detailed process is not repeated.
Further, the attribute information sets of each required word segmentation sequence and all the matching entities of the input named entities in each required word segmentation sequence are used as the input of a generation model GPT (GENERATIVE PRE-trained Transformer), the generation model GPT is used for generating a reply text corresponding to each required word segmentation sequence, the training of the generation model is a known technology, and the specific process is not repeated.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. The foregoing description of the preferred embodiments of the present application is not intended to be limiting, but rather, any modifications, equivalents, improvements, etc. that fall within the principles of the present application are intended to be included within the scope of the present application.
Claims (10)
1. The intelligent text automatic generation method based on the AI large language model is characterized by comprising the following steps:
Converting the collected text information data into a required word segmentation sequence by using a data cleaning technology and a word segmentation tool;
determining a requirement descriptor of each input named entity and a screening entity of each input named entity based on a named entity recognition result of the words in the requirement word segmentation sequence and the dependency relationship among the words;
determining an entity dissimilarity index between each input named entity and each description word based on the depth of the word in the dependency syntax tree corresponding to each required word segmentation sequence of each input named entity;
determining entity guiding weights between each input named entity and each screening entity based on the entity dissimilarity index and the information similarity degree of each named entity and the screening entity adjacent entities of each named entity;
Determining an entity-oriented extension correction weight between each input named entity and each screening entity based on the information extension feature between each input named entity and each screening entity and the entity-oriented weight between each input named entity and each screening entity; and automatically generating a reply text corresponding to each text information data input by a user based on the entity guiding extension correction weight by using a generation model.
2. The intelligent text automatic generation method based on the AI large language model of claim 1, wherein the method for determining the requirement descriptor of each input named entity and the screening entity of each input named entity based on the recognition result of the named bodies of the words in the requirement word segmentation sequence and the dependency relationship between the words is as follows:
taking all the required Word segmentation sequences as input, and acquiring Word vectors of each Word in each required Word segmentation sequence by adopting a Word2Vec model;
The method comprises the steps of taking word vectors of all words as input of a named entity recognition model, obtaining named entity recognition results of all words by using the named entity recognition model, taking the named entity recognition results of all words in each required word segmentation sequence as an input named entity, taking words corresponding to each node directly connected with the node where each input named entity is located in a dependency syntax tree of each required word segmentation sequence as a description word of each input named entity, and taking a sequence consisting of all description words of each input named entity as a required descriptor of each input named entity;
And calculating cosine similarity between the word vector of each input named entity and the word vector of the entity corresponding to each entity node in the knowledge graph, and taking the entity corresponding to the word vector, of which the cosine similarity between each word vector and the input named entity in the knowledge graph is larger than a preset threshold value, as a screening entity of the input named entity.
3. The intelligent text automatic generation method based on the AI large language model of claim 1, wherein the method for determining the entity dissimilarity index between each input named entity and each description word based on the depth of the word in the dependency syntax tree corresponding to each required word segmentation sequence of each input named entity is as follows:
Taking the average value of the depth of a node where each input named entity is located and the depth of a node where the description word is located in a dependency syntax tree containing each input named entity and one description word of each input named entity as an average depth value;
Taking the minimum value of the number of nodes from the node of each input named entity to the node of the description word in a dependency syntax tree containing each input named entity and one description word of each input named entity as the dependency path distance between each input named entity and one description word of each input named entity;
and taking the accumulated result of the product of the average depth value and the dependency path distance on all dependency syntax trees containing each input named entity and one description word of each input named entity as an entity dissimilarity index between each input named entity and each description word.
4. The intelligent text automatic generation method based on AI big language model of claim 1, wherein the method for determining the entity guiding weight between each input named entity and each screening entity based on the entity dissimilarity index and the similarity of information of each named entity and the adjacent entities of the screening entity of each named entity is as follows:
taking cosine similarity between each description word of each input named entity and word vectors of each attribute information of any screening entity of each input named entity as a molecule;
taking the sum of the normalized Google distance between each description word of each input named entity and each attribute information of any screening entity of each input named entity in all the requirement sequences, and the entity dissimilarity index between each input named entity and each description word and 0.01 as denominators;
And taking the average value of the accumulated results of the ratio of the numerator to the denominator on all the description words of all the screening entities of each input named entity as the entity guiding weight between each input named entity and each screening entity.
5. The method for automatically generating intelligent text based on AI large language model according to claim 1, wherein the method for determining entity oriented extension correction weight between each input named entity and each screening entity based on information extension characteristics between each input named entity and each screening entity and entity oriented weight between each input named entity and each screening entity is as follows:
acquiring the generated information supply and demand degree of the screening entity in the knowledge graph based on the information node set determined by the screening entity of each input named entity and the adjacent entity node of the screening entity in the knowledge graph;
determining entity guide correction coefficients between each input named entity and each screening entity based on the generated information supply and demand degree of the screening entity of each input named entity and entity guide weights between each input named entity and different screening entities;
taking the product of the entity guiding weight and the entity guiding correction coefficient between each input named entity and each screening entity as a molecule;
And taking the ratio of the generated information supply and demand of the molecule and each screening entity of each input named entity as the entity guiding extension correction weight between each input named entity and each screening entity.
6. The intelligent text automatic generation method based on the AI large language model of claim 5, wherein the information node set obtaining method is as follows:
The method comprises the steps that a set formed by all entity nodes which are directly connected with entity nodes of each screening entity of each input named entity in a knowledge graph is recorded as a node intersection set of the screening entities;
the intersection between the set of all screening entities of each input named entity and the node intersection set of each screening entity of each input named entity is taken as the information node set of each input named entity.
7. The intelligent text automatic generation method based on the AI large language model of claim 5, wherein the method for acquiring the information supply and demand degree of the screening entity in the knowledge graph based on the information node set determined by the screening entity of each input named entity and the adjacent entity node of the screening entity in the knowledge graph is as follows:
taking a set formed by all the attributes of each entity node in the knowledge graph as an attribute information set of each entity node;
Taking the accumulation result of cosine similarity between word vectors of two input named entities corresponding to each screening entity on all the input named entities corresponding to each screening entity as a first characteristic value, and taking the product of the sum of Jaccard coefficients between attribute information sets of all entity nodes in a node intersection set of each screening entity and the first characteristic value and the sum of 0.01 as a denominator;
and taking the variation coefficient of the entity guiding weight between all the input named entities and the same screening entity as a numerator and taking the ratio of the numerator to the denominator as the generated information supply and demand degree of each screening entity.
8. The intelligent text automatic generation method based on AI big language model of claim 5, wherein the method for determining the entity-oriented correction coefficient between each input named entity and each filtering entity based on the generated information supply and demand of the filtering entity of each input named entity and the entity-oriented weight between each input named entity and different filtering entities is as follows:
taking the ratio of the number of nodes in the information node set of each input named entity to the number of screening entities of each input named entity as the guiding proportion of each input named entity;
taking each screening entity of each input named entity as a reference entity, and taking entity guiding weights between each input named entity and each reference entity as reference values;
Taking the sum of absolute values of entity guiding weight difference values between the reference value and the nodes of any two non-reference entities in each input named entity and the information node set thereof as a molecule;
taking the sum of the absolute value of the entity guiding weight difference value corresponding to the nodes of any two non-reference entities in the information node set of each input named entity and 0.01 as a denominator;
and taking the product of the accumulated result of the ratio of the numerator and the denominator on the nodes of the non-reference entity in the information node set of each input named entity and the guiding proportion of each input named entity as an entity guiding correction coefficient between each input named entity and each screening entity.
9. The automatic generation method of intelligent text based on AI large language model of claim 1, wherein the method for automatically generating reply text corresponding to each text information data input by user based on entity oriented extension correction weight by using generation model is:
Obtaining a matching entity of each input named entity based on entity guiding extension correction weight between each input named entity and all screening entities by using a GAT model;
And taking each required word segmentation sequence and attribute information of all the matching entities of the input named entities in each required word segmentation sequence as input of a generation model, and generating a reply text corresponding to each required word segmentation sequence by using the generation model.
10. The intelligent text automatic generation method based on AI big language model of claim 9, wherein the method for obtaining the matching entity of each input named entity based on the entity-oriented extension correction weight between each input named entity and all the screening entities by using GAT model is as follows:
Taking the sum of entity guiding extension correction weights between each input named entity and all screening entities as a denominator, and taking the ratio of the entity guiding extension correction weights between each input named entity and each screening entity and the denominator as a map structure similarity score between each input named entity and each screening entity;
Taking the spectrum structure similarity score between each input named entity and each screening entity as one element in a matrix, and taking the matrix formed by the spectrum structure similarity scores between each input named entity and all screening entities as a structure adjacent matrix of each input named entity;
And taking all the required word segmentation sequences, the structural adjacency matrix of each input named entity, word vectors and attribute information of screening entities of all named entities as inputs, and obtaining a matching entity of each input named entity by using a GAT model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410281005.XA CN118013017B (en) | 2024-03-12 | 2024-03-12 | Intelligent text automatic generation method based on AI large language model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410281005.XA CN118013017B (en) | 2024-03-12 | 2024-03-12 | Intelligent text automatic generation method based on AI large language model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118013017A true CN118013017A (en) | 2024-05-10 |
CN118013017B CN118013017B (en) | 2024-07-05 |
Family
ID=90956083
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410281005.XA Active CN118013017B (en) | 2024-03-12 | 2024-03-12 | Intelligent text automatic generation method based on AI large language model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118013017B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180232443A1 (en) * | 2017-02-16 | 2018-08-16 | Globality, Inc. | Intelligent matching system with ontology-aided relation extraction |
CN112860781A (en) * | 2021-02-05 | 2021-05-28 | 陈永朝 | Mining and displaying method combining vocabulary collocation extraction and semantic classification |
CN114330352A (en) * | 2022-01-05 | 2022-04-12 | 北京京航计算通讯研究所 | Named entity identification method and system |
-
2024
- 2024-03-12 CN CN202410281005.XA patent/CN118013017B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180232443A1 (en) * | 2017-02-16 | 2018-08-16 | Globality, Inc. | Intelligent matching system with ontology-aided relation extraction |
CN112860781A (en) * | 2021-02-05 | 2021-05-28 | 陈永朝 | Mining and displaying method combining vocabulary collocation extraction and semantic classification |
CN114330352A (en) * | 2022-01-05 | 2022-04-12 | 北京京航计算通讯研究所 | Named entity identification method and system |
Non-Patent Citations (2)
Title |
---|
YONGCHUN GU等: "Mining Similar Words with Similarity Ranking", 《SPRINGER》, 27 June 2023 (2023-06-27) * |
蒲婷: "基于深度学习的刑事案件法条和罪名多标签分类方法研究", 《中国硕士学位论文全文数据库》, 15 August 2023 (2023-08-15) * |
Also Published As
Publication number | Publication date |
---|---|
CN118013017B (en) | 2024-07-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298033B (en) | Keyword corpus labeling training extraction system | |
CN108804521B (en) | Knowledge graph-based question-answering method and agricultural encyclopedia question-answering system | |
CN109829104B (en) | Semantic similarity based pseudo-correlation feedback model information retrieval method and system | |
CN110750995B (en) | File management method based on custom map | |
CN116628173B (en) | Intelligent customer service information generation system and method based on keyword extraction | |
CN110209818B (en) | Semantic sensitive word and sentence oriented analysis method | |
CN110674252A (en) | High-precision semantic search system for judicial domain | |
CN108538294B (en) | Voice interaction method and device | |
CN112417846B (en) | Text automatic generation method and device, electronic equipment and storage medium | |
CN113505209A (en) | Intelligent question-answering system for automobile field | |
CN112632250A (en) | Question and answer method and system under multi-document scene | |
CN117474703B (en) | Topic intelligent recommendation method based on social network | |
CN115827819A (en) | Intelligent question and answer processing method and device, electronic equipment and storage medium | |
Valčič et al. | Information technology for management and promotion of sustainable cultural tourism | |
CN109522396B (en) | Knowledge processing method and system for national defense science and technology field | |
CN111858896A (en) | A Knowledge Base Question Answering Method Based on Deep Learning | |
CN109284389A (en) | A kind of information processing method of text data, device | |
CN118277509A (en) | Knowledge graph-based data set retrieval method | |
CN111680493B (en) | English text analysis method and device, readable storage medium and computer equipment | |
CN108595413A (en) | A kind of answer extracting method based on semantic dependent tree | |
CN118013017B (en) | Intelligent text automatic generation method based on AI large language model | |
Altaf et al. | Efficient natural language classification algorithm for detecting duplicate unsupervised features | |
CN119128115B (en) | Multi-round question-answering implementation method aiming at standard specification | |
CN118551024B (en) | Question answering method, device, storage medium and gateway system | |
CN117540747B (en) | Book publishing intelligent question selecting system based on artificial intelligence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |