CN114528368A - Spatial relationship extraction method based on pre-training language model and text feature fusion - Google Patents
Spatial relationship extraction method based on pre-training language model and text feature fusion Download PDFInfo
- Publication number
- CN114528368A CN114528368A CN202111338542.6A CN202111338542A CN114528368A CN 114528368 A CN114528368 A CN 114528368A CN 202111338542 A CN202111338542 A CN 202111338542A CN 114528368 A CN114528368 A CN 114528368A
- Authority
- CN
- China
- Prior art keywords
- text
- spatial relationship
- word
- entity
- text data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 37
- 238000012549 training Methods 0.000 title claims abstract description 30
- 230000004927 fusion Effects 0.000 title claims abstract description 18
- 239000013598 vector Substances 0.000 claims abstract description 84
- 238000000034 method Methods 0.000 claims abstract description 49
- 238000011176 pooling Methods 0.000 claims abstract description 34
- 238000013528 artificial neural network Methods 0.000 claims abstract description 24
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 239000011159 matrix material Substances 0.000 claims description 19
- 230000008569 process Effects 0.000 claims description 9
- 230000007246 mechanism Effects 0.000 claims description 8
- 239000002356 single layer Substances 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 4
- 230000001174 ascending effect Effects 0.000 claims description 3
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 2
- 230000004913 activation Effects 0.000 claims description 2
- 230000011218 segmentation Effects 0.000 claims 2
- 238000006243 chemical reaction Methods 0.000 abstract 1
- 238000004140 cleaning Methods 0.000 abstract 1
- 239000003550 marker Substances 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a spatial relationship extraction method based on pre-training language model and text feature fusion, which comprises the steps of firstly cleaning and preprocessing text data, realizing the conversion of single-strip or batch text data to low-dimensional word vectors by utilizing the pre-training language model, and ensuring the consistency of the dimensions of the low-dimensional word vectors converted from the text data with different lengths; then, predicting the starting and ending positions of geographic entities and spatial relation feature words in the text by a classifier and a word vector which are formed by a feedforward neural network, and generating character span representation by a pooling method according to the starting and ending positions and the word vector representation; and finally, performing two tasks of geographic entity recognition and spatial relationship classification according to character span representation, thereby realizing text spatial relationship extraction. The method well considers the association relationship between the geographic entity type and the spatial relationship characteristic word and the spatial relationship extraction, realizes the extraction of the text-oriented triple form spatial relationship, and has better expansibility and universality.
Description
Technical Field
The invention belongs to the field of natural language processing and geographic big data mining, and particularly relates to a geographic entity recognition and spatial relationship extraction method based on pre-training language model and text feature fusion.
Background
The spatial relationship is information for describing mutual constraint, interaction and mutual correlation state between the geographic entities, and is indispensable connection information when people describe spatial positions. People's daily life communication frequently involves descriptions about spatial locations, usually in the form of a pair of geographic entity objects plus a spatial relationship that motivates people to infer the spatial location of unknown geographic entities from known geographic entities, linking the semantic space of human thought with the physical space of the real world. The text is one of the most common communication and information interaction modes in daily life, and contains abundant position description information and corresponding spatial relationship information, however, due to the flexibility and ambiguity of text expression, it is difficult to correctly understand the spatial position described in the text. In order to understand the description of the spatial position more fully, the accurate recognition of the geographic entities and the spatial relationships in the text becomes a scientific problem to be solved urgently.
In order to obtain the spatial relationship in the text, researchers have used the relation extraction method of natural language processing to provide a relation extraction method based on rule template and machine learning to obtain the spatial relationship in the text. The method for acquiring the natural language spatial relationship based on the rule template is used for formulating an extraction rule and a template to acquire the spatial relationship through the steps of listing spatial vocabularies, defining the spatial relationship definition, constructing a spatial relationship characteristic word dictionary, inducing a syntactic pattern and the like, but the extraction method has poor generalization capability and low extraction result recall rate due to the defects of excessive dependence on expert knowledge, incomplete induction of the extraction rule and the like. A natural language spatial relationship obtaining method based on machine learning is to introduce frequency statistics, a Bootstrapping method, a kernel method, a support vector machine and other statistical learning methods to extract key features of natural language, so that dependence on a rule template is eliminated to a great extent, but the method is difficult to be applied to the problem of sparse spatial relationship instance distribution. Based on deep learning, a plurality of scientific researchers express entity information and relationship information in a text by using the same encoder through a combined extraction method, so that the dependency relationship between two tasks of entity identification and relationship extraction is enhanced, the problem of error accumulation caused by the fact that the entity identification and the relationship extraction are used as independent tasks is solved, and the influence of sparse spatial relationship instance distribution on a model is relieved.
However, existing experiments and analysis show that joint extraction is not an ideal relationship extraction method, and the blind sharing of the context representations of entities and relationships can adversely affect the spatial extraction performance of the model. In addition, the joint extraction method does not fully consider the entity type information and the relation feature word information, does not fully consider the influence of the entity type and the relation feature word on the relation classification task, and is difficult to further alleviate the problem caused by sparse distribution of the spatial relation examples.
Disclosure of Invention
The invention aims to provide a spatial relationship extraction method based on fusion of a pre-training language model and text features, aiming at the defects and shortcomings of the existing spatial relationship extraction method in the process of extracting the spatial relationship in a text.
The technical scheme adopted by the invention for solving the technical problems is a spatial relationship extraction method based on the fusion of a pre-training language model and text features, and the method comprises the following steps:
step 1: firstly, preprocessing text data, and removing meaningless characters such as #% $' and spaces in a text by using a regular expression to ensure that quotation marks before and after double quotation marks or single quotation marks are completely matched. Then, the text data is character-by-character divided, and [ CLS ] and [ SEP ] identifiers are added at the beginning and end of the text data division result. If the text data is input in batch, it is necessary to ensure that each text data is consistent in length, and text data with shorter length is filled with a [ PAD ] identifier.
And 2, step: inputting the preprocessed text data into a pre-training language model, and segmenting the text data character by character to obtain a result T ═ T1,t2,..,tNIs converted to a dense real word vector Z ═ Z1,z2,..,zN}。
And step 3: respectively inputting the dense real number vectors obtained in the step 2 into two single-layer feedforward neural networks, wherein the feedforward neural networks are used as two classifiers for predicting the word vectors ziWhether it is the beginning character or the ending character of the geographic entity or the spatial relationship characteristic word. The prediction results of the two single-layer feedforward neural networks are respectively recorded in the POSstartAnd POSendIn the index set, the index set is sorted in ascending order. Based on dense real number vector Z ═ Z1,z2,..,zN}、POSstartAnd POSendIndex set, selecting a pair of start index and end index [ i, j ]]And fusing Z in Z by the method of Max PoolingiTo zjTo constitute a character Span Representation (Span Representation). In the process of selecting the start index and the end index, the selection is strictly based on the principle of proximity, and the start index and the end index are required not to repeatedly appear.
And 4, step 4: inputting the character span representation obtained in the step 3 into a single-layer feedforward neural network, predicting a start index and an end index [ i, j]The entity type represented by the corresponding character span comprises specific geographic entities (mountains, rivers, administrative divisions and the like), and spatial relationship characteristic words or。Representing that the character span representation does not belong to any geographic entity or spatial relationshipA characteristic word type.
And 5: and (4) automatically adding geographic entity marks before and after the starting position and the ending position of the source text data by the model based on the geographic entity prediction result in the step (4), wherein the geographic entity marks the positions of the geographic entities identified by the model in the text, and meanwhile, the information of the starting position and the ending position of the spatial relation characteristic words is updated. And after the geographic entity mark is added, inputting newly generated text data into a pre-training language model for relation extraction to generate a corresponding low-dimensional dense word vector. The model represents the geographic entity by fusing the start and end token word vectors of the entity through an Average Pooling (Average Pooling) method, and represents the spatial relationship feature words by fusing the corresponding word vectors through a Max Pooling (Max Pooling) method.
Step 6: firstly, the model splices vector representations of any pair of geographic entities and spatial relation feature words, and the spliced vector representations are fused into text feature vectors through a self-attention mechanism (self-attention); then, inputting the text feature vector into a feed-forward neural network for spatial relationship classification; and finally, the model judges the spatial relationship among the geographic entities according to the probability information output by the feedforward neural network.
Furthermore, based on large-scale text data in the geographic field, the pre-training language model learns grammar rules and excavates hidden semantics from the text data through a self-supervision learning method, the text data divided by character granularity is used as input, the model encodes the text data from the three aspects of characters, positions and semantics to generate a word vector matrix, and the dimensionality of the matrix is the output dimensionality and the input text character length which are respectively set by the pre-training language model.
Furthermore, in the spatial relationship extraction process, two independent pre-training language models are used in two subtasks of geographic entity recognition and spatial relationship classification, and in the model training process, the two pre-training language models are not affected with each other and can independently update parameters, so that word vector representations meeting the subtask requirements are generated better, the word vector representations Z of the text data T generated by the pre-training language models can be represented as Z ═ BERT (T), and T ═ T ═ T-1,t2,..,tN},Z= {z1,z2,..,zNN denotes the number of characters in each sample of text data.
Furthermore, the two classifiers are respectively used for predicting the starting and ending positions of the geographic entity and the spatial relation characteristic word in the text data, the two classifiers take the word vector generated by the pre-training language model as input, output affine operation and GeLU activation function calculation results, and judge whether the current character is the starting or ending position of the geographic entity or the spatial relation characteristic word according to the set threshold and the output result, and the process can be expressed as POS through a formulastart=GeLU(WstartZ+bstart),POSend=GeLU(WendZ+ bend),ifPOSstart>δthen1else0。
Further, character span representation is generated by fusing word vector representation between the start index and the end index based on a pooling method, each dimension of each word vector is fully considered by a maximum pooling method, and the maximum value of each dimension is selected and fused to be final word vector representation; the average pooling method pays attention to the characteristics of the boundary word vectors, the geographic entity boundary marking word vectors represent the geographic entity in an average summation mode, the model further learns the boundary characteristics and the type characteristics of the entity better, and the two pooling methods can be specifically represented as S[i-j]=Max([zi;zi+1;…;zj]) And
furthermore, the model forms a text feature matrix by splicing word vectors of geographic entities and spatial feature words, and based on a self-attention mechanism, the text feature matrix is formed by a parameter Wq、WkAnd WvRespectively generating a query matrix Q, a key matrix K and a value matrix V, then further fusing the three types of matrices by utilizing a softmax function to generate a text with a specified dimension sizeThe feature vector.
Has the beneficial effects that:
1. the invention adopts the pre-training language model to replace the word2vec model, and obtains more complete word vector representation of context information.
2. The method constructs a group of two classifiers based on the feedforward neural network to judge the starting and ending positions of the geographic entities and the spatial relation characteristic words in the text, and further reduces the time loss of the algorithm of the starting and ending positions.
3. The invention fuses the word vectors by Average Pooling (Average Pooling) and maximum Pooling (Max Pooling) methods to generate character span representation to represent geographic entities and spatial feature words. Compared with the existing method for carrying out sequence labeling on a single character, the character span representation is more in line with the thinking mode of people, the identification error caused by over discrete meaning of the single character can be effectively reduced, and the identification precision of geographic entities and spatial characteristic words is further improved
4. The method is based on the vector representation of the self-attention mechanism and the fusion geographic entity pair and spatial relation feature words, so that two key text features of the geographic entity type and the spatial relation feature words are fused, and vector representation with more complete semantics is generated.
Drawings
FIG. 1 is a technical flowchart of a spatial relationship extraction method based on pre-training language model and text feature fusion according to the present invention.
Fig. 2 is a schematic diagram of a text data preprocessing process used in the example.
FIG. 3 is a diagram of an example process for generating a representation of a character span.
Fig. 4 is a schematic diagram of a text feature fusion process of geographic entity types and spatial relationship feature words.
Detailed Description
The following detailed description will be made in conjunction with the accompanying drawings of the specification, and the spatial relationship extraction method based on the fusion of the pre-training language model and the text features includes the following steps:
(1) preprocessing original text data, removing meaningless characters and spaces in the text by using a regular expression, and adding [ CLS ] and [ SEP ] marks at the beginning and the end of the text data. The preprocessed text data T is input into a pre-trained language model (by default, the BERT pre-trained language model), and a word vector representation Z corresponding to the input data is generated.
Z=BERT(T),T={t1,t2,..,tN},Z={z1,z2,..,zN}
If the input text data is in batch, the model will ensure that all the input text data is of consistent length, and text data of shorter length will be filled in with the [ PAD ] symbol.
(2) The vector representation of the text words generated by the pre-training language model is input into two independent classifiers, and the starting positions and the ending positions of the geographic entities and the spatial relation characteristic words in the text are respectively predicted. And allocating the index of the starting position and the ending position according to the principle of proximity to the prediction result, and constructing a geographic entity and spatial relationship characteristic word [ starting, ending ] index pair.
POSstart=GeLU(WstartZ+bstart),POSend=GeLU(WendZ+bend)
WstartAnd WendParameter matrices representing two classifiers, respectively, bstartAnd bendRespectively representing the bias coefficients of the two classifiers. POS (Point of sale)startAnd POSendRepresenting the starting and ending positions of the geographic entity or spatial relationship feature word, respectively.
1) The starting position classifier and the ending position classifier are constructed by a single-layer feedforward neural network, the feedforward neural network maps vector representation of a single character to a one-dimensional tensor, and whether the starting position or the ending position of a geographic entity and a spatial relation feature word is determined according to a set hyper-parameter threshold.
2) The method for constructing the [ start, end ] index pair provided by the invention matches the start position and the end position according to the principle of proximity. Specifically, the start positions and the end positions are sorted in ascending order, all the start positions in the sequence are traversed by taking the start position sequence as a reference, and the end positions which accord with the rule are selected for matching. According to the matching rule, any pair of [ start, end ] does not include other start positions or end positions.
(3) Starting and ending constructed based on the step (2)]And fusing the word vector representation from the starting position to the ending position by a maximum Pooling (Max Pooling) method to generate a corresponding character span representation. Based on the feedforward neural network, each generated character span representation is identified, and the entity type (geographic entity, spatial feature word or spatial feature word) of the character span representation is judged)。
S[i-j]=Max([zi;zi+1;…;zj])
Entity Class=softmax(WentityS[i-j]+bentity)
WentityParameter matrix representing a feed-forward neural network in an entity recognition process, bentityRepresenting the bias coefficients of the feed-forward neural network.
(4) And (4) adding entity start and entity end marks at the position corresponding to the text source data by combining the geographic entity type identification result obtained in the step (3), and updating the start and end positions predicted as the geographic entity or the spatial relationship characteristic word. Then, the text with the added start and end marks is input into another pre-trained language model, and corresponding word vector representations are generated. Finally, the word vectors of the start and end marks are fused through an Average Pooling (Average Pooling) method to represent the geographic entity, and the word vectors from the start to the end positions are fused through a maximum Pooling (Max Pooling) method to represent the spatial relationship feature words.
Z′=BERT′(T′),T′={t′1,t′2,…,t′M},Z={z′1,z′2,…,z′M}
SFeature word-[k-l]=Max([zk;zk+1;…;zl])
i, j respectively represent the positions of the start marker and the end marker of the geographic entity predicted by the model, and k, l respectively represent the positions of the start marker and the end marker of the spatial feature word predicted by the model.
(5) The model first matches the geographic entities in combination to form a set of candidate pairs of geographic entities. Then, the model selects any pair of geographic entities in the set and word vector representations corresponding to the spatial relationship characteristic words, and the word vector representations are spliced; then, fusing the spliced vector representation into a text feature vector through a self-attention mechanism (self-attention); and finally, inputting the text feature vectors into a feed-forward neural network for spatial relationship classification, and judging the spatial relationship between the geographic entities according to probability information output by the feed-forward neural network.
S=concat(SEntity-sub;SEntity-obj;SFeatureword_1;…;SFeatureword_p)
SEntity-subAnd SEntity-objCharacter span representation, S, representing subject and object, respectivelyFeatureword_iEach space relation characteristic word vector, W, identified by the representation modelq、WkAnd WvRespectively representing query vector generation parameter matrix, key vector generation parameter matrix and value vector generation parameter matrix, WrAnd brAnd respectively representing the feedforward neural network parameter matrix and the bias coefficient of the spatial relation classification.
As shown in FIG. 1, the spatial relationship extraction method based on the fusion of the pre-training language model and the text features of the invention mainly comprises the following three parts:
1. text word vector generation based on a pre-trained language model.
2. Character span representation generation based on the start and end position indices and pooling methods.
3. Text feature fusion considering geographic entity types and spatial relationship feature words.
The detailed flow of the spatial relationship extraction method of the present invention is described in detail by taking the chinese text data from chinese encyclopedia (geography) as an example.
(1) Chinese text data preprocessing and word vector generation based on a pre-training language model.
As shown in fig. 2, the selected text data "town a is located in the northeast of prefecture B. ", according to the data preprocessing step, dividing words by characters for text data, and adding [ CLS ] and [ SEP ] symbols at the beginning and end of the data, respectively. And inputting the preprocessed text data into a pre-training language model to generate a character vector representation matrix with uniform dimensionality.
(2) And identifying the starting position and the ending position of the geographic entity and the spatial relation characteristic word based on the two classifiers.
The vector representation of the text words generated by the pre-training language model is input into two independent classifiers, and the starting positions and the ending positions of the geographic entities and the spatial relation characteristic words in the text are respectively predicted. Town is northeast of B county for example "[ CLS ] A. [ SEP ] ", one of the two classifier predictors is start position 1, end position 3.
(3) A character span representation based on the start and end position pairs is generated.
As shown in fig. 3, a character span representation is generated by fusing the word vectors of the start position to the end position by a Max Pooling (Max Pooling) method based on the start and end positions of the geographic entity predicted by the two classifiers and the word vector representation of the text data. The word vectors of the two characters "A" and "town" in the example are fused to generate a character span representation to characterize "A town".
(4) And identifying the geographic entity and the spatial relation characteristic word of the text data.
Inputting each character span representation generated based on the feedforward neural network, and judging the entity type (geographic entity, spatial feature word or spatial feature word) of the character span representation). In the example, the recognition results of the model are "town a", "located", "county B", and "northeast", wherein town a "and county B" belong to administrative divisions in the geographic entity, and "located" and "northeast" belong to the spatial relationship feature word. And adding entity starting marks and entity ending marks to the model based on the positions corresponding to the text source data of the recognition result, and updating the starting positions and the ending positions of the predicted geographic entities or spatial relation characteristic words. Then, the model inputs the text added with the start and end marks into another pre-training language model, and generates corresponding word vector representation again. Finally, the word vectors of the start and end marks are fused by an Average Pooling (Average Pooling) method to represent the geographic entity, and the word vectors from the start to the end position are fused by a Max Pooling (Max Pooling) method to represent the spatial relationship feature words.
(5) Text feature fusion and spatial relationship extraction based on self-attention mechanism
The model firstly matches the geographic entities in a combined mode according to the identification result of the geographic entities to form a candidate geographic entity pair set. Then, as shown in fig. 4, the model selects the feature words in the set ("town a", "county B") and the spatial relationship, and concatenates the word vector representations corresponding to the above elements; then, fusing the spliced vector representation into a text feature vector through a self-attention mechanism (self-attention); and finally, inputting the text feature vectors into a feed-forward neural network for spatial relationship classification, and judging the spatial relationship between the geographic entities according to probability information output by the feed-forward neural network.
In combination with the embodiment, the method provided by the invention uses the pre-training language model to generate the word vector, and simultaneously considers the association relationship between the geographic entity type and the spatial relationship characteristic word and the spatial relationship, and has good extraction performance and interpretability.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be made by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.
Claims (6)
1. The spatial relationship extraction method based on the fusion of the pre-training language model and the text features is characterized by comprising the following steps of:
step 1: preprocessing original text data, removing meaningless characters in a text by using a regular expression, ensuring that front and back quotation marks in the text are completely matched, segmenting the processed text data character by character, adding [ CLS ] and [ SEP ] identifiers at the beginning and the end of a segmentation result, and if the text data is input in a batch form, ensuring that each piece of text data is consistent in length, and filling the text data with shorter length by using the [ PAD ] identifier;
step 2: inputting the preprocessed text data into a pre-training language model, and performing character-by-character word segmentation on the text data to obtain a result T ═ T { (T) }1,t2,..,tNIs converted into a dense real number vector Z ═ Z1,z2,..,zN};
And step 3: respectively inputting the word vectors obtained in the step 2 into two classifiers formed by single-layer feedforward neural networks to predict word vectors ziWhether the predicted result is the beginning or the end of the geographic entity or the spatial relation characteristic word or not, and the predicted results of the two classifiers are respectively recorded in the POSstartAnd POSendIn the index set, and sorted according to the ascending order of the indexes,
POSstart=GeLU(WstartZ+bstart),POSend=GeLU(WendZ+bend)
based on the word vector Z ═ { Z1,z2,..,zN}、POSstartAnd POSendIndex set, a pair of start and end indexes [ i, j ] is selected according to the principle of proximity]And fusing Z in Z by the maximum pooling methodiTo zjGenerating a character span representation;
and 4, step 4: inputting the character span representation generated in the step 3 into an entity recognizer constructed by a single-layer feedforward neural network, predicting the entity type represented by the character span,
S[i-j]=Max([zi;zi+1;…;zj])
Entity Class=softmax(WentityS[i-j]+bentity)
the entity type comprises a specific geographic entity type, a spatial relationship characteristic word or Representing that the character span representation does not belong to any geographic entity or spatial relationship characteristic word type;
and 5: according to the prediction result of the geographic entity in the text, the model automatically adds geographic entity marks before and after the starting position and the ending position of the source text data, meanwhile, the information of the starting position and the ending position of the spatial relationship feature word in the source text data is updated, after the geographic entity marks are added, newly generated text data are input into another pre-training language model to generate a corresponding text word vector, the model integrates the word vectors of the starting mark and the ending mark through an average pooling method to represent the geographic entity, and the word vectors of the starting mark and the ending mark are integrated through a maximum pooling method to represent the spatial relationship feature word;
step 6: matching the geographic entities by the model in a combined mode to form a candidate geographic entity pair set, selecting any pair of geographic entities in the set and word vector representations corresponding to the spatial relationship characteristic words, and splicing the word vector representations; fusing the spliced vector representation into a text feature vector through a self-attention mechanism; inputting the text feature vector into a feedforward neural network for spatial relationship classification, judging the spatial relationship between the geographic entities according to probability information output by the feedforward neural network,
S=concat(SEntity-sub;SEntity-obj;SFeatureword_1;…;SFeatureword_p)
Relation Class=softmax(WrS′+br)。
2. the method for extracting spatial relationship based on fusion of pre-trained language model and text features as claimed in claim 1, wherein based on large-scale geographic domain text data, the pre-trained language model learns grammar rules and excavates hidden semantics from the text data by a self-supervised learning method, the text data divided by character granularity is used as input, the model encodes the text data from three aspects of character itself, position and semantics to generate a word vector matrix, and the dimensions of the matrix are the output dimension size and the input text character length set for the pre-trained language model respectively.
3. The spatial relationship extraction method based on the fusion of the pre-trained language model and the text features as claimed in claim 1, wherein in the spatial relationship extraction process, two independent pre-trained language models are used in two subtasks of geographic entity recognition and spatial relationship classification, and in the model training process, the two pre-trained language models are not affected by each other and can independently update parameters, so as to better generate word vector representation meeting the subtask requirements, and the word vector representation Z of the text data T generated by the pre-trained language model can be represented as Z ═ bert (T), and T ═ T { (T)1,t2,..,tN},Z={z1,z2,..,zNN denotes the number of characters in each sample of text data.
4. The spatial relationship extraction method based on pre-trained language model and text feature fusion as claimed in claim 1, wherein the two classifiers are formed by a single-layer feedforward neural network, the two classifiers are used for predicting the start and end positions of geographic entities and spatial relationship feature words in text data, the two classifiers take word vectors generated by the pre-trained language model as input, output affine operation and GeLU activation function calculationCalculating the result, and judging whether the current character is the starting or ending position of the geographic entity or the spatial relation characteristic word according to the set threshold value and the output result, wherein the process can be expressed as POS through a formulastart=GeLU(WstartZ+bstart),POSend=GeLU(WendZ+bend),if POSstart>δthen 1 else 0。
5. The spatial relationship extraction method based on the fusion of the pre-trained language model and the text features as claimed in claim 1, wherein the character span representation is generated based on the fusion of the word vector representations between the start and end indexes by a pooling method, the maximal pooling method takes full account of each dimension of each word vector, and the maximum value selected as each dimension is fused into the final word vector representation; the average pooling method pays attention to the characteristics of the boundary word vectors, the geographic entity boundary marking word vectors represent the geographic entity in an average summation mode, the model further learns the boundary characteristics and the type characteristics of the entity better, and the two pooling methods can be specifically represented as S[i-j]=Max([zi;zi+1;…;zj]) And
6. the method of claim 1, wherein the model forms a text feature matrix by concatenating word vectors of geographic entities and spatial feature words, and the spatial relationship extraction method is based on a self-attention mechanism and by a parameter Wq、WkAnd WvAnd respectively generating a query matrix Q, a key matrix K and a value matrix V, and then further fusing the three types of matrices by utilizing a softmax function to generate a text feature vector with a specified dimension.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111338542.6A CN114528368B (en) | 2021-11-12 | 2021-11-12 | Spatial relation extraction method based on fusion of pre-training language model and text features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111338542.6A CN114528368B (en) | 2021-11-12 | 2021-11-12 | Spatial relation extraction method based on fusion of pre-training language model and text features |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114528368A true CN114528368A (en) | 2022-05-24 |
CN114528368B CN114528368B (en) | 2023-08-25 |
Family
ID=81618545
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111338542.6A Active CN114528368B (en) | 2021-11-12 | 2021-11-12 | Spatial relation extraction method based on fusion of pre-training language model and text features |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114528368B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114881038A (en) * | 2022-07-12 | 2022-08-09 | 之江实验室 | Chinese entity and relation extraction method and device based on span and attention mechanism |
CN116402055A (en) * | 2023-05-25 | 2023-07-07 | 武汉大学 | Extraction method, device, equipment and medium for patent text entity |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120078503A1 (en) * | 2009-06-10 | 2012-03-29 | Ancestralhunt Partners, Llc | System and method for the collaborative collection, assignment, visualization, analysis, and modification of probable genealogical relationships based on geo-spatial and temporal proximity |
CN110377686A (en) * | 2019-07-04 | 2019-10-25 | 浙江大学 | A kind of address information Feature Extraction Method based on deep neural network model |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
CN111680122A (en) * | 2020-05-18 | 2020-09-18 | 国家基础地理信息中心 | Space data active recommendation method and device, storage medium and computer equipment |
CN113190655A (en) * | 2021-05-10 | 2021-07-30 | 南京大学 | Spatial relationship extraction method and device based on semantic dependence |
-
2021
- 2021-11-12 CN CN202111338542.6A patent/CN114528368B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120078503A1 (en) * | 2009-06-10 | 2012-03-29 | Ancestralhunt Partners, Llc | System and method for the collaborative collection, assignment, visualization, analysis, and modification of probable genealogical relationships based on geo-spatial and temporal proximity |
CN110377686A (en) * | 2019-07-04 | 2019-10-25 | 浙江大学 | A kind of address information Feature Extraction Method based on deep neural network model |
CN111680122A (en) * | 2020-05-18 | 2020-09-18 | 国家基础地理信息中心 | Space data active recommendation method and device, storage medium and computer equipment |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
CN113190655A (en) * | 2021-05-10 | 2021-07-30 | 南京大学 | Spatial relationship extraction method and device based on semantic dependence |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114881038A (en) * | 2022-07-12 | 2022-08-09 | 之江实验室 | Chinese entity and relation extraction method and device based on span and attention mechanism |
CN114881038B (en) * | 2022-07-12 | 2022-11-11 | 之江实验室 | Chinese entity and relation extraction method and device based on span and attention mechanism |
CN116402055A (en) * | 2023-05-25 | 2023-07-07 | 武汉大学 | Extraction method, device, equipment and medium for patent text entity |
CN116402055B (en) * | 2023-05-25 | 2023-08-25 | 武汉大学 | Extraction method, device, equipment and medium for patent text entity |
Also Published As
Publication number | Publication date |
---|---|
CN114528368B (en) | 2023-08-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113204952B (en) | Multi-intention and semantic slot joint identification method based on cluster pre-analysis | |
CN111694924A (en) | Event extraction method and system | |
CN113010693A (en) | Intelligent knowledge graph question-answering method fusing pointer to generate network | |
CN108536754A (en) | Electronic health record entity relation extraction method based on BLSTM and attention mechanism | |
WO2023005293A1 (en) | Text error correction method, apparatus, and device, and storage medium | |
CN107818084B (en) | Emotion analysis method fused with comment matching diagram | |
CN112818698B (en) | Fine-grained user comment sentiment analysis method based on dual-channel model | |
CN116151132B (en) | Intelligent code completion method, system and storage medium for programming learning scene | |
CN110782892B (en) | Voice text error correction method | |
CN114528368B (en) | Spatial relation extraction method based on fusion of pre-training language model and text features | |
CN110750646B (en) | Attribute description extracting method for hotel comment text | |
CN114332519A (en) | Image description generation method based on external triple and abstract relation | |
CN114444507A (en) | Context parameter Chinese entity prediction method based on water environment knowledge map enhancement relationship | |
CN113158674B (en) | Method for extracting key information of documents in artificial intelligence field | |
CN111368066B (en) | Method, apparatus and computer readable storage medium for obtaining dialogue abstract | |
CN113204975A (en) | Sensitive character wind identification method based on remote supervision | |
CN116127013A (en) | Personal sensitive information knowledge graph query method and device | |
CN115658846A (en) | Intelligent search method and device suitable for open-source software supply chain | |
CN110110137A (en) | Method and device for determining music characteristics, electronic equipment and storage medium | |
CN111507103B (en) | Self-training neural network word segmentation model using partial label set | |
CN116227428B (en) | Text style migration method based on migration mode perception | |
CN113705222A (en) | Slot recognition model training method and device and slot filling method and device | |
CN115186670B (en) | Method and system for identifying domain named entities based on active learning | |
CN114970537B (en) | Cross-border ethnic cultural entity relation extraction method and device based on multi-layer labeling strategy | |
CN117744658A (en) | Ship naming entity identification method based on BERT-BiLSTM-CRF |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |