CN114626425A

CN114626425A - Multi-view interactive matching method for noise text and electronic device

Info

Publication number: CN114626425A
Application number: CN202011456860.8A
Authority: CN
Inventors: 井雅琪; 李扬曦; 佟玲玲; 任博雅; 段东圣; 段运强; 胡燕林; 方芳; 尹鹏飞
Original assignee: Institute of Information Engineering of CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Information Engineering of CAS; National Computer Network and Information Security Management Center
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2022-06-14
Anticipated expiration: 2040-12-10
Also published as: CN114626425B

Abstract

The invention provides a multi-view interactive matching method and an electronic device for noise texts, which comprises the steps of respectively coding two sections of noise texts to be matched to obtain two sections of coding vector sequences, and adding position information into each coding vector of the two sections of coding vector sequences; carrying out internal interaction on the two sections of coding vector sequences added with the position information to respectively obtain two sections of internal interaction results; performing external interaction on the two sections of internal interaction results, and respectively constructing two bidirectional noise text interaction matrixes; and splicing the two noise text interaction matrixes, and judging whether the two noise texts to be matched are matched. According to the method, the attention mechanism is adopted to capture the bidirectional matching mode between the noise texts, the influence of the logic sequence of sentences in the noise texts is small, the influence of effective semantic words of the texts is increased, the model time efficiency and the noise text matching effect are improved, and the problem of transmission matching is avoided.

Description

Multi-view interactive matching method for noise text and electronic device

Technical Field

The invention relates to the field of computers, in particular to a multi-view interactive matching method and an electronic device for a noise text.

Background

On the current internet, there is a large amount of noisy text. Specifically, noisy text refers to text that contains segments of text that have no practical significance or that is disorganized in the grammatical structure order. There are two main problems in these noisy texts: in content, the semantics expressed by noisy text are independent of the original text, and usually have ambiguity and repeatability. In terms of form, the noise text has a relatively complex grammatical structure, and the sequential structure of the noise text is also diverse. In view of these two main problems in noisy text, it is necessary to design a matching model insensitive to noise and language order in order to solve the matching problem of noisy text. At present, the mainstream noise text matching method firstly filters the noise in the text before matching through a rule method and a feature engineering. The filtered noise text is input into a time sequence matching model, and the time sequence matching model mainly comprises a Markov conditional random field, a recurrent neural network and the like. And finally, the model sequentially reads the input texts to obtain a matching score as a score of a matching result so as to judge whether the two noise sentences are matched with each other.

However, methods that utilize rules and feature engineering have limited filtering effects, are difficult to cover all noise instances, and are difficult to correctly identify all noise. Because the form of noise varies, it is difficult to exhaustively enumerate and generalize, and some noise still has its practical meaning in a particular context. In addition, because the sequential structure of the text determines the true meaning of the text to be expressed to a great extent, the phenomenon of word order disorder in the noisy text may cause the effect of the conventional time sequence model to be poor.

Disclosure of Invention

The invention aims to provide a multi-view interactive matching method and an electronic device for a noise text, wherein an Attention weight is calculated by a cosine scaling mechanism, and interference of noise and a word sequence on a text matching result is suppressed by adopting an Attention weighting mode, so that a better text matching effect can be still obtained under the condition of text noise and word sequence disorder.

The technical scheme of the invention is as follows:

a multi-view interactive matching method for noise text comprises the following steps:

1) respectively encoding two sections of noise texts to be matched to obtain two sections of encoding vector sequences, and adding position information into each encoding vector of the two sections of encoding vector sequences;

2) carrying out internal interaction on the two sections of coding vector sequences added with the position information to respectively obtain two sections of internal interaction results, wherein the dimensions of the internal interaction results are consistent with those of the coding vector sequences;

3) performing external interaction on two sections of internal interaction results by calculating bidirectional attention distribution, and respectively constructing two bidirectional noise text interaction matrixes;

4) and splicing the two noise text interaction matrixes, and judging whether the two sections of noise texts to be matched are matched.

Further, preprocessing the two sections of noise texts to be matched before coding the two sections of noise texts to be matched; the pretreatment comprises the following steps: punctuation marks, stop words and low frequency words are removed.

Further, the method for coding two sections of noise texts comprises the following steps: encoding is performed using a pre-trained Word2vec or Bert model.

Furthermore, position information is added into each code vector of the two code vector sequences through a production mode of position vector coding in the Bert model.

Further, before the two segments of coding vector sequences added with the position information are subjected to internal interaction, the two segments of coding vector sequences added with the position information are mapped to a unified semantic space through the following steps:

1) respectively inputting the two sections of coding vector sequences added with the position information into a bidirectional LSTM neural network for secondary coding to obtain two sections of final vector coding sequences;

2) and mapping each vector code in the two final vector code sequences through the same residual error network, so that the two coding vector sequences added with the position information are mapped to a unified semantic space.

Further, the vector dimension of the final vector encoding sequence is determined by the number of hidden layer units of the second layer of LSTM encoding layer in the bidirectional LSTM neural network.

Further, the noise internal interaction result is obtained by the following steps:

1) inputting the two sections of coding vector sequences added with the position information into a first residual error network respectively;

2) performing internal interaction on other vector coding sequences in the noise text by using a cosine attention scaling algorithm and a noise coding vector added with position information as a query item, and calculating a self-attention weight;

3) combining the coding vector sequence added with the position information with the attention weight value to obtain a weighted coding vector sequence;

4) sending the weighted coding vector sequence into a second residual error network to obtain abstract vector representation of the noise text;

5) and respectively carrying out L2 regularization operation on the abstract vector representation of the noise text to obtain an internal interaction result added with the position information coding vector sequence.

Further, a noise text interaction matrix is constructed by the following steps:

1) respectively taking each coding vector in the two sections of internal interaction results as a query, and performing external interaction based on cosine similarity attention on the internal interaction results of the other section of text to obtain the attention weight distribution of the current coding vector relative to the other section of coding sequence;

2) weighting the corresponding internal interaction results by using the attention weight distribution to obtain external interaction vectors of each coding vector in the internal interaction results;

3) and inputting a vector sequence obtained according to an external interaction result into a third residual error network, and performing L2 regularization operation on the output of the third residual error network to obtain a noise bidirectional interaction matrix.

Further, whether two sections of to-be-matched noise texts are matched is judged through the following steps:

1) acquiring a splicing result of two noise text interaction matrixes;

2) inputting the splicing result into a scorer to obtain a matching score;

3) and judging whether the two sections of noise texts to be matched are matched or not according to the matching scores.

Further, the structure of the scorer comprises: a scoring network consisting of fully connected layers.

A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the above-mentioned method when executed.

An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer to perform the method as described above.

Compared with the prior art, the invention has the following advantages:

1. the method mainly adopts an attention mechanism to capture the matching mode between the noise texts, and can improve the time efficiency of the model through parallel computation.

2. The model is not a traditional time sequence model, is less influenced by the logic sequence of sentences in the noise text, and is also effective to the noise text pair with disordered word sequence.

3. By using the cosine scaling attention mechanism, the interference of noise in the text can be effectively inhibited in a weighting mode, and the influence of effective semantic words of the text is increased, so that the matching effect of the noise text is improved.

4. In long-text matching, attention mechanism can well avoid the problems of difficult representation and long-distance dependence of long documents in a time sequence model, and the matching effect is obviously superior to that of a matching method based on document representation.

5. A bidirectional matching mode is adopted, the matching mode of the noise texts q to d is calculated, and meanwhile the matching degree of the texts q to d is considered, so that the problem of transmission matching can be avoided.

Drawings

Fig. 1 is a flowchart of a multi-view interactive matching method for noise-oriented text according to the present invention.

Fig. 2 is a frame diagram of the noise text-oriented multi-view interactive matching method of the present invention.

Detailed Description

For the purpose of promoting an understanding of the principles, solutions and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings.

The method provided by the invention is suitable for matching tasks between text pairs containing noise texts and disordered word orders. The method has the main ideas that the method adopts various attention mechanisms to increase the weight of key information, reduce noise and sentence sequence interference, capture the matching mode of sentences in the interaction process, and finally score the matching result by a scoring network.

As shown in fig. 1 and fig. 2, the method mainly includes three processes of internal interaction, external interaction and matching scoring. First, the text pairs need to be encoded before these three processes. After a noise query text q and a corresponding text d to be matched are given, a Word vector sequence in a noise text pair is firstly encoded through a pre-training existing Word2vec model to obtain a corresponding Word vector sequence. After that, the word vector sequence will be encoded for the second time by the existing bidirectional LSTM neural network, to obtain a vector encoding sequence whose one dimension is determined by the number of LSTM hidden layer units. And finally, mapping each vector code in the sequence through the same residual error network, so that the vector code sequences of the two sections of texts are mapped to a uniform semantic space.

The internal interaction module of the model mainly uses a scaled cosine similarity attention mechanism to capture key information inside a sentence and suppress the expression of secondary information by weighting vector-coded sequences. After the internal interaction, the external interaction module of the model also uses a scaled cosine similarity mechanism to capture the interaction matching patterns between sentences according to the results of the internal interaction module. Finally, the scorer gives a score of the degree of match according to the results of the internal interaction and the external interaction.

The internal interaction process is defined as a Self-Attention process, and the model adopts a scaled cosine Attention mechanism and is input as a vector coding sequence of text. For the noise text q, the internal interaction takes the words in the vector coding sequence of the text q as query terms in the Attention algorithm, weights are carried out on the vector codes of other words in the sequence, the Attention weight of the text q self-interaction can be obtained through the method, then the vector code of each word in the self-interaction result sequence is weighted according to the obtained Attention weight, and therefore the Attention information of the model to different word vectors is obtained to reduce the weight of the noise word and improve the weight of the key word. In the same way, the invention weights the Attention of the vector coding sequence in the candidate document d by the same scaled cosine Attention mechanism. The calculation method of the Attention can avoid the influence of the text sequence structure to a certain extent, because although the sequence of the word positions in the same layer is changed, the weights of the positions calculated by the Attention are all unchanged. The specific calculation process is as follows:

where q and k are the query and key values for the calculation of the scaled cosine Attention, W_qAnd W_kIs a trainable parameter matrix, alpha_ijIs the degree of correlation between the ith word and the jth word, I_qIs a sequence of word vectors of the noise text after stacking and weighting. The above formula represents the self-interaction process of the text q, where I_qIt is the final result from the interaction process that encodes a sequence of word vectors. In addition, the self-interaction result of the candidate document d is obtained in the same manner

The external interaction module of the model is defined as a two-way Attention interaction process, and the invention is also realized by using a scaled cosine Attention algorithm. The input of the bidirectional interaction layer is the output from the interaction layer, namely, the bidirectional interaction matrix is constructed by calculating bidirectional attention distribution through two vector coding sequence models, so that the key word of the current word vector in another noise text is obtained. Assuming that the length of the noise text q is n and the length of the corresponding candidate document d is m, the model firstly uses the word vector of the text q to respectively perform Attenttion on the word vector sequence in the candidate document d, and performs weighted summation on the result of the Attenttion, so that n fixed-dimension interaction vectors can be obtained, and the dimension is equal to the dimension of the vector coding sequence of the candidate document d. In the same way, the word vectors in the candidate document d are used for respectively carrying out Attention and weighted summation on the word vector coding sequence of the text q, and m interactive vectors with the same dimension as the vector coding sequence in the text q can be obtained. The detailed calculation process is as follows:

q＝[q₁；q₂；...；q_n]＝W_q·I_q

k＝[k₁；k₂；...；k_n]＝W_k·I_d

wherein I_q，I_dRespectively, the text q and the candidate document d are obtained through self-interaction, alpha_ijIs the correlation score between the ith word in the noise text q and the jth word in the text d, and O represents the interaction result of the noise text q to the candidate document d. Similarly, the interaction result of the noise document d to the candidate document q can be obtained in the same manner.

After the bidirectional interaction result is obtained, the bidirectional interaction result is spliced, and a final matching score is obtained through a fully-connected scoring network, which is specifically as follows:

f(p，e)＝softmax(W_y[u_m；u_e]+b_y)

wherein u is_m，u_eRespectively, the results obtained after the noise text pair q and d are subjected to external interaction are fixed as d by using dimensionality extracted by a single-layer convolutional neural network_eThe global feature of (2) represents a vector. Wherein d is_eThe total number of the convolution kernels is represented, the width of the convolution kernels is consistent with the width of an external interaction result vector, the heights of the convolution kernels are respectively 3, 5 and 7, the number of the convolution kernels is the same, and a constant term is obtained by performing maximum pooling operation on the feature vectors obtained by each convolution kernel. W_yAnd b_yThe weight matrix and the bias term of the full-connection network respectively, and the scoring function calculates the probability distribution of the labels of the matched noise text pair (q, d), wherein the label of the text pair is 0 or 1 to indicate whether the two sections of noise text are matched or not.

In the training process, the invention uses the hinge loss function to calculate the error and performs back propagation on the whole network. Given a noisy text pair, the training objective function minimizes the hinge loss between positive and negative examples, the projection interval between positive and negative examples being defined by a manually set threshold, the loss function being defined as:

wherein P is a query set of noise texts, E is a set of texts to be matched, E⁺E represents a regular example document corresponding to the noise query text P, E^-E represents the corresponding negative example document, and gamma represents the threshold value over-parameter set by the hinge loss function.

In an embodiment of the present invention, a specific process of setting a given noise text q and a candidate document d and calculating a matching score between two text segments is as follows:

(1) and (3) segmenting the noise texts q and d, and after punctuation marks, stop words and low-frequency words are removed, the lengths of the text sequences are n and m respectively.

(2) Coding each Word in the noise texts q and d through a Word2vec model, and converting each Word in the text into a Word with the length d_eIs fixed vector.

(3) Generating corresponding position vectors for the texts q and d by using a position vector coding generation mode in the existing bert model, and setting the dimension as d_eThe position vector is spliced with each word vector, and position information is added into each word vector to obtain a word vector sequence containing the position information.

(4) The step is an optional step, the word vector after the splicing position coding is input into a bidirectional LSTM network, and the LSTM hidden layer state of each step is used as the coding of the word vector. The addition of the step can improve the accuracy of matching the text with shorter length.

(5) The word vector (the result of step 3) of each splicing position code in the noise texts q and d or the LSTM output code (the result of step 4) is input into the same residual error network, words in different texts can be mapped to the same vector representation space through the residual error network, and finally the output of the residual error network is a vector sequence with fixed dimensions and the lengths of n and m, wherein n and m are the lengths of the vector coding sequences of the texts q and d respectively.

(6) And respectively calculating the corresponding Self-orientation weight value of the vector of the fixed dimension corresponding to each word in q and d by using a scaled cosine Attention algorithm.

(7) And performing dot product on the Self-orientation weight value obtained by calculation in the previous step and the vector sequence of the fixed dimension after residual error network mapping to obtain a weighted word vector.

(8) And inputting the weighted word vectors obtained by self-interaction into a new residual error network to be used as abstract representation of the self-interaction layer.

(9) Performing L2 regularization operation on the output of the residual error network in the self-interacting layer to obtain the output of the self-interacting layer, wherein the output of the self-interacting layer is still two vector sequences with the lengths of n and m respectively, and the dimensionality of the output vector and the dimensionality of the word vector are kept consistent to be (d)_e+d_m) Dimension wherein d_mThe dimension that encodes the position vector.

(10) And (3) enabling each output vector obtained by the noise text q through the self-interaction layer to serve as an output vector sequence from the query to the text d for Attention, and obtaining an Attention weight value of each word in the text q to the word vector sequence in the text d.

(11) And performing dot product operation on the Attention weight value calculated in the last step and the word sequence after self interaction in the candidate document d, weighting the word sequence in d, accumulating the weighted results, and finally calculating the word vector in each noise text q to obtain an interaction vector with a fixed length.

(12) Inputting each interactive vector into the same residual error network, and performing L2 regularization operation on the output of the residual error network to obtain a vector sequence with the length of n as an external interactive result from the noise document q to the candidate document d.

(13) In a similar way, each word vector in the candidate document d is regarded as a word vector in the query-to-noise text q for Attention, and finally, the external interaction result from the candidate document d to the noise document q is a vector sequence with the length of m.

(14) And coding the vector sequence obtained through external interaction by using a convolutional neural network to finally obtain two interaction vectors with fixed lengths.

(15) And splicing the two interactive vectors, inputting the two interactive vectors into a fully-connected neural network, calculating the probability distribution of the class labels through a Softmax function, taking the label with the maximum probability value as a final matching result, wherein the probability value corresponding to the result label is the confidence coefficient of the matching result.

The above-mentioned embodiments are merely for better illustrating the objects, principles, technical solutions and advantages of the present invention. It should be understood that the above-mentioned embodiments are only exemplary of the present invention, and are not intended to limit the present invention, and any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A multi-view interactive matching method for noise text comprises the following steps:

2. The method of claim 1, wherein the two sections of noise texts to be matched are preprocessed before the two sections of noise texts to be matched are encoded; the pretreatment comprises the following steps: punctuation marks, stop words and low frequency words are removed.

3. The method of claim 1, wherein the method of encoding two sections of noisy text comprises: coding by using a pre-trained Word2vec or Bert model; and adding position information into each coding vector of the two sections of coding vector sequences by a production mode of position vector coding in the Bert model.

4. The method of claim 1, wherein before the two coded vector sequences with the added position information are inter-interacted, the two coded vector sequences with the added position information are mapped to a unified semantic space by:

5. The method of claim 4, wherein the vector dimensions of the final vector encoded sequence are determined by the number of hidden layer elements of a second layer of LSTM encoded layer in the bi-directional LSTM neural network.

6. The method of claim 1, wherein the noisy intra-interaction result is obtained by:

7. The method of claim 1, wherein the noise-text interaction matrix is constructed by:

8. The method of claim 1, wherein whether two pieces of noise text to be matched are matched is determined by:

1) acquiring a splicing result of the two noise text interaction matrixes;

2) inputting the splicing result into a scorer to obtain a matching score, wherein the structure of the scorer comprises: a scoring network consisting of fully connected layers;

9. A storage medium having a computer program stored thereon, wherein the computer program is arranged to, when run, perform the method of any of claims 1-8.

10. An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the method according to any of claims 1-8.