CN110765768A

CN110765768A - Optimized text abstract generation method

Info

Publication number: CN110765768A
Application number: CN201910981470.3A
Authority: CN
Inventors: 刘博�; 申利彬
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2020-02-07

Abstract

An optimized text abstract generation method belongs to the field of natural language generation, and particularly relates to a sequence-to-sequence text abstract generation related method. Firstly, Chinese data is preprocessed through cleaning and the like, an article is sent to an AS-CNN model of an Encoder end to extract characteristics, and then the characteristics are sent to a Decoder end composed of transformers. The network can not only utilize the parallel capability of the CNN network and the transform, give full play to the effect of hardware and accelerate the training speed, but also use the CNN at the Encoder end, reduce the parameters of the model, avoid the over-fitting problem and expand the application range of the model.

Description

Optimized text abstract generation method

The technical field is as follows:

the invention belongs to the field of natural language generation, and particularly relates to a method for generating a sequence text abstract.

Background art:

with the rapid development of information technology, information explosion is impacting people's lives. On the one hand, the internet now has a large number of web pages and texts, but there is a large amount of redundant contents among the texts related to the contents, and it takes a lot of time and energy for people to read and acquire the repeated contents. On the other hand, the social development speeds up the life rhythm of people, and the time of more and more fragmentations drives people to obtain contents through the internet instead of traditional paper materials such as books and the like. Therefore, how to extract the main content from a large amount of text information has become a hot spot of academic research nowadays.

With regard to the problem of text summarization, many domestic and foreign scholars have deep insights on the field, and a lot of available text summarization technologies are provided. The earliest scholars proposed an Extraction Text Summarization (ETS) method, which mainly uses the traditional statistical method to extract the segments that can summarize the subject matter of the content. Although this approach may capture the primary content to some extent, there is a major problem in that the captured summary may have a semantic inconsistency. Subsequently, some researchers proposed an Abstract Text Summarization (ATS) method, which can effectively solve the problem of semantic incoherence of the text summarization generated by the ETS method. The method uses the latest Deep learning technology (DL) technology, simulates the writing habit of people by using a neural network, and then trains to generate the text abstract. The classic network architecture in neural network technology is Sequence to Sequence (Sequence 2Seq), first proposed by Cho et al, which is an Encoder for encoding source text input and a decoder for decoding and outputting target text. This architecture is based on a Recurrent Neural Network (RNN), but because it is a sequential input and output, it is not possible to parallelize the training and is time consuming. Therefore, Jonas et al propose a seq2seq based on Convolutional Neural Network (CNN) to speed up the training process. However, the convolutional neural network has a defect in the encoding capability of the convolutional neural network on language sequence information, and in 2017, a transform model proposed by Ashish et al can process language information and can perform parallel training. However, the Transformer is a self-attention model of an Encoder and a Decoder with 6 layers, has many parameters and an overlarge integral model structure, and is not suitable for efficient laboratory research.

Disclosure of Invention

The invention mainly solves the technical problems of reducing the model parameters of the Encoder module and increasing the training speed on the premise of not influencing the performance. A CNN model suitable for Text summarization is provided, and abstract Text summary relational Neural Network (AS-CNN) is improved based on textCNN provided by Yoon, and AS-CNN coding results are sent to a Decoder module of a Transformer for summarization generation.

The invention provides a method for quickly training a summary generation model for massive text data. The method comprises the steps of removing spaces and special characters from text data, cleaning the text according to frequency, and then constructing a dictionary required by a user, wherein keys of the dictionary are words, and values are corresponding ids of all the words. And then converting the article to be processed into corresponding id according to the dictionary, initializing a word vector matrix at an Embedding layer of the model, and then finding out a word vector corresponding to each word according to the id. The word vectors are sent to the Encoder end of the model for feature extraction, a large number of parameters can be generated by different models in the process of feature extraction, the parameters of some models are increased in an exponential order, the requirement on computing hardware equipment is high, a feature extraction method can be replaced in the stage of feature extraction by the Encoder, and the number of the parameters of the model is reduced on the premise of obtaining rich features.

In order to achieve the purpose, the invention adopts the following technical scheme: in order to avoid excessive parameter quantity of an Encoder end in a characteristic extraction stage and guarantee parallel training, an AS-CNN algorithm is adopted AS the Encoder end of a model, and effective text characteristics are extracted by adopting different convolution kernel sizes according to different article lengths. And then inputting the extracted text features into a Decoder end, wherein the Decoder adopts a self-attention mechanism of a transform model, so that the advantages of the transform generated text are used, and the parameter quantity is reduced. Thus, a text abstract generating framework based on the AS-CNN and the Transformer architecture is obtained.

A method for optimizing text summary generation comprises the following steps:

step 1, acquiring related text data needing to generate an abstract, and performing necessary text data processing and word segmentation.

And 2, constructing a related dictionary for the processed text, setting word vector dimensions and randomly initializing all word vectors, wherein each word corresponds to a unique id.

And 3, performing feature extraction on the AS-CNN of the vector input model Encoder end input by the article.

And 4, sending the characteristic vectors extracted by the AS-CNN into a Decoder end of a Transformer for decoding to generate a abstract of the article.

Preferably, step 3 specifically comprises the following steps:

step 3.1, setting the size of convolution kernels and the number of each convolution kernel according to the length of the article;

step 3.2, extracting sentence characteristics with different lengths from the characteristics extracted by different convolution kernels

Step 3.3, carrying out padding on the sentence characteristics with different lengths to ensure that the sentence lengths are consistent, and generally selecting the longest sentence length as the standard

Step 3.4, feature fusion is carried out on the features extracted by different convolution kernels

And 3.5, carrying out full-connection network mapping on the fused feature vectors.

Preferably, step 4 specifically comprises the following steps:

step 4.1, performing dimension conversion on the text characteristic vector extracted by the AS-CNN to enable the text characteristic vector to be input into a Decoder end of a Transformer

And 4.2, taking the characteristic vector of the AS-CNN AS a keys and value matrix in the self-attention mechanism of the Decoder end, calculating the attention weight, and then acting on the queries matrix input by the Decoder end.

And 4.3, finding out the words needing to be generated through a Softmax layer according to the semantic vector generated by decoding the Decoder.

Compared with the prior art, the invention has the following obvious advantages:

when the text abstract is generated, the AS-CNN is adopted to extract the text characteristic information, and then the abstract information is generated by a self-attention mechanism, compared with other methods, the method has two advantages, namely: the Encoder end adopts AS-CNN to extract features, but not a self-attention mechanism of a Transformer or a recurrent neural network, the change can reduce the parameter quantity to one percent or one thousandth of the original quantity, thereby not only saving the memory space of hardware, but also obviously improving the iteration speed. In addition, hardware conditions can be fully utilized, and the training speed is accelerated. Secondly, the size of the convolution kernel of the AS-CNN can be selected by self, which is beneficial to solving the problem of long text dependence. In summary, the abstract generation method based on AS-CNN and transform provided by the invention has the advantages of accelerating training, reducing model parameters and solving long text dependence.

Description of the drawings:

FIG. 1 is a flow chart of a method according to the present invention

FIG. 2 AS-CNN Module schematic diagram

FIG. 3A-CNN and a transform Decoder module interaction schematic diagram

The specific implementation mode is as follows:

the invention is described in further detail below with reference to specific network model examples and with reference to the accompanying drawings.

Hardware equipment used by the invention comprises one PC (personal computer), and 1080 video cards 1 block;

in this section, we have conducted extensive experiments to investigate the effect of our proposed method. The network architecture operation flow chart designed by the invention is shown in fig. 1, and specifically comprises the following steps:

step 1, processing a text data set, removing special symbols, removing low-frequency words according to word frequency, and constructing a dictionary for training. The key in the dictionary is a word, and the value is the id of the word.

And 2, randomly initializing an Embedding layer matrix, and selecting a word vector corresponding to each word according to the id in the dictionary.

And step 3, as shown in fig. 2, selecting convolution kernels with different sizes to extract text features, wherein 512 convolution kernels with each size are selected.

Step 3.1, text of 7 × 300 is input, wherein the sentence length is 7, and the word vector dimension is 300.

Step 3.2, three sizes of convolution kernels are selected, namely 4 × 300, 3 × 300 and 2 × 300, and the number of convolution kernels in each size is 512.

Step 3.3, taking 4 × 300 convolution kernels as an example, the feature dimension extracted by one convolution kernel is 4 × 1, so that the feature dimension extracted by 512 convolution kernels is 4 × 512; the dimension of the 3 x 300 convolution kernel extraction features is 5 x 512; the feature dimension extracted by the 2 x 300 convolution kernel is 6 x 512

And 3.4, the features padding extracted by the convolution kernels with different sizes are taken as the same dimension, 6 × 512, and feature fusion is carried out to obtain feature vectors with 6 × 1536 dimensions.

Step 3.5, mapping the convolution extracted features into 6 x 512 dimensions by using a full-connection network

And 4, sending the characteristics extracted by the AS-CNN into a Decoder end of a Transformer, and calculating the attention weight by taking the characteristics AS keys and values of the self-attention model.

And 5, training a network model, evaluating the generated abstract quality by using a BLEU evaluation index, and comparing the abstract quality with a native transform to obtain a final conclusion according to the number of model parameters.

Step 5.1, training the network model until the Loss convergence is verified, wherein the used Loss function is a Cross Entropy Loss function (Cross Entropy Loss):

wherein y is⁽ⁱ⁾In order to be the true value of the value,

is a predicted value.

AS shown in fig. 3, an interaction diagram of the AS-CNN and the Decoder side is obtained, the AS-CNN extracts text features AS keys and values of the self-attention model, and sends the keys and values to the Decoder side, and the input of the Decoder side is used AS query, and the three are subjected to attention calculation to form a final decoding vector.

The above embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and the scope of the present invention is defined by the claims. Various modifications and equivalents may be made by those skilled in the art within the spirit and scope of the present invention, and such modifications and equivalents should also be considered as falling within the scope of the present invention.

Claims

1. A method for optimizing text summary generation, comprising the steps of:

step 1, acquiring related text data needing to generate an abstract, and processing the text data;

step 2, constructing a relevant dictionary for the processed text, setting word vector dimensions and randomly initializing all word vectors, wherein each word corresponds to a unique id;

step 3, performing feature extraction on the AS-CNN of the vector input model Encoder end input by the article;

2. The method according to claim 1, characterized in that step 3 comprises in particular the steps of:

3.2, extracting sentence characteristics with different lengths from the characteristics extracted by different convolution kernels;

step 3.3, carrying out padding on the sentence characteristics with different lengths to ensure that the sentence lengths are consistent, and selecting the longest sentence length as the standard;

step 3.4, performing feature fusion on the features extracted by different convolution kernels;

3. The method according to claim 1, wherein step 4 comprises the steps of:

step 4.1, performing dimension conversion on the text characteristic vector extracted by the AS-CNN, and inputting the Decoder end of the Transformer

4.2, using the characteristic vector of the AS-CNN AS a keys and value matrix in a Decoder-end self-attention mechanism, calculating the attention weight, and then acting on a queries matrix input by the Decoder end;