CN110765768A - Optimized text abstract generation method - Google Patents
Optimized text abstract generation method Download PDFInfo
- Publication number
- CN110765768A CN110765768A CN201910981470.3A CN201910981470A CN110765768A CN 110765768 A CN110765768 A CN 110765768A CN 201910981470 A CN201910981470 A CN 201910981470A CN 110765768 A CN110765768 A CN 110765768A
- Authority
- CN
- China
- Prior art keywords
- cnn
- text
- decoder
- extracted
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 239000013598 vector Substances 0.000 claims description 24
- 238000000605 extraction Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 5
- 230000004927 fusion Effects 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 238000012549 training Methods 0.000 abstract description 12
- 238000004140 cleaning Methods 0.000 abstract description 2
- 230000000694 effects Effects 0.000 abstract description 2
- 238000013527 convolutional neural network Methods 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 230000003116 impacting effect Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
An optimized text abstract generation method belongs to the field of natural language generation, and particularly relates to a sequence-to-sequence text abstract generation related method. Firstly, Chinese data is preprocessed through cleaning and the like, an article is sent to an AS-CNN model of an Encoder end to extract characteristics, and then the characteristics are sent to a Decoder end composed of transformers. The network can not only utilize the parallel capability of the CNN network and the transform, give full play to the effect of hardware and accelerate the training speed, but also use the CNN at the Encoder end, reduce the parameters of the model, avoid the over-fitting problem and expand the application range of the model.
Description
The technical field is as follows:
the invention belongs to the field of natural language generation, and particularly relates to a method for generating a sequence text abstract.
Background art:
with the rapid development of information technology, information explosion is impacting people's lives. On the one hand, the internet now has a large number of web pages and texts, but there is a large amount of redundant contents among the texts related to the contents, and it takes a lot of time and energy for people to read and acquire the repeated contents. On the other hand, the social development speeds up the life rhythm of people, and the time of more and more fragmentations drives people to obtain contents through the internet instead of traditional paper materials such as books and the like. Therefore, how to extract the main content from a large amount of text information has become a hot spot of academic research nowadays.
With regard to the problem of text summarization, many domestic and foreign scholars have deep insights on the field, and a lot of available text summarization technologies are provided. The earliest scholars proposed an Extraction Text Summarization (ETS) method, which mainly uses the traditional statistical method to extract the segments that can summarize the subject matter of the content. Although this approach may capture the primary content to some extent, there is a major problem in that the captured summary may have a semantic inconsistency. Subsequently, some researchers proposed an Abstract Text Summarization (ATS) method, which can effectively solve the problem of semantic incoherence of the text summarization generated by the ETS method. The method uses the latest Deep learning technology (DL) technology, simulates the writing habit of people by using a neural network, and then trains to generate the text abstract. The classic network architecture in neural network technology is Sequence to Sequence (Sequence 2Seq), first proposed by Cho et al, which is an Encoder for encoding source text input and a decoder for decoding and outputting target text. This architecture is based on a Recurrent Neural Network (RNN), but because it is a sequential input and output, it is not possible to parallelize the training and is time consuming. Therefore, Jonas et al propose a seq2seq based on Convolutional Neural Network (CNN) to speed up the training process. However, the convolutional neural network has a defect in the encoding capability of the convolutional neural network on language sequence information, and in 2017, a transform model proposed by Ashish et al can process language information and can perform parallel training. However, the Transformer is a self-attention model of an Encoder and a Decoder with 6 layers, has many parameters and an overlarge integral model structure, and is not suitable for efficient laboratory research.
Disclosure of Invention
The invention mainly solves the technical problems of reducing the model parameters of the Encoder module and increasing the training speed on the premise of not influencing the performance. A CNN model suitable for Text summarization is provided, and abstract Text summary relational Neural Network (AS-CNN) is improved based on textCNN provided by Yoon, and AS-CNN coding results are sent to a Decoder module of a Transformer for summarization generation.
The invention provides a method for quickly training a summary generation model for massive text data. The method comprises the steps of removing spaces and special characters from text data, cleaning the text according to frequency, and then constructing a dictionary required by a user, wherein keys of the dictionary are words, and values are corresponding ids of all the words. And then converting the article to be processed into corresponding id according to the dictionary, initializing a word vector matrix at an Embedding layer of the model, and then finding out a word vector corresponding to each word according to the id. The word vectors are sent to the Encoder end of the model for feature extraction, a large number of parameters can be generated by different models in the process of feature extraction, the parameters of some models are increased in an exponential order, the requirement on computing hardware equipment is high, a feature extraction method can be replaced in the stage of feature extraction by the Encoder, and the number of the parameters of the model is reduced on the premise of obtaining rich features.
In order to achieve the purpose, the invention adopts the following technical scheme: in order to avoid excessive parameter quantity of an Encoder end in a characteristic extraction stage and guarantee parallel training, an AS-CNN algorithm is adopted AS the Encoder end of a model, and effective text characteristics are extracted by adopting different convolution kernel sizes according to different article lengths. And then inputting the extracted text features into a Decoder end, wherein the Decoder adopts a self-attention mechanism of a transform model, so that the advantages of the transform generated text are used, and the parameter quantity is reduced. Thus, a text abstract generating framework based on the AS-CNN and the Transformer architecture is obtained.
A method for optimizing text summary generation comprises the following steps:
step 1, acquiring related text data needing to generate an abstract, and performing necessary text data processing and word segmentation.
And 2, constructing a related dictionary for the processed text, setting word vector dimensions and randomly initializing all word vectors, wherein each word corresponds to a unique id.
And 3, performing feature extraction on the AS-CNN of the vector input model Encoder end input by the article.
And 4, sending the characteristic vectors extracted by the AS-CNN into a Decoder end of a Transformer for decoding to generate a abstract of the article.
Preferably, step 3 specifically comprises the following steps:
step 3.1, setting the size of convolution kernels and the number of each convolution kernel according to the length of the article;
step 3.2, extracting sentence characteristics with different lengths from the characteristics extracted by different convolution kernels
Step 3.3, carrying out padding on the sentence characteristics with different lengths to ensure that the sentence lengths are consistent, and generally selecting the longest sentence length as the standard
Step 3.4, feature fusion is carried out on the features extracted by different convolution kernels
And 3.5, carrying out full-connection network mapping on the fused feature vectors.
Preferably, step 4 specifically comprises the following steps:
step 4.1, performing dimension conversion on the text characteristic vector extracted by the AS-CNN to enable the text characteristic vector to be input into a Decoder end of a Transformer
And 4.2, taking the characteristic vector of the AS-CNN AS a keys and value matrix in the self-attention mechanism of the Decoder end, calculating the attention weight, and then acting on the queries matrix input by the Decoder end.
And 4.3, finding out the words needing to be generated through a Softmax layer according to the semantic vector generated by decoding the Decoder.
Compared with the prior art, the invention has the following obvious advantages:
when the text abstract is generated, the AS-CNN is adopted to extract the text characteristic information, and then the abstract information is generated by a self-attention mechanism, compared with other methods, the method has two advantages, namely: the Encoder end adopts AS-CNN to extract features, but not a self-attention mechanism of a Transformer or a recurrent neural network, the change can reduce the parameter quantity to one percent or one thousandth of the original quantity, thereby not only saving the memory space of hardware, but also obviously improving the iteration speed. In addition, hardware conditions can be fully utilized, and the training speed is accelerated. Secondly, the size of the convolution kernel of the AS-CNN can be selected by self, which is beneficial to solving the problem of long text dependence. In summary, the abstract generation method based on AS-CNN and transform provided by the invention has the advantages of accelerating training, reducing model parameters and solving long text dependence.
Description of the drawings:
FIG. 1 is a flow chart of a method according to the present invention
FIG. 2 AS-CNN Module schematic diagram
FIG. 3A-CNN and a transform Decoder module interaction schematic diagram
The specific implementation mode is as follows:
the invention is described in further detail below with reference to specific network model examples and with reference to the accompanying drawings.
Hardware equipment used by the invention comprises one PC (personal computer), and 1080 video cards 1 block;
in this section, we have conducted extensive experiments to investigate the effect of our proposed method. The network architecture operation flow chart designed by the invention is shown in fig. 1, and specifically comprises the following steps:
step 1, processing a text data set, removing special symbols, removing low-frequency words according to word frequency, and constructing a dictionary for training. The key in the dictionary is a word, and the value is the id of the word.
And 2, randomly initializing an Embedding layer matrix, and selecting a word vector corresponding to each word according to the id in the dictionary.
And step 3, as shown in fig. 2, selecting convolution kernels with different sizes to extract text features, wherein 512 convolution kernels with each size are selected.
Step 3.1, text of 7 × 300 is input, wherein the sentence length is 7, and the word vector dimension is 300.
Step 3.2, three sizes of convolution kernels are selected, namely 4 × 300, 3 × 300 and 2 × 300, and the number of convolution kernels in each size is 512.
Step 3.3, taking 4 × 300 convolution kernels as an example, the feature dimension extracted by one convolution kernel is 4 × 1, so that the feature dimension extracted by 512 convolution kernels is 4 × 512; the dimension of the 3 x 300 convolution kernel extraction features is 5 x 512; the feature dimension extracted by the 2 x 300 convolution kernel is 6 x 512
And 3.4, the features padding extracted by the convolution kernels with different sizes are taken as the same dimension, 6 × 512, and feature fusion is carried out to obtain feature vectors with 6 × 1536 dimensions.
Step 3.5, mapping the convolution extracted features into 6 x 512 dimensions by using a full-connection network
And 4, sending the characteristics extracted by the AS-CNN into a Decoder end of a Transformer, and calculating the attention weight by taking the characteristics AS keys and values of the self-attention model.
And 5, training a network model, evaluating the generated abstract quality by using a BLEU evaluation index, and comparing the abstract quality with a native transform to obtain a final conclusion according to the number of model parameters.
Step 5.1, training the network model until the Loss convergence is verified, wherein the used Loss function is a Cross Entropy Loss function (Cross Entropy Loss):
AS shown in fig. 3, an interaction diagram of the AS-CNN and the Decoder side is obtained, the AS-CNN extracts text features AS keys and values of the self-attention model, and sends the keys and values to the Decoder side, and the input of the Decoder side is used AS query, and the three are subjected to attention calculation to form a final decoding vector.
The above embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and the scope of the present invention is defined by the claims. Various modifications and equivalents may be made by those skilled in the art within the spirit and scope of the present invention, and such modifications and equivalents should also be considered as falling within the scope of the present invention.
Claims (3)
1. A method for optimizing text summary generation, comprising the steps of:
step 1, acquiring related text data needing to generate an abstract, and processing the text data;
step 2, constructing a relevant dictionary for the processed text, setting word vector dimensions and randomly initializing all word vectors, wherein each word corresponds to a unique id;
step 3, performing feature extraction on the AS-CNN of the vector input model Encoder end input by the article;
and 4, sending the characteristic vectors extracted by the AS-CNN into a Decoder end of a Transformer for decoding to generate a abstract of the article.
2. The method according to claim 1, characterized in that step 3 comprises in particular the steps of:
step 3.1, setting the size of convolution kernels and the number of each convolution kernel according to the length of the article;
3.2, extracting sentence characteristics with different lengths from the characteristics extracted by different convolution kernels;
step 3.3, carrying out padding on the sentence characteristics with different lengths to ensure that the sentence lengths are consistent, and selecting the longest sentence length as the standard;
step 3.4, performing feature fusion on the features extracted by different convolution kernels;
and 3.5, carrying out full-connection network mapping on the fused feature vectors.
3. The method according to claim 1, wherein step 4 comprises the steps of:
step 4.1, performing dimension conversion on the text characteristic vector extracted by the AS-CNN, and inputting the Decoder end of the Transformer
4.2, using the characteristic vector of the AS-CNN AS a keys and value matrix in a Decoder-end self-attention mechanism, calculating the attention weight, and then acting on a queries matrix input by the Decoder end;
and 4.3, finding out the words needing to be generated through a Softmax layer according to the semantic vector generated by decoding the Decoder.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910981470.3A CN110765768A (en) | 2019-10-16 | 2019-10-16 | Optimized text abstract generation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910981470.3A CN110765768A (en) | 2019-10-16 | 2019-10-16 | Optimized text abstract generation method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110765768A true CN110765768A (en) | 2020-02-07 |
Family
ID=69331275
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910981470.3A Pending CN110765768A (en) | 2019-10-16 | 2019-10-16 | Optimized text abstract generation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110765768A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112733498A (en) * | 2020-11-06 | 2021-04-30 | 北京工业大学 | Method for improving automatic Chinese text summarization self-attention calculation |
CN113449489A (en) * | 2021-07-22 | 2021-09-28 | 深圳追一科技有限公司 | Punctuation mark marking method, punctuation mark marking device, computer equipment and storage medium |
CN117763140A (en) * | 2024-02-22 | 2024-03-26 | 神州医疗科技股份有限公司 | Accurate medical information conclusion generation method based on computing feature network |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180300400A1 (en) * | 2017-04-14 | 2018-10-18 | Salesforce.Com, Inc. | Deep Reinforced Model for Abstractive Summarization |
CN109492232A (en) * | 2018-10-22 | 2019-03-19 | 内蒙古工业大学 | A kind of illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer |
CN109885673A (en) * | 2019-02-13 | 2019-06-14 | 北京航空航天大学 | A kind of Method for Automatic Text Summarization based on pre-training language model |
-
2019
- 2019-10-16 CN CN201910981470.3A patent/CN110765768A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180300400A1 (en) * | 2017-04-14 | 2018-10-18 | Salesforce.Com, Inc. | Deep Reinforced Model for Abstractive Summarization |
CN109492232A (en) * | 2018-10-22 | 2019-03-19 | 内蒙古工业大学 | A kind of illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer |
CN109885673A (en) * | 2019-02-13 | 2019-06-14 | 北京航空航天大学 | A kind of Method for Automatic Text Summarization based on pre-training language model |
Non-Patent Citations (1)
Title |
---|
SHENGLI SONG 等: "Abstractive text summarization using LSTM-CNN based deep learning" * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112733498A (en) * | 2020-11-06 | 2021-04-30 | 北京工业大学 | Method for improving automatic Chinese text summarization self-attention calculation |
CN112733498B (en) * | 2020-11-06 | 2024-04-16 | 北京工业大学 | Method for improving self-attention calculation of Chinese automatic text abstract |
CN113449489A (en) * | 2021-07-22 | 2021-09-28 | 深圳追一科技有限公司 | Punctuation mark marking method, punctuation mark marking device, computer equipment and storage medium |
CN113449489B (en) * | 2021-07-22 | 2023-08-08 | 深圳追一科技有限公司 | Punctuation mark labeling method, punctuation mark labeling device, computer equipment and storage medium |
CN117763140A (en) * | 2024-02-22 | 2024-03-26 | 神州医疗科技股份有限公司 | Accurate medical information conclusion generation method based on computing feature network |
CN117763140B (en) * | 2024-02-22 | 2024-05-28 | 神州医疗科技股份有限公司 | Accurate medical information conclusion generation method based on computing feature network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113254599B (en) | Multi-label microblog text classification method based on semi-supervised learning | |
CN108920473B (en) | Data enhancement machine translation method based on same-class word and synonym replacement | |
CN109766432B (en) | Chinese abstract generation method and device based on generation countermeasure network | |
CN110134782B (en) | Text summarization model based on improved selection mechanism and LSTM variant and automatic text summarization method | |
CN106033426B (en) | Image retrieval method based on latent semantic minimum hash | |
CN111723547A (en) | Text automatic summarization method based on pre-training language model | |
CN110765264A (en) | Text abstract generation method for enhancing semantic relevance | |
CN109815476B (en) | Word vector representation method based on Chinese morpheme and pinyin combined statistics | |
CN114880461B (en) | Chinese news text abstract method combining contrast learning and pre-training technology | |
CN110765768A (en) | Optimized text abstract generation method | |
CN111563160B (en) | Text automatic summarization method, device, medium and equipment based on global semantics | |
CN110807324A (en) | Video entity identification method based on IDCNN-crf and knowledge graph | |
CN114528827B (en) | Text-oriented countermeasure sample generation method, system, equipment and terminal | |
CN116958997B (en) | Graphic summary method and system based on heterogeneous graphic neural network | |
CN111984782A (en) | Method and system for generating text abstract of Tibetan language | |
CN112597366B (en) | Encoder-Decoder-based event extraction method | |
CN115759119B (en) | Financial text emotion analysis method, system, medium and equipment | |
CN113626584A (en) | Automatic text abstract generation method, system, computer equipment and storage medium | |
CN107832307B (en) | Chinese word segmentation method based on undirected graph and single-layer neural network | |
CN113065349A (en) | Named entity recognition method based on conditional random field | |
CN111428518B (en) | Low-frequency word translation method and device | |
CN113609840B (en) | Chinese law judgment abstract generation method and system | |
CN111400487A (en) | Quality evaluation method of text abstract | |
Zhang et al. | Extractive Document Summarization based on hierarchical GRU | |
CN115795037B (en) | Multi-label text classification method based on label perception |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200207 |