[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN115017404B - Target news topic abstracting method based on compressed space sentence selection - Google Patents

Target news topic abstracting method based on compressed space sentence selection Download PDF

Info

Publication number
CN115017404B
CN115017404B CN202210449431.0A CN202210449431A CN115017404B CN 115017404 B CN115017404 B CN 115017404B CN 202210449431 A CN202210449431 A CN 202210449431A CN 115017404 B CN115017404 B CN 115017404B
Authority
CN
China
Prior art keywords
sentence
sentences
topic
news
document set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210449431.0A
Other languages
Chinese (zh)
Other versions
CN115017404A (en
Inventor
余正涛
卢天旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202210449431.0A priority Critical patent/CN115017404B/en
Publication of CN115017404A publication Critical patent/CN115017404A/en
Application granted granted Critical
Publication of CN115017404B publication Critical patent/CN115017404B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a target news topic summary method based on compressed space sentence selection, and belongs to the field of natural language processing. The invention comprises the following steps: constructing a target news topic summary data set; and filtering information irrelevant to topic description by using a sentence importance evaluation module to compress a search space, encoding the screened document set and sentences through a document set encoder module based on an improved Bert model, and extracting the abstract of the topic formed by the sentences with the highest final score after integrating the two features by calculating the salient features and the repeated features of the sentences in the encoded document set. The method integrates the guidance of topic key descriptors, and uses a sentence selection method capable of balancing the salient features and the repeated features, so that the quality of the generated topic abstract is improved on the basis of compressing the search space, and the support is provided for subsequent tasks such as target supervision.

Description

Target news topic abstracting method based on compressed space sentence selection
Technical Field
The invention relates to a target news topic summary method based on compressed space sentence selection, and belongs to the technical field of natural language processing.
Background
When a hot event related to a case occurs in society, information tends to spread and ferment rapidly on news websites and media, and how to effectively conduct target supervision is a key problem. The information is quickly acquired through the technical means, the key content of the case topic news is timely extracted, and the key content is very important for relevant departments to monitor the network movement and maintain the network order. The number of news documents on the same topic is hundreds or thousands, and if users directly search information describing the topic from the news, the users can only spend more time and effort to read the information one by one, so that the information is very inconvenient to process. The text automatic summarization technology is used for extracting the topic summary aiming at the topic key information from news of the topic cluster, so that the reading time of a user is greatly reduced, the cost of information storage is effectively reduced, and the method plays an important role in carrying out supervision work aiming at target topics by related departments and acquiring topic key contents.
At present, research and analysis on topic abstracts in the general field are quite many, deep learning and neural network algorithms have been greatly developed in recent years, and many research results are obtained when the deep learning and neural network algorithms are applied to topic abstracts. Cheng et al can extract the characteristics of sentences and words respectively to obtain a better abstract through a hierarchical neural network structure based on the abstract framework driven by the data of the encoder extractor; cao et al use a ranking model of a recurrent neural network to select sentences in a set of documents, which represents a sentence ranking task as a hierarchical regression process while measuring the prominence of the sentences and their components in the parse tree; zhang et al propose an enhanced multi-view convolutional neural network to jointly acquire the features of sentences and rank the sentences; NALLAPATI et al propose generating abstract predictions based on a model of a recurrent neural network, including abstract features such as sentence content, prominence, and repeatability. The reference abstract can be trained independently, and the dependence on sentence extraction labels is eliminated. However, the topic summary of the target news aims at the news document under the topic of the same case, the title and the text content of the news document both contain key information of the key element of the target topic, if a general automatic summary method is used, the key information is easy to miss and cover, sentences irrelevant to the description of the target topic can be extracted, and the sentence saliency in the summary is low. In addition, the existing method uses a sentence selection model to directly extract representative sentences from the original text, but the method is simple and practical, cannot adapt to the characteristics of a plurality of case elements of news data in a topic cluster on the task of abstracting a target news topic, and the generated abstract sentences are insufficient in criticality, high in repeatability and not practical. Therefore, the sentence importance screening is carried out by considering the key words combined with the target topics, and the quality of the generated topic abstract can be improved on the basis of compressing the search space by using a sentence selection method capable of balancing the salient features and the repeated features.
Disclosure of Invention
The invention provides a target news topic abstract method based on compressed space sentence selection, which is integrated with the guidance of topic key descriptors and uses a sentence selection method capable of balancing salient features and repeated features, so that the quality of generated topic abstract is improved on the basis of compressed search space.
The technical scheme of the invention is as follows: the method for abstracting the target news topics based on the compressed space sentence selection comprises the following specific steps:
Step1, crawling target news through a crawler technology, selecting news related to topics and selecting 30 topic news clusters from the news, wherein the news in each cluster describes the same topic, each cluster contains 20 table of contents marks of news, including titles and text content, and a total of 15343 sentences are subjected to data set construction; carrying out data denoising cleaning and preprocessing; analyzing the crawled target news to ensure that each news only belongs to one topic cluster, marking news documents under the same topic cluster, obtaining labels of sentences in the documents, and manually writing a reference abstract of each topic cluster;
Step2, screening sentences with highest importance scores, which contain topic words, from all documents through defined topic key description words; encoding the screened document set and sentences using an improved pre-training model; two sentence characteristics are obtained through a prominence calculation module and a repeated characteristic calculation module; and generating the salient features and the repeated features of the abstract through sentence selection model balance, so as to obtain the abstract containing accurate topic information.
The Step1 specifically comprises the following steps:
step1.1, crawling news data related to target topics in recent years of various large news websites and public number platforms through a crawler technology, selecting 30 topic news clusters, wherein news in each cluster describes the same topic, each cluster contains 20 table of contents marks of news, including titles and text content, and 15343 sentences in total form a data set;
step1.2, performing data cleaning, denoising and preprocessing on the crawled data, wherein the cleaning, denoising and preprocessing process of the data comprises the steps of removing webpage labels, advertisement links and special symbols, removing repeated data, foreign language data and traditional Chinese characters, and manually calibrating the relevance of news data and case topics;
Step1.3, obtaining sentence labels in the data set by adopting a manual labeling mode; manually compiling a reference abstract of each topic cluster;
as a preferable scheme of the invention, the Step2 comprises the following specific steps:
Step2.1, defining keywords in topic clusters, extracting a sentence set containing the keywords through regularization matching, filtering sentences containing irrelevant information, calculating word frequencies of the keywords in news documents to obtain importance scores of the sentences, extracting sentences with highest importance scores in the documents, and combining the sentences into a new topic cluster document set;
step2.2, encoding a document set and sentences by using a document set encoder based on an improved Bert model to obtain characterization;
step2.3, measuring the saliency of candidate sentences and sentences which are not selected through a bilinear mapping function;
Step2.4, calculating the coincidence degree of n-gram phrases of candidate sentences and selected abstract sentences in the sentence selection process, calculating semantic characterization similarity through cosine similarity, carrying out normalization and discrete processing, converting the two features into one-hot vector representation, and splicing and fusing to obtain repeated feature vectors;
step2.5, balancing the saliency features and the repetition features of the candidate sentences through a bilinear mapping function to obtain matching vectors, inputting the matching vectors into a multi-layer perceptron to obtain the final score of the candidate sentences, and selecting sentences with high scores to put into a summary sentence set to obtain topic summaries.
As a preferred embodiment of the present invention, the step2.1 specifically includes:
Calculating importance scores of sentences, namely calculating the scores of keywords, namely the word frequency of the keywords; extracting a sentence set containing keywords by using regularization matching of Python, filtering sentences containing irrelevant information, and calculating word frequency of the keywords in the news document; let num (w i) be the number of occurrences of keyword w i in a certain news, The score of the keyword w i is SC (w i) which is the sum of the occurrence times of all words in the news;
After calculating the scores of the keywords in the document, the scores of sentences in the document can be calculated, and the importance score of the nth sentence s mn of the mth document in the document set is set as SC (s mn);
After the importance scores of sentences in the document set are calculated, one sentence screening is required according to the scores, a document set encoder based on an improved Bert model is used for encoding the sentences of the document set, and the Bert model has a limit on the encoding length, so that sentences with highest importance scores in all the documents are extracted before sentence encoding, and the sentences are combined into a new document set L for subsequent encoding operation;
as a preferred embodiment of the present invention, the step2.2 specifically includes:
After sentence importance evaluation, the newly combined document set L contains m target news sentences { L 1,l2,…,lm }, wherein L i represents an ith sentence in the set; in order to obtain high-quality sentence and document set characterization, the art considers applying the Bert pre-training model to topic summary tasks; Because the Bert model is based on Token level coding, not sentence level coding, and the segment embedding part of the model is used for judging whether two sentences are related or not, only the two types of segment embedding are contained, and the method cannot be directly applied to topic summary tasks of inputting a plurality of sentences, a document set encoder based on an improved Bert model is used, a [ CLS ] mark is added before each sentence l i of a document set for summarizing sentence embedding information, and a [ SEP ] mark is added at the tail to distinguish boundaries of different sentences. To distinguish sentences at different positions, two different interval fragment embeddings of E odd and E even are introduced; for sentence i i, if i is odd, the interval fragment of the sentence is embedded as E odd, whereas if i is even, it is embedded as E even; Through the coding mode, the fusion of three types of embedded Token embedded E l, interval segment embedded E odd/Eeven and position embedded E p of sentences can be obtained; The representation T [CLS] which is output through the [ CLS ] mark before the sentence l i after being encoded by a plurality of transducer encoding layers is taken as the representation of the corresponding sentence, can be recorded as E 'li,E'li and the position embedded E' p of each sentence representation in the document set encoder are fused to form an input representation sequence, Adding an embedded E set capable of representing a document set into the head of the sequence, combining the embedded E set into a complete document set-sentence representation input sequence, inputting the complete document set-sentence representation input sequence into a plurality of transform coding layers for coding, and finally obtaining a representation r set and a sentence coding representation r li of the complete document set L;
As a preferred embodiment of the present invention, the step2.3 specifically includes:
The topic summary task needs to extract representative sentences, namely sentences with high prominence, so that a single-step sentence prominence calculation module is designed on the basis of document set coding and sentence coding; setting a manually written reference abstract of a document set L as R, wherein the aim is to extract k sentences capable of summarizing key information from the L as abstract sentences; for the t-th selection step, the set of summary sentences which are currently generated is Let L j be the sentence not yet selected in L, and measure the probability that the selected sentence is contained in the reference abstract R by calculating the bilinear mapping function F pro of the set token R set and the sentence token R li output by the document set encoder;
Wherein W bm is a bilinear mapped weight matrix, which can respectively perform linear transformation on vectors with different dimensions of r set and r li and map the vectors into another space; the objective function is to maximize the log-likelihood function of sentences contained in the reference abstract R in the training sample;
The bilinear mapping function F pro, as a saliency score function for measuring the current candidate sentence l i and the not yet selected sentence l j, can calculate the attention score of each candidate sentence, that is, the saliency score of the sentence;
As a preferred embodiment of the present invention, the step2.4 specifically includes:
after calculating the saliency score of the candidate sentence, the repeated characteristics of the sentence need to be calculated, and when the t-th selection process is carried out, the matching characteristics of the n-gram model of the process are calculated firstly, and the matching characteristics represent the coincidence degree of the n-gram phrase of the candidate sentence l i and the selected abstract sentence l t-1;
The more coincident phrases indicate the more repeated features, and in order to accurately calculate the repeated features, the phrase coincidence degrees of the single-element, binary and ternary grammar models are calculated respectively; in order to mine the similarity of the sentence representation in a deeper layer, on the basis of obtaining the phrase coincidence degree of the n-gram model, the maximum semantic similarity F sim of the sentence representation is fused to calculate the coincidence characteristic;
In order to expand the numerical difference of coincident features calculated by cosine similarity between the candidate sentence and the selected sentence, the feature values are discretized between 0 and 1 by using linear normalization;
The repeated feature calculation module calculates two repeated features, and the two features are fused to obtain an integral repeated feature; firstly, equally dividing the interval length from 0 to 1 into c blocks, dispersing the numerical value of the interval length into corresponding blocks equally divided between 0 and 1 according to the coincidence degree characteristics and normalized semantic similarity characteristics of unitary, binary and ternary grammar phrases, thereby converting each part of characteristics into one-hot vector representation with the length of c, and splicing and fusing each part to obtain repeated characteristic vector representation F rep(li of the whole module;
In the middle of The one-hot vector after the repeated feature vector of each part is segmented can capture the influence of the repeated feature of each part, and the selected abstract sentence is hoped to have fewer repeated features;
As a preferred embodiment of the present invention, the step2.5 specifically includes:
After the saliency score and the repeatability characteristics are obtained through the sentence saliency calculation module and the repetition characteristic calculation module, the two characteristics are required to be balanced in the sentence selection module, so that the selected abstract sentence has a certain saliency and cannot contain excessive repeatability characteristics; in the first step of sentence selection, only the sentence with the highest prominence score is extracted as the first sentence of the abstract; the two features of the candidate sentence l i are balanced by calculating the bilinear mapping function of the saliency feature F pro(li) and the repetition feature F rep(li), so that a d-dimensional mapping matching vector is obtained; inputting the sentence into the MLP to obtain a final score SC (l i);
Wherein the method comprises the steps of The two-linear mapping matrix is characterized by two features, and W h is the weight matrix of the MLP; the sentence selection module randomly selects sentences from the reference abstract R in the training process, so that the model learns the context information, learns and searches for the next outstanding and unrepeated sentences, and the objective function is that
The objective function represents that in the t-th process, the probability of selecting any sentence L i is a softmax function of the sentence score SC (L i) on the sentence L j remaining in L; the loss of the sentence selection module is irrelevant to the sequence of sentence selection, because the sentences given in the training process are a group of unordered sentences, the selected object of the module always stands out next sentence which is not repeated, and finally a sentence set is obtainedAs a generated topic summary;
Further, a document set encoder based on an improved Bert model is adopted to encode topic cluster news documents and sentences, the topic cluster news documents and sentences comprise 12 hidden layers, each layer has 12 attention heads, the dimension of the hidden layers is 768, and the vocabulary size is 30522; the document set encoder encodes the document set with 2 layers of convertors, and the dropout of each layer is set to be 0.1; the training batch size is 128, the training round is 20, the learning rate is 2e-3, the optimizer adopts Adam, beta 1 is 0.9, and beta 2 is 0.999; the one-hot vector representation length c of each partial repetitive feature in formula 8 is 20, the feature dimension d of bilinear mapping output of the saliency feature and the repetitive feature in formula 9 is 10, and model training adopts unordered sentences randomly selected from the reference abstract as context information.
The beneficial effects of the invention are as follows:
(1) Aiming at the target news topic abstract, more news documents exist under the same topic cluster, the search space is larger, more sentences irrelevant to key information exist, and a better sentence selection model is constructed, so that the generated topic abstract can reflect the key information of target news, repeated sentences and unimportant sentences can be well reduced, a target news topic abstract method based on compressed space sentence selection is provided, and a sentence importance evaluation module is designed to filter out irrelevant information to compress the search space;
(2) The invention provides two parallel modules for respectively calculating the salient features and the repeated features;
(3) The invention provides a sentence selection model which is used for extracting sentences with highest final scores to form topic abstracts after balancing two characteristics, and solves the problems of larger search space, low saliency of the generated abstracts and more repeated information in the task of abstracts of target news topics.
Drawings
FIG. 1 is a flow chart of a target news topic summary based on compressed space sentence selection proposed by the method of the present invention;
FIG. 2 is a model diagram of a document collection encoder module based on an improved Bert model in the process flow of the present invention.
Detailed Description
Example 1: as shown in fig. 1 and 2, a target news topic summary method based on compressed space sentence selection includes the following specific steps:
step1, crawling target news through a crawler technology, selecting news related to topics and selecting 30 topic news clusters from the news, wherein the news in each cluster describes the same topic, each cluster contains 20 table of contents marks of news, including titles and text content, and a total of 15343 sentences are subjected to data set construction; carrying out data denoising cleaning and preprocessing; each news only belongs to one topic cluster by analyzing the crawled target news, and the news documents under the same topic cluster are marked to obtain labels of sentences in the documents, and a reference abstract of each topic cluster is manually written.
Step1.1, crawling target key news in recent years of each large news website and public number platform through a crawler technology, and selecting 17889 total of more than ten case topic news with higher attention of netizens such as 'certain right-of-maintenance case';
step1.2, performing data cleaning, denoising and preprocessing on the crawled data, wherein the cleaning, denoising and preprocessing process of the data comprises the steps of removing webpage labels, advertisement links and special symbols, removing repeated data, foreign language data and traditional Chinese characters, and manually calibrating the relevance of news data and case topics;
Step1.3, obtaining sentence labels in the data set by adopting a manual labeling mode; a reference summary for each topic cluster is manually written. The experimental data set scale is shown in table 1:
Table 1 experimental dataset statistics
Step2, screening sentences with highest importance scores, which contain topic words, from all documents through defined topic key description words; encoding the screened document set and sentences using an improved pre-training model; two sentence characteristics are obtained through a prominence calculation module and a repeated characteristic calculation module; the method comprises the steps of generating salient features and repeated features of the abstract through sentence selection model balance, calculating sentence scores, and extracting sentences with high scores so as to obtain the abstract containing accurate topic information.
Step2.1, calculating importance scores of sentences, wherein the scores of keywords, namely word frequencies of the keywords, need to be calculated first; extracting a sentence set containing keywords by using regularization matching of Python, filtering sentences containing irrelevant information, and calculating word frequency of the keywords in the news document; let num (w i) be the number of occurrences of keyword w i in a certain news,The score of the keyword w i is SC (w i) which is the sum of the occurrence times of all words in the news;
After calculating the scores of the keywords in the document, the scores of sentences in the document can be calculated, and the importance score of the nth sentence s mn of the mth document in the document set is set as SC (s mn);
After the importance scores of sentences in the document set are calculated, one sentence screening is required according to the scores, a document set encoder based on an improved Bert model is used for encoding the sentences of the document set, and the Bert model has a limit on the encoding length, so that sentences with highest importance scores in all the documents are extracted before sentence encoding, and the sentences are combined into a new document set L for subsequent encoding operation;
After sentence importance evaluation, the newly combined document set L contains m target news sentences { L 1,l2,…,lm }, wherein L i represents an ith sentence in the set; in order to obtain high-quality sentence and document set characterization, the art considers applying the Bert pre-training model to topic summary tasks; Because the Bert model is based on Token level coding, not sentence level coding, and the segment embedding part of the model is used for judging whether two sentences are related or not, only the two types of segment embedding are contained, and the method cannot be directly applied to topic summary tasks of inputting a plurality of sentences, a document set encoder module based on an improved Bert model is used, a [ CLS ] mark is added before each sentence l i of a document set for summarizing sentence embedding information, and a [ SEP ] mark is added at the tail to distinguish boundaries of different sentences. To distinguish sentences at different positions, two different interval fragment embeddings of E odd and E even are introduced; for sentence i i, if i is odd, the interval fragment of the sentence is embedded as E odd, whereas if i is even, it is embedded as E even; Through the coding mode, the fusion of three types of embedded Token embedded E l, interval segment embedded E odd/Eeven and position embedded E p of sentences can be obtained; The representation T [CLS] which is output through the [ CLS ] mark before the sentence l i after being encoded by a plurality of transducer encoding layers is taken as the representation of the corresponding sentence, can be recorded as E 'li,E'li and the position embedded E' p of each sentence representation in the document set encoder are fused to form an input representation sequence, Adding an embedded E set capable of representing a document set into the head of the sequence, combining the embedded E set into a complete document set-sentence representation input sequence, inputting the complete document set-sentence representation input sequence into a plurality of transform coding layers for coding, and finally obtaining a representation r set and a sentence coding representation r li of the complete document set L;
the task of the topic abstract of Step2.3 needs to extract a representative sentence, namely a sentence with high prominence, so that a single-step sentence prominence calculation module is designed on the basis of document set coding and sentence coding; setting a manually written reference abstract of a document set L as R, wherein the aim is to extract k sentences capable of summarizing key information from the L as abstract sentences; for the t-th selection step, the set of summary sentences which are currently generated is Let L j be the sentence not yet selected in L, and measure the probability that the selected sentence is contained in the reference abstract R by calculating the bilinear mapping function F pro of the set token R set and the sentence token R li output by the document set encoder;
Wherein W bm is a bilinear mapped weight matrix, which can respectively perform linear transformation on vectors with different dimensions of r set and r li and map the vectors into another space; the objective function is to maximize the log-likelihood function of sentences contained in the reference abstract R in the training sample;
The bilinear mapping function F pro, as a saliency score function for measuring the current candidate sentence l i and the not yet selected sentence l j, can calculate the attention score of each candidate sentence, that is, the saliency score of the sentence;
step2.4, after calculating the saliency score of the candidate sentence, calculating the repeated characteristics of the sentence, and when the t-th selection process is carried out, firstly calculating the matching characteristics of the n-gram model of the process, wherein the matching characteristics represent the coincidence degree of the n-gram phrase of the candidate sentence l i and the selected abstract sentence l t-1;
The more coincident phrases indicate the more repeated features, and in order to accurately calculate the repeated features, the phrase coincidence degrees of the single-element, binary and ternary grammar models are calculated respectively; in order to mine the similarity of the sentence representation in a deeper layer, on the basis of obtaining the phrase coincidence degree of the n-gram model, the maximum semantic similarity F sim of the sentence representation is fused to calculate the coincidence characteristic;
In order to expand the numerical difference of coincident features calculated by cosine similarity between the candidate sentence and the selected sentence, the feature values are discretized between 0 and 1 by using linear normalization;
The repeated feature calculation module calculates two repeated features, and the two features are fused to obtain an integral repeated feature; firstly, equally dividing the interval length from 0 to 1 into c blocks, dispersing the numerical value of the interval length into corresponding blocks equally divided between 0 and 1 according to the coincidence degree characteristics and normalized semantic similarity characteristics of unitary, binary and ternary grammar phrases, thereby converting each part of characteristics into one-hot vector representation with the length of c, and splicing and fusing each part to obtain repeated characteristic vector representation F rep(li of the whole module;
In the middle of The one-hot vector after the repeated feature vector of each part is segmented can capture the influence of the repeated feature of each part, and the selected abstract sentence is hoped to have fewer repeated features;
Step2.5, after the saliency score and the repeatability characteristics are obtained through the sentence saliency calculation module and the repetition characteristic calculation module, the two characteristics are required to be balanced in the sentence selection module, so that the selected abstract sentence has a certain saliency and cannot contain excessive repeatability characteristics; in the first step of sentence selection, only the sentence with the highest prominence score is extracted as the first sentence of the abstract; the two features of the candidate sentence l i are balanced by calculating the bilinear mapping function of the saliency feature F pro(li) and the repetition feature F rep(li), so that a d-dimensional mapping matching vector is obtained; inputting the sentence into the MLP to obtain a final score SC (l i);
Wherein the method comprises the steps of The two-linear mapping matrix is characterized by two features, and W h is the weight matrix of the MLP; the sentence selection module randomly selects sentences from the reference abstract R in the training process, so that the model learns the context information, learns and searches for the next outstanding and unrepeated sentences, and the objective function is that
The objective function represents that in the t-th process, the probability of selecting any sentence L i is a softmax function of the sentence score SC (L i) on the sentence L j remaining in L; the loss of the sentence selection module is irrelevant to the sequence of sentence selection, because the sentences given in the training process are a group of unordered sentences, the selected object of the module always stands out next sentence which is not repeated, and finally a sentence set is obtainedAs a generated topic summary;
Step2.6, a model experiment adopts a document set encoder based on an improved Bert model to encode topic cluster news documents and sentences, and the topic cluster news documents and sentences comprise 12 hidden layers, each layer has 12 attention heads, the dimension of the hidden layer is 768, and the vocabulary size is 30522; the document set encoder encodes the document set with 2 layers of convertors, and the dropout of each layer is set to be 0.1; the training batch size is 128, the training round is 20, the learning rate is 2e-3, the optimizer adopts Adam, beta 1 is 0.9, and beta 2 is 0.999; the one-hot vector representation length c of each partial repetitive feature in formula 8 is 20, the feature dimension d of bilinear mapping output of the saliency feature and the repetitive feature in formula 9 is 10, and model training adopts unordered sentences randomly selected from the reference abstract as context information.
To illustrate the effect of the invention, 3 comparative experiments were set up. The first group of experiments verify the improvement of topic summary performance, the second group of experiments verify the effectiveness of the model of the invention, and the third group of experiments verify the influence of different summary lengths on the effectiveness of the model.
(1) Topic summary performance promotion verification
In the baseline model, a target news topic summary data set constructed by step1 is used as model input to carry out a comparison experiment, 5 models are selected as reference models, and the comparison experiment is respectively as follows: LEAD-3, LDA topic model TextRank, bertSum and RL-MMR, the experimental results are shown in Table 2.
Table 2 performance comparison of baseline model
As can be seen from analysis Table 2, the method of the present invention has better performance than other reference models, and the LEAD-3 algorithm has the worst effect in generating the summary of the target news topic, because it only focuses on the beginning part of the document set, and extracts the first three sentences by means of hard cut-off, and the first three sentences describe more irrelevant information, resulting in more non-representative content and poorer model performance. The LDA topic model relies on statistical features, and the LDA can have the problem of inconsistent topic importance due to the specificity of target news. The TextRank algorithm is based on a graph model, has obvious advantages in constructing the association relation of sentences in a document set, but the method does not firstly carry out sentence importance screening, and is easily influenced by high-frequency words of non-topic keywords in the target field, so that obvious defects exist in ROUGE-2 and ROUGE-L indexes. The DPP model has better indexes than the prior comparison method, the model selects a representative sample through a determinant point process, a capsule network is utilized to filter sentences which contain less overlapped words but have repeated semantics, the effect is better in terms of removing repeated features, but the model lacks end-to-end representation learning, error accumulation can be caused, and the effect still needs to be improved. Compared with a comparison model outside the text model, the RL-MMR model achieves good effect, but compared with the text model, the method for ranking sentences by soft attention introduced by the RL-MMR is imperfect, no guidance of topic keyword information exists, and non-topic keyword information can also appear in sentences with high ranking. After the sentence importance evaluation module is introduced into the model of the method, a large number of sentences irrelevant to key information are filtered, repeated characteristics and salient characteristics are balanced, and the score of the finally selected sentences is not biased to any one characteristic, so that the best effect is achieved, compared with a baseline model, the method has the advantages that ROUGE-1 value is improved by 1.44-6.47, ROUGE-2 value is improved by 0.96-6.34, ROUGE-L value is improved by 0.91-6.42, and the effectiveness of the method is verified.
(2) Model validity verification
In order to verify the effectiveness of each module of the model, the main model is ablated into three sub-models of main model removal sentence importance assessment, main model removal salient features and main model removal repeated features, evaluation indexes are calculated by using ROUGE values, an optimal result is represented by bold, and the evaluation indexes are kept unchanged. The test results are shown in table 3:
Table 3 simplified model performance analysis
Analysis of Table 3 shows that the sentence importance evaluation module of the removal model has the worst effect of each index, the ROUGE-1 value decreased by 6.28, the ROUGE-2 value decreased by 6.32, and the ROUGE-L value decreased by 7.71. The method is characterized in that after the module is removed, a sentence set input by the model is unfiltered and contains more words of non-key information, and a document set encoder module can conduct hard truncation on excessive sentence codes, is inaccurate in description of topic content and has larger difference from a reference abstract. After the model removes the salient features, the effect is slightly better than that of removing the sentence importance evaluation module, the ROUGE-1 value is reduced by 4.47, the ROUGE-2 value is reduced by 3.53, and the ROUGE-L value is reduced by 5.94. Because the model extracts sentences containing topic keywords, after the salient features are removed, the information in the abstract sentences extracted by the model is not representative and the topic content cannot be well described. The minimum drop in each index after the model removes the repeated features, the ROUGE-1 value drops by 3.73, the ROUGE-2 value drops by 2.42, and the ROUGE-L value drops by 3.77. This is because the model retains the computation module of salient features, the extracted sentences contain more representative information, but the repeated features are removed, the extracted sentence set contains a large amount of repeated information, and the generated abstract is still not the best abstract although the key information is more, which also laterally verifies the effectiveness of the method model of the invention.
(3) Verification of the effect of different digest lengths on model validity
In order to verify the influence of the abstracts with different lengths on ROUGE indexes generated by the model, namely to verify whether the model has better adaptability, the invention makes the following experiment, and four abstracts with different lengths are generated for comparison, and the experimental results are shown in table 4. The test results are shown in table 4:
TABLE 4 Effect of different digest lengths on model validity verification analysis
Analysis of Table 4 shows that the model performance decreases significantly when the generated summary length is 50 and 100, because the generated summary is too short, which results in a large amount of related information loss. The model is near best performance when the generated summary length is 150, and best is achieved when the length is 200. The method is characterized in that when a data set is constructed, the average length of artificial reference summaries written for each topic cluster in a test set is about 178, the generated summaries are closer to the length of the reference summaries, the number of co-occurrence phrases and the longest character strings of the reference summaries is larger, and the performance effect of the method model is better.
The experiment data prove that aiming at the problems that the search space of the topic cluster is larger and more sentences irrelevant to topic key information exist, the invention provides a target news topic summary method based on the sentence selection of the compressed space, sentences containing topic key words are extracted through a sentence importance evaluation module, the salient features and repeated features of the sentences are balanced by utilizing a bilinear mapping function to score the sentences, and the extraction effectiveness of the sentences relevant to topic key information is realized. Based on the constructed target news topic abstract data set and the manually written reference abstract, various experiments prove that the topic abstract model provided herein can extract key sentences with representative information and no repetition, and the generated abstract has higher quality. Experiments show that the method of the invention achieves the optimal effect compared with a plurality of baseline models. Aiming at the task of abstracting the target news topics, the target news topic abstracting method based on the compressed space sentence selection is effective for improving the performance of the news topic abstracts in the target field.
While the present invention has been described in detail with reference to the drawings, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (3)

1. The target news topic summary method based on compressed space sentence selection is characterized by comprising the following steps of: the method comprises the following specific steps:
Step1, crawling target news through a crawler technology, selecting news related to topics, and constructing a target news topic summary data set; carrying out data denoising cleaning and preprocessing; analyzing the crawled target news to ensure that each news only belongs to one topic cluster, marking news documents under the same topic cluster, obtaining labels of sentences in the documents, and manually writing a reference abstract of each topic cluster;
Step2, screening sentences with highest importance scores, which contain topic words, from all documents through defined topic key description words; encoding the screened document set and sentences using an improved pre-training model; two sentence characteristics are obtained through a prominence calculation module and a repeated characteristic calculation module; generating salient features and repeated features of the abstract through sentence selection model balance, calculating sentence score, and extracting sentences with high scores to obtain the abstract containing accurate topic information;
the specific steps of Step2 are as follows:
Step2.1, defining keywords in topic clusters, extracting a sentence set containing the keywords through regularization matching, filtering sentences containing irrelevant information, calculating word frequencies of the keywords in news documents to obtain importance scores of the sentences, extracting sentences with highest importance scores in the documents, and combining the sentences into a new topic cluster document set;
step2.2, encoding a document set and sentences by using a document set encoder based on an improved Bert model to obtain characterization;
step2.3, measuring the saliency of candidate sentences and sentences which are not selected through a bilinear mapping function;
Step2.4, calculating the coincidence degree of n-gram phrases of candidate sentences and selected abstract sentences in the sentence selection process, calculating semantic characterization similarity through cosine similarity, carrying out normalization and discrete processing, converting the two features into one-hot vector representation, and splicing and fusing to obtain repeated feature vectors;
Step2.5, balancing the saliency features and the repetition features of the candidate sentences through a bilinear mapping function to obtain matching vectors, inputting the matching vectors into a multi-layer perceptron to obtain the final score of the candidate sentences, and selecting sentences with high scores to put into a summary sentence set to obtain topic summaries;
The step2.2 specifically comprises:
After sentence importance evaluation, the newly combined document set L contains m target news sentences { L 1,l2,…,lm }, wherein L i represents an ith sentence in the set; in order to obtain high-quality sentence and document set characterization, a Bert pre-training model is considered to be applied to topic summary tasks; Because the Bert model is based on Token level coding, not sentence level coding, and the segment embedding part of the model is used for judging whether two sentences are related or not, only the two types of segment embedding are included, and the method can not be directly applied to topic summary tasks of inputting a plurality of sentences, a document set encoder based on an improved Bert model is used, a [ CLS ] mark is added before each sentence l i of a document set for summarizing sentence embedding information, a [ SEP ] mark is added at the end for distinguishing boundaries of different sentences, to distinguish sentences at different positions, two different interval fragment embeddings of E odd and E even are introduced; For sentence i i, if i is odd, the interval fragment of the sentence is embedded as E odd, whereas if i is even, it is embedded as E even; through the coding mode, the fusion of three types of embedded Token embedded E l, interval segment embedded E odd/Eeven and position embedded E p of sentences can be obtained; The representation T [CLS] which is output through the [ CLS ] mark before the sentence l i after being encoded by a plurality of transducer encoding layers is taken as the representation of the corresponding sentence, can be recorded as E 'li,E'li and the position embedded E' p of each sentence representation in the document set encoder are fused to form an input representation sequence, Adding an embedded E set capable of representing a document set into the head of the sequence, combining the embedded E set into a complete document set-sentence representation input sequence, inputting the complete document set-sentence representation input sequence into a plurality of transform coding layers for coding, and finally obtaining a representation r set and a sentence coding representation r li of the complete document set L;
the step2.3 specifically comprises:
The topic summary task needs to extract representative sentences, namely sentences with high prominence, so that a single-step sentence prominence calculation module is designed on the basis of document set coding and sentence coding; setting a manually written reference abstract of a document set L as R, wherein the aim is to extract k sentences capable of summarizing key information from the L as abstract sentences; for the t-th selection step, the set of summary sentences which are currently generated is Let L j be the sentence not yet selected in L, and measure the probability that the selected sentence is contained in the reference abstract R by calculating the bilinear mapping function F pro of the set token R set and the sentence token R li output by the document set encoder;
Wherein W bm is a bilinear mapped weight matrix, which can respectively perform linear transformation on vectors with different dimensions of r set and r li and map the vectors into another space; the objective function is to maximize the log-likelihood function of sentences contained in the reference abstract R in the training sample;
The bilinear mapping function F pro, as a saliency score function for measuring the current candidate sentence l i and the not yet selected sentence l j, can calculate the attention score of each candidate sentence, that is, the saliency score of the sentence;
the step2.4 specifically comprises:
after calculating the saliency score of the candidate sentence, the repeated characteristics of the sentence need to be calculated, and when the t-th selection process is carried out, the matching characteristics of the n-gram model of the process are calculated firstly, and the matching characteristics represent the coincidence degree of the n-gram phrase of the candidate sentence l i and the selected abstract sentence l t-1;
The more coincident phrases indicate the more repeated features, and in order to accurately calculate the repeated features, the phrase coincidence degrees of the single-element, binary and ternary grammar models are calculated respectively; in order to mine the similarity of the sentence representation in a deeper layer, on the basis of obtaining the phrase coincidence degree of the n-gram model, the maximum semantic similarity F sim of the sentence representation is fused to calculate the coincidence characteristic;
In order to expand the numerical difference of coincident features calculated by cosine similarity between the candidate sentence and the selected sentence, the feature values are discretized between 0 and 1 by using linear normalization;
The repeated feature calculation module calculates two repeated features, and the two features are fused to obtain an integral repeated feature; firstly, equally dividing the interval length from 0 to 1 into c blocks, dispersing the numerical value of the interval length into corresponding blocks equally divided between 0 and 1 according to the coincidence degree characteristics and normalized semantic similarity characteristics of unitary, binary and ternary grammar phrases, thereby converting each part of characteristics into one-hot vector representation with the length of c, and splicing and fusing each part to obtain repeated characteristic vector representation F rep(li of the whole module;
In the middle of A one-hot vector after the repeated feature vector of each part is segmented;
The step2.5 specifically comprises:
After the saliency score and the repeatability characteristics are obtained through the sentence saliency calculation module and the repetition characteristic calculation module, the two characteristics are required to be balanced in the sentence selection module, so that the selected abstract sentence has a certain saliency and cannot contain excessive repeatability characteristics; in the first step of sentence selection, only the sentence with the highest prominence score is extracted as the first sentence of the abstract; the two features of the candidate sentence l i are balanced by calculating the bilinear mapping function of the saliency feature F pro(li) and the repetition feature F rep(li), so that a d-dimensional mapping matching vector is obtained; inputting the sentence into the MLP to obtain a final score SC (l i);
Wherein the method comprises the steps of The two-linear mapping matrix is characterized by two features, and W h is the weight matrix of the MLP; the sentence selection module randomly selects sentences from the reference abstract R in the training process, so that the model learns the context information, learns and searches for the next outstanding and unrepeated sentences, and the objective function is that
The objective function represents that in the t-th process, the probability of selecting any sentence L i is a softmax function of the sentence score SC (L i) on the sentence L j remaining in L; the loss of the sentence selection module is irrelevant to the sequence of sentence selection, because the sentences given in the training process are a group of unordered sentences, the selected object of the module always stands out next sentence which is not repeated, and finally a sentence set is obtainedAs a generated topic summary.
2. The target news topic summary method based on compressed space sentence selection according to claim 1, wherein: the Step1 specifically comprises the following steps:
step1.1, crawling news data related to a target topic of each large news website and a public number platform through a crawler technology to form a topic cluster;
step1.2, performing data cleaning, denoising and preprocessing on the crawled data, wherein the cleaning, denoising and preprocessing process of the data comprises the steps of removing webpage labels, advertisement links and special symbols, removing repeated data, foreign language data and traditional Chinese characters, and manually calibrating the relevance of news data and case topics;
Step1.3, obtaining sentence labels in the data set by adopting a manual labeling mode; a reference summary for each topic cluster is manually written.
3. The target news topic summary method based on compressed space sentence selection according to claim 1, wherein: the Step2.1 specifically comprises:
Calculating importance scores of sentences, namely calculating the scores of keywords, namely the word frequency of the keywords; extracting a sentence set containing keywords by using regularization matching of Python, filtering sentences containing irrelevant information, and calculating word frequency of the keywords in the news document; let num (w i) be the number of occurrences of keyword w i in a certain news, The score of the keyword w i is SC (w i) which is the sum of the occurrence times of all words in the news;
After calculating the scores of the keywords in the document, the scores of sentences in the document can be calculated, and the importance score of the nth sentence s mn of the mth document in the document set is set as SC (s mn);
After the importance scores of sentences in the document set are calculated, one sentence screening is required according to the scores, the document set sentence coding is carried out by using a document set encoder module based on an improved Bert model, and the Bert model has a limit on the coding length, so that sentences with highest importance scores in all the documents are extracted before sentence coding, and the sentences are combined into a new document set L for subsequent coding operation.
CN202210449431.0A 2022-04-27 2022-04-27 Target news topic abstracting method based on compressed space sentence selection Active CN115017404B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210449431.0A CN115017404B (en) 2022-04-27 2022-04-27 Target news topic abstracting method based on compressed space sentence selection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210449431.0A CN115017404B (en) 2022-04-27 2022-04-27 Target news topic abstracting method based on compressed space sentence selection

Publications (2)

Publication Number Publication Date
CN115017404A CN115017404A (en) 2022-09-06
CN115017404B true CN115017404B (en) 2024-10-18

Family

ID=83066510

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210449431.0A Active CN115017404B (en) 2022-04-27 2022-04-27 Target news topic abstracting method based on compressed space sentence selection

Country Status (1)

Country Link
CN (1) CN115017404B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115687628A (en) * 2022-12-30 2023-02-03 北京搜狐新媒体信息技术有限公司 News quality judging method, system, computer equipment and storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101330350B (en) * 2007-06-21 2011-09-14 华为技术有限公司 Method for transmitting data adapting load bandwidth, receive processing method and apparatus
CN102254011A (en) * 2011-07-18 2011-11-23 哈尔滨工业大学 Method for modeling dynamic multi-document abstracts
CN110378409B (en) * 2019-07-15 2020-08-21 昆明理工大学 Chinese-Yue news document abstract generation method based on element association attention mechanism
FR3102276A1 (en) * 2019-10-17 2021-04-23 Amadeus METHODS AND SYSTEMS FOR SUMMARIZING MULTIPLE DOCUMENTS USING AN AUTOMATIC LEARNING APPROACH
CN111274816B (en) * 2020-01-15 2021-05-18 湖北亿咖通科技有限公司 Named entity identification method based on neural network and vehicle machine
US11580975B2 (en) * 2020-06-01 2023-02-14 Salesforce.Com, Inc. Systems and methods for response selection in multi-party conversations with dynamic topic tracking
CN113468290A (en) * 2021-06-05 2021-10-01 浙江华巽科技有限公司 Abstract extraction method based on information decomposition
CN113822076A (en) * 2021-07-12 2021-12-21 腾讯科技(深圳)有限公司 Text generation method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于压缩空间句子选择的涉案新闻话题摘要;卢天旭;通信技术;20221231;第55卷(第009期);1136-1145 *

Also Published As

Publication number Publication date
CN115017404A (en) 2022-09-06

Similar Documents

Publication Publication Date Title
CN112801010B (en) Visual rich document information extraction method for actual OCR scene
CN107229668B (en) Text extraction method based on keyword matching
CN111160031A (en) Social media named entity identification method based on affix perception
CN111914062B (en) Long text question-answer pair generation system based on keywords
CN111291188B (en) Intelligent information extraction method and system
CN112035658B (en) Enterprise public opinion monitoring method based on deep learning
CN113569050B (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN112464656A (en) Keyword extraction method and device, electronic equipment and storage medium
CN114911917B (en) Asset meta-information searching method and device, computer equipment and readable storage medium
CN112101027A (en) Chinese named entity recognition method based on reading understanding
CN115599902B (en) Oil-gas encyclopedia question-answering method and system based on knowledge graph
CN115186665B (en) Semantic-based unsupervised academic keyword extraction method and equipment
CN114647715A (en) Entity recognition method based on pre-training language model
CN111159342A (en) Park text comment emotion scoring method based on machine learning
CN111814477A (en) Dispute focus discovery method and device based on dispute focus entity and terminal
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN114969304A (en) Case public opinion multi-document generation type abstract method based on element graph attention
CN114048354A (en) Test question retrieval method, device and medium based on multi-element characterization and metric learning
CN117634615A (en) Multi-task code retrieval method based on mode irrelevant comparison learning
CN111274494B (en) Composite label recommendation method combining deep learning and collaborative filtering technology
CN112926340A (en) Semantic matching model for knowledge point positioning
CN115017404B (en) Target news topic abstracting method based on compressed space sentence selection
CN114579729B (en) FAQ question-answer matching method and system fusing multi-algorithm models
CN116010552A (en) Engineering cost data analysis system and method based on keyword word library
CN113641788B (en) Unsupervised long and short film evaluation fine granularity viewpoint mining method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant