[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN114328895A - News abstract generation method and device and computer equipment - Google Patents

News abstract generation method and device and computer equipment Download PDF

Info

Publication number
CN114328895A
CN114328895A CN202111349428.3A CN202111349428A CN114328895A CN 114328895 A CN114328895 A CN 114328895A CN 202111349428 A CN202111349428 A CN 202111349428A CN 114328895 A CN114328895 A CN 114328895A
Authority
CN
China
Prior art keywords
sentence
abstract
news
digest
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111349428.3A
Other languages
Chinese (zh)
Inventor
喻燕君
杨洋
李锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Pudong Development Bank Co Ltd
Original Assignee
Shanghai Pudong Development Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Pudong Development Bank Co Ltd filed Critical Shanghai Pudong Development Bank Co Ltd
Priority to CN202111349428.3A priority Critical patent/CN114328895A/en
Publication of CN114328895A publication Critical patent/CN114328895A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of language processing, and particularly discloses a news abstract generating method and device and computer equipment. The method for generating the news abstract comprises the following steps: preprocessing news data to obtain a single sentence text set; reordering the single sentence texts in the single sentence text set according to a first preset importance ordering algorithm to obtain a first abstract candidate sentence set; extracting a plurality of keywords from the single sentence text set, and sequencing the plurality of keywords according to a second preset importance sequencing algorithm to obtain a keyword set; extracting an abstract first sentence from the single sentence text set; and performing logic splicing on the basis of the first abstract candidate sentence set, the keyword set and the abstract first sentence to obtain the news abstract. The news abstract formed by the method has good generality, coherence and logicality.

Description

News abstract generation method and device and computer equipment
Technical Field
The invention relates to the technical field of language processing, in particular to a news abstract generating method and device and computer equipment.
Background
Text Summarization (Text Summarization) is a common basic task of Natural Language Processing (NLP), and it refers to extracting and refining Text or Text collection data through various technologies or methods to summarize the main idea of the Text or Text collection data.
The text abstract algorithm mainly comprises an extraction formula and a generation formula, wherein the extraction formula method is to directly select representative phrases, sentences or paragraphs from the original text to form the abstract of the original text. Conventional extraction algorithms are often designed only from the viewpoint of generalization of sentences to texts, and finally formed text summaries often lack logicality and coherence.
Disclosure of Invention
In view of the above, it is necessary to provide a method, an apparatus, a computer device, a computer readable storage medium, and a computer program product for generating a news digest.
A method for generating a news abstract comprises the following steps:
preprocessing news data to obtain a single sentence text set;
reordering the single sentence texts in the single sentence text set according to a first preset importance ordering algorithm to obtain a first abstract candidate sentence set;
extracting a plurality of keywords from the single sentence text set, and sequencing the plurality of keywords according to a second preset importance sequencing algorithm to obtain a keyword set;
extracting an abstract first sentence from the single sentence text set;
and performing logic splicing on the basis of the first abstract candidate sentence set, the keyword set and the abstract first sentence to obtain the news abstract.
In one embodiment, the step of preprocessing the news data to obtain a single sentence text set includes:
filtering the news data according to a preset rule;
and carrying out single sentence division on the filtered news data to obtain a plurality of single sentence texts and form a single sentence text set.
In one embodiment, the step of filtering the news data according to the preset rule is implemented by any one or more of the following manners:
filtering all levels of title data in the news data;
filtering useless information in the news data;
filtering logic words at the head of each single sentence in the news data;
and filtering preset sentence patterns in the news data.
In one embodiment, the step of reordering each sentence text in the set of sentence texts according to a first preset importance ranking algorithm to obtain a first abstract candidate sentence set includes:
establishing a graph model by using a summary extraction algorithm and taking each single sentence text in the single sentence text set as a node;
and determining importance degree index data of each single sentence text in the news data according to the relation between the single sentence texts, and sequencing the single sentence texts according to the importance degree index data to form the first abstract candidate sentence set.
In one embodiment, the step of extracting the abstract first sentence from the single sentence text set comprises:
forming a first sentence classification model;
and sequentially inputting each single sentence text in the single sentence text set to the first sentence classification model, and taking the first single sentence text which is judged as the abstract first sentence by the first sentence classification model as the extracted abstract first sentence.
In one embodiment, the step of obtaining the news digest by logically splicing the first digest candidate sentence set, the keyword set, and the digest first sentence includes:
screening abstract candidate sentences in the first abstract candidate sentence set through keywords in the keyword set to obtain a second abstract candidate sentence set;
and splicing the abstract candidate sentences in the second abstract candidate sentence set with the abstract first sentences to obtain the news abstract.
In one embodiment, the step of screening the candidate sentences in the first abstract candidate sentence set by the keywords in the keyword set to obtain a second abstract candidate sentence set includes:
sequentially extracting a preset number of target keywords from the keyword set according to the sequence of the importance degree index data from high to low;
and sequentially extracting single sentence texts containing any one target keyword from the first abstract candidate sentence set according to the sequence of the importance degree index data from high to low to form the second abstract candidate sentence set.
An apparatus for generating a news digest, comprising:
the preprocessing module is used for preprocessing news data to obtain a single sentence text set;
the first generation module is used for reordering the single sentence texts in the single sentence text set according to a first preset importance ordering algorithm to obtain a first abstract candidate sentence set;
the second generation module is used for extracting a plurality of keywords from the single sentence text set and sequencing the keywords according to a second preset importance sequencing algorithm to obtain a keyword set;
the extraction module is used for extracting an abstract first sentence from the single sentence text set;
and the third generation module is used for carrying out logic splicing on the basis of the first abstract candidate sentence set, the keyword set and the abstract first sentence to obtain the news abstract.
A computer device comprising a memory storing a computer program and a processor executing the steps of the method of generating a news digest described above.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method of generating a news digest.
A computer program product comprising a computer program which, when being executed by a processor, carries out the steps of the above-mentioned method of generating a news digest.
The method for generating the news abstract preprocesses the news data before extracting the abstract candidate sentences so as to avoid sentence inconsistency and logic errors in the follow-up abstract to a certain extent, meanwhile, on the basis of extracting the abstract candidate sentences, key words in news data are extracted, appropriate single sentence texts are extracted to serve as abstract first sentences, finally news abstracts are spliced by combining the key words, the abstract candidate sentences and the abstract first sentences, because the keywords have high generalization, the summary is formed by combining the keywords, the generalization of the summary obtained by the traditional extraction mode can be compensated, the first sentence of the summary is extracted in advance, and the candidate sentences of the summary are directly added after the first sentence of the summary, thereby simplifying the difficulty of splicing the summary, the fluency and the front-back logicality among sentences can be further improved, and the formed news abstract has good generalization, continuity and logicality.
Drawings
Fig. 1 is a flowchart of a method for generating a news digest according to an embodiment of the present application;
fig. 2 is a block flow diagram of step S100 in a method for generating a news digest according to an embodiment of the present application;
fig. 3 is a block flow diagram of step S200 in a method for generating a news digest according to an embodiment of the present application;
fig. 4 is a flowchart of a step S400 in a method for generating a news digest according to an embodiment of the present application;
fig. 5 is a flowchart of a step S500 in a method for generating a news digest according to an embodiment of the present application;
fig. 6 is a flowchart of step S510 in a method for generating a news digest according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a news digest generation apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
To facilitate an understanding of the invention, the invention will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
The terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
Text Summarization (Text Summarization) is a common basic task of Natural Language Processing (NLP), and it refers to extracting and refining Text or Text collection data through various technologies or methods to summarize the main idea of the Text or Text collection data.
With the rapid development of the internet, people pay more and more attention to information, and each news website generates a large amount of news every day. But with it, the problem of text information overload is becoming more serious. When daily news is recommended and broadcasted for users, news contents need to be extracted and refined to obtain abstracts corresponding to texts in order to reduce the reading cost of the users.
The text summarization algorithm is divided into an abstraction type and a generation type, wherein the abstraction type method is to directly select representative phrases, sentences or paragraphs from the original text to form the summary of the original text. For example, the traditional experience-based lead-3 algorithm directly takes the first three sentences of the original text as the abstract of the text; selecting keywords in a text based on a word-significant algorithm of characteristics, then sequentially selecting a first sentence containing the keywords from the text according to the weight sequence of the keywords, and splicing the first sentence into a summary; the algorithm based on the neural network judges whether a sentence is a abstract sentence or not by utilizing the comprehension capability of the neural network model to the text semantics.
The above extraction algorithm, whether based on experience, keywords or neural network, is designed from the viewpoint of generalization of sentences to text, neglects consistency and logicality among sentences, and is easy to cause problems of logic inaccuracy, incoherence and the like.
In view of the above, the present application provides a news digest generation method, a generation apparatus, a computer device, a computer-readable storage medium, and a computer program product.
In an embodiment, a method for generating a news digest is provided, and this embodiment is illustrated by applying the method to a terminal, and it is to be understood that the method may also be applied to a server, and may also be applied to a system including a terminal and a server, and is implemented by interaction between the terminal and the server.
Referring to fig. 1, in the present embodiment, the method includes the following steps:
and S100, preprocessing news data to obtain a single sentence text set.
The method comprises the steps of acquiring news data, namely after editing finished news releases, preprocessing the news data firstly, wherein the preprocessing mode is various, contents which are not beneficial to forming abstract can be filtered, obvious errors in the news data can be corrected, and the like, and the specific preprocessing mode can be determined according to actual requirements and is not limited.
The preprocessing process also comprises splitting the news data, namely, the news data can be split into a plurality of single sentences according to the end sign of each single sentence, so that a single sentence text set is formed.
Step S200, reordering the single sentence texts in the single sentence text set according to a first preset importance ordering algorithm to obtain a first abstract candidate sentence set.
After the single sentence text set is determined, the single sentence texts can be sequenced according to a first preset importance sequencing algorithm to obtain a first abstract candidate sentence set, and the single sentence texts in the first abstract candidate sentence set are sequentially sequenced according to the importance degree index data.
The first preset importance ranking algorithm generally refers to calculating importance index data of each single sentence text in a single sentence text set according to a preset rule, and ranking each single sentence text according to the importance index data of each single sentence text. The importance degree index data is used for representing the importance degree of the single sentence text in the single sentence text set. Generally, the importance index data may be set according to actual requirements, and may be a degree of correlation with other single sentence texts in a single sentence text set, or a level of occurrence frequency of words included in each single sentence text in the set may be used as the importance index data, or may be other data types, which is not specifically limited herein.
Step S300, extracting a plurality of keywords from the single sentence text set, and sequencing the plurality of keywords according to a second preset importance sequencing algorithm to obtain a keyword set.
In this embodiment, on the basis of a traditional abstract candidate sentence extraction, keywords are extracted from a single sentence text set, and the extracted keywords are ranked according to a second preset importance ranking algorithm to form a keyword set, and the keywords in the keyword set are sequentially ranked according to the importance degree.
The second preset importance ranking algorithm generally refers to calculating importance index data of each keyword in a single sentence text set according to a preset rule, and ranking each keyword according to the importance index data of each keyword. The importance indicator data is used to characterize the importance of the keyword in the single sentence text set. Generally, the importance index data may be set according to actual requirements, and may include the frequency of occurrence of the keywords in the single sentence text set, and may also be other data types, which are not specifically limited herein.
And S400, extracting a summary first sentence from the single sentence text set.
The first sentence in the abstract is usually concerned with the fluency and the front-back logic of the whole abstract, so that the proper sentence is extracted from the formed single sentence text set to serve as the abstract first sentence, and the fluency and the logic of the whole abstract can be improved.
And S500, performing logic splicing based on the first abstract candidate sentence set, the keyword set and the abstract first sentence to obtain a news abstract.
After the first abstract candidate sentence set, the keyword set and the abstract first sentence are determined, logic splicing can be performed according to the first abstract candidate sentence set, the keyword set and the abstract first sentence, specifically, keywords in the keyword set fully reflect the generality of news data, the first abstract candidate sentence set can be further screened and shrunk according to the keywords in the keyword set, so that more generalized abstract candidate sentences can be obtained, then the abstract candidate sentences are spliced with the abstract first sentence, and meanwhile, the logicality of front and back sentences can be ensured, and news abstract can be obtained.
The method for generating the news abstract preprocesses the news data before extracting the abstract candidate sentences so as to avoid sentence inconsistency and logic errors in the follow-up abstract to a certain extent, meanwhile, on the basis of extracting the abstract candidate sentences, key words in news data are extracted, appropriate single sentence texts are extracted to serve as abstract first sentences, finally news abstracts are spliced by combining the key words, the abstract candidate sentences and the abstract first sentences, because the keywords have high generalization, the summary is formed by combining the keywords, the generalization of the summary obtained by the traditional extraction mode can be compensated, the first sentence of the summary is extracted in advance, and the candidate sentences of the summary are directly added after the first sentence of the summary, thereby simplifying the difficulty of splicing the summary, the fluency and the front-back logicality among sentences can be further improved, and the formed news abstract has good generalization, continuity and logicality.
In one embodiment, referring to fig. 2, step S100, namely, preprocessing the news data to obtain a single sentence text set, includes:
and step S110, filtering the news data according to a preset rule.
That is, the rule of filtering can be preset to filter out the content that is not beneficial to forming the abstract. For example, all levels of titles in the news data can be extracted and filtered, contents irrelevant to news events in the news data can be filtered, sentences obviously unsuitable for serving as abstracts, such as questioning sentences, exclamation sentences, subtitles and the like, can be filtered, and logic connection words such as turning words, carrying words and the like contained in the sentences can be filtered, so that the problem of front and back logic errors in subsequent logic splicing is solved. The above are just some examples of the pretreatment methods, but the pretreatment method is not limited to this, and may be determined as needed.
And step S120, performing single sentence division on the filtered news data to obtain a plurality of single sentence texts, and forming a single sentence text set.
When the news data is filtered, the single sentence division can be performed, specifically, the end symbol of each sentence, such as a period symbol, a question mark, an exclamation mark, an ellipsis mark and the like, can be identified, and the sentences are divided according to the end symbol to form a single sentence text set.
After a plurality of single sentence texts are obtained, word segmentation processing can be performed on each single sentence text by using a word segmentation tool.
In one embodiment, step S110, namely, the step of filtering the news data according to the preset rule, is implemented by any one or more of the following manners:
(1) the various levels of header data in the news data are filtered. In order to embody a better text structure, various forms of titles often exist in news data, and the title data is not suitable for being used as the content of the abstract, so that the titles can be filtered. Since such header data has characteristics of short text length, no ending word, and beginning with sequence number data, the header data can be extracted and filtered according to the characteristics.
(2) Useless information in the news data is filtered. News data usually contains a lot of information about editors, media, prompts, etc., and this information is irrelevant to the news event itself and can be filtered as useless information.
(3) And filtering the logic words at the head of each single sentence in the news data. The logical words may include turning words appearing at the beginning of the single sentence, for example, "but", "however", etc., and may include conjunctions, for example, "therefore", "but", "still", "next", etc., which are liable to cause a problem of logical errors between preceding and following sentences in the subsequently spliced abstract, so that the logical words at the head of each single sentence are filtered during the preprocessing process to avoid the problem of logical errors occurring during the subsequently spliced abstract.
(4) And filtering preset sentence patterns in the news data. Because the length of the abstract is limited, and the abstract generally takes statement sentences as the main, all sentence patterns such as subtitles, exclamation sentences, question sentences and the like in news data are not suitable to be used as the abstract, the sentence patterns can be used as preset sentence patterns needing filtering, filtering processing is carried out on the preset sentence patterns according to the end marks, the subtitles can be identified according to the judging mode of the existence of the end marks, and when no end mark is identified, the headline data is judged to be filtered.
In one embodiment, referring to fig. 3, in step S200, the step of reordering the single sentence texts in the single sentence text set according to a first preset importance ordering algorithm to obtain a first abstract candidate sentence set includes:
and step S210, establishing a graph model by using a summary extraction algorithm to take each single sentence text in the single sentence text set as a node.
Step S220, determining importance index data of each single sentence text in news data according to the relation between the single sentence texts, and sequencing the single sentence texts according to the importance index data to form a first abstract candidate sentence set.
In this embodiment, the TextRank algorithm may be used to use each single-sentence text in the single-sentence text set as a node, establish a graph model, determine importance index data of each single-sentence text in news data by using a relationship between the single-sentence texts, and sort the single-sentence texts in the single-sentence text set according to the importance index data of each single-sentence text.
In particular, a single sentence text collection<S1,S2,…,Sn>Represented as a directed weighted graph G ═ (V, E), consisting of a set of points V and a set of edges E. The point set is a sentence set, and every two sentences are connected by an edge. Two arbitrary points Vi,VjWeight W betweeni,jFor a given point Vi,In(Vi) To point to the set of points, Out (V)i) Is a point ViThe set of points pointed to. Point ViThe score of (c) is calculated as follows:
Figure RE-GDA0003538574750000111
wherein d is a damping coefficient and has a value ranging from 0 to 1. The score initialization values of all points are the same, and the weight value between two points is the sentence similarity between two points. After initialization, the score of each point is calculated recursively until convergence, at this time, the score of each point tends to be stable, and the stabilized score is the importance degreeAnd marking the data. The size of the score determines how important the sentence is in the overall news data. The sentences are sorted according to the score to obtain<S1′,S2′,...,Sn′>I.e., the first set of digest candidate sentences.
In one embodiment, in step S300, a plurality of keywords are extracted from the single sentence text set, and the plurality of keywords are ranked according to a second preset importance ranking algorithm to obtain a keyword set, the keywords may be extracted by using a TF-IDF (term frequency-inverse text frequency index) algorithm.
Specifically, the core idea of the TF-IDF algorithm is to assign corresponding weight values to different words according to the occurrence frequency from the viewpoint of a statistical method. This method can be used to assess how important a word is in a particular corpus. The method comprises the following steps:
for a single sentence text collection, TFijThe word frequency of the ith word T in the jth document after the data set is segmented is represented by the following specific calculation formula:
Figure RE-GDA0003538574750000121
wherein n isijRepresenting the number of times the ith word T appears in the jth document, and the denominator is the sum of the number of times all words appear in the jth document.
The inverse document frequency is used to measure the general importance of a term. For the above specific word T, its IDFiMay be calculated from the logarithm of the quotient of the total number of documents divided by the number of files containing the document. The specific calculation formula is as follows:
Figure RE-GDA0003538574750000122
where | D | represents the total number of documents and the denominator represents the number of documents that contain the term T in all documents.
The word frequency inverse document frequency index may be expressed as:
TFIDFi,j=TFi,j×IDFi
finally, all the words are sorted according to TF-IDF values to obtain the importance sorting of all the words in the single sentence text set<W1,W2,...,Wm>I.e. a set of keywords.
In one embodiment, referring to fig. 4, step S400, extracting the first sentence of the abstract from the single sentence text set, includes:
and step S410, forming a first sentence classification model.
And step S420, sequentially inputting the single sentence texts in the single sentence text set to the first sentence classification model, and taking the single sentence text which is judged as the abstract first sentence by the first sentence classification model as the extracted abstract first sentence.
Generally, a single sentence text suitable for being a first abstract sentence often has vocabulary, a sentence pattern and structural consistency, wherein the vocabulary generally does not refer to unclear pronouns, semantically does not depend on the connection, turning and the like of the previous sentence, and the sentence pattern, question and other forms are not suitable for being the first abstract sentence. In addition, due to the characteristic of writing news data, important information needs to be placed in the obvious part of an article, so that the first few sentences of most news data contain proper abstract first sentences, only the first few sentences of the news data need to be labeled when positive samples are labeled, and the data labeling cost is effectively reduced. In view of this, a first sentence classification model may be obtained through training, and is used to screen out a first sentence in the news data, which is suitable for being the first sentence of the abstract.
In this embodiment, a fastText model may be adopted, which has only three layers, namely an input layer, a hidden layer, and an output layer. The input of the model is the character vector and n-gram vector of the sentence, and the output is the category to which the sentence belongs. The first half of the model, from the input layer to the hidden layer, is mainly used for generating vector representation of sentences, and the second half of the model, from the hidden layer to the output layer, is mainly used for classification by means of hierarchical softmax. In the classification task, the model can better have the training precision and the training duration.
For any sentence in a single sentence text set, assuming that the number of input feature vectors is N, the input vectors can be obtained by word2vec or training by other methods, the feature vectors are mapped to an intermediate layer through linear transformation to be used as vector representation of the sentence, and finally mapped to a label through a softmax layer. And so on, the sentences in the single sentence text set are input into the model in sequence, and the first sentence which is determined as the first sentence of the abstract by the model is taken<Si>。
In one embodiment, referring to fig. 5, in step S500, the step of performing logical concatenation based on the first abstract candidate sentence set, the keyword set, and the abstract first sentence to obtain the news abstract includes:
step S510, a second abstract candidate sentence set is obtained by screening the abstract candidate sentences in the first abstract candidate sentence set through the keywords in the keyword set.
Therefore, after the second abstract candidate sentence set is obtained, the abstract candidate sentences in the second abstract candidate sentence set can be further screened by the keywords, so that the abstract candidate sentences with high correlation degree with news events in the second abstract candidate sentence set can be obtained, and the second abstract candidate sentence set is formed, and therefore the inclusion degree of subsequently formed abstracts on news information can be improved. Specifically, sentences containing keywords can be screened out from the first summarization candidate sentence set, and the sentences are used as the summarization candidate sentences which are subsequently spliced with the first summarization sentence.
And step S520, splicing the abstract candidate sentences in the second abstract candidate sentence set with the abstract first sentences to obtain news abstracts.
When the second abstract candidate sentence set is determined, the abstract candidate sentences in the second abstract candidate sentence set can be spliced with the abstract first sentences to obtain news abstract. In practical application, considering that the abstract length has a certain influence on the abstract effect, the abstract length is too long, the front-back sequence and the logic relationship between sentences are difficult to grasp, and the abstract length is too short, the news subject is difficult to cover, so that an appropriate number of abstract candidate sentences need to be selected to be spliced with the abstract first sentences during splicing. Specifically, two sentences with higher importance may be selected from the second abstract candidate sentence set as the splicing object, or three or four sentences may be selected, which may be determined according to the actual requirements.
In one embodiment, referring to fig. 6, in step S510, the step of filtering the abstract candidate sentences in the first abstract candidate sentence set by the keywords in the keyword set to obtain a second abstract candidate sentence set includes:
and step S511, sequentially extracting a preset number of target keywords from the keyword set according to the sequence of the importance degree index data from high to low.
Suppose that the set of keywords is<W1,W2,...,Wm>Wherein W is1To WmThe keywords are ranked from high to low according to the importance index data of each keyword, and the top 5 keywords can be selected as the target keywords. Of course, the number of extracted keywords is not unique, and may be 4 bits, 6 bits, 7 bits, 10 bits, etc., depending on actual requirements. For example, when the information amount of news data is large, the number of target keywords can be increased to improve the coverage of the abstract on the key information; when the information amount of the news data is small, the number of the target keywords can be reduced to simplify the summary generation process.
It should be noted that, because there may be the same condition in the importance index data of each keyword, for example, the maximum value of the importance index data is a, there are 4 keywords whose importance index data is a, if the preset number is 5, the 4 keywords whose importance index data is a may be extracted, and in addition, a keyword which is arranged behind a and is close to a may be extracted; if 5 keywords with the importance degree index data of a exist, if the preset number is 5, only the 5 keywords with the importance degree index data of a are extracted; if there are 6 keywords whose importance index data are a, if the preset number is 5, then the keywords whose importance index data are 5 may be randomly extracted. That is, after the predetermined number is determined, only the predetermined number of keywords need to be extracted.
And S512, sequentially extracting single sentence texts containing any one target keyword from the first abstract candidate sentence set according to the sequence of the importance degree index data from high to low to form a second abstract candidate sentence set.
And after the target key words are obtained, the abstract candidate sentences in the first abstract candidate sentence set can be screened according to the target key words to obtain abstract candidate sentences containing news key information. Specifically, the abstract candidate sentences in the first abstract candidate sentence set may be sequentially screened according to the order of the importance degree index data from high to low, and the abstract candidate sentences including the target keywords may be extracted. For example, the keyword set comprises 5 keywords, and the first abstract candidate sentence set is<S1′,S2′,...,Sn′>,S1' to Sn' the data are sorted from high to low according to the importance degree index of each single sentence text, and the data can be sorted in<S1′,S2′,...,Sn′>The first 5 abstract candidate sentences respectively containing any one of 5 target key words are screened out and combined to form a second abstract candidate sentence set<S1″,S2″,...,S5″>Then the first sentence of the abstract is divided into two parts<Si>And a second abstract candidate sentence set<S1″,S2″,...,S5″>And performing first splicing and de-duplication. In practical application, considering that the abstract length is not suitable to be too long, the second abstract candidate sentence set can be selected<S1″,S2″,...,S5″>And a small number of abstract candidate sentences are extracted from the abstract and spliced with the abstract first sentence to control the length of the abstract, for example, two or three sentences are extracted.
It should be noted that, the method for generating a news abstract provided in this embodiment not only enables the formed abstract to have good generality, coherence and logicality, but also needs to perform classified data annotation only when the first sentence is classified, reduces the original chapter-level annotation requirement to the sentence-level classified annotation requirement, and considers the abstract effect and the data annotation cost.
It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.
Based on the same inventive concept, the embodiment of the present application further provides a device for generating a news abstract, which is used for implementing the above-mentioned method for generating a news abstract. The implementation scheme for solving the problem provided by the device for generating the news abstract is similar to the implementation scheme recorded in the method, so specific limitations in the following embodiment of one or more devices for generating the news abstract can be referred to the limitations on the method for generating the news abstract, and details are not repeated here.
Referring to fig. 7, the apparatus for generating a news digest provided in this embodiment includes a preprocessing module 100, a first generating module 200, a second generating module 300, an extracting module 400, and a third generating module 500. Wherein:
the preprocessing module 100 is configured to preprocess news data to obtain a single sentence text set;
the first generating module 200 is configured to reorder the single sentence texts in the single sentence text set according to a first preset importance ordering algorithm to obtain a first abstract candidate sentence set;
the second generating module 300 is configured to extract a plurality of keywords from the single sentence text set, and sort the plurality of keywords according to a second preset importance sorting algorithm to obtain a keyword set;
the extraction module 400 is configured to extract a first abstract sentence from the single-sentence text set;
the third generating module 500 is configured to perform logic concatenation on the first abstract candidate sentence set, the keyword set, and the abstract first sentence to obtain a news abstract.
The device for generating the news abstract preprocesses the news data before extracting the abstract candidate sentences in the prior art, can avoid sentence inconsistency and logic errors in the follow-up abstract to a certain extent, meanwhile, on the basis of extracting the abstract candidate sentences, key words in news data are extracted, appropriate single sentence texts are extracted to serve as abstract first sentences, finally news abstracts are spliced by combining the key words, the abstract candidate sentences and the abstract first sentences, because the keywords have high generalization, the summary is formed by combining the keywords, the generalization of the summary obtained by the traditional extraction mode can be compensated, the first sentence of the summary is extracted in advance, and the candidate sentences of the summary are directly added after the first sentence of the summary, thereby simplifying the difficulty of splicing the summary, the fluency and the front-back logicality among sentences can be further improved, and the formed news abstract has good generalization, continuity and logicality.
In one embodiment, the preprocessing module is configured to: extracting and filtering the news data according to a preset rule; and carrying out single sentence division on the extracted and filtered news data to obtain a plurality of single sentence texts and form a single sentence text set.
In one embodiment, the pre-processing module performs the extraction and filtering of the news data by any one or more of the following methods:
extracting all levels of title data in news data;
filtering useless information in news data;
filtering logic words at the head of each single sentence in the news data;
and filtering preset sentence patterns in the news data.
In one embodiment, the first generating module is configured to: establishing a graph model by using a summary extraction algorithm to take each single-sentence text in the single-sentence text set as a node; and determining importance degree index data of the single sentence texts in the news data according to the relation among the single sentence texts, and sequencing the single sentence texts according to the importance degree index data to form a first abstract candidate sentence set.
In one embodiment, the extraction module is configured to: forming a first sentence classification model; and sequentially inputting each single sentence text in the single sentence text set into the first sentence classification model, and taking the single sentence text which is judged as the abstract first sentence by the first sentence classification model as the extracted abstract first sentence.
In one embodiment, the third generating module is configured to: sequentially extracting a preset number of target keywords from the keyword set according to the sequence of the importance degree index data from high to low; according to the sequence of the importance degree index data from high to low, single sentence texts containing any one target keyword are sequentially extracted from the first abstract candidate sentence set to form a second abstract candidate sentence set; and splicing the single sentence text in the second abstract candidate sentence set with the abstract first sentence to obtain the news abstract.
The modules in the above-mentioned news digest generation apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
Fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application, where the computer device may be a server, and an internal structural diagram of the computer device may be as shown in fig. 8. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing various data related to the generation method of the news abstract. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of generating a news digest.
Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the above-described method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
In an embodiment, a computer program product is provided, comprising a computer program which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (11)

1. A method for generating a news digest is characterized by comprising the following steps:
preprocessing news data to obtain a single sentence text set;
reordering the single sentence texts in the single sentence text set according to a first preset importance ordering algorithm to obtain a first abstract candidate sentence set;
extracting a plurality of keywords from the single sentence text set, and sequencing the plurality of keywords according to a second preset importance sequencing algorithm to obtain a keyword set;
extracting an abstract first sentence from the single sentence text set;
and performing logic splicing on the basis of the first abstract candidate sentence set, the keyword set and the abstract first sentence to obtain the news abstract.
2. The method for generating a news digest of claim 1, wherein the step of preprocessing the news data to obtain a set of single-sentence texts comprises:
filtering the news data according to a preset rule;
and carrying out single sentence division on the filtered news data to obtain a plurality of single sentence texts and form a single sentence text set.
3. The method for generating a news digest of claim 2, wherein the step of filtering the news data according to the preset rule is implemented by any one or more of the following methods:
filtering all levels of title data in the news data;
filtering useless information in the news data;
filtering logic words at the head of each single sentence in the news data;
and filtering preset sentence patterns in the news data.
4. The method for generating a news digest of claim 1, wherein the step of reordering the sentence texts in the sentence-text collection according to a first predetermined importance ranking algorithm to obtain a first digest candidate sentence collection includes:
establishing a graph model by using a summary extraction algorithm and taking each single sentence text in the single sentence text set as a node;
and determining importance degree index data of each single sentence text in the news data according to the relation between the single sentence texts, and sequencing the single sentence texts according to the importance degree index data to form the first abstract candidate sentence set.
5. The method for generating a news digest of claim 1, wherein the step of extracting the first sentence of the digest from the text set of single sentences includes:
forming a first sentence classification model;
and sequentially inputting each single sentence text in the single sentence text set to the first sentence classification model, and taking the first single sentence text which is judged as the abstract first sentence by the first sentence classification model as the extracted abstract first sentence.
6. The method for generating a news digest according to claim 1, wherein the step of obtaining the news digest by logically concatenating the first digest candidate sentence set, the keyword set, and the digest first sentence includes:
screening abstract candidate sentences in the first abstract candidate sentence set through keywords in the keyword set to obtain a second abstract candidate sentence set;
and splicing the abstract candidate sentences in the second abstract candidate sentence set with the abstract first sentences to obtain the news abstract.
7. The method for generating a news digest of claim 6, wherein the step of filtering the digest candidate sentences in the first digest candidate sentence set by the keywords in the keyword set to obtain a second digest candidate sentence set comprises:
sequentially extracting a preset number of target keywords from the keyword set according to the sequence of the importance degree index data from high to low;
and sequentially extracting single sentence texts containing any one target keyword from the first abstract candidate sentence set according to the sequence of the importance degree index data from high to low to form the second abstract candidate sentence set.
8. An apparatus for generating a news digest, comprising:
the preprocessing module is used for preprocessing news data to obtain a single sentence text set;
the first generation module is used for reordering the single sentence texts in the single sentence text set according to a first preset importance ordering algorithm to obtain a first abstract candidate sentence set;
the second generation module is used for extracting a plurality of keywords from the single sentence text set and sequencing the keywords according to a second preset importance sequencing algorithm to obtain a keyword set;
the extraction module is used for extracting an abstract first sentence from the single sentence text set;
and the third generation module is used for carrying out logic splicing on the basis of the first abstract candidate sentence set, the keyword set and the abstract first sentence to obtain the news abstract.
9. A computer device comprising a memory and a processor, said memory storing a computer program, characterized in that said processor, when executing said computer program, implements the steps of the method for generating a news digest according to any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of generating a news digest of any one of claims 1 to 7.
11. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the method for generating a news digest of any one of claims 1 to 7.
CN202111349428.3A 2021-11-15 2021-11-15 News abstract generation method and device and computer equipment Pending CN114328895A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111349428.3A CN114328895A (en) 2021-11-15 2021-11-15 News abstract generation method and device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111349428.3A CN114328895A (en) 2021-11-15 2021-11-15 News abstract generation method and device and computer equipment

Publications (1)

Publication Number Publication Date
CN114328895A true CN114328895A (en) 2022-04-12

Family

ID=81044941

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111349428.3A Pending CN114328895A (en) 2021-11-15 2021-11-15 News abstract generation method and device and computer equipment

Country Status (1)

Country Link
CN (1) CN114328895A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501861A (en) * 2023-06-25 2023-07-28 知呱呱(天津)大数据技术有限公司 Long text abstract generation method based on hierarchical BERT model and label migration

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501861A (en) * 2023-06-25 2023-07-28 知呱呱(天津)大数据技术有限公司 Long text abstract generation method based on hierarchical BERT model and label migration
CN116501861B (en) * 2023-06-25 2023-09-22 知呱呱(天津)大数据技术有限公司 Long text abstract generation method based on hierarchical BERT model and label migration

Similar Documents

Publication Publication Date Title
Robertson et al. The TREC 2002 Filtering Track Report.
US8407217B1 (en) Automated topic discovery in documents
US8027977B2 (en) Recommending content using discriminatively trained document similarity
CN109726274B (en) Question generation method, device and storage medium
US8150822B2 (en) On-line iterative multistage search engine with text categorization and supervised learning
Rai Identifying key product attributes and their importance levels from online customer reviews
US20180366013A1 (en) System and method for providing an interactive visual learning environment for creation, presentation, sharing, organizing and analysis of knowledge on subject matter
US11977589B2 (en) Information search method, device, apparatus and computer-readable medium
CN106997382A (en) Innovation intention label automatic marking method and system based on big data
US20040098385A1 (en) Method for indentifying term importance to sample text using reference text
US20190278838A1 (en) Tabular data compilation
US20050138079A1 (en) Processing, browsing and classifying an electronic document
CN111291177A (en) Information processing method and device and computer storage medium
CN111737560B (en) Content search method, field prediction model training method, device and storage medium
CN112307336A (en) Hotspot information mining and previewing method and device, computer equipment and storage medium
CN113468339B (en) Label extraction method and system based on knowledge graph, electronic equipment and medium
CN113988057A (en) Title generation method, device, equipment and medium based on concept extraction
CN112749272A (en) Intelligent new energy planning text recommendation method for unstructured data
CN109657043B (en) Method, device and equipment for automatically generating article and storage medium
CN110717008A (en) Semantic recognition-based search result ordering method and related device
CN114328895A (en) News abstract generation method and device and computer equipment
CN118446315A (en) Problem solving method, device, storage medium and computer program product
CN107315735A (en) For taking down notes the method and apparatus arranged
Hurtado Martín et al. An exploratory study on content-based filtering of call for papers
Suriyachay et al. Thai named entity tagged corpus annotation scheme and self verification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination