CN103853834B

CN103853834B - Text structure analysis-based Web document abstract generation method

Info

Publication number: CN103853834B
Application number: CN201410090200.0A
Authority: CN
Inventors: 沈怡涛; 顾君忠; 林晨
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2014-03-12
Filing date: 2014-03-12
Publication date: 2017-02-08
Anticipated expiration: 2034-03-12
Also published as: CN103853834A

Abstract

The invention discloses a text structure analysis-based Web document abstract generation method. The method comprises the steps of using a URL (uniform resource locator) as input, integrating the webpage main bodies of visual features and text features for extraction, partitioning the main bodies into a plurality of semantic paragraphs, and abstracting each semantic paragraph, so the generated abstract has higher coverage rate. The text structure analysis-based Web document summary generation method realizes the generation of the text abstract with better quality from a Webpage aiming at the conditions that the Webpage structure is complex, the main body is hard to identify and the Chinese automatic abstract is still positioned in the probe stage.

Description

The generation method of the Web document summary based on text retrieval conference TREC

Technical field

The present invention relates to Web page text extraction, natural language processing, Chinese Text Summarization technical field, specifically one Plant the generation method of the Web document summary based on text retrieval conference TREC.

Background technology

At present, Internet has had become as the main source that people obtain information.In particularly user generates in recent years Hold developing rapidly of (UGC), the information on Internet is just in explosive growth.Although search engine can require according to user Return Search Results.But user still needs and finds the webpage being best suitable for oneself needs from search listing, especially because mutually A large amount of search engine optimizations existing and reprint phenomenon in networking, to user fast and accurately find information bring very tired greatly Difficult.

Automatic abstracting system is quickly to process Web document using computer, therefrom captures out Web document by certain compression ratio Core content, user can therefrom obtain subject information and judge the value of this Web document, improve user and search for information Efficiency.

Noise information is there is in a large number, such as advertisement, navigation bar, user function bar, associated recommendation, copyright letter in Web document The information unrelated with theme such as breath.Web document is a kind of semi-structured information although having a fixed structure, but semanteme cannot be true Fixed.The page that expression in html source code for the content and final rendering obtain has very big difference.JS and AJAX skill in recent years The extensive application of art is so that web data is no longer static HTML code, and is dynamically generated, even for the behaviour of user Make behavior and also can produce corresponding change.So how to extract and structure correct content related with theme from Web document, There is certain difficulty.

The history of Chinese Text Summarization systematic research about more than two decades, but at present also in the exploratory stage, automatically The result of summary also far can not be satisfactory.The method of autoabstract is broadly divided into two big class, based on the automatic abstract understanding With the automatic abstract based on extraction.Because natural language processing technique is still without important breakthrough, so based on the method understanding simultaneously Can not be real realize automatic abstract.

And the research history of the autoabstract technology of web oriented document is shorter, " compared with traditional text, the text of webpage Loosely organized, title name is relatively less rigorous, and sentence terminates to be likely to be not over according with, and exist substantial amounts of with The incoherent content of text, this brings certain difficulty to the generation of summary.”

Content of the invention

It is an object of the invention to provide a kind of generation method of the Web document summary based on text retrieval conference TREC, the method The integrated use technology such as visual feature analysis, natural language analysis, text retrieval conference TREC, is each webpage in Search Results Generate based on semantic, the preferable web-page summarization of quality, provide the user reference.

The object of the present invention is achieved like this：

A kind of generation method of the Web document summary based on text retrieval conference TREC, it comprises the following steps：

1) input the URL of webpage to be made a summary；

2) analyze from webpage view-based access control model to be made a summary and extract Web page text, specifically include；

2.1) using browser core, Web document is parsed and rendered；

2.2) piecemeal is carried out to webpage using Visual tree (VIPS) algorithm, obtain position, the area of each block；

2.3) participle is carried out to each block；

2.4) to each component analysis text feature；

2.5) whether comprise text to each block to give a mark；

2.6) text being higher than a certain threshold value by score is linked in sequence；

2.7) export Web document text；

3) autoabstract based on text retrieval conference TREC is carried out to the text extracting, specifically include：

3.1) by step 2) obtain Web page text；

3.2) participle and part-of-speech tagging are carried out to text；

3.3) carry out Text Pretreatment：Basic structure in identification text, that is, identify article title, sentence completion, paragraph Cutting；

3.4) text is carried out with semantic section cutting, the position being changed by text retrieval conference TREC identification semanteme, as The mark of semantic section cutting；

3.5) to each semantic section, using the promotion method of TFIDF, to importance in the semantic section that is located for each sentence Measured, then required according to abstract word number, extract the sentence that some sentences can represent this semantic section theme；

3.6) each sentence is linked in sequence, exports digest.

Described step 2.4) in text feature be number of words, font size, assertive sentence quantity, non-assertive sentence quantity and text fragment Quantity.

Described step 2.5) described in judge whether each block comprises text and given a mark, using below equation calculate marking Score value：

V (S) = \frac{S^{2} * P (x_{1}, y_{1}, x_{2}, y_{2})}{N + 1}

Wherein S represents assertive sentence quantity, and N represents non-assertive sentence quantity, and P obtains according to block size and position calculation One value, x₁, y₁Represent the coordinate in the block upper left corner, x₂, y₂Represent the coordinate in the block lower right corner.

Described step 3.4) in semanteme change position analysis identification be：

3.4-1) subordinate sentence is carried out to document D, between the adjacent sentence of each two, be cut-point undetermined；

3.4-2) each cut-point undetermined is given a mark, its formula is：

Q (p_{i}) = \underset{i + 1 < j \leq i + a}{Σ} R (s_{i}, s_{j}) - \underset{i - a \leq j < i}{Σ} R (s_{i}, s_{j})

Wherein, R (s_i, s_j) represent sentence s_iWith sentence s_jSentence between semantic relevancy；p_iRepresent cut-point in sentence s_iWith s_i-1Between, if Q is (p_i) ＞ Q (p_i-1) and Q (p_i) ＞ Q (p_i+1), p is described_iIt is the maximum point of cut-point weights, so p_iIt is Cut-point between semantic section in the text.A is an adjustable empirical parameter, represents that the semanteme when identifying cut-point divides The scope of analysis, that is, represent and consider each a sentence before and after cut-point.

If 3.4-3) score value of cut-point is more than a certain threshold value, and it is local maximum, that is, score value is higher than former and later two points The score value of cutpoint, this cut-point is exactly the cut-off of semantic section, i.e. step 3.4) described in the position that changes of semanteme.

The analysis identification step 2 of the position that described semanteme changes) between sentence the calculating of semantic relevancy include following Step：

3.4-2-1) sentence is cut into the set of word；

3.4-2-2) below equation is used to calculate semantic relevancy between sentence

R (s_{1}, s_{2}) = \underset{w_{i} &Element; s_{1}}{Σ} m a x (R (w_{i}, w_{j})) (w_{j} &Element; s_{2})

Wherein R (w_i, w_j) represent word w_iWith word w_jWord between semantic relevancy.

Described step 3.5) in each sentence, the importance in the semantic section that is located carries out tolerance and uses below equation meter Calculate：

V(S₁)=sum (w ∈ S₁)*TFIDF(w)

Wherein, when calculating TFIDF (w), each paragraph is considered as independent file, several sections that entire article is comprised Fall to being considered as file set.

The present invention can filter out in webpage and the unrelated word of theme, link etc., identifies the literary composition included in webpage Zhang Zhengwen, accuracy rate is higher, and has higher robustness.Autoabstract flow process employs automatic based on text retrieval conference TREC Digest technology, the summary coverage rate of generation is high and makes a summary more smooth.

The present invention can be directed to Web document, and the compression ratio requirement specified by user is it is only necessary to input the URL of webpage to be made a summary So that it may within the time of several seconds, formation can cover the original text meaning for address, more accurate, smooth summary, helps user quickly accurate True finds information in the Internet.

Brief description

Fig. 1 is flow chart of the present invention；

Fig. 2 is Web-page preprocessing flow chart of the present invention；

Fig. 3 is autoabstract flow chart of the present invention.

Specific embodiment

The invention discloses a kind of Web document abstraction generating method of Search Engine-Oriented, a Web can be automatically analyzed Webpage, and the text snippet of reaction of formation Web page subject.

The present invention comprise a Web page text combining visual signature and text feature extract and one be based on by literary composition This structural analysis carries out the automatic text summarization of sub-topicses division.

The present invention, using a URL as input, through Web page text extraction, two stages of autoabstract, ultimately generates literary composition This summary.

Specific algorithm to described two stages below, makees furtherly in conjunction with as a example making a summary to a news web page Bright：

Fig. 1 describe from URL to be made a summary to generate summary overall procedure, which includes Web-page preprocessing flow process and from Dynamic summary flow process.

Specifically, in an embodiment, the present invention obtains in Web-page preprocessing flow process (see Fig. 2) URL input step and waits to pluck Want the URL of news web page.Web-page preprocessing flow process is passed through to analyze visual signature, can more accurately find the textual in webpage Point, have more high robust than additive method.Consider simultaneously text feature, text Controlling UEP, html tag feature, Other features such as semantic feature, improve the accuracy that Web page text extracts further.

Webpage rendering step is responsible for reading the input corresponding webpage of URL, in this embodiment, using IE11 browser core Html tag is processed, and renders this webpage.On the basis of webpage renders, Visual tree analytical procedure adopts VIPS to calculate Method, carries out Visual tree analysis to webpage, obtains position, the area of each block.In this embodiment, this step will be to be made a summary new Hear Web-page segmentation and become 6 blocks：One top block, a bottom block, a navigation block, an advertisement block and two Comprise the block of text.Participle step is responsible for carrying out participle to each block.Then, text feature analytical procedure is entered to word segmentation result Style of writing eigen analysis.Last comprehensive analysis step carries out comprehensive to the feature of each block that Visual tree analysis obtains and text feature Close analysis, export body.

In this embodiment, P (x is calculated using following equation₁, y₁, x₂, y₂).

P(x₁, y₁, x₂, y₂)=(x₂-x₁)*(y₂-y₁)-x₁*y₁

Wherein x₁, y₁Represent the coordinate in the block upper left corner, x₂, y₂Represent the coordinate in the block lower right corner.Then calculate each V (s) value of block：

V (S) = \frac{S^{2} * P (x_{1}, y_{1}, x_{2}, y_{2})}{N + 1}

V (s) value of above-mentioned 6 blocks is respectively 3.7 × 10 from big to small⁶, 2.3 × 10⁶, 7.5 × 10⁵, 5.4 × 10⁶, 3.7×10⁵, 1.6 × 10⁵, 1.2 × 10⁴.

In this embodiment, the threshold value of employing is 10⁶, so choosing V (s) to be more than 10⁶Block, that is, V (s) value is maximum Two blocks.In this embodiment, two maximum blocks of V (s) value are exactly two blocks comprising text, so correct extract Arrive body.

After extracting body, then carry out autoabstract flow process (see Fig. 3), comprise phase between Text Pretreatment, word Relatedness computation, semantic section segmentation, these steps of summarization generation between Guan Du calculating, sentence.

One Text Pretreatment step, the basic structure in identification text, that is, identify article title, sentence completion, paragraph Cutting.In this embodiment, body comprises 8 paragraphs, 23 sentences altogether.

The computational semantics knowledge that between word, relatedness computation step is provided based on Hownet, by calculating the former phase of justice of two words To obtain the degree of association of two words like degree.Using formula as follows：

R(w₁, w₂)=max (Rele (C_i, C_j))(C_i∈w₁, C_j∈w₂)

Wherein R (w₁, w₂) illustrate semantic relevancy between two words, Rele (C_i, C_j) illustrate two former correlations of justice Degree, takes its maximum to represent the semantic relevancy of two words.

Between sentence, degree of association step obtains the degree of association of two sentences by analyzing the degree of association between word in two sentences.

R (s_{1}, s_{2}) = \underset{w_{i} &Element; s_{1}}{Σ} m a x (R (w_{i}, w_{j})) (w_{j} &Element; s_{2})

Wherein R (s₁, s₂) illustrate degree of association between two sentences, for the word in each sentence 1, look in sentence 2 therewith The maximum word of degree of association, calculates the degree of association between this two words.Finally by these maximums sue for peace, obtain this two sentences it Between degree of association.

One semantic section segmentation step, with reference to document《Ground based on the Text Structure Analysis of content relatedness computation Study carefully》To carry out text retrieval conference TREC.First sentence after the feature of cut-point is cut-point between semantic section and some before The degree of association very little of sentence, and larger with the degree of association of several sentences afterwards.Using below equation to 23 in this embodiment 22 cut-points between individual sentence calculate the score value of cut-point, and find the maximum point of function Q (pi)：

Q (p_{i}) = \underset{i + 1 < j \leq i + a}{Σ} R (s_{i}, s_{j}) - \underset{i - a \leq j < i}{Σ} R (s_{i}, s_{j})

In this embodiment, Q (p_i) comprise 2 maximum points, according to this two maximum points, this news is divided into 3 Individual semanteme section.Each semantic section contains sub-topicses of news, and in this embodiment, first semantic section is to news thing The general introduction of part, latter two semantic section is two sides comment to this media event respectively.

One summarization generation step, requires according to user, extracts summary from the text of text formatting by a certain percentage.

In this embodiment, this summarization generation step passes through relatedness computation step between sentence, calculates in each sub-topics The degree of association sum of sentence and article title sequence of words, so that it is determined that the value of each sub-topicses.Sentence is extracted from sub-topicses Quantity and the degree of association of this sub-topics and article title be directly proportional.

In this embodiment, the ratio that user specifies is 0.2, that is, 5 words in extracting 23 form summary.By to 3 The value of individual sub-topicses is calculated, and determines and extracts 2,1,1 sentence respectively from 3 semantic sections.Finally, described summarization generation 5 chosen summary sentences are linked in sequence by step, are formed and make a summary and export.

Claims

1. a kind of based on text retrieval conference TREC Web document summary generation method it is characterised in that：The method includes following step Suddenly：

1) input the URL of webpage to be made a summary；

2.1) using browser core, Web document is parsed and rendered；

2.2) piecemeal is carried out to webpage using vision tree algorithm, obtain position, the area of each block；

2.3) participle is carried out to each block；

2.4) to each component analysis text feature；

2.5) whether text is comprised to each block give a mark, the score value of marking is calculated using below equation：

V (S) = \frac{S^{2} * P (x_{1}, y_{1}, x_{2}, y_{2})}{N + 1}

Wherein S represents assertive sentence quantity, and N represents non-assertive sentence quantity, and P is being obtained according to block size and position calculation Value, x₁, y₁Represent the coordinate in the block upper left corner, x₂, y₂Represent the coordinate in the block lower right corner；

2.7) export Web document text；

3.1) by step 2) obtain Web page text；

3.2) participle and part-of-speech tagging are carried out to text；

3.4) text is carried out with semantic section cutting, the position changing by text retrieval conference TREC identification semanteme, as semanteme The mark of section cutting；

3.5) to each semantic section, using the promotion method of TFIDF, importance in the semantic section that is located for each sentence is carried out Tolerance, then requires according to abstract word number, extracts the sentence that some sentences can represent this semantic section theme；

3.6) each sentence is linked in sequence, exports digest.

2. method according to claim 1 it is characterised in that：Step 2.4) described in text feature be number of words, font size, Assertive sentence quantity, non-assertive sentence quantity and text fragment quantity.

3. method according to claim 1 it is characterised in that：Step 3.4) described in semanteme change position point Analysing identification is：

3.4-1) subordinate sentence is carried out to document D, it is cut-point undetermined between the adjacent sentence of each two；

3.4-2) each cut-point undetermined is given a mark, its formula is：

Q (p_{i}) = \underset{i + 1 < j \leq i + a}{Σ} R (s_{i}, s_{j}) - \underset{i - a \leq j < i}{Σ} R (s_{i}, s_{j})

Wherein, R (s_i, s_j) represent sentence s_iWith sentence s_jSentence between semantic relevancy；p_iRepresent cut-point in sentence s_iAnd s_i-1It Between, if Q is (p_i) ＞ Q (p_i-1) and Q (p_i) ＞ Q (p_i+1), p is described_iIt is the maximum point of cut-point weights, so p_iIt is this article Cut-point between semantic section in this；A is an adjustable empirical parameter, the semantic analysis when identifying cut-point for the expression Scope, that is, represent and consider each a sentence before and after cut-point；

If 3.4-3) score value of cut-point is more than a certain threshold value, and it is local maximum, that is, score value is higher than former and later two cut-points Score value, this cut-point is exactly the cut-off of semantic section, i.e. step 3.4) described in the position that changes of semanteme.

4. method according to claim 3 it is characterised in that：Step 3.4-2) described between sentence semantic relevancy calculating Comprise the following steps：

3.4-2-1) sentence is cut into the set of word；

R (s_{1}, s_{2}) = \underset{w_{i} &Element; s_{1}}{Σ} m a x (R (w_{i}, w_{j})) (w_{j} &Element; s_{2})

5. method according to claim 1 it is characterised in that：Step 3.5) described in each sentence in the semantic section that is located In importance carry out tolerance calculated using below equation：

V(S₁)=sum (w ∈ S₁)*TFIDF(w)

Wherein, when calculating TFIDF (w), each paragraph is considered as independent file, several paragraphs that entire article is comprised regard For file set.