CN103530429B

CN103530429B - Webpage content extracting method

Info

Publication number: CN103530429B
Application number: CN201310538575.4A
Authority: CN
Inventors: 涂波
Original assignee: Beijing Zhongsou Network Technology Co ltd
Current assignee: Beijing Zhongsou Cloud Business Network Technology Co ltd
Priority date: 2013-11-04
Filing date: 2013-11-04
Publication date: 2017-01-18
Anticipated expiration: 2033-11-04
Also published as: CN103530429A

Abstract

The invention provides a webpage content extracting method. The method comprises the following steps of I, preprocessing a webpage, II, searching for the longest series in the webpage, III, establishing a DOM tree and searching for the nodes corresponding to the longest series according to the DOM tree, IV, determining a beginning node and a finishing node according to labels of the nodes corresponding to the longest series, V, checking and filtering the beginning node and the finishing node, and VI, outputting text in the filtered beginning node and text in the filtered finishing node. The method overcomes the defect of a module or blocking technique in news content extraction application, searches for seed paragraphs based on the longest series and improves webpage content extracting work efficiency and accuracy.

Description

A kind of method of Web page text extracting

Technical field

The present invention relates to a kind of method of computer realm, in particular to one kind search mark based on finding " the longest string " Property node realize news web page body matter extraction method.

Background technology

In news (or information) search field, it is the requisite link of item that body extracts, its text extracting Quality height determines quality and the Consumer's Experience of news search.

Body abstracting method form various kinds, is divided into two big class: based on template in the way of whether using template at present (or wrapper) mode extracts and is based on untemplated fashion and extracts.

Based on template way extract: definition template first, then coding parsing execution template obtain data.According to mould Plate generating mode, can be divided into again: artificial template extracts and automatic moulding plate extracts.Artificial template extracts.For the Target Station extracting Point, artificial hand-coding template, template can be canonical matching way or simple string matching first place match party Formula.Automatic moulding plate extracts.Using machine learning algorithm, first obtain a part of web data from targeted website and carry out learning training, Obtain template, then program utilizes template extracted data.

Untemplated fashion extracts the statistics that are based on more and realizes with learning style.Algorithm main at present have rule-based, Based on piecemeal, view-based access control model etc..Compare the page partitioning algorithm of the representational view-based access control model being Microsoft, through the page Block extracts, and divider extracts and semantic chunk reconstructs 3 steps, determines the main semantic chunk of webpage.

The shortcoming of manual compiling template way is to need to expend huge human resourcess to write template, and with target network The change stood, safeguards that the cost of template is also very big.The shortcoming of automatic moulding plate mode is that algorithm is complicated, simultaneously it is also desirable to mesh Mark website cycle monitoring, to safeguard the change of template.Either whether manually or automatically produce template, on the assumption that the data of website It is to be produced by template, some large-scale website basic problems are little, that is, the possible template of different entrances is different, but to crowd For many medium and small websites, its templating is not fine, is extracted using template and can only extract most information, has more machine Junk information can be comprised.The page partitioning algorithm of view-based access control model is complicated due to rule, and performance is not high, draws unsuitable for news search The application held up.

Content of the invention

In order to overcome the above-mentioned deficiencies of the prior art, the present invention provides a kind of effective side extracting internet news content Method.Extract the deficiency in application for template or partition in news content, design finds seed segment based on " the longest string " Fall, the algorithm using label clustering extracts news content, it is to avoid manually rule and its drawbacks of template.

Realizing the solution that above-mentioned purpose adopted is:

A kind of method of Web page text extracting, it thes improvement is that: the method comprising the steps of:

I, Web-page preprocessing；

Ii, the longest string found in described webpage；

Iii, establishment dom tree, the corresponding node of the longest string according to dom searches；

Iv, according to described the longest go here and there corresponding node label determine start node and end node；

V, described start node and end node are carried out check filter；

The text in start node and end node after vi, output filtering.

Further, described step i includes: judges whether to comprise negligible label in described webpage: " annotation ", “script”、“meta”；Obtain the content in described negligible label and negligible label and delete.

Further, described step ii comprises the following steps: 1), found with behavior unit in described Web page text described The longest string in webpage；

2), obtain and record the longest string length, the longest string being obtained is processed further, when the longest string is in specific mark The length of acquisition is increased or decreased when in label.

Further, in described step iii, dom tree is created to described webpage, obtain the letter of all nodes according to dom tree Breath, and described node is stored in array, search in the described array comprise node and comprise corresponding the most long string of node；

The information of described node includes word number, Chinese character number, link number.

Further, the label according to the most long string of node of storage in array utilizes label clustering method to find similar section Point, determines the start node in described step iv and end node.

Further, described label clustering method is included to described label characteristics forward, backward and two-way searching.

Further, the start node and end node selected in described step iv is checked and filtered, obtained surplus Remaining node, the content in the remaining node of output and node.

Further, described Web page text searches text using the part that the most paragraph of continuous text is text, Search seed node in dom tree, according to seed node to forward and backward extension, find out whole text region.

Compared with prior art, the method have the advantages that

(1) method of the present invention design finds seed paragraph based on the longest string, and the algorithm using label clustering extracts newly Hear content, extract the deficiency in application for template or partition in news content, it is to avoid be manually regular and its template Drawback.

(2) method of the present invention need not create dom tree when early stage looks for " the longest string ", directly searches in web page text Long string, with behavior unit, need not naturally enter a new line, forced termination current line is processed, and improves work efficiency, and accuracy rate is high.

(3) method of the present invention is based on single web page analysis, without template, saves artificial in a large number；With kind of a substring finding algorithm Simply, analysis efficiency is high；Method of the present invention motility simultaneously is high, processes more convenient for abnormal conditions.

(4), using the no template label clustering news web page content extraction of single webpage, its result is more for the method for the present invention Precisely；Calculate for follow-up fingerprint, content clustering, media event cluster provides quality data to ensure.

(5) method of the present invention mutually simple and quick can find text area, and because not being to do exercises in dom tree when majority Make, motility is good, convenient increase filtering rule, end to end locating rule, the method is applicable not only to Chinese and is also applied for western language.

Brief description

Fig. 1 is the flow chart of Web page text extracting method.

Specific embodiment

Below in conjunction with the accompanying drawings the specific embodiment of the present invention is described in further detail.

Webpage comprises the information such as text title, text source, text issuing time, text, author, is likely to wrap in webpage Include substantial amounts of advertisement, junk information etc., and in news category webpage, " the longest string " occurs in text, is sought using this feature more Look in text area one section and obtain its corresponding label characteristics, then in turn using the label characteristics being found forward, backward, Two-way searching similar tags node, this process is referred to as " label clustering ".

A kind of Web page text extracting method, searches significant node according to searching " the longest string " and realizes in news web page text Hold extraction, the method comprising the steps of: i, delete described webpage in negligible label and negligible label in interior Hold；Ii, the longest string found in described webpage；Iii, establishment dom tree, the corresponding node of the longest string according to dom searches； Iv, according to described the longest go here and there corresponding node label determine start node and end node；V, to described start node and knot Shu Jiedian carries out checking filtration；The text in start node and end node after vi, output filtering.

As shown in figure 1, Fig. 1 is the flow chart of Web page text extracting method；A kind of method of Web page text extracting is specifically wrapped Include following steps:

Negligible label in step one, the described webpage of deletion and the content in negligible label.

Collection obtains the source file of webpage, is such as acquired using acquisition system；

Pretreatment is carried out to the source file of html webpage.Because the data in webpage is various, need to be to the html in source file Code carries out unified page specificationsization and processes, i.e. pretreatment comprises the following steps:

First, it is determined that whether the label in source file matches, if any not in pairs situation then label is modified it is ensured that institute The beginning and end having label is mated；

Secondly, judge whether to comprise negligible label in described webpage, obtain in negligible label and negligible label Content, deleting can content in negligible label and negligible label.

Negligible label: label substance is not related to body matter, such as " annotation ", " script ", " meta " etc..

Step 2, the longest string found in described webpage.

1), found with behavior unit in described Web page text and record the continuous string length in described webpage.

Do not include label in described continuous string, run into and record length during label (when length is more than the longest string length of current line When spending, be expert at head when the longest string length of current line be initialized as 0), and length counted clear 0(start new string length count). Adjust correlation length according to residing label, such as when being in paragraph tag<p></p>increase when middle, (using taking advantage of proportionality coefficient Adjustment length), similar when being in<strong></strong>reduce when middle.

2), obtain the longest string therein from the continuous string length of each row recorded；

Described continuous string is continuous Chinese character (2 or more) or continuous word (2 or more western language list Word, centre is with space interval).

Step 3, establishment dom tree, the corresponding node of the longest string according to dom searches.

1), to obtain webpage source file create dom tree, count each node information (inclusion word number, Chinese character number, Link number etc.) and node is stored in array.

2), corresponding node and node position in array is searched out according to the longest string that step 2 obtains, simply using son String searches corresponding node.

For example: "<td><div>my test, language</div></td>it is the longest obtained by "), whether search contains String, if the longest string is " my test " just can find, if " catching a duck " just cannot be found, find for matching section Point, i.e. seed node.

Step 4, the basis corresponding node array position of the longest string and its label characteristics string (such as: html:body:div:p) Find start node and end node.

Due to totally one father or the grandfather's node of the node in Web page text, according to the most long string of node of storage in array Label find similar node it may be determined that start node and end node.

Step 5, the node (include start node and end node) being obtained is carried out checking and filters.

The start node and end node selected in described step 4 is checked and is filtered, obtained remaining node, defeated Go out the content in remaining node and node, described content includes word, picture etc..

Filter and carry out according to some features of interdependent node, such as class be equal to certain particular value need delete, id is equal to The needs of certain particular value are deleted etc., and content meets certain feature and is then followed by some network address and just deletes, and is mainly used in Advertisement category information in the middle of cleaning content.

The text in start node and end node after step 6, output filtering.

Finally it should be noted that: above example is merely to illustrate the technical scheme of the application rather than to its protection domain Restriction, although being described in detail to the application with reference to above-described embodiment, those of ordinary skill in the art should Understand: those skilled in the art read the application after still can to application specific embodiment carry out a variety of changes, modification or Person's equivalent, but these changes, modification or equivalent, are all applying within pending claims.

Claims

1. a kind of method of Web page text extracting it is characterised in that: the method comprising the steps of:

I, Web-page preprocessing；

Ii, the longest string found in described webpage；

V, described start node and end node are carried out check filter；

The text in start node and end node after vi, output filtering；

Described step ii comprises the following steps: 1), in described Web page text with behavior unit find described webpage in the longest String；

2), obtain and record the longest string length, the longest string being obtained is processed further, when the longest string is in specific label When increase or decrease the length of acquisition；

In described step iii, dom tree is created to described webpage, according to the information of the dom tree all nodes of acquisition, and by described node It is stored in array, search in the described array comprise node and comprise corresponding the most long string of node；

2. as claimed in claim 1 a kind of method of Web page text extracting it is characterised in that: described step i includes: judges institute State and in webpage, whether comprise negligible label: " annotation ", " script ", " meta "；Obtain described negligible label and negligible Content in label is simultaneously deleted.

3. as claimed in claim 1 a kind of method of Web page text extracting it is characterised in that: according in array storage the longest The label of the node of string utilizes label clustering method to find similar node, determines the start node in described step iv and terminates section Point.

4. as claimed in claim 3 a kind of method of Web page text extracting it is characterised in that: described label clustering method includes To described label characteristics forward, backward and two-way searching.

5. as claimed in claim 1 a kind of method of Web page text extracting it is characterised in that: in described step iv select Start node and end node are checked and are filtered, and obtain remaining node, the content in the remaining node of output and node.

6. as claimed in claim 1 a kind of method of Web page text extracting it is characterised in that: described Web page text is using continuous The most paragraph of text be that the part of text searches text, lookup seed node in dom tree, according to seed node to Forward and backward extension, finds out whole text region.