A kind of method of Web page text extracting
Technical field
The present invention relates to a kind of method of computer realm, in particular to one kind search mark based on finding " the longest string "
Property node realize news web page body matter extraction method.
Background technology
In news (or information) search field, it is the requisite link of item that body extracts, its text extracting
Quality height determines quality and the Consumer's Experience of news search.
Body abstracting method form various kinds, is divided into two big class: based on template in the way of whether using template at present
(or wrapper) mode extracts and is based on untemplated fashion and extracts.
Based on template way extract: definition template first, then coding parsing execution template obtain data.According to mould
Plate generating mode, can be divided into again: artificial template extracts and automatic moulding plate extracts.Artificial template extracts.For the Target Station extracting
Point, artificial hand-coding template, template can be canonical matching way or simple string matching first place match party
Formula.Automatic moulding plate extracts.Using machine learning algorithm, first obtain a part of web data from targeted website and carry out learning training,
Obtain template, then program utilizes template extracted data.
Untemplated fashion extracts the statistics that are based on more and realizes with learning style.Algorithm main at present have rule-based,
Based on piecemeal, view-based access control model etc..Compare the page partitioning algorithm of the representational view-based access control model being Microsoft, through the page
Block extracts, and divider extracts and semantic chunk reconstructs 3 steps, determines the main semantic chunk of webpage.
The shortcoming of manual compiling template way is to need to expend huge human resourcess to write template, and with target network
The change stood, safeguards that the cost of template is also very big.The shortcoming of automatic moulding plate mode is that algorithm is complicated, simultaneously it is also desirable to mesh
Mark website cycle monitoring, to safeguard the change of template.Either whether manually or automatically produce template, on the assumption that the data of website
It is to be produced by template, some large-scale website basic problems are little, that is, the possible template of different entrances is different, but to crowd
For many medium and small websites, its templating is not fine, is extracted using template and can only extract most information, has more machine
Junk information can be comprised.The page partitioning algorithm of view-based access control model is complicated due to rule, and performance is not high, draws unsuitable for news search
The application held up.
Content of the invention
In order to overcome the above-mentioned deficiencies of the prior art, the present invention provides a kind of effective side extracting internet news content
Method.Extract the deficiency in application for template or partition in news content, design finds seed segment based on " the longest string "
Fall, the algorithm using label clustering extracts news content, it is to avoid manually rule and its drawbacks of template.
Realizing the solution that above-mentioned purpose adopted is:
A kind of method of Web page text extracting, it thes improvement is that: the method comprising the steps of:
I, Web-page preprocessing;
Ii, the longest string found in described webpage;
Iii, establishment dom tree, the corresponding node of the longest string according to dom searches;
Iv, according to described the longest go here and there corresponding node label determine start node and end node;
V, described start node and end node are carried out check filter;
The text in start node and end node after vi, output filtering.
Further, described step i includes: judges whether to comprise negligible label in described webpage: " annotation ",
“script”、“meta”;Obtain the content in described negligible label and negligible label and delete.
Further, described step ii comprises the following steps: 1), found with behavior unit in described Web page text described
The longest string in webpage;
2), obtain and record the longest string length, the longest string being obtained is processed further, when the longest string is in specific mark
The length of acquisition is increased or decreased when in label.
Further, in described step iii, dom tree is created to described webpage, obtain the letter of all nodes according to dom tree
Breath, and described node is stored in array, search in the described array comprise node and comprise corresponding the most long string of node;
The information of described node includes word number, Chinese character number, link number.
Further, the label according to the most long string of node of storage in array utilizes label clustering method to find similar section
Point, determines the start node in described step iv and end node.
Further, described label clustering method is included to described label characteristics forward, backward and two-way searching.
Further, the start node and end node selected in described step iv is checked and filtered, obtained surplus
Remaining node, the content in the remaining node of output and node.
Further, described Web page text searches text using the part that the most paragraph of continuous text is text,
Search seed node in dom tree, according to seed node to forward and backward extension, find out whole text region.
Compared with prior art, the method have the advantages that
(1) method of the present invention design finds seed paragraph based on the longest string, and the algorithm using label clustering extracts newly
Hear content, extract the deficiency in application for template or partition in news content, it is to avoid be manually regular and its template
Drawback.
(2) method of the present invention need not create dom tree when early stage looks for " the longest string ", directly searches in web page text
Long string, with behavior unit, need not naturally enter a new line, forced termination current line is processed, and improves work efficiency, and accuracy rate is high.
(3) method of the present invention is based on single web page analysis, without template, saves artificial in a large number;With kind of a substring finding algorithm
Simply, analysis efficiency is high;Method of the present invention motility simultaneously is high, processes more convenient for abnormal conditions.
(4), using the no template label clustering news web page content extraction of single webpage, its result is more for the method for the present invention
Precisely;Calculate for follow-up fingerprint, content clustering, media event cluster provides quality data to ensure.
(5) method of the present invention mutually simple and quick can find text area, and because not being to do exercises in dom tree when majority
Make, motility is good, convenient increase filtering rule, end to end locating rule, the method is applicable not only to Chinese and is also applied for western language.
Brief description
Fig. 1 is the flow chart of Web page text extracting method.
Specific embodiment
Below in conjunction with the accompanying drawings the specific embodiment of the present invention is described in further detail.
Webpage comprises the information such as text title, text source, text issuing time, text, author, is likely to wrap in webpage
Include substantial amounts of advertisement, junk information etc., and in news category webpage, " the longest string " occurs in text, is sought using this feature more
Look in text area one section and obtain its corresponding label characteristics, then in turn using the label characteristics being found forward, backward,
Two-way searching similar tags node, this process is referred to as " label clustering ".
A kind of Web page text extracting method, searches significant node according to searching " the longest string " and realizes in news web page text
Hold extraction, the method comprising the steps of: i, delete described webpage in negligible label and negligible label in interior
Hold;Ii, the longest string found in described webpage;Iii, establishment dom tree, the corresponding node of the longest string according to dom searches;
Iv, according to described the longest go here and there corresponding node label determine start node and end node;V, to described start node and knot
Shu Jiedian carries out checking filtration;The text in start node and end node after vi, output filtering.
As shown in figure 1, Fig. 1 is the flow chart of Web page text extracting method;A kind of method of Web page text extracting is specifically wrapped
Include following steps:
Negligible label in step one, the described webpage of deletion and the content in negligible label.
Collection obtains the source file of webpage, is such as acquired using acquisition system;
Pretreatment is carried out to the source file of html webpage.Because the data in webpage is various, need to be to the html in source file
Code carries out unified page specificationsization and processes, i.e. pretreatment comprises the following steps:
First, it is determined that whether the label in source file matches, if any not in pairs situation then label is modified it is ensured that institute
The beginning and end having label is mated;
Secondly, judge whether to comprise negligible label in described webpage, obtain in negligible label and negligible label
Content, deleting can content in negligible label and negligible label.
Negligible label: label substance is not related to body matter, such as " annotation ", " script ", " meta " etc..
Step 2, the longest string found in described webpage.
1), found with behavior unit in described Web page text and record the continuous string length in described webpage.
Do not include label in described continuous string, run into and record length during label (when length is more than the longest string length of current line
When spending, be expert at head when the longest string length of current line be initialized as 0), and length counted clear 0(start new string length count).
Adjust correlation length according to residing label, such as when being in paragraph tag<p></p>increase when middle, (using taking advantage of proportionality coefficient
Adjustment length), similar when being in<strong></strong>reduce when middle.
2), obtain the longest string therein from the continuous string length of each row recorded;
Described continuous string is continuous Chinese character (2 or more) or continuous word (2 or more western language list
Word, centre is with space interval).
Step 3, establishment dom tree, the corresponding node of the longest string according to dom searches.
1), to obtain webpage source file create dom tree, count each node information (inclusion word number, Chinese character number,
Link number etc.) and node is stored in array.
2), corresponding node and node position in array is searched out according to the longest string that step 2 obtains, simply using son
String searches corresponding node.
For example: "<td><div>my test, language</div></td>it is the longest obtained by "), whether search contains
String, if the longest string is " my test " just can find, if " catching a duck " just cannot be found, find for matching section
Point, i.e. seed node.
Step 4, the basis corresponding node array position of the longest string and its label characteristics string (such as: html:body:div:p)
Find start node and end node.
Due to totally one father or the grandfather's node of the node in Web page text, according to the most long string of node of storage in array
Label find similar node it may be determined that start node and end node.
Step 5, the node (include start node and end node) being obtained is carried out checking and filters.
The start node and end node selected in described step 4 is checked and is filtered, obtained remaining node, defeated
Go out the content in remaining node and node, described content includes word, picture etc..
Filter and carry out according to some features of interdependent node, such as class be equal to certain particular value need delete, id is equal to
The needs of certain particular value are deleted etc., and content meets certain feature and is then followed by some network address and just deletes, and is mainly used in
Advertisement category information in the middle of cleaning content.
The text in start node and end node after step 6, output filtering.
Finally it should be noted that: above example is merely to illustrate the technical scheme of the application rather than to its protection domain
Restriction, although being described in detail to the application with reference to above-described embodiment, those of ordinary skill in the art should
Understand: those skilled in the art read the application after still can to application specific embodiment carry out a variety of changes, modification or
Person's equivalent, but these changes, modification or equivalent, are all applying within pending claims.