[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN103530429B - Webpage content extracting method - Google Patents

Webpage content extracting method Download PDF

Info

Publication number
CN103530429B
CN103530429B CN201310538575.4A CN201310538575A CN103530429B CN 103530429 B CN103530429 B CN 103530429B CN 201310538575 A CN201310538575 A CN 201310538575A CN 103530429 B CN103530429 B CN 103530429B
Authority
CN
China
Prior art keywords
node
label
text
web page
longest
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310538575.4A
Other languages
Chinese (zh)
Other versions
CN103530429A (en
Inventor
涂波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongsou Cloud Business Network Technology Co ltd
Original Assignee
Beijing Zhongsou Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongsou Network Technology Co ltd filed Critical Beijing Zhongsou Network Technology Co ltd
Priority to CN201310538575.4A priority Critical patent/CN103530429B/en
Publication of CN103530429A publication Critical patent/CN103530429A/en
Application granted granted Critical
Publication of CN103530429B publication Critical patent/CN103530429B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a webpage content extracting method. The method comprises the following steps of I, preprocessing a webpage, II, searching for the longest series in the webpage, III, establishing a DOM tree and searching for the nodes corresponding to the longest series according to the DOM tree, IV, determining a beginning node and a finishing node according to labels of the nodes corresponding to the longest series, V, checking and filtering the beginning node and the finishing node, and VI, outputting text in the filtered beginning node and text in the filtered finishing node. The method overcomes the defect of a module or blocking technique in news content extraction application, searches for seed paragraphs based on the longest series and improves webpage content extracting work efficiency and accuracy.

Description

A kind of method of Web page text extracting
Technical field
The present invention relates to a kind of method of computer realm, in particular to one kind search mark based on finding " the longest string " Property node realize news web page body matter extraction method.
Background technology
In news (or information) search field, it is the requisite link of item that body extracts, its text extracting Quality height determines quality and the Consumer's Experience of news search.
Body abstracting method form various kinds, is divided into two big class: based on template in the way of whether using template at present (or wrapper) mode extracts and is based on untemplated fashion and extracts.
Based on template way extract: definition template first, then coding parsing execution template obtain data.According to mould Plate generating mode, can be divided into again: artificial template extracts and automatic moulding plate extracts.Artificial template extracts.For the Target Station extracting Point, artificial hand-coding template, template can be canonical matching way or simple string matching first place match party Formula.Automatic moulding plate extracts.Using machine learning algorithm, first obtain a part of web data from targeted website and carry out learning training, Obtain template, then program utilizes template extracted data.
Untemplated fashion extracts the statistics that are based on more and realizes with learning style.Algorithm main at present have rule-based, Based on piecemeal, view-based access control model etc..Compare the page partitioning algorithm of the representational view-based access control model being Microsoft, through the page Block extracts, and divider extracts and semantic chunk reconstructs 3 steps, determines the main semantic chunk of webpage.
The shortcoming of manual compiling template way is to need to expend huge human resourcess to write template, and with target network The change stood, safeguards that the cost of template is also very big.The shortcoming of automatic moulding plate mode is that algorithm is complicated, simultaneously it is also desirable to mesh Mark website cycle monitoring, to safeguard the change of template.Either whether manually or automatically produce template, on the assumption that the data of website It is to be produced by template, some large-scale website basic problems are little, that is, the possible template of different entrances is different, but to crowd For many medium and small websites, its templating is not fine, is extracted using template and can only extract most information, has more machine Junk information can be comprised.The page partitioning algorithm of view-based access control model is complicated due to rule, and performance is not high, draws unsuitable for news search The application held up.
Content of the invention
In order to overcome the above-mentioned deficiencies of the prior art, the present invention provides a kind of effective side extracting internet news content Method.Extract the deficiency in application for template or partition in news content, design finds seed segment based on " the longest string " Fall, the algorithm using label clustering extracts news content, it is to avoid manually rule and its drawbacks of template.
Realizing the solution that above-mentioned purpose adopted is:
A kind of method of Web page text extracting, it thes improvement is that: the method comprising the steps of:
I, Web-page preprocessing;
Ii, the longest string found in described webpage;
Iii, establishment dom tree, the corresponding node of the longest string according to dom searches;
Iv, according to described the longest go here and there corresponding node label determine start node and end node;
V, described start node and end node are carried out check filter;
The text in start node and end node after vi, output filtering.
Further, described step i includes: judges whether to comprise negligible label in described webpage: " annotation ", “script”、“meta”;Obtain the content in described negligible label and negligible label and delete.
Further, described step ii comprises the following steps: 1), found with behavior unit in described Web page text described The longest string in webpage;
2), obtain and record the longest string length, the longest string being obtained is processed further, when the longest string is in specific mark The length of acquisition is increased or decreased when in label.
Further, in described step iii, dom tree is created to described webpage, obtain the letter of all nodes according to dom tree Breath, and described node is stored in array, search in the described array comprise node and comprise corresponding the most long string of node;
The information of described node includes word number, Chinese character number, link number.
Further, the label according to the most long string of node of storage in array utilizes label clustering method to find similar section Point, determines the start node in described step iv and end node.
Further, described label clustering method is included to described label characteristics forward, backward and two-way searching.
Further, the start node and end node selected in described step iv is checked and filtered, obtained surplus Remaining node, the content in the remaining node of output and node.
Further, described Web page text searches text using the part that the most paragraph of continuous text is text, Search seed node in dom tree, according to seed node to forward and backward extension, find out whole text region.
Compared with prior art, the method have the advantages that
(1) method of the present invention design finds seed paragraph based on the longest string, and the algorithm using label clustering extracts newly Hear content, extract the deficiency in application for template or partition in news content, it is to avoid be manually regular and its template Drawback.
(2) method of the present invention need not create dom tree when early stage looks for " the longest string ", directly searches in web page text Long string, with behavior unit, need not naturally enter a new line, forced termination current line is processed, and improves work efficiency, and accuracy rate is high.
(3) method of the present invention is based on single web page analysis, without template, saves artificial in a large number;With kind of a substring finding algorithm Simply, analysis efficiency is high;Method of the present invention motility simultaneously is high, processes more convenient for abnormal conditions.
(4), using the no template label clustering news web page content extraction of single webpage, its result is more for the method for the present invention Precisely;Calculate for follow-up fingerprint, content clustering, media event cluster provides quality data to ensure.
(5) method of the present invention mutually simple and quick can find text area, and because not being to do exercises in dom tree when majority Make, motility is good, convenient increase filtering rule, end to end locating rule, the method is applicable not only to Chinese and is also applied for western language.
Brief description
Fig. 1 is the flow chart of Web page text extracting method.
Specific embodiment
Below in conjunction with the accompanying drawings the specific embodiment of the present invention is described in further detail.
Webpage comprises the information such as text title, text source, text issuing time, text, author, is likely to wrap in webpage Include substantial amounts of advertisement, junk information etc., and in news category webpage, " the longest string " occurs in text, is sought using this feature more Look in text area one section and obtain its corresponding label characteristics, then in turn using the label characteristics being found forward, backward, Two-way searching similar tags node, this process is referred to as " label clustering ".
A kind of Web page text extracting method, searches significant node according to searching " the longest string " and realizes in news web page text Hold extraction, the method comprising the steps of: i, delete described webpage in negligible label and negligible label in interior Hold;Ii, the longest string found in described webpage;Iii, establishment dom tree, the corresponding node of the longest string according to dom searches; Iv, according to described the longest go here and there corresponding node label determine start node and end node;V, to described start node and knot Shu Jiedian carries out checking filtration;The text in start node and end node after vi, output filtering.
As shown in figure 1, Fig. 1 is the flow chart of Web page text extracting method;A kind of method of Web page text extracting is specifically wrapped Include following steps:
Negligible label in step one, the described webpage of deletion and the content in negligible label.
Collection obtains the source file of webpage, is such as acquired using acquisition system;
Pretreatment is carried out to the source file of html webpage.Because the data in webpage is various, need to be to the html in source file Code carries out unified page specificationsization and processes, i.e. pretreatment comprises the following steps:
First, it is determined that whether the label in source file matches, if any not in pairs situation then label is modified it is ensured that institute The beginning and end having label is mated;
Secondly, judge whether to comprise negligible label in described webpage, obtain in negligible label and negligible label Content, deleting can content in negligible label and negligible label.
Negligible label: label substance is not related to body matter, such as " annotation ", " script ", " meta " etc..
Step 2, the longest string found in described webpage.
1), found with behavior unit in described Web page text and record the continuous string length in described webpage.
Do not include label in described continuous string, run into and record length during label (when length is more than the longest string length of current line When spending, be expert at head when the longest string length of current line be initialized as 0), and length counted clear 0(start new string length count). Adjust correlation length according to residing label, such as when being in paragraph tag<p></p>increase when middle, (using taking advantage of proportionality coefficient Adjustment length), similar when being in<strong></strong>reduce when middle.
2), obtain the longest string therein from the continuous string length of each row recorded;
Described continuous string is continuous Chinese character (2 or more) or continuous word (2 or more western language list Word, centre is with space interval).
Step 3, establishment dom tree, the corresponding node of the longest string according to dom searches.
1), to obtain webpage source file create dom tree, count each node information (inclusion word number, Chinese character number, Link number etc.) and node is stored in array.
2), corresponding node and node position in array is searched out according to the longest string that step 2 obtains, simply using son String searches corresponding node.
For example: "<td><div>my test, language</div></td>it is the longest obtained by "), whether search contains String, if the longest string is " my test " just can find, if " catching a duck " just cannot be found, find for matching section Point, i.e. seed node.
Step 4, the basis corresponding node array position of the longest string and its label characteristics string (such as: html:body:div:p) Find start node and end node.
Due to totally one father or the grandfather's node of the node in Web page text, according to the most long string of node of storage in array Label find similar node it may be determined that start node and end node.
Step 5, the node (include start node and end node) being obtained is carried out checking and filters.
The start node and end node selected in described step 4 is checked and is filtered, obtained remaining node, defeated Go out the content in remaining node and node, described content includes word, picture etc..
Filter and carry out according to some features of interdependent node, such as class be equal to certain particular value need delete, id is equal to The needs of certain particular value are deleted etc., and content meets certain feature and is then followed by some network address and just deletes, and is mainly used in Advertisement category information in the middle of cleaning content.
The text in start node and end node after step 6, output filtering.
Finally it should be noted that: above example is merely to illustrate the technical scheme of the application rather than to its protection domain Restriction, although being described in detail to the application with reference to above-described embodiment, those of ordinary skill in the art should Understand: those skilled in the art read the application after still can to application specific embodiment carry out a variety of changes, modification or Person's equivalent, but these changes, modification or equivalent, are all applying within pending claims.

Claims (6)

1. a kind of method of Web page text extracting it is characterised in that: the method comprising the steps of:
I, Web-page preprocessing;
Ii, the longest string found in described webpage;
Iii, establishment dom tree, the corresponding node of the longest string according to dom searches;
Iv, according to described the longest go here and there corresponding node label determine start node and end node;
V, described start node and end node are carried out check filter;
The text in start node and end node after vi, output filtering;
Described step ii comprises the following steps: 1), in described Web page text with behavior unit find described webpage in the longest String;
2), obtain and record the longest string length, the longest string being obtained is processed further, when the longest string is in specific label When increase or decrease the length of acquisition;
In described step iii, dom tree is created to described webpage, according to the information of the dom tree all nodes of acquisition, and by described node It is stored in array, search in the described array comprise node and comprise corresponding the most long string of node;
The information of described node includes word number, Chinese character number, link number.
2. as claimed in claim 1 a kind of method of Web page text extracting it is characterised in that: described step i includes: judges institute State and in webpage, whether comprise negligible label: " annotation ", " script ", " meta ";Obtain described negligible label and negligible Content in label is simultaneously deleted.
3. as claimed in claim 1 a kind of method of Web page text extracting it is characterised in that: according in array storage the longest The label of the node of string utilizes label clustering method to find similar node, determines the start node in described step iv and terminates section Point.
4. as claimed in claim 3 a kind of method of Web page text extracting it is characterised in that: described label clustering method includes To described label characteristics forward, backward and two-way searching.
5. as claimed in claim 1 a kind of method of Web page text extracting it is characterised in that: in described step iv select Start node and end node are checked and are filtered, and obtain remaining node, the content in the remaining node of output and node.
6. as claimed in claim 1 a kind of method of Web page text extracting it is characterised in that: described Web page text is using continuous The most paragraph of text be that the part of text searches text, lookup seed node in dom tree, according to seed node to Forward and backward extension, finds out whole text region.
CN201310538575.4A 2013-11-04 2013-11-04 Webpage content extracting method Expired - Fee Related CN103530429B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310538575.4A CN103530429B (en) 2013-11-04 2013-11-04 Webpage content extracting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310538575.4A CN103530429B (en) 2013-11-04 2013-11-04 Webpage content extracting method

Publications (2)

Publication Number Publication Date
CN103530429A CN103530429A (en) 2014-01-22
CN103530429B true CN103530429B (en) 2017-01-18

Family

ID=49932438

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310538575.4A Expired - Fee Related CN103530429B (en) 2013-11-04 2013-11-04 Webpage content extracting method

Country Status (1)

Country Link
CN (1) CN103530429B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942335B (en) * 2014-05-07 2017-04-26 武汉大学 Construction method of uninterrupted crawler system oriented to web page structure change
CN104376061B (en) * 2014-11-10 2018-01-19 武汉传神信息技术有限公司 A kind of method for extracting Web page text
CN104573097B (en) * 2015-01-30 2018-07-24 湖南蚁坊软件有限公司 A method of extraction Web page text
CN106802899B (en) * 2015-11-26 2020-11-24 北京搜狗科技发展有限公司 Webpage text extraction method and device
CN107203527B (en) * 2016-03-16 2019-06-28 北大方正集团有限公司 The text extracting method and system of news web page
CN107229668B (en) * 2017-03-07 2020-04-21 桂林电子科技大学 A text extraction method based on keyword matching
CN110390038B (en) * 2019-07-25 2021-10-15 中南民族大学 Page blocking method, device and equipment based on DOM tree and storage medium
CN110377796B (en) * 2019-07-25 2021-11-02 中南民族大学 Text extraction method, device and equipment based on DOM tree and storage medium
CN111046302A (en) * 2019-12-30 2020-04-21 珠海趣印科技有限公司 Method and device for extracting webpage content

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727461A (en) * 2008-10-13 2010-06-09 中国科学院计算技术研究所 Method for extracting content of web page
CN102298638A (en) * 2011-08-31 2011-12-28 北京中搜网络技术股份有限公司 Method and system for extracting news webpage contents by clustering webpage labels
CN102314520A (en) * 2011-10-24 2012-01-11 莫雅静 Webpage text extraction method and device based on statistical backtracking positioning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102270206A (en) * 2010-06-03 2011-12-07 北京迅捷英翔网络科技有限公司 Method and device for capturing valid web page contents

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727461A (en) * 2008-10-13 2010-06-09 中国科学院计算技术研究所 Method for extracting content of web page
CN102298638A (en) * 2011-08-31 2011-12-28 北京中搜网络技术股份有限公司 Method and system for extracting news webpage contents by clustering webpage labels
CN102314520A (en) * 2011-10-24 2012-01-11 莫雅静 Webpage text extraction method and device based on statistical backtracking positioning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Web网页正文抽取方法研究;万晶;《中国优秀硕士学位论文全文数据库 信息科技辑》;20100815(第8期);第27-28页 *
基于DOM树的网页相似度研究与应用;张瑞雪;《中国优秀硕士学位论文全文数据库 信息科技辑》;20111015(第10期);参见第11-14,17-19,36-39页及图4.1 *

Also Published As

Publication number Publication date
CN103530429A (en) 2014-01-22

Similar Documents

Publication Publication Date Title
CN103530429B (en) Webpage content extracting method
CN108154395B (en) Big data-based customer network behavior portrait method
CN103914478B (en) Webpage training method and system, webpage Forecasting Methodology and system
CN109857956B (en) Automatic extraction of key information from news web pages based on label and block features
CN102541937B (en) Webpage information detection method and system
CN105843965B (en) A deep web crawler form filling method and device based on URL subject classification
CN104715064B (en) It is a kind of to realize the method and server that keyword is marked on webpage
CN112667940B (en) Webpage text extraction method based on deep learning
CN104035972B (en) A kind of knowledge recommendation method and system based on microblogging
CN108268539A (en) Video matching system based on text analyzing
CN106815307A (en) Public Culture knowledge mapping platform and its use method
CN104869009A (en) Website data statistics system and method
CN104462532B (en) The method and apparatus that Web page text is extracted
CN102270234A (en) Image search method and search engine
CN107423391A (en) The information extracting method of Web page structural data
CN103118007A (en) Method and system of acquiring user access behavior
CN104598536B (en) A kind of distributed network information structuring processing method
CN105574200A (en) User interest extraction method based on historical record
CN108491512A (en) The method of abstracting and device of headline
CN108399265A (en) Real-time hot news providing method based on search and device
CN108460150A (en) The processing method and processing device of headline
CN106528068A (en) Webpage content reconstruction method and system
CN108363700A (en) The method for evaluating quality and device of headline
CN108874870A (en) A kind of data pick-up method, equipment and computer can storage mediums
Berlingerio et al. Evolving networks: Eras and turning points

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20170426

Address after: 100086 Beijing, Haidian District, North Third Ring Road West, No. 43, building 5, floor 08-09, No. 2

Patentee after: BEIJING ZHONGSOU CLOUD BUSINESS NETWORK TECHNOLOGY Co.,Ltd.

Address before: Shou Heng Technology Building No. 51 Beijing 100191 Haidian District Xueyuan Road room 0902

Patentee before: BEIJING ZHONGSOU NETWORK TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170118

Termination date: 20211104

CF01 Termination of patent right due to non-payment of annual fee