CN107346335B - A method for web page topic block recognition based on combined features - Google Patents
A method for web page topic block recognition based on combined features Download PDFInfo
- Publication number
- CN107346335B CN107346335B CN201710509023.9A CN201710509023A CN107346335B CN 107346335 B CN107346335 B CN 107346335B CN 201710509023 A CN201710509023 A CN 201710509023A CN 107346335 B CN107346335 B CN 107346335B
- Authority
- CN
- China
- Prior art keywords
- block
- webpage
- blocks
- subject
- web page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 15
- 230000000007 visual effect Effects 0.000 claims abstract description 29
- 238000012706 support-vector machine Methods 0.000 claims abstract description 12
- 238000012549 training Methods 0.000 claims description 12
- 238000012360 testing method Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 6
- 238000013145 classification model Methods 0.000 claims description 4
- 238000000638 solvent extraction Methods 0.000 claims description 4
- RZVAJINKPMORJF-UHFFFAOYSA-N Acetaminophen Chemical compound CC(=O)NC1=CC=C(O)C=C1 RZVAJINKPMORJF-UHFFFAOYSA-N 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 3
- 238000002372 labelling Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 description 13
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000000903 blocking effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000010835 comparative analysis Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a webpage theme block identification method based on combined characteristics, which is characterized in that after a webpage is blocked, a support vector machine is used for predicting whether the webpage block is the theme block according to visual characteristics of the webpage block, an improved BM25 algorithm is used for calculating a correlation weight value of the content and the theme of each webpage block, the weight value is compared with a searched optimal threshold value to judge whether the webpage block is the theme block, and finally the two modes are combined, and the visual characteristics and the text characteristics of the webpage block are comprehensively used for judging whether the webpage block is the theme block. Therefore, when the topic block is identified, the structure and the content of the topic block are considered at the same time, the possible deviation caused by adopting a single characteristic is avoided, and the content related to the topic in the webpage can be identified more accurately.
Description
Technical Field
The invention belongs to the technical field of Web information extraction, and particularly relates to a webpage theme block identification method based on combination characteristics.
Background
The rapidly developing internet creates the current 'big data' era of information explosion, and the research work of all industries is not open to the 'big data'. Web pages are used as important media for transmitting 'big data', and the contained information covers all trades. However, the Web information contains a large amount of noise information, such as advertisements, navigation bars, etc., and the noise information brings trouble to the automatic mining and collection of the Web information. It is crucial how to quickly and accurately locate information about a topic in a page and identify the topic information.
At present, various methods for identifying topic information of a webpage exist, and whether the structure of the webpage or the text content of the webpage is analyzed to judge whether the webpage is a topic block is an effective topic information identification method, but the method has the defect that certain deviation may exist only by utilizing text features or visual features to identify the topic information.
Disclosure of Invention
In view of the above, the present invention provides a method for identifying a topic block of a web page based on a combined feature, which can determine the relevance between the topic block and the web page block from multiple aspects by combining the text feature and the visual feature of the web page block, and can identify the content related to the topic in the web page more accurately.
A webpage theme block identification method based on combination characteristics comprises the following steps:
(1) collecting a large number of web pages with various themes, analyzing the web page structure, partitioning the web pages according to the visual characteristics of the web pages, dividing different contents in the web pages into different web page blocks, and dividing all the obtained web page blocks into a training set and a test set;
(2) visual characteristic data and text characteristic data of the webpage block are extracted, normalization processing is carried out on the visual characteristic data of the webpage block, the data are normalized to be within a range of 0-1, and then the category of the webpage block is trained and concentrated through manual labeling: 0 represents a non-subject block, and 1 represents a subject block;
(3) training and constructing a classification model of the webpage blocks by using a support vector machine according to visual characteristic data of the webpage blocks of known classes in the training set, and identifying whether the webpage blocks in the testing set are theme blocks or not by using the classification model;
(4) analyzing text characteristic data of the webpage blocks, calculating the relevance weight of the content of the webpage blocks and the webpage theme, and judging whether the webpage blocks in the test set are theme blocks or not according to the relevance weight;
(5) and (4) integrating the classification results of the step (3) and the step (4) and re-judging whether the webpage block is a theme block.
Further, in the step (1), a VIPS (Vision-based page segmentation, web page segmentation algorithm based on visual information) algorithm is adopted to segment the web page.
Further, before the training is performed by using the support vector machine in the step (3), converting the visual feature data of the webpage blocks of the known category in the training set into a format required by the support vector machine, so as to establish a data set as an input of the training of the support vector machine; each row of the data set represents a webpage block, the first column of each row is the category of the webpage block, and the other columns are the sequence number of the webpage block and the visual characteristic value of the webpage block.
Further, the specific method for calculating the relevance weight between the content of the webpage block and the webpage topic in the step (4) is as follows: firstly, extracting keywords of a current webpage according to a webpage block title, and calculating the relevance weight Score (Q, B) of the content and the webpage theme of any webpage block B in the current webpage by the following formula:
wherein: q is the keyword set in the current webpageiFor any keyword in the set of keywords Q, diFor the current webpage containing the keyword qiNumber of web page blocks, fiAs a keyword qiNumber of occurrences in the Web Block B, k1And b are set empirical parameters, lBIs the text content length of the web page block B, lavgIs the average text content length of all the web page blocks in the current web page.
Further, in the step (4), a weight threshold of the topic relevance is found by integrating the common evaluation indexes of the accuracy, the recall rate, the F1 value and the roc (receiveoperating characterization) curve, if the relevance weight of the tested centralized web page block is greater than the weight threshold, the tested centralized web page block is classified as a topic block, otherwise, the tested centralized web page block is classified as a non-topic block.
Further, the specific criteria for re-determining whether the web page block is the subject block in step (5) are as follows:
if the classification results in the step (3) and the step (4) are consistent, finally determining that the webpage block is a subject block or a non-subject block according to the common result of the two;
if the webpage block is judged to be the subject block in the step (3), and the webpage block is judged to be the non-subject block in the step (4), analyzing adjacent webpage blocks around the current webpage block, and if at least 2/3 webpage blocks around the webpage block are the subject blocks, finally judging the current webpage block to be the subject block and modifying the correlation weight of the current webpage block, setting the correlation weight to be larger than a weight threshold, and if not, finally judging the current webpage block to be the non-subject block;
if the web page block is determined to be a non-subject block in the step (3), and the web page block is determined to be a subject block in the step (4), analyzing adjacent web page blocks around the current web page block, and if at least 2/3 web page blocks around the current web page block are non-subject blocks, finally determining that the current web page block is a non-subject block and modifying the correlation weight of the non-subject block, setting the correlation weight to be smaller than a weight threshold, and if not, finally determining that the current web page block is a subject block.
After the webpage is blocked by the webpage theme block identification method based on the combination characteristics, firstly, whether the webpage block is the theme block is predicted by using a support vector machine according to the visual characteristics of the webpage block, then, the correlation weight value of the content and the theme of each webpage block is calculated by using an improved BM25 algorithm, the weight value is compared with a searched optimal threshold value so as to judge whether the webpage block is the theme block, and finally, the two modes are combined, and the visual characteristics and the text characteristics of the webpage block are comprehensively used to judge whether the webpage block is the theme block. Therefore, when the topic block is identified, the structure and the content of the topic block are considered at the same time, the possible deviation caused by adopting a single characteristic is avoided, and the content related to the topic in the webpage can be identified more accurately.
Drawings
FIG. 1 is a flowchart illustrating a method for identifying a topic block of a web page according to the present invention.
Detailed Description
In order to more specifically describe the present invention, the following detailed description is provided for the technical solution of the present invention with reference to the accompanying drawings and the specific embodiments.
As shown in fig. 1, the method for identifying a webpage title block based on a combination feature of the present invention includes the following steps:
(1) firstly, analyzing the structure of a webpage, partitioning the webpage by using an improved VIPS algorithm according to the visual characteristics of the webpage, partitioning different contents of the webpage into different webpage blocks, extracting visual characteristic data and text characteristic data of the webpage blocks, wherein the data are original data, and then, corresponding processing is required to be carried out on the data. The specific steps for blocking the webpage are as follows:
1.1 extracting a webpage block: and constructing a DOM tree according to the HTML structure of the webpage, traversing the DOM tree in sequence, acquiring the visual information such as the label, the content, the font, the color and the like of the nodes on the DOM tree, detecting the information by adopting related rules to judge whether the webpage block can be segmented, outputting the webpage block to a webpage block pool if the webpage block can be segmented, and otherwise, iteratively detecting the child nodes of the node.
1.2 detecting the dividing strips: first, a list is defined to store all the slices. The list initially contains only one bar, which consists of the boundaries of the entire web page; then, traversing the webpage block, comparing the webpage block with each segmentation bar, resetting the size of the segmentation bar according to the adjustment rule of the segmentation bar, and placing the segmentation bar in a proper place; and finally, removing the segmentation strips at the four boundaries of the webpage block after all the segmentation strips are processed.
1.3 webpage block reconstruction: after all the segmentation bars among the webpage blocks are obtained through the processing of the steps 1.1 and 1.2, the segmentation bars are sorted according to the sequence that the weight values of the segmentation bars are sequentially increased, the segmentation bar with the minimum current weight value is selected, the webpage blocks on two sides of the segmentation bar are processed, then the DoC value of the processed new webpage block is calculated, if the DoC value is smaller than PDoC, the webpage block is processed again, otherwise, the block is stored as an independent visual block, and the processing is stopped. And then, continuously iterating and combining the webpage blocks on the two sides of the division bar according to the sequence of the weight values of the rest division bars from small to large until the DoC value of the webpage blocks calculated after combination is not less than the threshold PDoC.
(2) Carrying out normalization processing on visual characteristic data of the webpage blocks, and labeling the types of partial webpage blocks: 0 represents a non-subject block, and 1 represents a subject block; then converting the category of the webpage block and the visual characteristic thereof into a data format required by a support vector machine, wherein each row of the data format represents one webpage block, the first column of each row is the category of the webpage block, and the other columns are a sequence number and the visual characteristic value of the corresponding webpage block.
(3) The visual characteristic data of the webpage blocks with known categories are trained by using a support vector machine, a classification prediction model is built, the categories of the webpage blocks are predicted according to the visual characteristic data of the test centralized webpage blocks by using the built model, and therefore whether the blocks are theme blocks or not is judged.
(4) Analyzing the text characteristics of the webpage blocks, and calculating the relevance weight of the content of the webpage blocks and the theme after improvement by using the idea of a probability retrieval model BM25 algorithm; and then searching an optimal threshold value of the theme relevance weight through the accuracy rate, the recall rate, the F1 value and an ROC curve of the commonly used text evaluation index, classifying the webpage blocks with the relevance weight values larger than the threshold value as theme blocks, and classifying the webpage blocks with the relevance weight values smaller than the threshold value as non-theme blocks.
The implementation mode improves the traditional BM25 algorithm, when calculating the correlation between the content of the webpage block and the theme, firstly extracting the keywords of the current webpage according to the title of the webpage block, calculating the weight of each keyword, and then calculating the correlation weight between the webpage block and the theme according to the occurrence frequency of the keywords in the title and the current webpage block, the text content length of the current webpage block and the related experience parameters; the improved correlation weight is calculated by the following formula:
wherein: tf isiIs to contain the keyword qiNumber of web page blocks, fiIs the keyword qiNumber of occurrences in the webpage Block B, k1And B is an empirical parameter, bl is the text content length of the webpage block B, and avgbl is the average length of the text content of all webpage blocks.
(5) For a certain webpage block of an unknown category, firstly predicting whether the webpage block is a theme block by using the model established in the step (3), then calculating the correlation weight of the webpage block and the theme by using an improved BM25 algorithm according to the step (4), dividing the webpage block into the theme block and a non-theme block according to a threshold value, and finally performing comparative analysis on two results obtained when the certain webpage block is classified according to the visual characteristic and the text characteristic of the certain webpage block, wherein the specific flow is as follows:
if A and B are true or false, the webpage block is considered to be a subject block or a non-subject block. And if A is true and B is false, analyzing the next sibling webpage block of the webpage block, if at least 2/3 is the theme block, classifying the webpage block as the theme block and setting the weight value of the webpage block relative to the text characteristic to be larger than a threshold value, otherwise classifying the webpage block as the non-theme block. On the contrary, if a is false and B is true, the next sibling block of the web page block is also analyzed, if at least 2/3 in the sibling block is a non-subject block, the web page block is classified as a non-subject block and the weight value of the web page block related to the text feature is set smaller than the threshold value, otherwise, the web page block is classified as a subject block. Therefore, when the subject block is identified, the structure and the content of the subject block are considered at the same time, and the possible deviation caused by adopting a single feature is avoided.
The embodiments described above are presented to enable a person having ordinary skill in the art to make and use the invention. It will be readily apparent to those skilled in the art that various modifications to the above-described embodiments may be made, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.
Claims (2)
1. A webpage theme block identification method based on combination characteristics comprises the following steps:
(1) collecting a large number of web pages with various themes, analyzing the web page structure, partitioning the web pages according to the visual characteristics of the web pages, dividing different contents in the web pages into different web page blocks, and dividing all the obtained web page blocks into a training set and a test set;
(2) visual characteristic data and text characteristic data of the webpage block are extracted, normalization processing is carried out on the visual characteristic data of the webpage block, the data are normalized to be within a range of 0-1, and then the category of the webpage block is trained and concentrated through manual labeling: 0 represents a non-subject block, and 1 represents a subject block;
(3) training and constructing a classification model of the webpage blocks by using a support vector machine according to visual characteristic data of the webpage blocks of known classes in the training set, and identifying whether the webpage blocks in the testing set are theme blocks or not by using the classification model;
before training by using a support vector machine, converting visual characteristic data of webpage blocks of known classes in a training set into a format required by the support vector machine, and establishing a data set as input for training a support vector machine model; each row of the data set represents a webpage block, the first column of each row is the category of the webpage block, and the other columns are the serial number of the webpage block and the visual characteristic numerical value of the webpage block;
(4) through analyzing the text characteristic data of the webpage blocks, calculating the relevance weight of the content of the webpage blocks and the webpage subjects: firstly, extracting keywords of a current webpage according to a webpage block title, and calculating the relevance weight Score (Q, B) of the content and the webpage theme of any webpage block B in the current webpage by the following formula:
wherein: q is the keyword set in the current webpageiFor any keyword in the set of keywords Q, diFor the current webpage containing the keyword qiNumber of web page blocks, fiAs a keyword qiNumber of occurrences in the Web Block B, k1And b are set empirical parameters, lBIs the text content length of the web page block B, lavgThe average text content length of all webpage blocks in the current webpage is obtained;
further judging whether the webpage blocks in the test set are subject blocks or not according to the relevance weights, namely searching a weight threshold value of subject relevance by integrating the accuracy, recall rate, F1 value and common evaluation indexes of ROC curves, if the relevance weights of the test set webpage blocks are greater than the weight threshold value, classifying the webpage blocks into the subject blocks, and if not, classifying the webpage blocks into non-subject blocks;
(5) and (4) integrating the classification results of the step (3) and the step (4), and re-judging whether the webpage block is a theme block, wherein the specific criteria are as follows:
if the classification results in the step (3) and the step (4) are consistent, finally determining that the webpage block is a subject block or a non-subject block according to the common result of the two;
if the webpage block is judged to be the subject block in the step (3), and the webpage block is judged to be the non-subject block in the step (4), analyzing adjacent webpage blocks around the current webpage block, and if at least 2/3 webpage blocks around the webpage block are the subject blocks, finally judging the current webpage block to be the subject block and modifying the correlation weight of the current webpage block, setting the correlation weight to be larger than a weight threshold, and if not, finally judging the current webpage block to be the non-subject block;
if the web page block is determined to be a non-subject block in the step (3), and the web page block is determined to be a subject block in the step (4), analyzing adjacent web page blocks around the current web page block, and if at least 2/3 web page blocks around the current web page block are non-subject blocks, finally determining that the current web page block is a non-subject block and modifying the correlation weight of the non-subject block, setting the correlation weight to be smaller than a weight threshold, and if not, finally determining that the current web page block is a subject block.
2. The method for identifying a subject block of a web page according to claim 1, wherein: and (2) adopting a VIPS algorithm to block the webpage in the step (1).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710509023.9A CN107346335B (en) | 2017-06-28 | 2017-06-28 | A method for web page topic block recognition based on combined features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710509023.9A CN107346335B (en) | 2017-06-28 | 2017-06-28 | A method for web page topic block recognition based on combined features |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107346335A CN107346335A (en) | 2017-11-14 |
CN107346335B true CN107346335B (en) | 2020-04-14 |
Family
ID=60257105
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710509023.9A Active CN107346335B (en) | 2017-06-28 | 2017-06-28 | A method for web page topic block recognition based on combined features |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107346335B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109284392B (en) * | 2018-12-07 | 2021-04-06 | 达闼机器人有限公司 | Text classification method, device, terminal and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102663023A (en) * | 2012-03-22 | 2012-09-12 | 浙江盘石信息技术有限公司 | Implementation method for extracting web content |
CN103399951A (en) * | 2013-08-19 | 2013-11-20 | 山东大学 | Semi-supervised image reordering method with self-feedback characteristic based on heterogeneous diagram |
CN103853834A (en) * | 2014-03-12 | 2014-06-11 | 华东师范大学 | Text structure analysis-based Web document abstract generation method |
-
2017
- 2017-06-28 CN CN201710509023.9A patent/CN107346335B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102663023A (en) * | 2012-03-22 | 2012-09-12 | 浙江盘石信息技术有限公司 | Implementation method for extracting web content |
CN103399951A (en) * | 2013-08-19 | 2013-11-20 | 山东大学 | Semi-supervised image reordering method with self-feedback characteristic based on heterogeneous diagram |
CN103853834A (en) * | 2014-03-12 | 2014-06-11 | 华东师范大学 | Text structure analysis-based Web document abstract generation method |
Non-Patent Citations (2)
Title |
---|
伍杰华等.改进多分类器集成AdaBoost算法的Web主题分类.《计算机应用与软件》.2013, * |
韩先培等.基于布局特征与语言特征的网页主要内容块发现.《中文信息学报》.2008, * |
Also Published As
Publication number | Publication date |
---|---|
CN107346335A (en) | 2017-11-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107229668B (en) | A text extraction method based on keyword matching | |
CN108073568B (en) | Keyword extraction method and device | |
CN110929038B (en) | Knowledge graph-based entity linking method, device, equipment and storage medium | |
US7783642B1 (en) | System and method of identifying web page semantic structures | |
KR101312770B1 (en) | Information classification paradigm | |
CN103914478B (en) | Webpage training method and system, webpage Forecasting Methodology and system | |
WO2019128124A1 (en) | Text quality index obtaining method and apparatus | |
KR20170123331A (en) | Information extraction method and apparatus | |
US20170091318A1 (en) | Apparatus and method for extracting keywords from a single document | |
US20120221602A1 (en) | Method and apparatus for word quality mining and evaluating | |
US20180357302A1 (en) | Method and device for processing a topic | |
RU2016107443A (en) | METHOD AND DEVICE FOR RECOMMENDING REFERENCE DOCUMENTS | |
KR20070102035A (en) | Document classification system and method | |
CN101661482A (en) | Method and device for recognizing similar subgraph in network | |
CN105574047A (en) | Website main page feature analysis based Chinese website sorting method and system | |
CN108804421A (en) | Text similarity analysis method, device, electronic equipment and computer storage media | |
CN112434163A (en) | Risk identification method, model construction method, risk identification device, electronic equipment and medium | |
CN113468339A (en) | Label extraction method, system, electronic device and medium based on knowledge graph | |
CN106294797B (en) | A method and device for generating a video gene | |
CN113807073B (en) | Text content anomaly detection method, device and storage medium | |
JP2009015796A (en) | Apparatus and method for extracting multiplex topics in text, program, and recording medium | |
CN106815209B (en) | Uygur agricultural technical term identification method | |
CN107346335B (en) | A method for web page topic block recognition based on combined features | |
CN115130601A (en) | Two-stage academic data webpage classification method and system based on multi-dimensional feature fusion | |
CN112417322B (en) | Type discrimination method and system for interest point name text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |