CN107346335B

CN107346335B - A method for web page topic block recognition based on combined features

Info

Publication number: CN107346335B
Application number: CN201710509023.9A
Authority: CN
Inventors: 姜晓红; 张思; 付钊; 陈广; 杜定益; 吴朝晖
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2017-06-28
Filing date: 2017-06-28
Publication date: 2020-04-14
Anticipated expiration: 2037-06-28
Also published as: CN107346335A

Abstract

The invention discloses a webpage theme block identification method based on combined characteristics, which is characterized in that after a webpage is blocked, a support vector machine is used for predicting whether the webpage block is the theme block according to visual characteristics of the webpage block, an improved BM25 algorithm is used for calculating a correlation weight value of the content and the theme of each webpage block, the weight value is compared with a searched optimal threshold value to judge whether the webpage block is the theme block, and finally the two modes are combined, and the visual characteristics and the text characteristics of the webpage block are comprehensively used for judging whether the webpage block is the theme block. Therefore, when the topic block is identified, the structure and the content of the topic block are considered at the same time, the possible deviation caused by adopting a single characteristic is avoided, and the content related to the topic in the webpage can be identified more accurately.

Description

Webpage theme block identification method based on combination characteristics

Technical Field

The invention belongs to the technical field of Web information extraction, and particularly relates to a webpage theme block identification method based on combination characteristics.

Background

The rapidly developing internet creates the current 'big data' era of information explosion, and the research work of all industries is not open to the 'big data'. Web pages are used as important media for transmitting 'big data', and the contained information covers all trades. However, the Web information contains a large amount of noise information, such as advertisements, navigation bars, etc., and the noise information brings trouble to the automatic mining and collection of the Web information. It is crucial how to quickly and accurately locate information about a topic in a page and identify the topic information.

At present, various methods for identifying topic information of a webpage exist, and whether the structure of the webpage or the text content of the webpage is analyzed to judge whether the webpage is a topic block is an effective topic information identification method, but the method has the defect that certain deviation may exist only by utilizing text features or visual features to identify the topic information.

Disclosure of Invention

In view of the above, the present invention provides a method for identifying a topic block of a web page based on a combined feature, which can determine the relevance between the topic block and the web page block from multiple aspects by combining the text feature and the visual feature of the web page block, and can identify the content related to the topic in the web page more accurately.

A webpage theme block identification method based on combination characteristics comprises the following steps:

(1) collecting a large number of web pages with various themes, analyzing the web page structure, partitioning the web pages according to the visual characteristics of the web pages, dividing different contents in the web pages into different web page blocks, and dividing all the obtained web page blocks into a training set and a test set;

(2) visual characteristic data and text characteristic data of the webpage block are extracted, normalization processing is carried out on the visual characteristic data of the webpage block, the data are normalized to be within a range of 0-1, and then the category of the webpage block is trained and concentrated through manual labeling: 0 represents a non-subject block, and 1 represents a subject block;

(3) training and constructing a classification model of the webpage blocks by using a support vector machine according to visual characteristic data of the webpage blocks of known classes in the training set, and identifying whether the webpage blocks in the testing set are theme blocks or not by using the classification model;

(4) analyzing text characteristic data of the webpage blocks, calculating the relevance weight of the content of the webpage blocks and the webpage theme, and judging whether the webpage blocks in the test set are theme blocks or not according to the relevance weight;

(5) and (4) integrating the classification results of the step (3) and the step (4) and re-judging whether the webpage block is a theme block.

Further, in the step (1), a VIPS (Vision-based page segmentation, web page segmentation algorithm based on visual information) algorithm is adopted to segment the web page.

Further, before the training is performed by using the support vector machine in the step (3), converting the visual feature data of the webpage blocks of the known category in the training set into a format required by the support vector machine, so as to establish a data set as an input of the training of the support vector machine; each row of the data set represents a webpage block, the first column of each row is the category of the webpage block, and the other columns are the sequence number of the webpage block and the visual characteristic value of the webpage block.

Further, the specific method for calculating the relevance weight between the content of the webpage block and the webpage topic in the step (4) is as follows: firstly, extracting keywords of a current webpage according to a webpage block title, and calculating the relevance weight Score (Q, B) of the content and the webpage theme of any webpage block B in the current webpage by the following formula:

wherein: q is the keyword set in the current webpage_iFor any keyword in the set of keywords Q, d_iFor the current webpage containing the keyword q_iNumber of web page blocks, f_iAs a keyword q_iNumber of occurrences in the Web Block B, k₁And b are set empirical parameters, l_BIs the text content length of the web page block B, l_avgIs the average text content length of all the web page blocks in the current web page.

Further, in the step (4), a weight threshold of the topic relevance is found by integrating the common evaluation indexes of the accuracy, the recall rate, the F1 value and the roc (receiveoperating characterization) curve, if the relevance weight of the tested centralized web page block is greater than the weight threshold, the tested centralized web page block is classified as a topic block, otherwise, the tested centralized web page block is classified as a non-topic block.

Further, the specific criteria for re-determining whether the web page block is the subject block in step (5) are as follows:

if the classification results in the step (3) and the step (4) are consistent, finally determining that the webpage block is a subject block or a non-subject block according to the common result of the two;

if the webpage block is judged to be the subject block in the step (3), and the webpage block is judged to be the non-subject block in the step (4), analyzing adjacent webpage blocks around the current webpage block, and if at least 2/3 webpage blocks around the webpage block are the subject blocks, finally judging the current webpage block to be the subject block and modifying the correlation weight of the current webpage block, setting the correlation weight to be larger than a weight threshold, and if not, finally judging the current webpage block to be the non-subject block;

if the web page block is determined to be a non-subject block in the step (3), and the web page block is determined to be a subject block in the step (4), analyzing adjacent web page blocks around the current web page block, and if at least 2/3 web page blocks around the current web page block are non-subject blocks, finally determining that the current web page block is a non-subject block and modifying the correlation weight of the non-subject block, setting the correlation weight to be smaller than a weight threshold, and if not, finally determining that the current web page block is a subject block.

After the webpage is blocked by the webpage theme block identification method based on the combination characteristics, firstly, whether the webpage block is the theme block is predicted by using a support vector machine according to the visual characteristics of the webpage block, then, the correlation weight value of the content and the theme of each webpage block is calculated by using an improved BM25 algorithm, the weight value is compared with a searched optimal threshold value so as to judge whether the webpage block is the theme block, and finally, the two modes are combined, and the visual characteristics and the text characteristics of the webpage block are comprehensively used to judge whether the webpage block is the theme block. Therefore, when the topic block is identified, the structure and the content of the topic block are considered at the same time, the possible deviation caused by adopting a single characteristic is avoided, and the content related to the topic in the webpage can be identified more accurately.

Drawings

FIG. 1 is a flowchart illustrating a method for identifying a topic block of a web page according to the present invention.

Detailed Description

In order to more specifically describe the present invention, the following detailed description is provided for the technical solution of the present invention with reference to the accompanying drawings and the specific embodiments.

As shown in fig. 1, the method for identifying a webpage title block based on a combination feature of the present invention includes the following steps:

(1) firstly, analyzing the structure of a webpage, partitioning the webpage by using an improved VIPS algorithm according to the visual characteristics of the webpage, partitioning different contents of the webpage into different webpage blocks, extracting visual characteristic data and text characteristic data of the webpage blocks, wherein the data are original data, and then, corresponding processing is required to be carried out on the data. The specific steps for blocking the webpage are as follows:

1.1 extracting a webpage block: and constructing a DOM tree according to the HTML structure of the webpage, traversing the DOM tree in sequence, acquiring the visual information such as the label, the content, the font, the color and the like of the nodes on the DOM tree, detecting the information by adopting related rules to judge whether the webpage block can be segmented, outputting the webpage block to a webpage block pool if the webpage block can be segmented, and otherwise, iteratively detecting the child nodes of the node.

1.2 detecting the dividing strips: first, a list is defined to store all the slices. The list initially contains only one bar, which consists of the boundaries of the entire web page; then, traversing the webpage block, comparing the webpage block with each segmentation bar, resetting the size of the segmentation bar according to the adjustment rule of the segmentation bar, and placing the segmentation bar in a proper place; and finally, removing the segmentation strips at the four boundaries of the webpage block after all the segmentation strips are processed.

1.3 webpage block reconstruction: after all the segmentation bars among the webpage blocks are obtained through the processing of the steps 1.1 and 1.2, the segmentation bars are sorted according to the sequence that the weight values of the segmentation bars are sequentially increased, the segmentation bar with the minimum current weight value is selected, the webpage blocks on two sides of the segmentation bar are processed, then the DoC value of the processed new webpage block is calculated, if the DoC value is smaller than PDoC, the webpage block is processed again, otherwise, the block is stored as an independent visual block, and the processing is stopped. And then, continuously iterating and combining the webpage blocks on the two sides of the division bar according to the sequence of the weight values of the rest division bars from small to large until the DoC value of the webpage blocks calculated after combination is not less than the threshold PDoC.

(2) Carrying out normalization processing on visual characteristic data of the webpage blocks, and labeling the types of partial webpage blocks: 0 represents a non-subject block, and 1 represents a subject block; then converting the category of the webpage block and the visual characteristic thereof into a data format required by a support vector machine, wherein each row of the data format represents one webpage block, the first column of each row is the category of the webpage block, and the other columns are a sequence number and the visual characteristic value of the corresponding webpage block.

(3) The visual characteristic data of the webpage blocks with known categories are trained by using a support vector machine, a classification prediction model is built, the categories of the webpage blocks are predicted according to the visual characteristic data of the test centralized webpage blocks by using the built model, and therefore whether the blocks are theme blocks or not is judged.

(4) Analyzing the text characteristics of the webpage blocks, and calculating the relevance weight of the content of the webpage blocks and the theme after improvement by using the idea of a probability retrieval model BM25 algorithm; and then searching an optimal threshold value of the theme relevance weight through the accuracy rate, the recall rate, the F1 value and an ROC curve of the commonly used text evaluation index, classifying the webpage blocks with the relevance weight values larger than the threshold value as theme blocks, and classifying the webpage blocks with the relevance weight values smaller than the threshold value as non-theme blocks.

The implementation mode improves the traditional BM25 algorithm, when calculating the correlation between the content of the webpage block and the theme, firstly extracting the keywords of the current webpage according to the title of the webpage block, calculating the weight of each keyword, and then calculating the correlation weight between the webpage block and the theme according to the occurrence frequency of the keywords in the title and the current webpage block, the text content length of the current webpage block and the related experience parameters; the improved correlation weight is calculated by the following formula:

wherein: tf is_iIs to contain the keyword q_iNumber of web page blocks, f_iIs the keyword q_iNumber of occurrences in the webpage Block B, k₁And B is an empirical parameter, bl is the text content length of the webpage block B, and avgbl is the average length of the text content of all webpage blocks.

(5) For a certain webpage block of an unknown category, firstly predicting whether the webpage block is a theme block by using the model established in the step (3), then calculating the correlation weight of the webpage block and the theme by using an improved BM25 algorithm according to the step (4), dividing the webpage block into the theme block and a non-theme block according to a threshold value, and finally performing comparative analysis on two results obtained when the certain webpage block is classified according to the visual characteristic and the text characteristic of the certain webpage block, wherein the specific flow is as follows:

if A and B are true or false, the webpage block is considered to be a subject block or a non-subject block. And if A is true and B is false, analyzing the next sibling webpage block of the webpage block, if at least 2/3 is the theme block, classifying the webpage block as the theme block and setting the weight value of the webpage block relative to the text characteristic to be larger than a threshold value, otherwise classifying the webpage block as the non-theme block. On the contrary, if a is false and B is true, the next sibling block of the web page block is also analyzed, if at least 2/3 in the sibling block is a non-subject block, the web page block is classified as a non-subject block and the weight value of the web page block related to the text feature is set smaller than the threshold value, otherwise, the web page block is classified as a subject block. Therefore, when the subject block is identified, the structure and the content of the subject block are considered at the same time, and the possible deviation caused by adopting a single feature is avoided.

The embodiments described above are presented to enable a person having ordinary skill in the art to make and use the invention. It will be readily apparent to those skilled in the art that various modifications to the above-described embodiments may be made, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.

Claims

1. A webpage theme block identification method based on combination characteristics comprises the following steps:

before training by using a support vector machine, converting visual characteristic data of webpage blocks of known classes in a training set into a format required by the support vector machine, and establishing a data set as input for training a support vector machine model; each row of the data set represents a webpage block, the first column of each row is the category of the webpage block, and the other columns are the serial number of the webpage block and the visual characteristic numerical value of the webpage block;

(4) through analyzing the text characteristic data of the webpage blocks, calculating the relevance weight of the content of the webpage blocks and the webpage subjects: firstly, extracting keywords of a current webpage according to a webpage block title, and calculating the relevance weight Score (Q, B) of the content and the webpage theme of any webpage block B in the current webpage by the following formula:

wherein: q is the keyword set in the current webpage_iFor any keyword in the set of keywords Q, d_iFor the current webpage containing the keyword q_iNumber of web page blocks, f_iAs a keyword q_iNumber of occurrences in the Web Block B, k₁And b are set empirical parameters, l_BIs the text content length of the web page block B, l_avgThe average text content length of all webpage blocks in the current webpage is obtained;

further judging whether the webpage blocks in the test set are subject blocks or not according to the relevance weights, namely searching a weight threshold value of subject relevance by integrating the accuracy, recall rate, F1 value and common evaluation indexes of ROC curves, if the relevance weights of the test set webpage blocks are greater than the weight threshold value, classifying the webpage blocks into the subject blocks, and if not, classifying the webpage blocks into non-subject blocks;

(5) and (4) integrating the classification results of the step (3) and the step (4), and re-judging whether the webpage block is a theme block, wherein the specific criteria are as follows:

2. The method for identifying a subject block of a web page according to claim 1, wherein: and (2) adopting a VIPS algorithm to block the webpage in the step (1).