CN104462582B

CN104462582B - A kind of web data similarity detection method based on structure and content secondary filtration

Info

Publication number: CN104462582B
Application number: CN201410843460.0A
Authority: CN
Inventors: 李石君; 吴岳廷; 范珊珊; 张健; 李宇轩
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2014-12-30
Filing date: 2014-12-30
Publication date: 2017-07-11
Anticipated expiration: 2034-12-30
Also published as: CN104462582A

Abstract

The invention discloses a kind of web data similarity detection method based on structure and content secondary filtration, on the basis of traditional general similarity detection method, the characteristics of excavating out web data structure and distribution of content, the document sets to detection carry out double-filtration；First order filtering in double-filtration is structural similarity filtering, Tag tree constructions are modeled as to each Web document, so as to reject dissimilar document sets in structure, and key content extraction is carried out to remaining document, it is expressed as the form of tuple vector, key message is connected into generation character trail；The character trail that second level filtering in double-filtration is generated after then being filtered to the first order carries out Trie tree construction modelings, and similar character string is attached, and obtains final result.Proved by many experiments, the effect of web sphere data similarity detection can be significantly improved using method proposed by the present invention.

Description

Web data similarity detection method based on structure and content secondary filtering

Technical Field

The invention belongs to the field of similarity detection of Web data structures and contents, and particularly relates to a Web data similarity detection method based on structure and content secondary filtering.

Background

Web data is typically stored and exchanged in XML or HTML, etc., and the data is organized by various tags. Because the Web data has high transmission speed and the format of the html webpage is flexible, the probability that the webpage with wide transmission is transferred or changed is higher, and the data organization forms which belong to the same theme may be different from each other, but the similarity is higher. Therefore, the detection and the exclusion of the similarity data are very significant in the fields of data classification and clustering, data mining, focused search engines, pattern recognition, article detection and the like.

The traditional Web data similarity detection method mainly focuses on two fields of Web data structure similarity detection and content character string similarity detection. For structural similarity detection, an index tree is mainly established first based on data contained in documents. Currently proposed index tree based K-nearest neighbor search algorithms such as R-tree (see document 3), K-D tree (see document 4), SR tree (see document 5) return accurate results, but are not time efficient for high dimensional data. Document 2 proposes a method of a location sensitive hash function LSH, and the basic method is to use a location sensitive hash function family to hash an object in a high dimensional space into an adjacent bucket. When similarity search is performed, the indexing method hashes a query object into a bucket, takes the value in the hashed bucket as a candidate result set, and then sorts the candidate result set based on similarity distance measurement.

Web documents in the candidate dataset may be modeled as an ordered tagged root tree (see documents 1, 2). The elements and attributes in a document can be viewed as structural nodes or internal nodes. The tree edit distance is used as a standard for measuring the similarity between TAG trees, wherein the tree edit distance refers to the minimum node operated by converting one tree into another tree by using operations such as deleting, inserting, renaming labels and the like. Two trees are considered similar when their edit distance is within a certain threshold.

In recent studies, the scholars Panigrahy proposed an entropy-based LSH method (see document 6) that randomly entropy-processes objects near a query object in addition to the query object, and then uses the intersection of all the results as a candidate set. This method trades time for space. However, in the case of a large Web data set, the time is not practical and the practicability is not high. Although various improved LSH algorithms exist, the LSH algorithm itself has the defects that it is difficult to find a hash function that varies with the scale and distribution characteristics of given data, and a large amount of memory resources are required to be consumed, and the hash function cannot be released in time, so that the solving process is inefficient, and the final result is not likely to be an absolute approximation, and it is difficult to obtain the expected effect only by improving the LSH algorithm.

The existing algorithm generally adopts a framework of filtering and then verifying the character string similarity of the document content (see document 7), and in the filtering step, a signature is generally generated for each character string in a character string set, and an unverified approximate pair is generated by using the characteristics between the signatures seen by the character strings. Then, in the subsequent verification process, the approximate pair is verified for correctness. After the given conditions are satisfied, the final similar results are output, and the Binary branch (see reference 7), the Min-Hashing based LSH (see reference 8) and the All-Pairs-ED (see reference 9) methods are similarity detection methods in this framework. Although the algorithms can meet the requirement of partial character string similarity detection, the algorithms cannot adapt to the condition that the average character string length in a character string set is small or large, so that the generated signature is inaccurate, the similarity detection range is easy to be too wide, a large number of similar pairs are generated, the pressure of the subsequent verification process is increased, no verification method which is suitable for any condition exists, and finally the generated result is too much and inaccurate.

The existing similarity detection methods are generally improved in one aspect of algorithm efficiency and result accuracy, and do not deeply explore the application field. For example, for documents and data in a Web space, if similarity mining is performed according to a general existing similarity detection method, it is difficult to efficiently find out approximate content blocks because structural features of the Web data and distribution region characteristics of character string contents are not fully utilized.

Document 1: andrea Tagarelli, Sergio greco.

Document 2: giovanni Mella, Elena Ferrari, Elisa Bertini, YunhuaKoglin. controlled and collaborative updates of XML documents in byzantine and failure-protein distributed systems. in TISSEC, 2006.

Document 3: a dynamic index structure for spatial search in SIGMOD, 1984.

Document 4: L.Bentley.K-D trees for semi-dynamic points sets. in SCG, 1990.

Document 5: N.Katayama and S.Satoh.the SR-tree: An index structure for high-dimensional neighbor queries.in SIGMOD, 1997.

Document 6: r. panigrahy. control based near neighbor search in high dimensions. in SODA, 2006.

Document 7: rui Yang, Panos Kalnis, and Anthony K.H.Tung.Simmilitiyevaluation on tree-structured data. in SIGMOD Conference, 2005.

Document 8: r. panigrahy. control based near neighbor search in high dimensions. in SODA, 2006.

Document 9: r.j.bayardo, y.ma, and r.srikant.scaling up all pages similarity WWW, 2007.

Disclosure of Invention

Aiming at the problems of the existing algorithm, the invention provides a Web data similarity detection method based on structure and content secondary filtering.

The technical scheme adopted by the invention is as follows: a Web data similarity detection method based on structure and content secondary filtering is characterized in that: on the basis of a traditional general similarity detection method, the characteristics of Web data structure and content distribution are discovered, and two-stage filtering is carried out on a detected document set; the first stage of filtering in the two-stage filtering is structural similarity filtering, each Web document is modeled into a Tag tree structure, so that document sets which are dissimilar in structure are removed, the key contents of the rest documents are extracted and expressed into a tuple vector form, and key information is connected to generate a character string set; and the second-stage filtration in the two-stage filtration is used for carrying out Trie tree structure modeling on the character string set generated after the first-stage filtration, and connecting similar character strings to obtain a final result.

Preferably, the first stage filtering specifically comprises the following steps:

step A1: preprocessing an input Web document set to be detected, extracting a Web document structure model tag.tree, and establishing an effective node index;

step A2: constructing Binary branch vectors (Binary Branchvector) of all tags in parallel by combining MapReduce;

step A3: establishing a multi-pointer LSH index for a binary branch vector of the tag.tree;

step A4: structure similarity Web document set S for inquiring given inquiry q₁；

Step A5: obtaining a set of Web documents S₁Based on a content six-tuple vector set W extracted by a Tag tree, and based on the W, a character string connection algorithm is used for obtaining a similarity result set S of a query q₂。

Preferably, the specific implementation of step a1 includes the following sub-steps:

step A1.1: combining MapReduce, processing a Web document data set stored in a distributed document system (HDFS) in parallel, and extracting a Web document structure model tag.tree by using a Tag tree modeling mode based on a path Tag;

step A1.2: newly building a MapReduce task, wherein the input key of the Map task is a file name, the value is tag.

Preferably, the specific implementation of step a2 includes the following sub-steps:

step A2.1: newly building a MapReduce task, calculating binary node vector sets S of all tags, and then distributing the universe set S to each DataNode in the system by using a distributedPath function of Hadoop;

step A2.2: constructing a binary branch vector BBV by using a MapReduce task according to the binary node vector of the S and each tag.tree; key of map task is file name, value is binary node vector of tag.tree; the key of the reduce task is a file name, and the value is BBV corresponding to tag.tree;

step A2.3: the task writes the BBV of each tag.

Preferably, the specific implementation of step a3 includes the following sub-steps:

step A3.1: initializing L hash tables for establishing a multi-pointer LSH index, and initializing M hash functions and r hash buckets in each table; calculating indexes of BBV of all tags by using a MapReduce task;

step A3.2: assuming that a binary branch vector of a certain Web document X is M, firstly, M times of hash functions are calculated for M, and g is set_i(m) is the hash value of query m in the ith hash table, i.e. g (m) ═ h₁(m),…,h_M(m))；

Step A3.3: initializing L-dimensional perturbation vectorsΔ, wherein Δ ═ Δ (Δ)₁,…,Δ_L)，g_i(m)+Δ₁Is to interfere with the vector delta₁Is applied to g_i(m) obtaining a first-order pointer to the hash bucket, the first-order pointer pointing to another hash bucket in the hash table, each perturbation vector being mapped to a unique set of hash values, and the pointer pointing to a hash bucket at most once;

step A3.4: applying the perturbation vector in delta to g in turn_i(m), obtaining i (i ═ 1,2, …, L) order pointers, pointing to different hash buckets, and establishing a multi-pointer index;

step A3.5: according to the pointer construction condition of all tag.tree, a hash bucket within the second order gives a group number BandID; keys of the MapReduce task are a group number BandID and a barrel number BucketID, and values are corresponding Web file names.

Preferably, the specific implementation of step a4 includes the following sub-steps:

step A4.1: extracting a binary branch vector of a given query q, quickly retrieving a top-k proximity point of the q based on a multi-pointer LSH index, and screening out a binary branch distance BDist (T) in the binary branch vector_i,T_q) Satisfies the conditionsT tree set S₁As a secondary index input;

step A4.2: quickly retrieving a top-k proximity point of q based on the index s;

computing M hash functions of query q, let g (q) be the hash value of query q in the ith hash table, i.e., where g (q) ═ h₁(q),…,h_M(q))，g_i(q)+Δ₁Is to interfere with the vector delta₁Is applied to g_iAnd (q) obtaining all hash buckets within two orders away from the given point q according to the index s by the first-order pointer of the hash bucket, namely the top-k adjacent point of the query q.

Preferably, the specific implementation of step a5 includes the following sub-steps:

step A5.1: based on step A1, a Web document set S is acquired₁Tree set;

step A5.2: obtaining the contents in the leaves of all effective nodes according to the effective index of the tag of the Web document X;

step A5.3: and after judging the leaf node type according to the father node label, adding a six-tuple vector of X, wherein the six-tuple vector consists of the number (id) of X, a title (title), content (content), a KeyWord (KeyWord), a link (URL) and release time (PageTime).

Preferably, the second-stage filtration is realized by the following steps:

step B1: extracting a character string set U (U) from the six-element group vector generated by the first-stage filtration₁,u₂,u₃,…u_k) And V (V)₁,v₂,v₃,…v_l) The two sets of strings have the exact same possibility; each character string is a result of serial connection of a number (id), a title (title), content (content), a KeyWord (KeyWord), a link (URL) and release time (PageTime) in each tuple vector; determining a character string editing distance threshold tau according to an actual application situation;

step B2: sorting the character strings in the character string sets U and V according to a dictionary to form an ordered set, and establishing a Trie tree H based on the ordered set_u∪vThe Trie tree consists of nodes corresponding to each character in the U and the V, and the nodes of the Trie tree store node numbers, mapping relations between the nodes and the characters, character string sets to which the nodes belong and structural relations between the nodes;

for example, Trie Tree H_u∪vIn the method, each node is numbered according to the sequencing traversal, and each node is labeled (only belongs to U, or only belongs to V, or belongs to U ∪ V) based on the set to which the node belongs.

Step B3: creating an accessed node stack data structure for storing accessed Trie tree nodes; the stack is initialized to be only a root node and an active node set corresponding to the root node;

for example, a length | H is created from the maximum depth of the Trie tree_U∪VEach element in the "visited node stack" ST, ST of | includes the number of the node and the active node set of the node.

Step B4: establishing a dynamic Trie tree H' only with root node root, solving an active node set of the root node root, and determining a next node for solving the active node set, which is called a destination node;

step B5: creating a virtual Trie-Tree H 'for the next destination node'_u∪v；

Step B6: solving an active node set of a destination node through an active node set of an element positioned at the uppermost layer of a stack ST based on the theorem 1;

wherein the introduction 1: given Trie Tree H_xOf a node l and its parent p, and a string edit distance threshold τ, for any node l' ∈ Q_τ(l) There must be one node p ' which is the ancestor of l ' and satisfies p ' ∈ Q_τ(p)；

Step B7: updating an active node set of elements, the distance between the elements and the topmost element of the stack ST is within tau, based on the description of the symmetry of the active nodes in the Trie tree active node definition;

step B8: judging whether the node p currently processed is a certain leaf node of the Trie tree structure, if so, determining whether the node p currently processed is a certain leaf node of the Trie tree structureOutputting similar string pairings<o,p>；

Step B9: in Trie Tree H'_u∪vThe newest destination node of the child node or the brother node of the destination node belonging to the same character set is taken out, and the steps B5 to B8 are repeated until the stack ST is emptyUntil now.

Preferably, the specific implementation of step B4 includes the following sub-steps:

step B4.1: active node set Q for solving root node based on character string editing distance_τ(root), pushing the root node and the corresponding set of active nodes into the stack ST;

the character string editing distance specifies two character strings s(s)₁s₂s₃…s_m) And t (t)₁t₂t₃…t_n) The string edit distance of s and t is represented as ED (s, t), and is represented as the minimum operation times of converting the string s into t by three operation modes of replacing, inserting and deleting with a single character as a unit; if s is empty, it is denoted ED (s, 0);

for ED (s, t), there is a degree of incremental:

①ED(s_i0)＝i,ED(t_j,0)＝j,1≤i≤m,1≤j≤n；

②ED(s_i,t_j)≥ED(s_i-1,t_j)+1,1<i≤m,1≤j≤n；

③ED(s_i,t_j)≥ED(s_i,t_j-1)+1,1≤i≤m,1<j≤n；

④ED(s_i,t_j)＝ED(s_i-1,t_j-1),1<i≤m,1<j≤n,s_i＝t_j；

⑤ED(s_i,t_j)≥ED(s_i-1,t_j-1)+1,1<i≤m,1<j≤n,s_i≠t_j；

step B4.2: and extracting a node set N only belonging to the character string set U from the active node set of the root node, and selecting the node N' closest to the root node.

Preferably, the specific implementation of step B5 includes the following sub-steps:

step B5.1: in Trie tree H based on lemma A and lemma B_u∪vScreening out a node area irrelevant to the active node set of the solving node n';

the theory A: given a string s and t, and a string edit distance threshold τ, if | | s | - | t | > τ, then there are: ED (s, t) > τ, where | s | and | t | represent the number of characters contained in the character strings s and t, respectively;

the theory B is as follows: given Trie tree structure H_xAnd two of the nodes u and v, and a string edit distance threshold τ, if one of the following conditions exists:

(1) v is the ancestor node of u, and subtrees with v and u as root nodes respectively contain the same leaf node;

(2) if only one string exists and v and u are simultaneously used as prefix strings, Q is calculated_τ(u) all subsequent nodes having v as parent or ancestor are not considered;

step B5.2: gray marking is carried out on the node area to form a tree H based on the original Trie_u∪vIs dedicated to a virtual Trie-Tree H 'of n'_u∪v。

Preferably, the specific implementation of step B6 includes the following sub-steps:

step B6.1: if the destination node belongs to the two sets, the destination node is set to belong to one set and is in a virtual Trie tree H'_u∪vExcluding nodes belonging only to the set; if the destination node only belongs to a certain set, the same is carried out in the virtual Trie tree H'_u∪vExcluding nodes belonging only to the set;

step B6.2: h'_u∪vAccording to an active node set of a parent node at the uppermost layer on a stack ST, an active node set of a target node is obtained;

step B6.3: and pressing the destination node into the stack ST, wherein the uppermost element is the number of the destination node and the corresponding active node set.

The method has the advantage that the structural characteristics and distribution characteristics of the Internet semi-structured data are fully utilized in the similarity detection process. On one hand, the structure modeling and the content processing are respectively carried out on the input Web document set based on the extracted similarity characteristics, and the Web data is detected from the two aspects of structure construction and content distribution through two-stage filtering, so that the precision of Web data similarity detection is improved; on the other hand, some impurity data which cannot belong to the similarity set can be removed in advance through the first-level structure similarity detection filtering, and the number of elements of a subsequent similarity matching candidate set is reduced. Meanwhile, a multi-pointer LSH index and a Tree modeling mechanism are respectively adopted in the similarity process, so that the efficiency and the practicability of similarity detection are improved. Multiple experiments prove that the method provided by the invention can obviously improve the data similarity detection effect in the Web field.

Drawings

FIG. 1: the invention is a framework diagram of the parallel processing flow of the Web document similarity detection.

FIG. 2: the invention discloses a Tag tree modeling diagram based on path tags, wherein part (a) is a Web document, and part (b) shows a Tag which is converted from the Web document.

FIG. 3: is a diagram illustrating an example of the conversion between the Tag tree and the binary tree according to the embodiment of the present invention.

FIG. 4: is an example of a binary node vector of an embodiment of the present invention.

FIG. 5: is a flow framework diagram of the multi-pointer LSH algorithm of the embodiment of the invention.

FIG. 6: the distribution diagram of the distance of the closest barrel to top-K in the embodiment of the invention is shown, wherein part (a) is a statistical diagram of single different hash values, and part (b) is a statistical diagram of the number indicated by different order disturbance vectors.

FIG. 7: the present invention is a Trie structure established based on an example string set.

FIG. 8: is an edit distance matrix diagram of two example character strings of an embodiment of the present invention.

FIG. 9: the Trie tree structure established based on the two character string sets is disclosed by the embodiment of the invention.

FIG. 10: the method is characterized by comprising the following steps of cutting a node set area irrelevant to solving of an active node set.

FIG. 11: the invention discloses a schematic diagram for showing the accessed sequence of Trie tree nodes.

FIG. 12: the method and the device for performing string similarity connection between different string sets in the embodiment of the invention are shown in the drawing.

FIG. 13: the method based on different parameters in the embodiment of the invention is a comparison graph of the operation effect of the traditional method.

FIG. 14: the embodiment of the invention utilizes the method in different data sets to compare the operation effect with the traditional method, wherein part (a) is the method operation effect graph provided by the invention, and part (b) is the traditional method operation effect graph.

FIG. 15: the accuracy comparison graph of the method based on different parameters and the traditional method is shown in the embodiment of the invention.

FIG. 16: the invention is a flow chart of a similarity detection method based on structure and content secondary filtering.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

Referring to fig. 1, the technical scheme adopted by the invention is as follows: a Web data similarity detection method based on structure and content two-stage filtering is characterized in that on the basis of a traditional general similarity detection method, the characteristics of Web data structure and content distribution are discovered, and two-stage filtering is carried out on a detected document set; the invention considers that the first structures of the documents containing similar Web data should be similar, and if the structures of the two documents are greatly different, the significance of further solving the similarity of the contents is not great. Therefore, the first-stage filtering in the two-stage filtering is structural similarity filtering, each Web document is modeled into a Tag tree structure, so that document sets which are dissimilar in structure are removed, key contents of the rest documents are extracted and expressed into a tuple vector form, and key information is connected to generate a character string set; and the second-stage filtration in the two-stage filtration is used for carrying out Trie tree structure modeling on the character string set generated after the first-stage filtration, and connecting similar character strings to obtain a final result.

The Tag tree is an important data structure invented herein for tree modeling of the structure of a Web document, and decomposes a Web document according to the path of elements or attributes in the Web document.

Under a large-scale Web document data set, a non-text area in the tag.tree is converted into a binary tree, so that a large amount of storage space is wasted, and the accuracy of judging the structural similarity of the document is influenced. In order to solve the problem, a method for establishing an effective node index for the tag.tree is adopted, only an index area of the tag.tree is converted, and the conversion efficiency of the binary tree is improved.

Definition 1(Web document tag node tree): a tree is an ordered tagged root tree of a Web document X, with each node of the tree being associated with an element or attribute of X. The id of each node is related to its order in the X document, and the id is edited in a sequence number hierarchical manner. For each node, there is a sequence number path (tag)₀,tag₁,…,tag_n)。

Referring to fig. 2, part (a) of fig. 2 is a Web document, and the tags and the content between the tags are abstracted into simple letters to emphasize the distribution of the tag structure. (b) Partially showing the tag of the Web document, through a path hierarchical mode, the parent node sequence of a certain id can be found quickly. Tree can be seen to have the following properties:

leaf nodes have no sibling nodes.

The leaf nodes are corresponding elements in the label, and are marked with NULL when no element exists.

Definition 2(tag. tree valid node index): and traversing the tag, and if the element in the leaf node accessed under a certain path is not empty and is not noise, all the nodes on the path are effective nodes. All valid nodes constitute the valid node index of the tag.

Please see table 1, all valid nodes in fig. 2 are marked, and the marking process is shown in table 1.

TABLE 1 example of creating valid node index for Tag tree

Key	Traversed node id	Active node
			(abc,w)	1，1.1，1.1.1	a,b,c
(abde,null)	1，1.1，1.1.2，1.1.2.1
			(abdf,null)	1，1.1，1.1.2，1.1.2.2
(agh,null)	1，1.2，1.2.1
			(agik,null)	1，1.2，1.2.2，1.2.2.1	a,g,i,k
(agil,null)	1，1.2，1.2.2，1.2.2.2	a,g,i,l
			(agim,null)	1，1.2，1.2.2，1.2.2.3	a,g,i,m
(agil,null)	1，1.2，1.2.2，1.2.2.4	a,g,i,n
			(agj,null)	1，1.2，1.2.3

Definition 3 (path edit distance): let T be tag, pre (T) and post (T) of document X, which respectively represent the first-order traversal operation and the last-order traversal operation of tree T. The traversal paths of pre (T) and post (T) are represented as strings s1, s2, then ed (s1, s2) is the path edit distance between s1, s 2. The tree edit distance TD (T1, T2) has the following relationship with the path edit distance ed (s1, s 2):

ed(pre(T1),pre(T2))≤TD(T1,T2)

ed(post(T1),post(T2))≤TD(T1,T2)

a tree and its binary tree may be transformed into each other, with any node in the tree, in the process of transforming into a binary tree, its left child becoming the left node of the binary tree and its right sibling becoming the right node of the left child.

Please see fig. 3, let T₁₂、T₂₁Conversion to binary tree B (T)₁₂)、B(T₂₁) In the invention, B (T)₁₂)、B(T₂₁) And converting into a full binary tree, and replacing the full binary tree with a star when the nodes are insufficient. As can be seen from the figure, T₁₂Root node u₁Left child u₂Is B (T)₁₂) Left child of u₂The right brother of (A) is B (T)₁₂) Left child u₂The right node of (c).

Definition 4 (Binary Branch): let B be a binary tree and let B be a binary tree,there is one binary branch, node u, and its two child nodes.

Referring to FIG. 3, a binary tree B (T) is shown₁₂)、B(T₂₁) All binary branches of (1). As can be seen from the figure, for B (T)₁₂)、B(T₂₁) There is a binary branch for each node in, e.g., B (T)₁₂) Parent node with binary branch as u₁And child points u₂,*。

Definition 5 (Binary Node Vector): let B be a binary tree, representing all binary branches of B as a vector V ═ V (V)₁,v₂,…,v_n)，v_iIs the ith branch of the tree T, falseV. the_iLabel is a, its child node label is a, then vi ═ a, a.

Referring to FIG. 4, a binary tree B (T) is shown₁₂)、B(T₂₁) And a binary node vector corresponding to all the binary nodes. For example B (T)₁₂) Parent node with binary branch as u₁And child points u₂And the corresponding binary node vector is (a, a).

Definition 6 (Binary Branch Vector): let the binary branch vector BBV (T) of tree T be (b)₁,b₂,…,b_|B|) Wherein b is_iRefers to the sequence number of the ith node in the binary tree B of the tree T, and | B | represents the size of the binary tree B.

Please refer to table 2, which lists the construction method of the binary branch vector in fig. 5, where a binary node vector exists in T in the binary tree and is marked as 1, and when the binary node vector does not exist, is marked as 0.

Table 2 constructs binary branch vectors

BBV(T1)

BBV(T2)

(a,a,*)	1	1
			(a,*,e)	1	0
(a,f,b)	0	1
			(b,*,c)	1	0
(b,*,X)	0	1
			(b,*,y)	0	1
(c,*,g)	1	0
			(e,*,b)	1	0
(f,*,b)	0	1
			(g,,)	1	0
(x,,)	0	1
			(y,,)	0	1

Definition 7 (Binary Branch Distance): let BBV (T1) ═ B1, B2, …, B |, BBV (T2) ═ B1 ', B2 ', …, B | ') be the binary branch vectors of trees T1, T2, whose binary branch distances areFor any two trees T1, T2 satisfies

Define 8 (Hash interference Vectors): let m dimension vector delta ═ great face₁,…,_m}，g_i(q) is the hash value of query q in the ith hash table. Applying the interference vector delta to the hash bucket to obtain a hash bucket pointer g_i(q) + Δ, which points to a set of hash values. By using multiple interference vectors, the hash bucket closest to the query object hash value can be quickly located.

For a given query q, the hash bucket is retrieved using the basic LSH algorithm to get the hash value g (q) ═ h₁(q),…,h_M(q)). The interference vector Δ is applied to g (q) resulting in a set of hash buckets with pointers g (q) + Δ. Wherein the LSH function is of the formIf W is large enough, there is a high probability that similar objects have the same or adjacent values, and therefore the interference vector Δ may be used to point to those buckets that contain adjacent values, where W is large enough_i∈{-1,0,1}。

Applying the perturbation vector directly to the hash value of the query object avoids point perturbations and the overhead of computing the hash value using entropy-based LSH methods. A set of perturbation vector sequences is designed, where each perturbation vector maps to a unique set of hash values and does not point to a hash bucket more than once.

Please see fig. 5, set g_i(q) is the hash value of query q in the ith hash table, (Δ)₁,…,Δ_L) Is a sequence of pointers, g_i(q)+Δ₁Is to interfere with the vector delta₁Is applied to g_i(q) the new hash value obtained, which points to another hash bucket in the hash table. According to the principle, the n-order perturbation vector points to all first-order buckets first, then points to second-order buckets and so on. Based on L hash tables and LSH index of M hash functions in each table, the total number of n-order buckets isThe total number of buckets in s order isBy using multiple interference vectors, the present invention locates more hash buckets whose hash values approximate the hash value of the query object.

The core idea of the multi-pointer LSH algorithm is to use a sequence of pointers to examine a large number of buckets that may contain points nearest to the query object q. Based on the nature of the location sensitive hash function, the present invention understands that if an object is close to query point q, but is not hashed to q in the same bucket, but in a bucket adjacent to q, i.e. the hash values of the two buckets are only slightly different. The goal of the present invention is to locate these adjacent buckets, thus increasing the chance of finding the adjacent points of q.

Definition 9(top-K nearest neighbor distribution): let n-th order hash interference vector delta not be zero, delta is used to detect the bucket closest to the given query hash value. Based on the attributes of the location sensitive hash function, the first order probe pointer points to a bucket that is closer to the query point than the second order probe pointer.

FIG. 6 shows the distribution of the nearest bucket distances to top-K, where (a) is a histogram of different hash values, (b) is a histogram of the numbers of perturbation vectors of different orders, and (a) is a curve showing a single hash value_iIn contrast, part of the curve (b) shows that the number of hash values is different from the hash value of the query point. It can be seen from the graph that the hash values of all K nearest neighbors in the set are either the same (c:)_i0) or different_i+1 or_i-1. Similarly, most K nearest neighbors are hashed into hash buckets that are within second order of the hash bucket from a given point.

Please refer to fig. 16, which is an overall flow chart of the two-stage filtering based on structure and content proposed by the present invention, wherein the left half illustrates the algorithm process of the first-stage structure filtering. The method specifically comprises the following steps:

step 1: decomposing the Web document according to the paths of the elements and the attributes in the Web document, extracting a structure model tag of the Web document, and establishing an effective index.

Substep 1: referring to fig. 2, a tree is constructed according to the order of nodes in a Web document by using a sequence number hierarchical mode.

Substep 2: calculating effective nodes of the tag.

Step 2: and acquiring a binary branch vector of the tag.

Substep 1: see fig. 3, which converts the index region structure tree of the tag.

Substep 2: referring to fig. 4, all binary branches of the binary tree B are obtained, and all binary branches of the binary tree B are converted into binary node vectors.

Substep 3: and acquiring a binary branch vector of the binary tree B according to the binary node vector set S of all the Web documents.

And step 3: a multi-pointer LSH index is established for the binary branch vector of the tag.

Substep 1: according to the step 1, the binary branch vector m of the Web document X is obtained in the step 2.

Substep 2: initializing L hash tables, initializing M hash functions and r hash buckets in each table, calculating M hash functions for binary branch vector M of X, and setting g_i(m) is the hash value of query m in the ith hash table, i.e. g (m) ═ h₁(m),…,h_M(m))。

Substep 3: initializing an n-dimensional perturbation vector Δ, wherein Δ ═ Δ₁,…,Δ_L)。

Substep 4: applying the perturbation vector in delta to g in turn_i(m) to obtain an i (i ═ 1,2, …, n) order index, e.g. g_i(m)+Δ₁Is to interfere with the vector delta₁Is applied to g_i(m) obtaining a first order pointer to the hash bucket that points to another set of hash buckets in the hash table, each perturbation vector mapping to a unique set of hash values, and the pointer pointing to a hash bucket at most once.

Substep 5: for each document X in the Web document set, a multi-pointer LSH index is built according to substeps 1,2, 3, 4.

And 4, step 4: structure similarity Web document set S for inquiring given inquiry q₁。

Substep 1: and extracting a binary branch vector of the given query q, and quickly retrieving a top-k proximity point of the q based on a multi-pointer LSH index retrieval algorithm.

Referring to fig. 5, the detailed process of quickly retrieving the top-k proximity point of q based on the index s includes the following steps:

① calculating a hash function of query q M times, where g (q) is the hash value of query q in the ith hash table, i.e., where g (q) ═ h₁(q),…,h_m(q))。

② initialQuantizing the L-dimensional interference vector Δ, wherein Δ ═ Δ (Δ)₁,…,Δ_L) And Δ_i＝{₁,…,_mWill interfere with the vector Δ₁Is applied to g_i(q) the first order pointer g of the hash bucket obtained_i(q)+Δ₁And by analogy, acquiring each-order bucket pointer of q.

And thirdly, obtaining all vectors in the hash bucket within two orders of q from the given point according to the LSH multi-pointer index, wherein the vectors are top-k adjacent points of the query q.

Substep 2: screening out binary branch distance BDist (T) from top-k proximity point of q_i,T_q) Satisfies the conditionsT tree set S₁And is input as a secondary index.

A second level of content string similarity filtering follows, and before elaborating, a number of key definitions that play a key role in the algorithmic specification are first given.

Definition 10 (string edit distance): specifying two character strings s(s)₁s₂s₃…s_m) And t (t)₁t₂t₃…t_n) The string edit distance of s and t is denoted as ED (s, t). The minimum operation times of converting the character string s into t by three operation modes of replacement, insertion and deletion in a single character unit are represented.

For example, the string edit distance ED (tobe, buffer) between the string "tobe" and the string "buffer" is 4, that is, the string "tobe" can be changed into the string "buffer" by three character replacing operations and one character deleting operation, which are t- > u, o- > f, b- > f, respectively, and the character f is deleted.

If s is empty, it is denoted ED (s, 0). For ED (s, t), it is easy to conclude that there is a certain degree of incremental of ED (s, t):

①ED(s_i0)＝i,ED(t_j,0)＝j,1≤i≤m,1≤j≤n),

②ED(s_i,t_j)≥ED(s_i-1,t_j)+1,1<i≤m,1≤j≤n

③ED(s_i,t_j)≥ED(s_i,t_j-1)+1,1≤i≤m,1<j≤n

④ED(s_i,t_j)＝ED(s_i-1,t_j-1),1<i≤m,1<j≤n,s_i＝t_j

⑤ED(s_i,t_j)≥ED(s_i-1,t_j-1)+1,1<i≤m,1<j≤n,s_i≠t_j

definition 11 (string set similarity connection is similar to string): given two character string sets S and T and a character string editing distance threshold tau, the invention considers that the criterion for judging the similarity of the character S and the character string T is that the character string editing distance of the S and the T is not more than the set threshold tau, namely, the proposition is determinedED(s,t)<A conclusion s ≈ t can be derived, "≈" indicates string similarity. The similarity connection of the character string sets S and T can be understood as that one character string is taken from each of the two sets to form a matching pair of similar character strings. Noting that the result of the connection isIs shown as

For example, the character strings included in the two character string sets S and T are: s { "interlace", "ideology", "align", "expand", "primary", "association", T {, "immig range", "interactive", "interaction", "thermal", "deletion", "primary", "extension", "compensation", and if a string edit distance threshold τ { "1 is specified, S and T similarity are connectedAs a result of the welding If τ is 2, the result of the affinity join is

In the present invention, if the string edit distance threshold τ is not specifically described, τ is set to 1 as a default.

Definition 12 (string prefix): let s be s₁s₂s₃…s_nThen, thens₁s₂s₃...s_m(1≤m<n) is the character s in s_iThe prefix string before character c (c ∈ s) in string s is denoted as s, if string s is considered as a set of characters in the original order, s' is a subset of s^＜c。

For example, "pre", "prereq", and "prerequis" are all prefixes of s for a string s ═ prerequisite ".

Definition 13(Trie structure): the Trie structure is a tree-like data structure proposed by the present invention for second-level content string filtering. Wherein, the root of the tree is at the top layer, and the middle node and the leaf node are below and are in a tree structure from top to bottom. The specific meanings of each moiety in the Tie structure are shown below

The root node is root and represents an empty character string phi.

The intermediate node represents a prefix character string of a character string and at least comprises one character.

And the node only has one father node (the root node has no father node), and 0 or more child nodes.

And the path from the root to the node represents a substring or a complete character string formed by the characters from top to bottom. The path corresponds to its last node.

The leaf node represents a path from the root node root to the leaf node and corresponds to a complete character string.

Sixthly, all child nodes of a certain node in the tree have the same prefix character string.

Referring to fig. 7, the Trie structure of several strings is shown, which are "barren", "beam", "bean", "bear", "exend", "instance" and "thorough", respectively. The number marked next to each node represents its number ID, where the ID of the root node root is 0. The number ID of the middle node represents the character substring corresponding to the node, and the number ID of the leaf node represents the complete character string corresponding to the node. The intermediate node numbered 15 in fig. 7 represents the substring "exde", while the leaf node numbered 6 is denoted "barren". Conversely, a character sub-string or complete character string can uniquely identify its ID, for example, the sub-string "bea" of "bean" corresponds to node a with ID 8, and the character string "instant" uniquely identifies node t with ID 21. I.e. the nodes represented by the IDs correspond to character sub-strings or complete character strings. Specifying the structure of the string set S, TrieIs represented by H_x(x ∈ S) by | H'_xI represents the depth of a node x in the Trie structure, | H_xL represents the maximum depth of the Trie structure, and defines a substring or a complete character string represented by a path from a root node root to an intermediate node or a leaf node behind a node x as h^x＜。

Wherein S ═ { S ═ S₁,s₂,s₃,…s_kThe expression string set S is composed of k character strings, the length of each character string, namely the number of contained characters, is expressed as | S_iI (1 ≦ i ≦ k), S ≦ represents the number of character strings included in the character string set S, and S ≦ k, forAll can be in Trie structure H_xCorresponds to a node h_xThe Trie structure is formed by the nodes according to a tree structure and has H_x＝{h_xI x ∈ S }, letIs the maximum length of each character string in S, having At this time

Let | x | denote as a character string in which a character corresponding to the x node is locatedThe length of a character string consisting of all characters from the first character to the character x according to the original sequence, or the path length from the root node to the x. With | x | ═ J^xIf x is the last character in J, then | x | ═ J |.

For example, in the Trie shown in fig. 7, the longest path is 0(root) → 22(t) → 23(H) → 24(o) → 25(r) → 26(o) → 27(u) → 28(g) → 29(H), and the length of the path is 8, so | H → 22(t) → 23(H), and therefore | H → 27(u) → 28(g) → 29(H)_xAnd 8. If x represents a node numbered 9, | H'_9(m)|＝|beam|＝4。

In order to distinguish the nodes of the Trie tree structure from the represented character strings, the representation of the nodes is defined as follows: if the node number ID is known, if the node is represented by the character c, representing a character string w composed from the root node to the character c, it is represented as ID (c). In other cases where confusion is not likely to arise, node c is used to represent it.

Definition 14 (edit distance matrix): given two strings s and t, the value s ═ s, respectively₁s₂s₃…s_m,t＝t₁t₂t₃…t_nA data structure is constructed to store values of string edit distances between all the substrings of s and all the substrings of t, i.e., an edit distance matrix M_m+1,n+1,For any character string r ═ r₁r₂r₃…r_nIts substring r_sub＝{φ∪r₁r₂...r_lL is more than or equal to 1 and less than or equal to n, and the total number is 1+ n.

Edit distance matrix M_m+1,n+1The dimensions of (a) respectively correspond to the total number of sub-strings of the character strings s and t. The total number of substrings of s is 1+ M, the total number of substrings of t is 1+ n, M_m+1,n+1Is (1+ m) (1+ n), the total number of edit distances calculated for each substring of s and for each substring of t, and each element of the matrix is ED(s)_sub,t_sub). FIG. 8 is a matrix of edit distances between the character string "immatural" and the character string "latitude".

The calculation of each element of the edit distance matrix can be derived from the associated formula defining the edit distance in 14, see formula:

wherein ξ is represented as s_iCharacter string composed of single character and t_jThe string edit distance between strings composed of individual characters is naturally:

for example, in FIG. 8, M [0, k ]]＝M[k,0]＝k,(0≤k≤8),M[2,3]＝3,M[1,3]＝3,M[2,2]＝2,M[1,2]2 in the illustrated edit distance matrix, s₂＝m,t₃T due to s₂≠t₃According to the calculation formula of ξ, ξ is 1, and M [2,3 ] is satisfied]＝min(M[1,3]+1,M[2,2]+1,M[1,2]+1). For s_i＝t_jIn the case of (1), M6, 5 is satisfied]＝min(M[5,5]+1,M[6,4]+1,M[5,4]) And ξ is 0.

Definition 15 (matrix approximation and Trie tree active nodes): according to the definitions 13 and 14, the matrix approximation term is defined as M [ k, r ] in the present invention, and M [ k, r ] is equal to or less than τ, where k is equal to or less than 0 and is equal to or less than M +1, and r is equal to or less than 0 and is equal to or less than n +1, as can be seen from the definition 14 of the string edit distance, due to the incremental character edit distance, M [ k '> k, r ] ≧ τ, and M [ k, r' > r ] > τ exist, and the active node set of all nodes of the Trie tree of FIG. 11 is shown in Table 3.

Table 3 active node set for all nodes of the Trie of fig. 11

The shaded portion shown in fig. 8 indicates the matrix approximation term when τ is 1, and the solid circled portion indicates the matrix approximation term when τ is 3. It is readily seen that the number of matrix approximation entries grows with the increase in τ. That is, the wider the range of the determination approximation, the easier the value of the approximation result is taken.

Since the strings s and t can be expressed as both the edit distance matrix and the Trie structure, a string set is givenTrie tree H corresponding to character string set S_x(x ∈ S) and a character string t, wherein the Trie tree path corresponding to the character string S is h_υ(. nu ∈ s), forIf ED (upsilon, t) is less than or equal to tau, a node corresponding to upsilon is called as a Trie tree active node of the character string t. All active nodes of Trie tree of character string t form active node set of Trie tree, and is marked as Q_τ(t), it can be concluded that if ED (t, e) ≦ τ, then e ∈ Q may be inferred_τ(t) of (d). The direction of the arrows marked in fig. 11 shows the sequence of solving the active node sets of the Trie in the general case, i.e., preferentially solving the active node sets of the parent nodes and then solving the active node sets of the child nodes or the sibling nodes. Wherein, the set marked by the root node is an active node set with root being tau-1, Q₁(root) {0,1,9,13 }. Active node sets of nodes with the Trie tree number range of 0-17 under two conditions of tau being 1 and tau being 2 are shown in a table 3.

Given a node n and a string s, t, defining s as follows:

if it is notThen h is^n＜S is not always true. Wherein s is^＜cExpressed as c character before in the string sPrefix string (by definition 13). h is^n＜Expressed as a substring or string represented by a node in the Trie in which the string t is located after the node n (see definition 13), this conclusion indicates that if the node n is not an active node of the Trie of any prefix string of the string s, the substring or string represented by the node in the Trie after the node n is unlikely to be similar to s.

The active nodes of the Trie tree have symmetry, i.e. for any two nodes u and v of the Trie tree structure, if u ∈ Q_τ(v) Then there must be v ∈ Q_τ(u). For example, active node set Q of node 4(b) in FIG. 11₁(4(b)) {3,4,5}, and an active node set Q for node 5(i)₁(5(i)) {4,5}, and an active node set Q of node 3(m)₁(3(m)) {2,3,4,8,11 }. since 3 ∈ Q₁(4(b)),5∈Q₁(4(b)), so there is 4 ∈ Q₁(3(m)),4∈Q₁(5(i))。

Definition 16 (node belongs to string set): the definition that the node n belongs to the character string set is described as follows:

given a set of strings S (S)₁,s₂,s₃,..s_m) Trie Tree H_x(x ∈ S ∪ delta) and a node n, delta represents other character string sets corresponding to the node set in the Trie tree, if a character string S exists, S ∈ S is satisfied, and a substring S (n) corresponding to the node n is a prefix character string of the character string S, then n ∈ S can be obtained.

The active node set plays a very crucial role in solving similar string pairs. To find the active node sets of all nodes of the Trie, the active node sets of root nodes root of the Trie may be first computed, and then the active node sets of the children nodes of the root nodes may be computed. If traversing all nodes to find the corresponding active node set is very resource-consuming, some rules should be explored to improve the efficiency of solving the active node sets of all nodes in the Trie. Therefore, in the invention, several directions for improving the solving efficiency are given firstly, then corresponding reasoning is proposed, and finally, examples are given and the reasoning is actually applied to the calculation process of the active node set.

Several ways to improve the computational efficiency are as follows:

the structural characteristics of the Trie tree are fully utilized. The Trie tree is to arrange all characters from top to bottom according to the original sequence of all characters of the character string, and the path from the root node root to each leaf node is unique and represents a complete character string. A group of strings with the same prefix shares nodes in a Trie. The active node set of the intermediate nodes or leaf nodes in the tree may be computed using the active node set of the parent node. In this way, the number of computations for certain common substrings is reduced by using a common prefix.

From the definition 15 of the matrix approximation item and the active nodes of the Trie, it can be seen that the active nodes of the Trie for solving the character string t only use the Trie to map the nodes of each character of the character string s, and do not perform the transformation of the Trie on the character string t. For the character string sets S and T, for any character string therein, the advantage of sharing the prefix should be fully utilized, and the calculation amount of the active node is relatively small. Therefore, a Trie structure should be constructed separately for both string sets S and T. Fig. 9 shows a Trie established based on S and T, and all the strings are represented as paths of the Trie, so that the characteristics of the dual-set expressed in the Trie can be fully utilized to improve the calculation efficiency of the active node set.

And thirdly, discovering the mode characteristics of a node set irrelevant to solving the active node set of a certain node in the Trie tree structure, and cutting a node area set conforming to certain characteristics by utilizing the mode characteristics in the process of calculating the active node set of each node from top to bottom, thereby reducing the number of nodes of the Trie tree and reducing the calculation of the unorthodous nodes.

And fourthly, solving the active node set of the child node according to the obtained active node set of the father node, deleting the active node sets of some unnecessary nodes in a certain program for calculation, and reducing the calculation times, but if the active node sets known for all the nodes spend resources such as memory and the like for storage, a large amount of memory space is occupied, and the method is not suitable for the condition that the character string is long or the Trie tree is large in scale. Therefore, the characteristic that the calculation of some node active node sets only depends on part of nodes is utilized to timely release part of node sets which are invalid for solving the follow-up nodes, the use amount of system resources is greatly reduced, and the practicability of the algorithm is improved.

Some reasoning is specifically proposed below for the above proposed method for improving the solution efficiency of the active node set.

Introduction 1: given Trie Tree H_xOf a node l and its parent p, and a string edit distance threshold τ, for any node l' ∈ Q_τ(l) There must be one node p ' which is the ancestor of l ' and satisfies p ' ∈ Q_τ(p)。

For example, in fig. 11, it is assumed that the character string corresponding to the node 14(o) is "so", and the character string corresponding to the node 15(a) is "wa", and w is a child node of e. To calculate Q_τ(w), it is only necessary to judge an arbitrary node t ∈ Q_τ(e) And whether all child nodes of t are active nodes of the node w or not is sufficient. As can be seen from table 3, when τ is 1, Q is present_τ(e) Q is calculated as {13,14,15}_τ(w) whenever it is checked whether the nodes 13,14,15 and all their children belong to Q_τ(w), the node set passing the final check is Q_τ(w) has Q_τ(w)＝{14,15,16,17}。

2, leading: given Trie tree structure H_xAnd two of the nodes u and v, and a string edit distance threshold τ, if for any ancestor node v' of v, there isAnd for any ancestor node u' of u, all satisfyThe substring or character string s (u) represented by the node after the node u and the substring represented by the node after the node v in the TrieOr string t (v) cannot be similar, i.e., s (v) t (v) does not hold.

For example: in fig. 11, looking at the node 1(b), the node 6(e), the node 13(s) and the node 14(o), the character strings represented by the four nodes are "b", "be", "s" and "so", respectively, since the four nodes are represented by the characters "b", "be", "s" and "so" respectivelyAnd is also provided withAll subsequent nodes with node 6(e) and node 14(o) may not represent similar strings, i.e., "bed" and "soak", "bem" and "soak", "bed" and "soap", and "bem" and "soap" may not be similar.

And 3, introduction: given a string s and t, and a string edit distance threshold τ, if | | s | - | t | > τ, then there are: ED (s, t) > τ. Where | s | and | t | represent the number of characters contained in the character strings s and t, respectively.

In part (a) of fig. 10, a character string length section is marked next to the node s and the node tWherein,indicating the shortest path length in the sub Trie tree structure with node e as the "root node",the longest path length. In the figure [1,4 ]]The length of the character string representing the subsequent node representation of the node s is in the range of 1-4, and [7, 9%]The length of the character string represented by the nodes subsequent to the node s is in the range of 7-9. Due to the fact thatThat is, the length difference between the two groups of character strings is greater than the character string editing distance threshold, so that Q can be solved_τIn the process of (t), all subsequent nodes of the node s may be ignored. In a sub-Trie structure with the root node root of the actual Trie tree as the root node and the s and t nodes as the leaf nodes,there are any subsequent nodes μ to the node, there are

For example, in calculating the active node set of node 1(b) of FIG. 11; using the lemma to delete a portion and calculate Q_τ(1(b)) an unrelated node. The character string length intervals of the nodes 1(b), 12(e) and 22(t) are [4,6 ] respectively],[6,6]And [8,8 ]]Since the difference between the string lengths of node 12(e) and node 1(b) is [0,2 ]]And node 22(t) differs from node 1(b) by [2,4 ]]When τ is 1, the node 22(t) may be deleted, but the node 12(e) may not be deleted.

And (4) introduction: given Trie tree structure H_xAnd two of the nodes u and v, and a string edit distance threshold τ, if one of the following conditions exists:

(2) there is only one string prefixing v and u simultaneously,

then Q is calculated_τ(u) all successor nodes with v as parent or ancestor are not considered.

Fig. 10 shows two embodiments of the conditions in part (b) and part (c). (b) In the case that the nodes t and s have the same leaf node and t is the parent node of s, Q is calculated_τ(s) subsequent nodes to node t will not be considered. (c) Since there is only one character string"sto" has both a node s (representing the character string "s") and a node t (representing the character string "st") as prefix strings, and thus calculates Q_τ(s) node o is not considered.

Lemma 3 and lemma 4 can be combined as an important method to improve the computational efficiency of the active node set. Because these two arguments can reduce the amount of computation of the active node set by deleting irrelevant nodes. For example, solving for the active node set Q of node 4(r) in FIG. 7_τ(4(r)), Q can be obtained by theorem 3_τ(4(r)) {3,4,5}, and Q was obtained using theorem 4_τ(4(r)) - {4,5}, and the final result is Q_τ(4(r))＝{3,4,5}。

For the comparison of the similarity of character strings between two character string sets, the influence of the relationship of the character string sets on the algorithm process should be considered, and the method should be applicable to different types of character string sets. The invention firstly creates a Trie tree structure based on the shared prefix of the character string and the structural characteristics among the character strings. Preprocessing is performed before similarity calculation, namely, in order to avoid repeated calculation, a node area which is not related to the calculation of the active node set is deleted by using the interaction between the nodes of the Trie tree and the lemma related to the active node set. In order to improve the efficiency of the algorithm, the characteristics of the Trie tree structure are fully utilized, the active node set of the child node is solved through the known active node set of the father node, and meanwhile, the unknown node of the active node set is subjected to dynamic incremental access, so that on one hand, each node is not required to be accessed and calculated, and on the other hand, the whole Trie tree structure is prevented from being traversed for solving the active node set of a certain node. In this way, the computational efficiency can be significantly increased in a manner that reduces the amount of access. Meanwhile, in order to avoid the algorithm from occupying a large amount of memory resources, the invention stores the accessed nodes which are only effective to solve the active node set of the nodes to be accessed instead of storing all the accessed node information by introducing the data structure of the access node stack. By improving the calculation efficiency and limiting the use of algorithm system resources, the algorithm provided by the invention has good practicability and has advantages compared with the existing method.

Based on the above-mentioned text description of the overall process of the algorithm, the following is further illustrated with specific algorithm flows and accompanying drawings.

Fig. 16 is an overall flow chart of the two-stage filtering based on structure and content, wherein the right half part illustrates an algorithm process of the second-stage content filtering, and the specific implementation includes the following steps:

step 1: extracting a character string set U (U) from the six-element group vector generated by the first-stage structural similarity filtering₁,u₂,u₃,…u_k) And V (V)₁,v₂,v₃,…v_l) There is a possibility that both sets of character strings are identical. Each character string is the result of the concatenation of the title (title), the KeyWord (KeyWord) and the content (content) in each tuple vector. And determining the character string editing distance threshold tau according to the practical application situation.

Step 2: sorting the character strings in the character string sets U and V according to a dictionary to form an ordered set, and establishing a Trie tree H based on the ordered set_u∪vThe Trie tree is composed of nodes corresponding to each character in the U and the V, and the nodes of the Trie tree store node numbers, mapping relations between the nodes and the characters, character string sets to which the nodes belong and structural relations between the nodes.

And step 3: an "accessed node stack" data structure ST is created for storing the Trie nodes that have been accessed. The stack is initialized to have only the root node and its corresponding set of active nodes.

For example, a length | H is created from the maximum depth of the Trie tree_U∪V' visited node of |Each element in the stack "ST, ST includes the number of the node and the active node set of the node.

And 4, step 4: and establishing a dynamic Trie tree H' only with root nodes and roots, solving an active node set of the root nodes and roots, and determining the next node for solving the active node set, which is called a destination node. Comprising the following substeps.

Substep 1. solving for an active node set Q of a root node based on the incremental ① in the definition 10 of string edit distance_τ(root), the root node and corresponding set of active nodes are pushed onto the stack ST.

Substep 2: and extracting a node set N only belonging to the character string set U from the active node set of the root node, and selecting the node N' closest to the root node.

And 5: creating a virtual Trie-Tree H 'for the next destination node'_u∪v. Comprises the following steps.

Substep 1: exploring in Trie tree H based on theory 3 and theory 4_u∪vThe node area of (a) that is not relevant to solving the active node set of node n',

substep 2: gray marking is carried out on the node area to form a tree H based on the original Trie_u∪vIs dedicated to a virtual Trie-Tree H 'of n'_u∪v。

Step 6: the active node set of the destination node is found from the active node set of the uppermost element of the stack ST based on lemma 1.

Substep 1: if the destination node belongs to the two sets, the destination node is set to belong to one set and is in a virtual Trie tree H'_u∪vExcluding nodes belonging only to the set; if the destination node only belongs to a certain set, the same is carried out in the virtual Trie tree H'_u∪vExcluding nodes that belong only to the set.

Substep 2: h'_u∪vIn the method, the activity of the destination node is obtained according to the active node set of the parent node at the uppermost layer on the stack STNode set

Substep 3: the destination node is pushed onto the stack ST. The uppermost layer element is the number of the destination node and its corresponding active node set.

And 7: and updating the active node set of the elements within the distance tau from the uppermost element of the stack in the stack ST based on the description of the symmetry of the active nodes in the Trie tree active node definition.

And 8: judging whether the node p currently processed is a certain leaf node of the Trie tree structure, if so, determining whether the node p currently processed is a certain leaf node of the Trie tree structureOutputting similar string pairings<o,p>And pops up the uppermost node in the stack ST.

And step 9: in Trie Tree H'_u∪vAnd (4) taking out the newest destination node of the child node or the brother node of the destination node belonging to the same character set, and repeating the steps from 5 to 8 until the stack ST is empty.

Fig. 12 is an example of the implementation of the above algorithm flow, and the following first analyzes the influence of the relationship between the character string sets U and V on the algorithm, and then gives a specific algorithm execution example in conjunction with fig. 12:

let U and V be two sets of strings, where U ═ U₁,u₂,u₃,…u_m},V＝{v₁,v₂,v₃,…v_n}. Tau is string edit distance threshold, R is output similar string pairing set, H_U∪VRepresenting the Trie structure built on the character string sets U and V, and root representing the root node of the Trie structure.

The relationship of U and V affects the execution process and the result of the algorithm. If U and V are the same set, i.e., U { < U, w > | U, w ∈ U ^ ED (U, w) ≦ τ }. It can be understood that if a similar character string pairing set is solved in the same set, the result set is composed of tuples formed by two similar character strings in the same character string set. Otherwise, R { < U, V > | U ∈ U, V ∈ V, and ED (U, V) ≦ τ }. The specific implementation of the algorithm is as follows:

(1) the number of a node o is ID (o), ST represents an accessed node stack, and the length | ST | ≦ H of the node o is set in the Trie tree_U∪VEach element storing the number ID of the accessed Trie-tree node o and the corresponding extended active node set Q_τ' (o) is represented by<ID(o),Q_τ'(o)>ST is initialized to<ID(root),Q_τ'(root)>Is initialized to Q_τ' (root) ← phi, wait for step (4) to pair Q_τ' (root) update.

(2) Each node o in the Trie may be denoted as o (id (o), char (o), charset, string fo, bel). Wherein char (o) represents the character corresponding to the node o, charset represents the character string set to which the node o belongs, strucfo represents the structural relationship, such as father, son or brother, between the node o and other adjacent nodes, stored by the node o, bel represents the relationship between the node o and the character sets U and V,dictionary ordering for arbitrary strings s ∈ U ∪ V generates an ordered set U Θ V ═ r | r₁＞r₂＞...＞r_m+nR ∈ U ∪ V ">" indicates that the character string preceding the symbol precedes the character string following the symbol in the lexicographic ordering.

(3) Building Trie tree H based on ordered set U theta V_U∪VFor nodes belonging to different sets, bel is initialized to different values respectively.

(4) Establishing a dynamic Trie tree H 'based on a root node root, initializing as H' ← root, and traversing H_U∪VSolving for Q_τ(root) ← { q | q ∈ U ∪ V ^ | q | ≦ τ }, and element ^ q | ≦ τ }, and its formula<ID(root),Q_τ'(root)>The stack ST is pushed, marked as visited node.

Wherein, for node r, Q'_τ(r) Trie Tree active node set Q for r_τ(r) extended. Q'_τ(r) at Q_τ(r) including the relation of the initial character string setIs the influencing factor of the algorithm process. Solving for Q_τThe method of' (r) is as follows:

(5) defining dn as destination node, and dn 'as next destination node, initializing as dn ← root.1stchild, dn' ← dn.1stchild, satisfyingWherein C is_S ^PRepresents the complement of P in corpus S, S ← { U, V, U ∪ V }.

(6) If stack ST is empty, go to (15), otherwise go to (7).

(7) If dn is not null and dn ∈ U, go (8), otherwise go to (13).

(8) Let element tp be the uppermost element of stack ST, tp (ID (tp), Q_τ' (tp)) ═ st.tope (); clipping the nodes conforming to the lemma to form a virtual Trie tree H'_U∪V,H'_U∪V←H_U∪VDs, ds is the set of nodal regions that conform to the lemma.

(9) Active node set (dn, Q) by element tp_τ' (tp)) solving for Q_τ' (dn) if dn is a leaf node, go to (10), otherwise go to (12).

(10) Let s (node) be a substring or complete character string represented by a node, and for any Trie tree node l ∈ Q_τ' (dn) ^ l ≠ n ^ l.child ← phi, has s (dn) s (l), combines the output result into a result set R, calculates R ∪ { (dn, l) };

(11) in stack ST, let | t | be the position of the stack element corresponding to t, and the node corresponding to the uppermost element of stack ST be dn, with | dn |>If the node t meets the condition that the value of | dn | - | t | ≦ tau, dn is added to Q_τ' (t) calculating Q_τ'(t)∪＝{dn}。

(12) Push(s) to push the destination node dn onto the stack ST<dn,Q_τ'(dn)>) Solving for the next destination node dn',

(13) if dn is not null, derive dn' ← dn. next finding.dn. next finding that represents the next sibling of node dn, go (7); otherwise, go to (14).

(14) Post ST pops up the top element, tp (ID (tp), Q_τ'(tp)) ═ st.pop (), and define the sibling node of the node corresponding to the element as the object to be accessed next, dn' ← tp.nextsabling.

(15) And outputting a final result R, observing an actual matching program of the character string similarity pair in the R, and ending the algorithm.

The experimental data of the invention come from the Yahoo official website and the webpage data in the China Daily, crawl in the set theme through tools such as web crawlers, and form the Yahoo data sets and the China Daily data sets with different character string lengths through processing. Fig. 13,14 and 15 are comparative graphs of new and old methods plotted based on experimental result data.

Referring to fig. 13, the new method based on the two-stage filtering is obviously superior to the conventional method in comparison of the operation efficiency of the same data set. In fig. 13, it can be seen that the time for the operation of the conventional method and the method proposed in the present invention is first decreased, then slowly increased, and finally gradually decreased with the increase of the average string length in the string set. In the case of short average string length, the old method takes more runtime than the new method because the new method has one more layer of structural similarity filtering. The advantages of the new method are shown as the average string length increases. When the string edit distance threshold τ is 10, the drop is greater than τ 20, because the smaller the string edit distance threshold τ is, the more strict the definition of the similarity is, the higher the hierarchy of deletion of the nodes that do not satisfy the definition of the similarity is, and the less the nodes that need to be traversed is, the shorter the running time is. It can also be seen from the figure that the new method is more efficient than the old method both when τ is 10 and τ is 20.

Referring to fig. 14, the new method based on the two-stage filtering is obviously superior to the conventional method in comparison of the operating efficiency of different data sets. Fig. 14 is a graph showing a comparison of results plotted in the case where the method of the present invention and the conventional method are performed in the case where the average string lengths are different in two different data sets, respectively, the Yahoo data set and the China daisy data set, which are divided into 4 groups, and the average string lengths are 300, 500, 700, and 900, respectively. It can be found that the same regularity is present in both data sets, presenting a similar trend to that of fig. 14.

As shown in fig. 15, the new method based on the two-stage filtering proposed by the present invention is significantly superior to the conventional method in terms of accuracy. The three fold lines in fig. 15 are: the new method provided by the invention is a traditional similarity detection method and the relation between the number of the similar character string pairs and the size of the character string set under the actual condition. It can be seen from the figure that the method of secondary filtering proposed in the present invention has a smaller gap from the actual situation than the conventional method, and therefore, the new method has higher accuracy.

It should be understood that parts of the present invention not specifically set forth are within the prior art.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A Web data similarity detection method based on structure and content secondary filtering is characterized in that: on the basis of a traditional general similarity detection method, the characteristics of Web data structure and content distribution are discovered, and two-stage filtering is carried out on a detected document set; the first stage of filtering in the two-stage filtering is structural similarity filtering, each Web document is modeled into a Tag tree structure, so that document sets which are dissimilar in structure are removed, the key contents of the rest documents are extracted and expressed into a tuple vector form, and the key contents are connected to generate a character string set; the second filtering in the two-stage filtering is to perform Trie tree structure modeling on the character string set generated after the first filtering, and connect similar character strings to obtain a final result;

the first-stage filtering specifically comprises the following steps:

the specific implementation of the step a1 includes the following sub-steps:

step A1.2: newly building a MapReduce task, wherein the input key of the Map task is a file name, the value is tag.tree, the key of the reduce task is a file name, and the value is a binary node vector of a binary tree B correspondingly converted by the tag.tree, and inputting the result of the reduce into the HDFS;

step A2: constructing binary branch vectors BBV of all tag.tree in parallel by combining MapReduce; the specific implementation of the step a2 includes the following sub-steps:

step A2.3: the task writes the BBV of each tag.tree into an HDFS;

the specific implementation of the step a3 includes the following sub-steps:

Step a3.3, initialize the L-dimensional perturbation vector Δ, where △ ═ △₁,…,△_L)，g_i(m)+△₁Is to interfere with vector △₁Is applied to g_i(m) obtaining a first-order pointer to the hash bucket, the first-order pointer pointing to another hash bucket in the hash table, each perturbation vector being mapped to a unique set of hash values, and the pointer pointing to a hash bucket at most once;

step A3.5: according to the pointer construction condition of all tag.tree, a hash bucket within the second order gives a group number BandID; keys of the MapReduce task are a group number BandID and a barrel number BucketID, and values are corresponding Web file names;

Step A5: obtaining a set of Web documents S₁Based on a content six-tuple vector set W extracted by a Tag tree, and based on the W, a character string connection algorithm is used for obtaining a similarity result set S of a query q₂；

The specific implementation of the step a5 includes the following sub-steps:

step A5.1: based on step A1, a Web document set S is acquired₁Tree set;

step A5.3: according to the father node label, after judging the leaf node type, adding a six-element vector of X, wherein the six-element vector consists of the number (id) of X, a title (title), content (content), a KeyWord (KeyWord), a link (URL) and release time (PageTime);

the second-stage filtration specifically comprises the following steps:

the specific implementation of the step a4 includes the following sub-steps:

computing M hash functions of query q, let g (q) be the hash value of query q in the ith hash table, i.e., where g (q) ═ h₁(q),…,h_M(q))，g_i(q)+△₁Is to interfere with vector △₁Is applied to g_i(q) obtaining all hash buckets within two orders from the given point q according to the index s by the first-order pointer of the hash bucket, namely, the hash buckets are the top-k adjacent points of the query q;

Step B9: in Trie Tree H'_u∪vAnd taking out the newest destination node of the child node or the brother node of the destination node belonging to the same character set, and repeating the steps B5 to B8 until the stack ST is empty.

2. The Web data similarity detection method based on structure and content secondary filtering as claimed in claim 1, wherein the specific implementation of step B4 includes the following sub-steps:

step B4.1: active node set Q for solving root node based on character string editing distance_τ(root) ofThe root node and the corresponding active node set are pushed onto the stack ST;