KR20100022565A

KR20100022565A - Method for searching an url using hash tree

Info

Publication number: KR20100022565A
Application number: KR1020080081135A
Authority: KR
Inventors: 김형식; 유진형
Original assignee: 충남대학교산학협력단
Priority date: 2008-08-20
Filing date: 2008-08-20
Publication date: 2010-03-03
Also published as: KR100999408B1

Abstract

PURPOSE: A method for searching a URL using a hash tree is provided to quickly and effectively search a search target URL using a hash tree. CONSTITUTION: A search object URL(Uniform Resource Locator) is inputted. The search object URL is divided into a host name and a route. A hash value tuple is generated from the host name using a hash function. A list is found from the hash tree using the hash value tuple. Information related to host name and route saved in the list are compared.

Description

Method for searching an usingＲＬ using hash tree}

The present invention relates to a website content filtering system, and more particularly, to generate a hashlist for a URL list and to utilize the same to effectively search (filter) whether a URL input by a user is included in a predefined URL list. It is about a method.

Recently, with the popularization of the web and the Internet, its dysfunction has become a problem, and the necessity of a website filtering system has emerged, and various methods for efficiently filtering website contents have been used. In addition, web service objects are represented as uniform resource locators (URLs) in accordance with the trend of serving various applications on the web. Thus, a method of filtering website content by comparing a user's request URL with a predefined list of URLs (URL list) Based filtering method) is widely used.

Meanwhile, the size of predefined URL lists is growing rapidly. This increase in the size of the URL list causes a significant decrease in the search speed, which leads to an increase in response time and a decrease in network service quality.

It is important to define the inclusion relationship between URLs in the URL retrieval method. In the present invention, the URL is divided into a hostname and a path, and the path is again defined as a directory and a file. In this case, when the host names of URLs A and B are the same and the path of A includes B in the file hierarchy, 'URL A includes URL B'. For example, if URL A is "www.hostname / directory1 /" and URL B is "www.hostname / directory1 / file2", then URL A contains URL B.

A single hash table method is known as a conventional method of retrieving such an inclusion relationship. According to the single hash table method, a list of URLs is defined, a hash table having a proper size is created according to the number of URLs present in the URL list, and the hash table positions of the respective URLs are determined through a hash function to store the URLs. This single hash table method has a problem in that the search time is long depending on the following two attributes.

In the single hash table method, the path of the search target URL is separated and added to the host name one by one in order to find the inclusion relationship of the URL. That is, connect the separated paths one by one with the host name and compare them with reference to the hash table. For example, for the URL list and single hash table as shown in Fig. 1, assume that the search target URL is "www.hostname.com/dir2/a.html". According to this method, "www.hostname.com/" is first obtained through an hash function and compared with an index. However, since the URL is not found in the hash table, it searches for "www.hostname.com/dir2/" with "dir2 /" in the hash table and succeeds. If the search fails again, the search continues with "a.html" appended. Therefore, if the search fails or the path of the search target URL is composed of a large number of directories, the search time of the single hash table method is long because each directory needs to be compared with the host name. That is, according to the conventional URL search method, in order to search the search target URL, the directories and files must be connected to the host name one by one and the search using the hash function is repeated, and all directories and files must be added. do. Therefore, if the search target URL cannot be searched, the number of comparisons is inevitably increased by the number of directories. In general, the prior art is an inefficient search method because the rate of failure to search in a URL search application is much larger than the rate of success.

The second problem is index conflict. According to the hash method, there is a possibility of index conflict. When constructing a hash table, if a hash index conflict occurs, a single hash table uses open linear addressing, which stores it at the next slot location. Therefore, when searching for a target URL in a hash table, it must be found by comparing it with the next index until not only the index but also the URL stored in the hash table, so the search time when the hash table is not large enough or there are many collisions of hash values There is a problem with this lengthening. If you make the index bigger to reduce the conflict, you need to increase the size of the hash table and free up unused memory space. This problem occurs because the memory size of the hash table is determined by the size of the index regardless of the number of entries stored in a single hash table structure.

An object of the present invention is to solve the above-mentioned disadvantages of the prior art, and to provide a method for generating a hashtree from a predefined URL list and quickly and effectively searching for a search target URL using the same.

In the search method using a hash table, the larger the range of hash values, the higher the probability of avoiding collisions, the shorter the search time, but the limited memory space cannot increase the range of hash values. In addition, although the hash value generated by the ideal hash function is distributed evenly, the hash value generated by the actual hash function is concentrated, and even if the total number of entries in the hash table is larger than the inserted entry, collision may still occur. The present invention devises a method of reducing the memory usage of entries that have a wide range of hash values but do not use them. You can effectively use memory by hierarchically using a relatively small hash table rather than a single large hash table structure and allocating only the hash tables that are needed each time an entry is inserted. For example, if a 24-bit hash consists of a single hash table, all ²⁴ entries must be allocated from the beginning. However, if you configure a three-stage hash and split a 24-bit hash into three 8-bit hashes, you can reduce the amount of allocated memory space while allowing 2 ²⁴ entries, just like a single hash table. Therefore, even if the same size memory is used, the range of hash value can be increased.

The present invention utilizes this principle.

In order to achieve the above object, the present invention provides a process for gradually generating a hashsheet from a predefined URL list ( Hattry generation process, A ), and quickly and effectively searching for a search target URL using the hattry. It relates to a URL search method consisting of a process ( URL search process, B ).

In the present invention, the hatstry generating process ( A ) comprises: (a) extracting one URL to be inserted into the hatstry from the predefined URL list; (b) separating the extracted URL into a host name and a path; (c) generating a hash value tuple (I ₀ , I ₁ ,..., I _n-1 ) from the host name using the hash function; (d) The hash value tuple (I ₀ , I ₁ ,..., I _n-2 ) is used as an index of each step of the hash _matrix to find the corresponding end node, and if the end node does not exist, the internal node and the end node are used. Generating; (e) storing the information about the host name and the information about the path as a list at the index I _n-1 of the end node; and (f) completing the hash creation process after performing steps (a) to (e) for all URLs in the URL list.

In the present invention, a search process ( B ) of searching for a search target URL using a hashtree includes: (a) inputting a search target URL; (b) separating the search target URL into a host name and a path; (c) generating a hash value tuple (J ₀ , J ₁ , ..., J _n-1 ) from the host name using the same hash function as the hash function used in the hash creation process; (d) finding a corresponding list in the hash using a hash value tuple (J ₀ , J ₁ ,..., J _n-1 ) as each step index; and (e) comparing the host name stored in the list with information related to the path information.

As mentioned above, the search method using the URL hashsheet checks the existence of the corresponding list using the hash node and the index, and searches again using the list only if the list exists, so the given URL is not included in the list. If you can find out quickly.

In addition, according to the present invention, since nodes of an index that are not used in the hash table are not allocated, less memory is used as compared to a single hash table method having an index having the same size. The single hash table method has the disadvantage of allocating a hash table of an unused index by using a sufficiently large hash table to prevent index collisions.However, since a hash table only generates a hash table of an index that is used The advantage is that virtual index spaces can be obtained without using memory.

Hereinafter, the present invention will be described in detail with reference to the accompanying drawings. However, these are for illustrative purposes only and the present invention is not limited thereto.

2 is a diagram illustrating a general structure of a URL hashtree. A hash function is used to generate a tuple of hash values calculated from the host names of the URLs in the list and use them as indices (I ₀ ~ I _n-1 ) of the hashtree.

There are several ways to generate a tuple of hash values from a hostname. For example, multiple hash functions can be used to obtain multiple hash values, or multiple hash values can be obtained by taking part of a single hash value obtained by a single hash function. In a hash value tuple, the number of hash values is equal to the total number of steps in the hashtree, and the number of bits of the hash value is related to the hash table size of the step.

Internal nodes of the hashtree store memory pointers of lower-level hash tables, and end-point nodes of the hashtree store memory pointers of lists that store information about host names and path information.

If different URLs have the same hash value tuples, the list can be used so that multiple URLs can be stored on the same end node.

3 is a flowchart illustrating a process of generating a hashtree by receiving a URL. To create a URL hashlist, enter a list of URLs and gradually insert the URLs from the URL list into the hashlist using the following steps: (a) Extract the URL to insert and (b) With the hostname and path (C) Create a hash value tuple with a separate hostname using the hash function. (d) Hash value tuple is used as index of each step of hash matrix to find the corresponding end node and if there is no end node, internal node and end node are created. (e) The information on the host name and the path information are stored in a list at the index I _n-1 of the end node. If you want to save a directory, do not specify it as a directory, but add the directory name with '/'. The list can also be sorted by a specific value for quick retrieval later. (f) Repeat the above procedure for all URLs in the URL list to store all URLs in the hashtree. This process inserts all URLs in the URL list into the hashtree.

FIG. 4 is a diagram illustrating three URLs being stored at the front and inserting a fourth URL in the process of generating a URL hashtree using the URL list. Two URLs with the same host name were stored as lists in the same location in the hashtree, and the remaining URLs were stored in different locations in the hash tree.

5 is a flowchart illustrating a method of searching a search target URL in a hash store storing a list of URLs: (a) receiving a search target URL, (b) separating a host name and a path, and (c) a hash function. Create a hash value tuple from the hostname using. (d) Find the list using the generated hash value tuple as the index of the hash table. If the list of generated hash value tuples is not found, the search fails and terminates. (e) If you find a list, search the list and compare the stored hostname and path information to the end of the list. If you are sorting the list using specific information when creating the list, you do not have to compare to the end of the node. When comparing paths, the string comparison is used to find the inclusion relationship of the URL. The length of the string comparison is determined by the length of the path stored in the tree node. Therefore, if the path stored in the tree node is longer, you can move to the next node without comparing strings.

FIG. 6 is a diagram for explaining a method of searching for a search target URL by using a URL tree.

Inputs "www.hostname.com/dir2/file.html" and separates it into host name and path, and creates hash value tuple with separated host name by using hash function and generated hash value tuple (4, 2,. Find the list using .2) as an index. Since the list exists at the index, the host name and path related information are compared with the first node of the list. Since the hostname information is the same but the path is longer, it moves to the next node in the list. In this node, the search is successful because the hostname-related information is the same and the path stored in the node contains the path of the search target URL.

1 is a conceptual diagram illustrating an example of conversion from a URL list to a hash table according to the related art.

2 is a conceptual diagram showing the general structure of the URL hashtree of the present invention.

3 is a flowchart illustrating a process of converting a URL list into a hash list according to the present invention.

4 is a conceptual diagram illustrating an example of a process of converting a URL list into a URL hashery according to the present invention.

5 is a flowchart illustrating a process of searching for a search target URL in a URL hashtree.

6 is a conceptual diagram illustrating an example of a process of searching for a search target URL in a URL hashtree;

Claims

(A) (a) extracting one URL to be inserted into the hashlist from a predefined URL list;

(b) separating the extracted URL into a host name and a path;

(c) generating a hash value tuple (I ₀ , I ₁ ,..., I _n-1 ) from the host name using the hash function;

(d) The hash value tuple (I ₀ , I ₁ ,..., I _n-2 ) is used as an index of each step of the hash _matrix to find the corresponding end node, and if the end node does not exist, the internal node and the end node are used. Generating;

(e) storing information about a host name and information about a path in a list at an index I _n-1 of the end node;

(f) performing all steps of the URL list after completing steps (a) to (e) to complete the process of creating a hash recipe ;

(B) (a) inputting a search target URL;

(b) separating a search target URL from a host name;

(c) generating a hash value tuple J ₀ , J ₁ ,..., J _n-1 from the host name using the hash function;

(d) finding a list in the hash using the hash value tuples (J ₀ , J ₁ ,..., J _n-1 );

(e) comparing the information related to the host name and the path information stored in the list; URL search process comprising: consisting of a URL search method using a hash.

The method of claim 1,

In step (e) of the step (A), the URL retrieval method using a hash, characterized in that the URL is divided into the host name-related information and the path-related information and stored.

The method of claim 1,

In the step (e) of the step (A), when the information about the path is a directory when storing the URL in the list, the URL retrieval method using a hash, characterized in that the end of the information stored by adding '/'.

The method of claim 1,

In steps (c) and (d) of step (A), and steps (c) and (d) of step (B), a hash value tuple is generated with a host name using a hash function of URL, and each hash A method for searching URLs using hatstry, characterized in that the value is used as the index of each step of the URL data.

The method of claim 1,

In the process (B), the search target URL is classified by first searching by using a node and a hash value tuple of the hashtree, and searching by using a list. URL search method.