KR20100022565A - Method for searching an url using hash tree - Google Patents
Method for searching an url using hash tree Download PDFInfo
- Publication number
- KR20100022565A KR20100022565A KR1020080081135A KR20080081135A KR20100022565A KR 20100022565 A KR20100022565 A KR 20100022565A KR 1020080081135 A KR1020080081135 A KR 1020080081135A KR 20080081135 A KR20080081135 A KR 20080081135A KR 20100022565 A KR20100022565 A KR 20100022565A
- Authority
- KR
- South Korea
- Prior art keywords
- url
- hash
- list
- host name
- search
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/322—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/325—Hash tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Software Systems (AREA)
Abstract
Description
The present invention relates to a website content filtering system, and more particularly, to generate a hashlist for a URL list and to utilize the same to effectively search (filter) whether a URL input by a user is included in a predefined URL list. It is about a method.
Recently, with the popularization of the web and the Internet, its dysfunction has become a problem, and the necessity of a website filtering system has emerged, and various methods for efficiently filtering website contents have been used. In addition, web service objects are represented as uniform resource locators (URLs) in accordance with the trend of serving various applications on the web. Thus, a method of filtering website content by comparing a user's request URL with a predefined list of URLs (URL list) Based filtering method) is widely used.
Meanwhile, the size of predefined URL lists is growing rapidly. This increase in the size of the URL list causes a significant decrease in the search speed, which leads to an increase in response time and a decrease in network service quality.
It is important to define the inclusion relationship between URLs in the URL retrieval method. In the present invention, the URL is divided into a hostname and a path, and the path is again defined as a directory and a file. In this case, when the host names of URLs A and B are the same and the path of A includes B in the file hierarchy, 'URL A includes URL B'. For example, if URL A is "www.hostname / directory1 /" and URL B is "www.hostname / directory1 / file2", then URL A contains URL B.
A single hash table method is known as a conventional method of retrieving such an inclusion relationship. According to the single hash table method, a list of URLs is defined, a hash table having a proper size is created according to the number of URLs present in the URL list, and the hash table positions of the respective URLs are determined through a hash function to store the URLs. This single hash table method has a problem in that the search time is long depending on the following two attributes.
In the single hash table method, the path of the search target URL is separated and added to the host name one by one in order to find the inclusion relationship of the URL. That is, connect the separated paths one by one with the host name and compare them with reference to the hash table. For example, for the URL list and single hash table as shown in Fig. 1, assume that the search target URL is "www.hostname.com/dir2/a.html". According to this method, "www.hostname.com/" is first obtained through an hash function and compared with an index. However, since the URL is not found in the hash table, it searches for "www.hostname.com/dir2/" with "dir2 /" in the hash table and succeeds. If the search fails again, the search continues with "a.html" appended. Therefore, if the search fails or the path of the search target URL is composed of a large number of directories, the search time of the single hash table method is long because each directory needs to be compared with the host name. That is, according to the conventional URL search method, in order to search the search target URL, the directories and files must be connected to the host name one by one and the search using the hash function is repeated, and all directories and files must be added. do. Therefore, if the search target URL cannot be searched, the number of comparisons is inevitably increased by the number of directories. In general, the prior art is an inefficient search method because the rate of failure to search in a URL search application is much larger than the rate of success.
The second problem is index conflict. According to the hash method, there is a possibility of index conflict. When constructing a hash table, if a hash index conflict occurs, a single hash table uses open linear addressing, which stores it at the next slot location. Therefore, when searching for a target URL in a hash table, it must be found by comparing it with the next index until not only the index but also the URL stored in the hash table, so the search time when the hash table is not large enough or there are many collisions of hash values There is a problem with this lengthening. If you make the index bigger to reduce the conflict, you need to increase the size of the hash table and free up unused memory space. This problem occurs because the memory size of the hash table is determined by the size of the index regardless of the number of entries stored in a single hash table structure.
An object of the present invention is to solve the above-mentioned disadvantages of the prior art, and to provide a method for generating a hashtree from a predefined URL list and quickly and effectively searching for a search target URL using the same.
In the search method using a hash table, the larger the range of hash values, the higher the probability of avoiding collisions, the shorter the search time, but the limited memory space cannot increase the range of hash values. In addition, although the hash value generated by the ideal hash function is distributed evenly, the hash value generated by the actual hash function is concentrated, and even if the total number of entries in the hash table is larger than the inserted entry, collision may still occur. The present invention devises a method of reducing the memory usage of entries that have a wide range of hash values but do not use them. You can effectively use memory by hierarchically using a relatively small hash table rather than a single large hash table structure and allocating only the hash tables that are needed each time an entry is inserted. For example, if a 24-bit hash consists of a single hash table, all 24 entries must be allocated from the beginning. However, if you configure a three-stage hash and split a 24-bit hash into three 8-bit hashes, you can reduce the amount of allocated memory space while allowing 2 24 entries, just like a single hash table. Therefore, even if the same size memory is used, the range of hash value can be increased.
The present invention utilizes this principle.
In order to achieve the above object, the present invention provides a process for gradually generating a hashsheet from a predefined URL list ( Hattry generation process, A ), and quickly and effectively searching for a search target URL using the hattry. It relates to a URL search method consisting of a process ( URL search process, B ).
In the present invention, the hatstry generating process ( A ) comprises: (a) extracting one URL to be inserted into the hatstry from the predefined URL list; (b) separating the extracted URL into a host name and a path; (c) generating a hash value tuple (I 0 , I 1 ,..., I n-1 ) from the host name using the hash function; (d) The hash value tuple (I 0 , I 1 ,..., I n-2 ) is used as an index of each step of the hash matrix to find the corresponding end node, and if the end node does not exist, the internal node and the end node are used. Generating; (e) storing the information about the host name and the information about the path as a list at the index I n-1 of the end node; and (f) completing the hash creation process after performing steps (a) to (e) for all URLs in the URL list.
In the present invention, a search process ( B ) of searching for a search target URL using a hashtree includes: (a) inputting a search target URL; (b) separating the search target URL into a host name and a path; (c) generating a hash value tuple (J 0 , J 1 , ..., J n-1 ) from the host name using the same hash function as the hash function used in the hash creation process; (d) finding a corresponding list in the hash using a hash value tuple (J 0 , J 1 ,..., J n-1 ) as each step index; and (e) comparing the host name stored in the list with information related to the path information.
As mentioned above, the search method using the URL hashsheet checks the existence of the corresponding list using the hash node and the index, and searches again using the list only if the list exists, so the given URL is not included in the list. If you can find out quickly.
In addition, according to the present invention, since nodes of an index that are not used in the hash table are not allocated, less memory is used as compared to a single hash table method having an index having the same size. The single hash table method has the disadvantage of allocating a hash table of an unused index by using a sufficiently large hash table to prevent index collisions.However, since a hash table only generates a hash table of an index that is used The advantage is that virtual index spaces can be obtained without using memory.
Hereinafter, the present invention will be described in detail with reference to the accompanying drawings. However, these are for illustrative purposes only and the present invention is not limited thereto.
2 is a diagram illustrating a general structure of a URL hashtree. A hash function is used to generate a tuple of hash values calculated from the host names of the URLs in the list and use them as indices (I 0 ~ I n-1 ) of the hashtree.
There are several ways to generate a tuple of hash values from a hostname. For example, multiple hash functions can be used to obtain multiple hash values, or multiple hash values can be obtained by taking part of a single hash value obtained by a single hash function. In a hash value tuple, the number of hash values is equal to the total number of steps in the hashtree, and the number of bits of the hash value is related to the hash table size of the step.
Internal nodes of the hashtree store memory pointers of lower-level hash tables, and end-point nodes of the hashtree store memory pointers of lists that store information about host names and path information.
If different URLs have the same hash value tuples, the list can be used so that multiple URLs can be stored on the same end node.
3 is a flowchart illustrating a process of generating a hashtree by receiving a URL. To create a URL hashlist, enter a list of URLs and gradually insert the URLs from the URL list into the hashlist using the following steps: (a) Extract the URL to insert and (b) With the hostname and path (C) Create a hash value tuple with a separate hostname using the hash function. (d) Hash value tuple is used as index of each step of hash matrix to find the corresponding end node and if there is no end node, internal node and end node are created. (e) The information on the host name and the path information are stored in a list at the index I n-1 of the end node. If you want to save a directory, do not specify it as a directory, but add the directory name with '/'. The list can also be sorted by a specific value for quick retrieval later. (f) Repeat the above procedure for all URLs in the URL list to store all URLs in the hashtree. This process inserts all URLs in the URL list into the hashtree.
FIG. 4 is a diagram illustrating three URLs being stored at the front and inserting a fourth URL in the process of generating a URL hashtree using the URL list. Two URLs with the same host name were stored as lists in the same location in the hashtree, and the remaining URLs were stored in different locations in the hash tree.
5 is a flowchart illustrating a method of searching a search target URL in a hash store storing a list of URLs: (a) receiving a search target URL, (b) separating a host name and a path, and (c) a hash function. Create a hash value tuple from the hostname using. (d) Find the list using the generated hash value tuple as the index of the hash table. If the list of generated hash value tuples is not found, the search fails and terminates. (e) If you find a list, search the list and compare the stored hostname and path information to the end of the list. If you are sorting the list using specific information when creating the list, you do not have to compare to the end of the node. When comparing paths, the string comparison is used to find the inclusion relationship of the URL. The length of the string comparison is determined by the length of the path stored in the tree node. Therefore, if the path stored in the tree node is longer, you can move to the next node without comparing strings.
FIG. 6 is a diagram for explaining a method of searching for a search target URL by using a URL tree.
Inputs "www.hostname.com/dir2/file.html" and separates it into host name and path, and creates hash value tuple with separated host name by using hash function and generated hash value tuple (4, 2,. Find the list using .2) as an index. Since the list exists at the index, the host name and path related information are compared with the first node of the list. Since the hostname information is the same but the path is longer, it moves to the next node in the list. In this node, the search is successful because the hostname-related information is the same and the path stored in the node contains the path of the search target URL.
1 is a conceptual diagram illustrating an example of conversion from a URL list to a hash table according to the related art.
2 is a conceptual diagram showing the general structure of the URL hashtree of the present invention.
3 is a flowchart illustrating a process of converting a URL list into a hash list according to the present invention.
4 is a conceptual diagram illustrating an example of a process of converting a URL list into a URL hashery according to the present invention.
5 is a flowchart illustrating a process of searching for a search target URL in a URL hashtree.
6 is a conceptual diagram illustrating an example of a process of searching for a search target URL in a URL hashtree;
Claims (5)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR20080081135A KR100999408B1 (en) | 2008-08-20 | 2008-08-20 | Method for searching an ??? using hash tree |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR20080081135A KR100999408B1 (en) | 2008-08-20 | 2008-08-20 | Method for searching an ??? using hash tree |
Publications (2)
Publication Number | Publication Date |
---|---|
KR20100022565A true KR20100022565A (en) | 2010-03-03 |
KR100999408B1 KR100999408B1 (en) | 2010-12-09 |
Family
ID=42175027
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR20080081135A KR100999408B1 (en) | 2008-08-20 | 2008-08-20 | Method for searching an ??? using hash tree |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR100999408B1 (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20140039696A (en) * | 2012-09-25 | 2014-04-02 | 삼성전자주식회사 | Method and apparatus for searching url address in url list in a communication system |
KR20160069707A (en) * | 2014-12-09 | 2016-06-17 | 홍익대학교 산학협력단 | Apparatus and method for searching address based on hashing using temporal locality |
KR102000627B1 (en) * | 2019-01-04 | 2019-07-16 | (주)공간인소프트 | Data updating method and apparatus |
KR102112079B1 (en) * | 2019-11-06 | 2020-05-18 | (주)모니터랩 | Apparatus and method for processing URL |
CN112632360A (en) * | 2020-12-30 | 2021-04-09 | 北京安博通科技股份有限公司 | Matching method and device for big data URL (Uniform resource locator) library and storage medium |
KR20230167947A (en) | 2022-06-03 | 2023-12-12 | (주)엔토빌소프트 | Method for generating url map, url examination system including url map and method for examining url using url map |
-
2008
- 2008-08-20 KR KR20080081135A patent/KR100999408B1/en active IP Right Grant
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20140039696A (en) * | 2012-09-25 | 2014-04-02 | 삼성전자주식회사 | Method and apparatus for searching url address in url list in a communication system |
KR20160069707A (en) * | 2014-12-09 | 2016-06-17 | 홍익대학교 산학협력단 | Apparatus and method for searching address based on hashing using temporal locality |
KR102000627B1 (en) * | 2019-01-04 | 2019-07-16 | (주)공간인소프트 | Data updating method and apparatus |
KR102112079B1 (en) * | 2019-11-06 | 2020-05-18 | (주)모니터랩 | Apparatus and method for processing URL |
CN112632360A (en) * | 2020-12-30 | 2021-04-09 | 北京安博通科技股份有限公司 | Matching method and device for big data URL (Uniform resource locator) library and storage medium |
KR20230167947A (en) | 2022-06-03 | 2023-12-12 | (주)엔토빌소프트 | Method for generating url map, url examination system including url map and method for examining url using url map |
Also Published As
Publication number | Publication date |
---|---|
KR100999408B1 (en) | 2010-12-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6490592B1 (en) | Method of and apparatus for generating a tree data structure supporting longest match lookup | |
CN107153647B (en) | Method, apparatus, system and computer program product for data compression | |
CN102122285B (en) | Data cache system and data inquiry method | |
US6564211B1 (en) | Fast flexible search engine for longest prefix match | |
US8938459B2 (en) | System and method for distributed index searching of electronic content | |
KR101467589B1 (en) | Dynamic fragment mapping | |
CN101315640B (en) | Directory management method and apparatus | |
CN111868710B (en) | Random extraction forest index structure for searching large-scale unstructured data | |
US8099421B2 (en) | File system, and method for storing and searching for file by the same | |
US7526497B2 (en) | Database retrieval apparatus, retrieval method, storage medium, and program | |
CN111801665B (en) | Hierarchical Locality Sensitive Hash (LSH) partition index for big data applications | |
US9020951B2 (en) | Methods for indexing and searching based on language locale | |
KR100999408B1 (en) | Method for searching an ??? using hash tree | |
CN102333036B (en) | Method and system for realizing high-speed routing lookup | |
US7756859B2 (en) | Multi-segment string search | |
US6735600B1 (en) | Editing protocol for flexible search engines | |
US20080133494A1 (en) | Method and apparatus for searching forwarding table | |
CN106302172A (en) | Support Hash lookup and the storage of route querying, lookup method and device simultaneously | |
EP1063827A2 (en) | Method for address lookup | |
CN105912696A (en) | DNS (Domain Name System) index creating method and query method based on logarithm merging | |
CN106302178B (en) | Route query method and device | |
US9509757B2 (en) | Parallel sorting key generation | |
CN111819552B (en) | Access control list management method and device | |
KR20210028576A (en) | Network Key Value Indexing Design | |
CN112231400A (en) | Distributed database access method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A201 | Request for examination | ||
E902 | Notification of reason for refusal | ||
E701 | Decision to grant or registration of patent right | ||
GRNT | Written decision to grant | ||
FPAY | Annual fee payment |
Payment date: 20131125 Year of fee payment: 4 |
|
FPAY | Annual fee payment |
Payment date: 20141201 Year of fee payment: 5 |
|
FPAY | Annual fee payment |
Payment date: 20151201 Year of fee payment: 6 |
|
FPAY | Annual fee payment |
Payment date: 20161125 Year of fee payment: 7 |
|
FPAY | Annual fee payment |
Payment date: 20171122 Year of fee payment: 8 |
|
FPAY | Annual fee payment |
Payment date: 20191202 Year of fee payment: 10 |