[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

KR20100022565A - Method for searching an url using hash tree - Google Patents

Method for searching an url using hash tree Download PDF

Info

Publication number
KR20100022565A
KR20100022565A KR1020080081135A KR20080081135A KR20100022565A KR 20100022565 A KR20100022565 A KR 20100022565A KR 1020080081135 A KR1020080081135 A KR 1020080081135A KR 20080081135 A KR20080081135 A KR 20080081135A KR 20100022565 A KR20100022565 A KR 20100022565A
Authority
KR
South Korea
Prior art keywords
url
hash
list
host name
search
Prior art date
Application number
KR1020080081135A
Other languages
Korean (ko)
Other versions
KR100999408B1 (en
Inventor
김형식
유진형
Original Assignee
충남대학교산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 충남대학교산학협력단 filed Critical 충남대학교산학협력단
Priority to KR20080081135A priority Critical patent/KR100999408B1/en
Publication of KR20100022565A publication Critical patent/KR20100022565A/en
Application granted granted Critical
Publication of KR100999408B1 publication Critical patent/KR100999408B1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Software Systems (AREA)

Abstract

PURPOSE: A method for searching a URL using a hash tree is provided to quickly and effectively search a search target URL using a hash tree. CONSTITUTION: A search object URL(Uniform Resource Locator) is inputted. The search object URL is divided into a host name and a route. A hash value tuple is generated from the host name using a hash function. A list is found from the hash tree using the hash value tuple. Information related to host name and route saved in the list are compared.

Description

Method for searching an usingRL using hash tree}

The present invention relates to a website content filtering system, and more particularly, to generate a hashlist for a URL list and to utilize the same to effectively search (filter) whether a URL input by a user is included in a predefined URL list. It is about a method.

Recently, with the popularization of the web and the Internet, its dysfunction has become a problem, and the necessity of a website filtering system has emerged, and various methods for efficiently filtering website contents have been used. In addition, web service objects are represented as uniform resource locators (URLs) in accordance with the trend of serving various applications on the web. Thus, a method of filtering website content by comparing a user's request URL with a predefined list of URLs (URL list) Based filtering method) is widely used.

Meanwhile, the size of predefined URL lists is growing rapidly. This increase in the size of the URL list causes a significant decrease in the search speed, which leads to an increase in response time and a decrease in network service quality.

It is important to define the inclusion relationship between URLs in the URL retrieval method. In the present invention, the URL is divided into a hostname and a path, and the path is again defined as a directory and a file. In this case, when the host names of URLs A and B are the same and the path of A includes B in the file hierarchy, 'URL A includes URL B'. For example, if URL A is "www.hostname / directory1 /" and URL B is "www.hostname / directory1 / file2", then URL A contains URL B.

A single hash table method is known as a conventional method of retrieving such an inclusion relationship. According to the single hash table method, a list of URLs is defined, a hash table having a proper size is created according to the number of URLs present in the URL list, and the hash table positions of the respective URLs are determined through a hash function to store the URLs. This single hash table method has a problem in that the search time is long depending on the following two attributes.

In the single hash table method, the path of the search target URL is separated and added to the host name one by one in order to find the inclusion relationship of the URL. That is, connect the separated paths one by one with the host name and compare them with reference to the hash table. For example, for the URL list and single hash table as shown in Fig. 1, assume that the search target URL is "www.hostname.com/dir2/a.html". According to this method, "www.hostname.com/" is first obtained through an hash function and compared with an index. However, since the URL is not found in the hash table, it searches for "www.hostname.com/dir2/" with "dir2 /" in the hash table and succeeds. If the search fails again, the search continues with "a.html" appended. Therefore, if the search fails or the path of the search target URL is composed of a large number of directories, the search time of the single hash table method is long because each directory needs to be compared with the host name. That is, according to the conventional URL search method, in order to search the search target URL, the directories and files must be connected to the host name one by one and the search using the hash function is repeated, and all directories and files must be added. do. Therefore, if the search target URL cannot be searched, the number of comparisons is inevitably increased by the number of directories. In general, the prior art is an inefficient search method because the rate of failure to search in a URL search application is much larger than the rate of success.

The second problem is index conflict. According to the hash method, there is a possibility of index conflict. When constructing a hash table, if a hash index conflict occurs, a single hash table uses open linear addressing, which stores it at the next slot location. Therefore, when searching for a target URL in a hash table, it must be found by comparing it with the next index until not only the index but also the URL stored in the hash table, so the search time when the hash table is not large enough or there are many collisions of hash values There is a problem with this lengthening. If you make the index bigger to reduce the conflict, you need to increase the size of the hash table and free up unused memory space. This problem occurs because the memory size of the hash table is determined by the size of the index regardless of the number of entries stored in a single hash table structure.

An object of the present invention is to solve the above-mentioned disadvantages of the prior art, and to provide a method for generating a hashtree from a predefined URL list and quickly and effectively searching for a search target URL using the same.

In the search method using a hash table, the larger the range of hash values, the higher the probability of avoiding collisions, the shorter the search time, but the limited memory space cannot increase the range of hash values. In addition, although the hash value generated by the ideal hash function is distributed evenly, the hash value generated by the actual hash function is concentrated, and even if the total number of entries in the hash table is larger than the inserted entry, collision may still occur. The present invention devises a method of reducing the memory usage of entries that have a wide range of hash values but do not use them. You can effectively use memory by hierarchically using a relatively small hash table rather than a single large hash table structure and allocating only the hash tables that are needed each time an entry is inserted. For example, if a 24-bit hash consists of a single hash table, all 24 entries must be allocated from the beginning. However, if you configure a three-stage hash and split a 24-bit hash into three 8-bit hashes, you can reduce the amount of allocated memory space while allowing 2 24 entries, just like a single hash table. Therefore, even if the same size memory is used, the range of hash value can be increased.

The present invention utilizes this principle.

In order to achieve the above object, the present invention provides a process for gradually generating a hashsheet from a predefined URL list ( Hattry generation process, A ), and quickly and effectively searching for a search target URL using the hattry. It relates to a URL search method consisting of a process ( URL search process, B ).

In the present invention, the hatstry generating process ( A ) comprises: (a) extracting one URL to be inserted into the hatstry from the predefined URL list; (b) separating the extracted URL into a host name and a path; (c) generating a hash value tuple (I 0 , I 1 ,..., I n-1 ) from the host name using the hash function; (d) The hash value tuple (I 0 , I 1 ,..., I n-2 ) is used as an index of each step of the hash matrix to find the corresponding end node, and if the end node does not exist, the internal node and the end node are used. Generating; (e) storing the information about the host name and the information about the path as a list at the index I n-1 of the end node; and (f) completing the hash creation process after performing steps (a) to (e) for all URLs in the URL list.

In the present invention, a search process ( B ) of searching for a search target URL using a hashtree includes: (a) inputting a search target URL; (b) separating the search target URL into a host name and a path; (c) generating a hash value tuple (J 0 , J 1 , ..., J n-1 ) from the host name using the same hash function as the hash function used in the hash creation process; (d) finding a corresponding list in the hash using a hash value tuple (J 0 , J 1 ,..., J n-1 ) as each step index; and (e) comparing the host name stored in the list with information related to the path information.

As mentioned above, the search method using the URL hashsheet checks the existence of the corresponding list using the hash node and the index, and searches again using the list only if the list exists, so the given URL is not included in the list. If you can find out quickly.

In addition, according to the present invention, since nodes of an index that are not used in the hash table are not allocated, less memory is used as compared to a single hash table method having an index having the same size. The single hash table method has the disadvantage of allocating a hash table of an unused index by using a sufficiently large hash table to prevent index collisions.However, since a hash table only generates a hash table of an index that is used The advantage is that virtual index spaces can be obtained without using memory.

Hereinafter, the present invention will be described in detail with reference to the accompanying drawings. However, these are for illustrative purposes only and the present invention is not limited thereto.

2 is a diagram illustrating a general structure of a URL hashtree. A hash function is used to generate a tuple of hash values calculated from the host names of the URLs in the list and use them as indices (I 0 ~ I n-1 ) of the hashtree.

There are several ways to generate a tuple of hash values from a hostname. For example, multiple hash functions can be used to obtain multiple hash values, or multiple hash values can be obtained by taking part of a single hash value obtained by a single hash function. In a hash value tuple, the number of hash values is equal to the total number of steps in the hashtree, and the number of bits of the hash value is related to the hash table size of the step.

Internal nodes of the hashtree store memory pointers of lower-level hash tables, and end-point nodes of the hashtree store memory pointers of lists that store information about host names and path information.

If different URLs have the same hash value tuples, the list can be used so that multiple URLs can be stored on the same end node.

3 is a flowchart illustrating a process of generating a hashtree by receiving a URL. To create a URL hashlist, enter a list of URLs and gradually insert the URLs from the URL list into the hashlist using the following steps: (a) Extract the URL to insert and (b) With the hostname and path (C) Create a hash value tuple with a separate hostname using the hash function. (d) Hash value tuple is used as index of each step of hash matrix to find the corresponding end node and if there is no end node, internal node and end node are created. (e) The information on the host name and the path information are stored in a list at the index I n-1 of the end node. If you want to save a directory, do not specify it as a directory, but add the directory name with '/'. The list can also be sorted by a specific value for quick retrieval later. (f) Repeat the above procedure for all URLs in the URL list to store all URLs in the hashtree. This process inserts all URLs in the URL list into the hashtree.

FIG. 4 is a diagram illustrating three URLs being stored at the front and inserting a fourth URL in the process of generating a URL hashtree using the URL list. Two URLs with the same host name were stored as lists in the same location in the hashtree, and the remaining URLs were stored in different locations in the hash tree.

5 is a flowchart illustrating a method of searching a search target URL in a hash store storing a list of URLs: (a) receiving a search target URL, (b) separating a host name and a path, and (c) a hash function. Create a hash value tuple from the hostname using. (d) Find the list using the generated hash value tuple as the index of the hash table. If the list of generated hash value tuples is not found, the search fails and terminates. (e) If you find a list, search the list and compare the stored hostname and path information to the end of the list. If you are sorting the list using specific information when creating the list, you do not have to compare to the end of the node. When comparing paths, the string comparison is used to find the inclusion relationship of the URL. The length of the string comparison is determined by the length of the path stored in the tree node. Therefore, if the path stored in the tree node is longer, you can move to the next node without comparing strings.

FIG. 6 is a diagram for explaining a method of searching for a search target URL by using a URL tree.

Inputs "www.hostname.com/dir2/file.html" and separates it into host name and path, and creates hash value tuple with separated host name by using hash function and generated hash value tuple (4, 2,. Find the list using .2) as an index. Since the list exists at the index, the host name and path related information are compared with the first node of the list. Since the hostname information is the same but the path is longer, it moves to the next node in the list. In this node, the search is successful because the hostname-related information is the same and the path stored in the node contains the path of the search target URL.

1 is a conceptual diagram illustrating an example of conversion from a URL list to a hash table according to the related art.

2 is a conceptual diagram showing the general structure of the URL hashtree of the present invention.

3 is a flowchart illustrating a process of converting a URL list into a hash list according to the present invention.

4 is a conceptual diagram illustrating an example of a process of converting a URL list into a URL hashery according to the present invention.

5 is a flowchart illustrating a process of searching for a search target URL in a URL hashtree.

6 is a conceptual diagram illustrating an example of a process of searching for a search target URL in a URL hashtree;

Claims (5)

(A) (a) extracting one URL to be inserted into the hashlist from a predefined URL list; (b) separating the extracted URL into a host name and a path; (c) generating a hash value tuple (I 0 , I 1 ,..., I n-1 ) from the host name using the hash function; (d) The hash value tuple (I 0 , I 1 ,..., I n-2 ) is used as an index of each step of the hash matrix to find the corresponding end node, and if the end node does not exist, the internal node and the end node are used. Generating; (e) storing information about a host name and information about a path in a list at an index I n-1 of the end node; (f) performing all steps of the URL list after completing steps (a) to (e) to complete the process of creating a hash recipe ; (B) (a) inputting a search target URL; (b) separating a search target URL from a host name; (c) generating a hash value tuple J 0 , J 1 ,..., J n-1 from the host name using the hash function; (d) finding a list in the hash using the hash value tuples (J 0 , J 1 ,..., J n-1 ); (e) comparing the information related to the host name and the path information stored in the list; URL search process comprising: consisting of a URL search method using a hash. The method of claim 1, In step (e) of the step (A), the URL retrieval method using a hash, characterized in that the URL is divided into the host name-related information and the path-related information and stored. The method of claim 1, In the step (e) of the step (A), when the information about the path is a directory when storing the URL in the list, the URL retrieval method using a hash, characterized in that the end of the information stored by adding '/'. The method of claim 1, In steps (c) and (d) of step (A), and steps (c) and (d) of step (B), a hash value tuple is generated with a host name using a hash function of URL, and each hash A method for searching URLs using hatstry, characterized in that the value is used as the index of each step of the URL data. The method of claim 1, In the process (B), the search target URL is classified by first searching by using a node and a hash value tuple of the hashtree, and searching by using a list. URL search method.
KR20080081135A 2008-08-20 2008-08-20 Method for searching an ??? using hash tree KR100999408B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR20080081135A KR100999408B1 (en) 2008-08-20 2008-08-20 Method for searching an ??? using hash tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR20080081135A KR100999408B1 (en) 2008-08-20 2008-08-20 Method for searching an ??? using hash tree

Publications (2)

Publication Number Publication Date
KR20100022565A true KR20100022565A (en) 2010-03-03
KR100999408B1 KR100999408B1 (en) 2010-12-09

Family

ID=42175027

Family Applications (1)

Application Number Title Priority Date Filing Date
KR20080081135A KR100999408B1 (en) 2008-08-20 2008-08-20 Method for searching an ??? using hash tree

Country Status (1)

Country Link
KR (1) KR100999408B1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20140039696A (en) * 2012-09-25 2014-04-02 삼성전자주식회사 Method and apparatus for searching url address in url list in a communication system
KR20160069707A (en) * 2014-12-09 2016-06-17 홍익대학교 산학협력단 Apparatus and method for searching address based on hashing using temporal locality
KR102000627B1 (en) * 2019-01-04 2019-07-16 (주)공간인소프트 Data updating method and apparatus
KR102112079B1 (en) * 2019-11-06 2020-05-18 (주)모니터랩 Apparatus and method for processing URL
CN112632360A (en) * 2020-12-30 2021-04-09 北京安博通科技股份有限公司 Matching method and device for big data URL (Uniform resource locator) library and storage medium
KR20230167947A (en) 2022-06-03 2023-12-12 (주)엔토빌소프트 Method for generating url map, url examination system including url map and method for examining url using url map

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20140039696A (en) * 2012-09-25 2014-04-02 삼성전자주식회사 Method and apparatus for searching url address in url list in a communication system
KR20160069707A (en) * 2014-12-09 2016-06-17 홍익대학교 산학협력단 Apparatus and method for searching address based on hashing using temporal locality
KR102000627B1 (en) * 2019-01-04 2019-07-16 (주)공간인소프트 Data updating method and apparatus
KR102112079B1 (en) * 2019-11-06 2020-05-18 (주)모니터랩 Apparatus and method for processing URL
CN112632360A (en) * 2020-12-30 2021-04-09 北京安博通科技股份有限公司 Matching method and device for big data URL (Uniform resource locator) library and storage medium
KR20230167947A (en) 2022-06-03 2023-12-12 (주)엔토빌소프트 Method for generating url map, url examination system including url map and method for examining url using url map

Also Published As

Publication number Publication date
KR100999408B1 (en) 2010-12-09

Similar Documents

Publication Publication Date Title
US6490592B1 (en) Method of and apparatus for generating a tree data structure supporting longest match lookup
CN107153647B (en) Method, apparatus, system and computer program product for data compression
CN102122285B (en) Data cache system and data inquiry method
US6564211B1 (en) Fast flexible search engine for longest prefix match
US8938459B2 (en) System and method for distributed index searching of electronic content
KR101467589B1 (en) Dynamic fragment mapping
CN101315640B (en) Directory management method and apparatus
CN111868710B (en) Random extraction forest index structure for searching large-scale unstructured data
US8099421B2 (en) File system, and method for storing and searching for file by the same
US7526497B2 (en) Database retrieval apparatus, retrieval method, storage medium, and program
CN111801665B (en) Hierarchical Locality Sensitive Hash (LSH) partition index for big data applications
US9020951B2 (en) Methods for indexing and searching based on language locale
KR100999408B1 (en) Method for searching an ??? using hash tree
CN102333036B (en) Method and system for realizing high-speed routing lookup
US7756859B2 (en) Multi-segment string search
US6735600B1 (en) Editing protocol for flexible search engines
US20080133494A1 (en) Method and apparatus for searching forwarding table
CN106302172A (en) Support Hash lookup and the storage of route querying, lookup method and device simultaneously
EP1063827A2 (en) Method for address lookup
CN105912696A (en) DNS (Domain Name System) index creating method and query method based on logarithm merging
CN106302178B (en) Route query method and device
US9509757B2 (en) Parallel sorting key generation
CN111819552B (en) Access control list management method and device
KR20210028576A (en) Network Key Value Indexing Design
CN112231400A (en) Distributed database access method, device, equipment and storage medium

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
E701 Decision to grant or registration of patent right
GRNT Written decision to grant
FPAY Annual fee payment

Payment date: 20131125

Year of fee payment: 4

FPAY Annual fee payment

Payment date: 20141201

Year of fee payment: 5

FPAY Annual fee payment

Payment date: 20151201

Year of fee payment: 6

FPAY Annual fee payment

Payment date: 20161125

Year of fee payment: 7

FPAY Annual fee payment

Payment date: 20171122

Year of fee payment: 8

FPAY Annual fee payment

Payment date: 20191202

Year of fee payment: 10