[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

US7979418B1 - System, method, and computer program product for processing a prefix tree file utilizing a selected agent - Google Patents

System, method, and computer program product for processing a prefix tree file utilizing a selected agent Download PDF

Info

Publication number
US7979418B1
US7979418B1 US11/963,589 US96358907A US7979418B1 US 7979418 B1 US7979418 B1 US 7979418B1 US 96358907 A US96358907 A US 96358907A US 7979418 B1 US7979418 B1 US 7979418B1
Authority
US
United States
Prior art keywords
file
agent
prefix
prefix tree
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US11/963,589
Inventor
Andreas Schlemmer
Thomas C. H. Steiner
Peter Blaimschein
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
McAfee LLC
Original Assignee
McAfee LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by McAfee LLC filed Critical McAfee LLC
Priority to US11/963,589 priority Critical patent/US7979418B1/en
Assigned to MCAFEE, INC. reassignment MCAFEE, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: BLAIMSCHEIN, PETER, SCHLEMMER, ANDREAS, STEINER, THOMAS C.H.
Priority to US13/150,947 priority patent/US8560521B2/en
Application granted granted Critical
Publication of US7979418B1 publication Critical patent/US7979418B1/en
Assigned to MCAFEE, LLC reassignment MCAFEE, LLC CHANGE OF NAME AND ENTITY CONVERSION Assignors: MCAFEE, INC.
Assigned to JPMORGAN CHASE BANK, N.A. reassignment JPMORGAN CHASE BANK, N.A. SECURITY INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MCAFEE, LLC
Assigned to MORGAN STANLEY SENIOR FUNDING, INC. reassignment MORGAN STANLEY SENIOR FUNDING, INC. SECURITY INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MCAFEE, LLC
Assigned to JPMORGAN CHASE BANK, N.A. reassignment JPMORGAN CHASE BANK, N.A. CORRECTIVE ASSIGNMENT TO CORRECT THE REMOVE PATENT 6336186 PREVIOUSLY RECORDED ON REEL 045055 FRAME 786. ASSIGNOR(S) HEREBY CONFIRMS THE SECURITY INTEREST. Assignors: MCAFEE, LLC
Assigned to MORGAN STANLEY SENIOR FUNDING, INC. reassignment MORGAN STANLEY SENIOR FUNDING, INC. CORRECTIVE ASSIGNMENT TO CORRECT THE REMOVE PATENT 6336186 PREVIOUSLY RECORDED ON REEL 045056 FRAME 0676. ASSIGNOR(S) HEREBY CONFIRMS THE SECURITY INTEREST. Assignors: MCAFEE, LLC
Assigned to MCAFEE, LLC reassignment MCAFEE, LLC RELEASE OF INTELLECTUAL PROPERTY COLLATERAL - REEL/FRAME 045055/0786 Assignors: JPMORGAN CHASE BANK, N.A., AS COLLATERAL AGENT
Assigned to MCAFEE, LLC reassignment MCAFEE, LLC RELEASE OF INTELLECTUAL PROPERTY COLLATERAL - REEL/FRAME 045056/0676 Assignors: MORGAN STANLEY SENIOR FUNDING, INC., AS COLLATERAL AGENT
Assigned to JPMORGAN CHASE BANK, N.A., AS ADMINISTRATIVE AGENT AND COLLATERAL AGENT reassignment JPMORGAN CHASE BANK, N.A., AS ADMINISTRATIVE AGENT AND COLLATERAL AGENT SECURITY INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MCAFEE, LLC
Assigned to JPMORGAN CHASE BANK, N.A., AS ADMINISTRATIVE AGENT reassignment JPMORGAN CHASE BANK, N.A., AS ADMINISTRATIVE AGENT CORRECTIVE ASSIGNMENT TO CORRECT THE THE PATENT TITLES AND REMOVE DUPLICATES IN THE SCHEDULE PREVIOUSLY RECORDED AT REEL: 059354 FRAME: 0335. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT. Assignors: MCAFEE, LLC
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees

Definitions

  • the present invention relates to prefix trees, and more particularly to storing data in prefix trees.
  • relational databases For providing access to such data.
  • use of relational databases for hierarchical data has exhibited various limitations.
  • Particularly, relational databases have been limited with respect to storing and accessing very large amounts of data.
  • queries for retrieving data for web crawling purposes, storing data, modifying data, etc. have conventionally been unable to be a part of a distributed system.
  • selecting data from a traditional relational database usually results in a high number of cross-table joins and constraints to be processed in order to define crawling behavior and to impose limitations on the amount of data to be stored.
  • constraints include prevention of target server flooding by keeping an amount of parallel requests low, Internet protocol (IP) address information tracking to avoid parallel visits of target servers on multiple crawler nodes through usage of mutually exclusive IP ranges, as well as focused selections of hyperlinks to follow via user-definable strategies (e.g. only follow links that could point to potentially interesting content).
  • IP Internet protocol
  • a system, method, and computer program product are provided for processing a prefix tree file utilizing a selected agent.
  • a file including a prefix tree is identified.
  • an agent is selected from a plurality of agents to process the file. Further, the file is processed utilizing the agent.
  • FIG. 1 illustrates a network architecture, in accordance with one embodiment.
  • FIG. 2 shows a representative hardware environment that may be associated with the servers and/or clients of FIG. 1 , in accordance with one embodiment.
  • FIG. 3 shows a method for processing a prefix tree file utilizing a selected agent, in accordance with one embodiment.
  • FIG. 4 shows a system for processing a prefix tree file utilizing a selected agent, in accordance with one embodiment.
  • FIG. 5 shows a method for selecting an agent from a plurality of agents for processing a prefix tree file, in accordance with yet another embodiment.
  • FIG. 6 shows a system for storing new data in a plurality of distributed prefix trees, in accordance with still yet another embodiment.
  • FIG. 7 shows a system for querying a plurality of distributed tries, in accordance with another embodiment.
  • FIG. 8 shows uniform resource locators stored in a prefix tree, in accordance with yet another embodiment.
  • FIG. 1 illustrates a network architecture 100 , in accordance with one embodiment.
  • a plurality of networks 102 is provided.
  • the networks 102 may each take any form including, but not limited to a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, peer-to-peer network, etc.
  • LAN local area network
  • WAN wide area network
  • peer-to-peer network etc.
  • servers 104 which are capable of communicating over the networks 102 .
  • clients 106 are also coupled to the networks 102 and the servers 104 .
  • Such servers 104 and/or clients 106 may each include a desktop computer, lap-top computer, hand-held computer, mobile phone, personal digital assistant (PDA), peripheral (e.g. printer, etc.), any component of a computer, and/or any other type of logic.
  • PDA personal digital assistant
  • peripheral e.g. printer, etc.
  • any component of a computer and/or any other type of logic.
  • at least one gateway 108 is optionally coupled therebetween.
  • FIG. 2 shows a representative hardware environment that may be associated with the servers 104 and/or clients 106 of FIG. 1 , in accordance with one embodiment.
  • Such figure illustrates a typical hardware configuration of a workstation in accordance with one embodiment having a central processing unit 210 , such as a microprocessor, and a number of other units interconnected via a system bus 212 .
  • a central processing unit 210 such as a microprocessor
  • the workstation shown in FIG. 2 includes a Random Access Memory (RAM) 214 , Read Only Memory (ROM) 216 , an I/O adapter 218 for connecting peripheral devices such as disk storage units 220 to the bus 212 , a user interface adapter 222 for connecting a keyboard 224 , a mouse 226 , a speaker 228 , a microphone 232 , and/or other user interface devices such as a touch screen (not shown) to the bus 212 , communication adapter 234 for connecting the workstation to a communication network 235 (e.g., a data processing network) and a display adapter 236 for connecting the bus 212 to a display device 238 .
  • a communication network 235 e.g., a data processing network
  • display adapter 236 for connecting the bus 212 to a display device 238 .
  • the workstation may have resident thereon any desired operating system. It will be appreciated that an embodiment may also be implemented on platforms and operating systems other than those mentioned.
  • One embodiment may be written using JAVA, C, and/or C++ language, or other programming languages, along with an object oriented programming methodology.
  • Object oriented programming (OOP) has become increasingly used to develop complex applications.
  • FIG. 3 shows a method 300 for processing a prefix tree file utilizing a selected agent, in accordance with one embodiment.
  • the method 300 may be carried out in the context of the architecture and environment of FIGS. 1 and/or 2 . Of course, however, the method 300 may be carried out in any desired environment.
  • the prefix tree may include any ordered hierarchical data structure which stores arrays of data with a common prefix. For example, each of a plurality of descendants of a node may include a common prefix. In one embodiment, the prefix tree may be a trie.
  • the prefix tree may store a plurality of uniform resource locators (URLs).
  • each of the URLs stored in the prefix tree may include a common prefix.
  • the common prefix may include the first two characters (e.g. letters, etc.) of a domain name server (DNS) domain name of each of the URLs.
  • DNS domain name server
  • the common prefix of URLs capable of being stored in the prefix tree may be predefined (e.g. by a user, etc.).
  • the characters representative of the common prefix may be predefined.
  • the number of characters representative of the common prefix may also be predefined.
  • the file may include any resource capable of storing the prefix tree.
  • the file may be dedicated to storing the prefix tree.
  • the file may only store the prefix tree, as an option.
  • the file may be identified in any desired manner.
  • the file may be identified based on a selection of the file, as set forth in the example explained below.
  • the file may be selected from a plurality of files, each file including a different prefix tree.
  • the file may be selected from a queue of files, where each file in the queue of files includes a different prefix tree.
  • the file may be selected based on a time of a last access to the file.
  • the queue of files may optionally store the files in order of a time each of the files were last accessed, where more recently accessed files are stored in the queue after less recently accessed files.
  • an agent is selected from a plurality of agents to process the file.
  • the agents may be running on the client computer and/or server computer on which the file is stored. In this way, the selected agent may access the file locally for processing the file.
  • each of the agents may include software.
  • each of the agents may include a command line executable.
  • each of the agents may include hardware with respect to another embodiment.
  • a predefined number of the agents e.g. predefined by a user, predefined based on capabilities and/or resources of the computer on which the agents are located, etc. may optionally execute in parallel.
  • each of the agents may perform different functions.
  • Such functions may include a crawling function (e.g. for crawling sites, such as web sites, indicated by data stored in the prefix tree, etc.), a merging function (e.g. for merging data into the prefix tree, etc.), a separating function (e.g. for separating data within the prefix tree, etc.), a reorganization function (e.g. for reorganizing data stored in the prefix tree, etc.) and/or any other function capable of being performed with respect to the prefix tree and/or any information associated with such prefix tree.
  • the agent may optionally be selected based on the type of processing to be performed on the file.
  • the agent may be selected based on a determination of whether the agent is in an idle state. For example, only an agent in an idle state may be selected for processing the file. As yet another option, the agent may be selected based on the type of prefix tree included in the file. Thus, an agent capable of categorizing data may be selected for processing a prefix tree containing uncategorized data, just by way of example.
  • the file is processed utilizing the agent, as shown in operation 306 .
  • the processing may include performing any functions in association with the file.
  • the processing may include executing any functions capable of being performed by the agent.
  • the processing may include downloading a web page identified using a URL stored in the prefix tree.
  • the processing may include analyzing the web page for gathering information associated with the URL. Such information may include, for example, information indicating whether the web page includes unwanted content (e.g. malware, etc.) and/or a type of any unwanted content discovered in the web page, a categorization of the web page, a type of language included in the web page, etc.
  • the processing may include categorizing the URL (e.g. based on a categorization of the web page, etc.).
  • the prefix tree file may be processed utilizing an agent selected from a plurality of agents.
  • the identification of the file (operation 302 ), the selection of the agent (operation 304 ), and the processing of the file (operation 306 ) may be controlled utilizing a controller application.
  • the controller application may schedule execution of one of the agents for processing the prefix tree file.
  • the controller application may be located on the computer on which the file and/or agents are located, for example.
  • FIG. 4 shows a system 400 for processing a prefix tree file utilizing a selected agent, in accordance with one embodiment.
  • the system 400 may be implemented in the context of the architecture and environment of FIGS. 1-3 . Of course, however, the system 400 may be implemented in any desired environment. It should also be noted that the aforementioned definitions may apply during the present description.
  • a first processing node 402 A is in communication with a second processing node 402 B.
  • the first processing node 402 A may communicate with the second processing node 402 B over a network (e.g. such as any of the networks described above with respect to FIG. 1 ), but of course the first processing node 402 A may also communicate with the second processing node 402 B via a local connection.
  • the first processing node 402 A may communicate with the second processing node 402 B using a server message block (SMB) protocol, a node specific protocol and/or any other protocol capable of being used for communication purposes.
  • SMB server message block
  • the first processing node 402 A and the second processing node 402 B may optionally operate as a peer network of independent nodes.
  • the first processing node 402 A and the second processing node 402 B may each include any of the devices described above with respect to FIGS. 1 and/or 2 .
  • the first processing node 402 A and the second processing node 402 B may each include the Intel® CoreTM 2 Duo processor with 250 gigabytes of hard disk space.
  • the first processing node 402 A and the second processing node 402 B may optionally include client computers instead of being consolidated on expensive high-end servers. While only two processing nodes 402 A and 402 B are shown, it should be noted that any desired number of processing nodes may be included in the system 400 , thus allowing scalability of the system 400 .
  • the first processing node 402 A and the second processing node 402 B each include a respective queue of prefix trees 404 A and 404 B (hereinafter referred to as tries).
  • the queues of tries 404 A and 404 B may each include a plurality of trie files, where each trie file includes a different trie (e.g. a different type of trie, etc.) that stores different data (e.g. URLs), for example.
  • tries may each trie file includes a different trie (e.g. a different type of trie, etc.) that stores different data (e.g. URLs), for example.
  • tries may each trie file includes a different trie (e.g. a different type of trie, etc.) that stores different data (e.g. URLs), for example.
  • tries may be distributed across a plurality of processing nodes for providing distributed processing and load-balancing of such trie files, as described in more detail below.
  • the trie files may be stored locally on a hard disk of a respective one of the first processing node 402 A and the second processing node 402 B.
  • the queues of tries 404 A and 404 B may access trie files in an order based on a time in which the files were last accessed (e.g. modified, etc.). For example, the least recently accessed trie file may be stored first in the queue 404 A and 404 B, whereas the most recently accessed trie file may be stored last in the queue 404 A and 404 B. Accordingly, after a top trie file in the queue 404 A and 404 B is accessed (e.g. for processing, etc.), such trie file may be returned to the end of the queue 404 A and 404 B.
  • each trie may only store data with a common prefix.
  • the data may include URLs
  • the common prefix may include the first two characters of a DNS domain name of such URLs.
  • Table 1 illustrates exemplary trie file names that may be stored in the queues 404 A and 404 B of the first processing node 402 A and the second processing node 402 B.
  • such example is set forth for illustrative purposes only, and thus should not be construed as limiting in any manner.
  • Trie_1 aa.ct-stores data with the common prefix aa Trie_2: cx.ct-stores data with the common prefix cx Trie_3: de.ct-stores data with the common prefix de Processing Node_2 Trie_1: ab.ct-stores data with the common prefix ab Trie_2: mi.ct-stores data with the common prefix mi
  • the URL msdn.microsoft.com would be stored in Trie — 2 of Processing Node — 2 since the first two characters of the DNS domain name (i.e. “microsoft”) of such URL share the common prefix “mi”.
  • the URL www.dell.com would be stored in Trie — 3 of Processing Node — 1 since the first two characters of the DNS domain name (i.e. “dell”) of such URL share the common prefix “de”.
  • each trie file may be limited to a predetermined size (e.g. 100 megabytes, etc.), such that multiple tries may be capable of being loaded into random access memory (RAM).
  • RAM random access memory
  • a new trie may be created and added to the queue of tries 404 A and 404 B in which the original trie is stored.
  • the new trie may store data with a three character common prefix which includes the two character common prefix.
  • the respective tree is split into multiple files with a three character prefix.
  • the first processing node 402 A and the second processing node 402 B may also each include a crawler controller 406 A and 406 B.
  • the crawler controller 406 A and 406 B may include a controller application, in one embodiment.
  • the crawler controller 406 A and 406 B may select the top trie file from a respective queue of tries 404 A and 404 B and may send the selected trie file to one of a plurality of different types of agents 410 A, 410 B, 412 A, 412 B, 414 A, and 414 B located in an agent pool 408 A and 408 B on the same processing node 402 A and 402 B.
  • the agent 410 A, 410 B, 412 A, 412 B, 414 A, and 414 B to which the selected trie file is sent may be selected by the crawler controller 406 A and 406 B.
  • the agent pool 408 A and 408 B may include a first crawler agent 410 A and 410 B, a second crawler agent 412 A and 412 B, and a maintenance agent 414 A and 414 B.
  • the first crawler agent 410 A and 410 B and/or second crawler agent 412 A and 412 B may each categorize data stored in trie files stored in the trie file queue 404 A and 404 B, download content (e.g. web pages, etc.) identified by the data stored in the trie files, analyze such content (e.g. for malware, etc.) and/or gather any other information associated with the data stored in the trie files. New data will be stored in an extra trie file (seed trie).
  • the maintenance agent 414 A and 414 B may sort data stored in such trie files that do not have a common prefix (e.g. determine a common prefix of the data), separate such data from such trie files, and optionally merge the data into various other trie files based on a common prefix between the other trie files and the data.
  • the crawler controller 406 A of the first processing node 402 A may select a trie file from the top of the queue of tries 404 A stored on the first processing node 402 A.
  • the crawler controller 406 A may determine the type of the selected trie file for determining which processing is to be performed on the trie file.
  • a trie file storing data without a common prefix may optionally require processing for separating the data and storing the data by common prefix.
  • the crawler controller 406 A may select an agent 410 A, 412 A and 414 A from the agent pool 408 A of the first processing node 402 A based on the type of the selected trie file.
  • the crawler controller 406 A may select the maintenance agent 414 A for sorting and merging the data into other trie files with prefixes common to the data.
  • the crawler controller 406 A may select an agent 410 A, 412 A and 414 A based on which agent 410 A, 412 A and 414 A is idle.
  • the crawler controller 406 A of the first processing node 402 A is in communication with the second processing node 402 B.
  • the crawler controller 406 B of the second processing node 402 B may also be in communication with the first processing node 402 A.
  • the crawler controller 406 A may control the copying of data (e.g. via a file share) stored in trie files of the first processing node 402 A to a trie file in the queue of tries 404 B stored on the second processing node 402 B (e.g. utilizing an SMB protocol, etc.).
  • the crawler controller 406 B of the second processing node 402 B may control a merging of the received data into trie files based on an associated common prefix of the data.
  • the crawler controller 406 A of the first processing node 402 A may be synchronized with the crawler controller 406 B of the second processing node 402 B via the node specific protocol.
  • FIG. 5 shows a method 500 for selecting an agent from a plurality of agents for processing a prefix tree file, in accordance with yet another embodiment.
  • the method 500 may be carried out in the context of the architecture and environment of FIGS. 1-4 .
  • the method 500 may be carried out utilizing the crawler controller 406 A and 406 B of FIG. 4 .
  • the method 500 may be carried out in any desired environment.
  • the aforementioned definitions may apply during the present description.
  • the idle state may include any state in which an agent is not being utilized (e.g. executed for processing purposes, etc.).
  • each of a plurality of agents within a pool of agents may be monitored for determining whether any of such agents are idle.
  • each of the plurality of agents may broadcast a signal indicating an associated state, such that an agent with an idle state may be determined based on a signal broadcasted therefrom.
  • a prefix tree stored in a queue with an oldest access time is identified. Note operation 504 .
  • the prefix tree may be included in a file stored in the queue.
  • the prefix tree with the oldest access time may be identified based on an ordering of prefix trees within the queue.
  • the queue may store prefix trees in an order based on a latest access time for each prefix tree.
  • a prefix tree at a top of the queue may be the prefix tree with the oldest access time.
  • the identified prefix tree is sent to the idle agent, as shown in operation 506 .
  • the agent may be selected for processing the prefix tree.
  • the prefix tree may only be sent to an agent capable of processing the type of the prefix tree.
  • the type of the prefix tree may be indicated by the file extension of the file in which the prefix tree is included, in one embodiment.
  • the prefix tree may only be sent to an agent capable of categorizing data within the prefix tree.
  • the method 500 may wait to identify an idle agent capable of processing the prefix tree prior to sending the prefix tree.
  • FIG. 6 shows a system 600 for storing new data in a plurality of distributed prefix trees, in accordance with still yet another embodiment.
  • the system 600 may be implemented in the context of the architecture and environment of FIGS. 1-5 .
  • the system 600 may be implemented in any desired environment.
  • the aforementioned definitions may apply during the present description.
  • a first computer 602 is in communication with a second computer 604 A, a third computer 604 B, and a fourth computer 604 C.
  • the first computer 602 , second computer 604 A, third computer 604 B, and fourth computer 604 C may each include a processing node (e.g. such as one of the processing nodes 402 A and 402 B of FIG. 4 ), for example.
  • the first computer 602 may be in communication with the second computer 604 A, third computer 604 B, and fourth computer 604 C over a network. While not shown, it should be noted that any of the second computer 604 A, third computer 604 B, and fourth computer 604 C may also be in communication with one another.
  • the first computer 602 includes a seed trie 606 .
  • the seed trie may be stored in a file with a unique extension (e.g. *.st).
  • the seed trie may store newly discovered data (e.g. URLs discovered via a crawler crawling web pages on the Internet, etc.).
  • the seed trie 606 may store data which is unsorted according to common prefixes of such data. While only a single seed trie 606 is shown, it should be noted that the first computer 602 may also include a plurality of seed tries, in another embodiment. For example, each time a module (e.g. crawler) is finished discovering new data, the crawler may produce a new seed trie containing the newly discovered data.
  • a module e.g. crawler
  • the first computer 602 also includes a plurality of sorted seed tries 608 A-C.
  • the sorted seed tries 608 A-C may also each be included in a file with a unique extension (e.g. *.sst). Additionally, the sorted seed tries 608 A-C may store data which has been sorted from the seed trie according to common prefixes of such data. For example, each sorted seed trie 608 A-C may store data with a single common prefix.
  • a maintenance agent may process the data stored in the seed trie 606 by sorting (e.g. grouping, etc.) the data according to common prefixes, separating the data based on the sorting, and storing the sorted data in the sorted seed tries 608 A-C based on common prefixes.
  • the sorting may include identifying a name of a target sorted seed trie 608 A-C by extracting a registered domain name from a URL stored in the seed trie 606 and utilizing the first n letters (e.g. where n is predefined) as the name (or part of the name) of the target sorted seed trie 608 A-C.
  • Each of the second computer 604 A, third computer 604 B, and fourth computer 604 C also include a sorted seed trie 610 A-C.
  • the maintenance agent of the first computer 602 may copy each of the sorted seed tries 608 A-C to a remote computer as a separate file.
  • each of the sorted seed tries 608 A-C of the first computer 602 may only be copied to one of the second computer 604 A, third computer 604 B, and fourth computer 604 C if the name of such sorted seed trig 608 A-C matches the names of tries already existing on the second computer 604 A, third computer 604 B, and/or fourth computer 604 C. In this way, each of the second computer 604 A, third computer 604 B, and fourth computer 604 C may store a different sorted seed tree 610 A-C.
  • each of the second computer 604 A, third computer 604 B, and fourth computer 604 C include a crawl tree 612 A-C.
  • the crawl trees 612 A-C may also each be included in a file with a unique extension (e.g. *.ct).
  • the crawl trees 612 -C may be utilized for permanently storing data, as an option.
  • the maintenance agent of each of the second computer 604 A, third computer 604 B, and fourth computer 604 C may identify data stored in a respective sorted seed tree 610 A-C and merge the data into an associated crawl tree 612 A-C.
  • the merging may include copying data from the sorted seed trie 610 A-C to the associated crawl tree 612 A-C.
  • the maintenance agent may also perform operations (e.g. based on user-configured constraints) such as splitting the crawl trie 612 A-C into multiple crawl tries (e.g. due to size limits of the crawl trie 612 A-C), removal of data in the crawl trie 612 A-C that is marked for deletion, etc.
  • the maintenance agent may delete data stored in the crawl trie 612 A-C by reorganizing the structure of the crawl trie 612 A-C. Data marked for deletion may be deleted in bulk during other maintenance operations, such as merging, for limiting a performance impact on the crawl trie 612 A-C, as an option.
  • the crawl tries 612 A-C may be utilized as input for crawlers.
  • each crawler may pull a set of URLs to be visited (e.g. according to defined crawling strategies) from one of the crawl tries 612 A-C, may download an associated web page and may update such crawl trie 612 A-C with metadata.
  • the metadata may include, for example, a category of the web page, a main language used by the web page, etc.
  • FIG. 7 shows a system 700 for querying a plurality of distributed tries, in accordance with another embodiment.
  • the system 700 may be implemented in the context of the architecture and environment of FIGS. 1-6 .
  • the system 700 may be implemented in any desired environment.
  • the aforementioned definitions may apply during the present description.
  • tries may be queried for data stored in such tries.
  • a central control facility e.g. user interface
  • source code 708 for each of the queries is sent to each processing node that stores a trie file 701 .
  • a query agent 706 local to each processing node compiles and binds the source code 708 as dynamic link library (DLL) files 710 and iterates over data in the trie.
  • DLL dynamic link library
  • Multiple queries may optionally be processed with a single in-order run of a selected trie file 701 .
  • Information may be collected during the processing and returned in a normalized format (e.g. query results 712 ) to a central database server, such as the SQL server 702 shown, via a database connector 714 .
  • the information may be returned to the central database server for the purpose of collation and reporting, for example.
  • the central database server may communicate results of the query to the initiator of the query (e.g. the user interface).
  • FIG. 8 shows uniform resource locators (URLs) 800 stored in a prefix tree, in accordance with yet another embodiment.
  • the URLs 800 may be implemented in the context of the architecture and environment of FIGS. 1-7 .
  • the URLs 800 may be implemented in any desired environment.
  • the aforementioned definitions may apply during the present description.
  • the URLs 800 include arrays which are stored in the prefix tree.
  • the prefix tree may include keys represented as strings, as an option.
  • Each element of an array may be a node within the prefix tree.
  • descendants of a node may share a common prefix with the node for minimizing storage space used by the prefix tree.
  • the URLs http://www.mcafe* and http://earth.goo* may share the common prefix “http://”.
  • the URL http://earth.goo* may be a descendant of http://www.mcafe* at the first “w” node within the URL http://www.mcafe*.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A system, method, and computer program product are provided for processing a prefix tree file utilizing a selected agent. In use, a file including a prefix tree is identified. Additionally, an agent is selected from a plurality of agents to process the file. Further, the file is processed utilizing the agent.

Description

FIELD OF THE INVENTION
The present invention relates to prefix trees, and more particularly to storing data in prefix trees.
BACKGROUND
Traditionally, data has been stored in relational databases for providing access to such data. However, use of relational databases for hierarchical data has exhibited various limitations. Particularly, relational databases have been limited with respect to storing and accessing very large amounts of data.
For example, traditional relational databases have limited ability to scale to large amounts of data. In addition, traditional relational databases oftentimes require costly redundant arrays of independent disks (RAID) devices to exhibit adequate query performance. Further, transforming hierarchical data into traditional relational databases has also been limited in performance.
Still yet, queries for retrieving data for web crawling purposes, storing data, modifying data, etc. have conventionally been unable to be a part of a distributed system. For example, selecting data from a traditional relational database usually results in a high number of cross-table joins and constraints to be processed in order to define crawling behavior and to impose limitations on the amount of data to be stored. Examples for such constraints include prevention of target server flooding by keeping an amount of parallel requests low, Internet protocol (IP) address information tracking to avoid parallel visits of target servers on multiple crawler nodes through usage of mutually exclusive IP ranges, as well as focused selections of hyperlinks to follow via user-definable strategies (e.g. only follow links that could point to potentially interesting content).
There is thus a need for overcoming these and/or other issues associated with the prior art.
SUMMARY
A system, method, and computer program product are provided for processing a prefix tree file utilizing a selected agent. In use, a file including a prefix tree is identified. Additionally, an agent is selected from a plurality of agents to process the file. Further, the file is processed utilizing the agent.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 illustrates a network architecture, in accordance with one embodiment.
FIG. 2 shows a representative hardware environment that may be associated with the servers and/or clients of FIG. 1, in accordance with one embodiment.
FIG. 3 shows a method for processing a prefix tree file utilizing a selected agent, in accordance with one embodiment.
FIG. 4 shows a system for processing a prefix tree file utilizing a selected agent, in accordance with one embodiment.
FIG. 5 shows a method for selecting an agent from a plurality of agents for processing a prefix tree file, in accordance with yet another embodiment.
FIG. 6 shows a system for storing new data in a plurality of distributed prefix trees, in accordance with still yet another embodiment.
FIG. 7 shows a system for querying a plurality of distributed tries, in accordance with another embodiment.
FIG. 8 shows uniform resource locators stored in a prefix tree, in accordance with yet another embodiment.
DETAILED DESCRIPTION
FIG. 1 illustrates a network architecture 100, in accordance with one embodiment. As shown, a plurality of networks 102 is provided. In the context of the present network architecture 100, the networks 102 may each take any form including, but not limited to a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, peer-to-peer network, etc.
Coupled to the networks 102 are servers 104 which are capable of communicating over the networks 102. Also coupled to the networks 102 and the servers 104 is a plurality of clients 106. Such servers 104 and/or clients 106 may each include a desktop computer, lap-top computer, hand-held computer, mobile phone, personal digital assistant (PDA), peripheral (e.g. printer, etc.), any component of a computer, and/or any other type of logic. In order to facilitate communication among the networks 102, at least one gateway 108 is optionally coupled therebetween.
FIG. 2 shows a representative hardware environment that may be associated with the servers 104 and/or clients 106 of FIG. 1, in accordance with one embodiment. Such figure illustrates a typical hardware configuration of a workstation in accordance with one embodiment having a central processing unit 210, such as a microprocessor, and a number of other units interconnected via a system bus 212.
The workstation shown in FIG. 2 includes a Random Access Memory (RAM) 214, Read Only Memory (ROM) 216, an I/O adapter 218 for connecting peripheral devices such as disk storage units 220 to the bus 212, a user interface adapter 222 for connecting a keyboard 224, a mouse 226, a speaker 228, a microphone 232, and/or other user interface devices such as a touch screen (not shown) to the bus 212, communication adapter 234 for connecting the workstation to a communication network 235 (e.g., a data processing network) and a display adapter 236 for connecting the bus 212 to a display device 238.
The workstation may have resident thereon any desired operating system. It will be appreciated that an embodiment may also be implemented on platforms and operating systems other than those mentioned. One embodiment may be written using JAVA, C, and/or C++ language, or other programming languages, along with an object oriented programming methodology. Object oriented programming (OOP) has become increasingly used to develop complex applications.
Of course, the various embodiments set forth herein may be implemented utilizing hardware, software, or any desired combination thereof. For that matter, any type of logic may be utilized which is capable of implementing the various functionality set forth herein.
FIG. 3 shows a method 300 for processing a prefix tree file utilizing a selected agent, in accordance with one embodiment. As an option, the method 300 may be carried out in the context of the architecture and environment of FIGS. 1 and/or 2. Of course, however, the method 300 may be carried out in any desired environment.
As shown in operation 302, a file including a prefix tree is identified. In the context of the present description, the prefix tree may include any ordered hierarchical data structure which stores arrays of data with a common prefix. For example, each of a plurality of descendants of a node may include a common prefix. In one embodiment, the prefix tree may be a trie.
In one embodiment, the prefix tree may store a plurality of uniform resource locators (URLs). As an option, each of the URLs stored in the prefix tree may include a common prefix. Just by way of example, the common prefix may include the first two characters (e.g. letters, etc.) of a domain name server (DNS) domain name of each of the URLs. Of course, it should be noted that any desired data with a common prefix (e.g. identifiers, data paths, strings, etc.) may be stored in the prefix tree.
As another option, the common prefix of URLs capable of being stored in the prefix tree may be predefined (e.g. by a user, etc.). For example, the characters representative of the common prefix may be predefined. As another example, the number of characters representative of the common prefix may also be predefined.
Additionally, the file may include any resource capable of storing the prefix tree. In one embodiment, the file may be dedicated to storing the prefix tree. Thus, the file may only store the prefix tree, as an option.
Further, the file may be identified in any desired manner. In one embodiment, the file may be identified based on a selection of the file, as set forth in the example explained below. To this end, the file may be selected from a plurality of files, each file including a different prefix tree.
For example, the file may be selected from a queue of files, where each file in the queue of files includes a different prefix tree. As another example, the file may be selected based on a time of a last access to the file. Thus, the queue of files may optionally store the files in order of a time each of the files were last accessed, where more recently accessed files are stored in the queue after less recently accessed files.
In addition as shown in operation 304, an agent is selected from a plurality of agents to process the file. In one embodiment, the agents may be running on the client computer and/or server computer on which the file is stored. In this way, the selected agent may access the file locally for processing the file.
In another embodiment, each of the agents may include software. For example, each of the agents may include a command line executable. Of course, each of the agents may include hardware with respect to another embodiment. Furthermore, a predefined number of the agents (e.g. predefined by a user, predefined based on capabilities and/or resources of the computer on which the agents are located, etc.) may optionally execute in parallel.
In yet another embodiment, each of the agents may perform different functions. Such functions may include a crawling function (e.g. for crawling sites, such as web sites, indicated by data stored in the prefix tree, etc.), a merging function (e.g. for merging data into the prefix tree, etc.), a separating function (e.g. for separating data within the prefix tree, etc.), a reorganization function (e.g. for reorganizing data stored in the prefix tree, etc.) and/or any other function capable of being performed with respect to the prefix tree and/or any information associated with such prefix tree. To this end, the agent may optionally be selected based on the type of processing to be performed on the file.
As another option, the agent may be selected based on a determination of whether the agent is in an idle state. For example, only an agent in an idle state may be selected for processing the file. As yet another option, the agent may be selected based on the type of prefix tree included in the file. Thus, an agent capable of categorizing data may be selected for processing a prefix tree containing uncategorized data, just by way of example.
Still yet, the file is processed utilizing the agent, as shown in operation 306. In the context of the present description, the processing may include performing any functions in association with the file. For example, the processing may include executing any functions capable of being performed by the agent.
In one embodiment, the processing may include downloading a web page identified using a URL stored in the prefix tree. In another embodiment, the processing may include analyzing the web page for gathering information associated with the URL. Such information may include, for example, information indicating whether the web page includes unwanted content (e.g. malware, etc.) and/or a type of any unwanted content discovered in the web page, a categorization of the web page, a type of language included in the web page, etc. In yet another embodiment, the processing may include categorizing the URL (e.g. based on a categorization of the web page, etc.).
In this way, the prefix tree file may be processed utilizing an agent selected from a plurality of agents. As an option, the identification of the file (operation 302), the selection of the agent (operation 304), and the processing of the file (operation 306) may be controlled utilizing a controller application. In one embodiment, the controller application may schedule execution of one of the agents for processing the prefix tree file. The controller application may be located on the computer on which the file and/or agents are located, for example.
More illustrative information will now be set forth regarding various optional architectures and features with which the foregoing technique may or may not be implemented, per the desires of the user. It should be strongly noted that the following information is set forth for illustrative purposes and should not be construed as limiting in any manner. Any of the following features may be optionally incorporated with or without the exclusion of other features described.
FIG. 4 shows a system 400 for processing a prefix tree file utilizing a selected agent, in accordance with one embodiment. As an option, the system 400 may be implemented in the context of the architecture and environment of FIGS. 1-3. Of course, however, the system 400 may be implemented in any desired environment. It should also be noted that the aforementioned definitions may apply during the present description.
As shown, a first processing node 402A is in communication with a second processing node 402B. In one embodiment, the first processing node 402A may communicate with the second processing node 402B over a network (e.g. such as any of the networks described above with respect to FIG. 1), but of course the first processing node 402A may also communicate with the second processing node 402B via a local connection. In another embodiment, the first processing node 402A may communicate with the second processing node 402B using a server message block (SMB) protocol, a node specific protocol and/or any other protocol capable of being used for communication purposes. The first processing node 402A and the second processing node 402B may optionally operate as a peer network of independent nodes.
In the context of the present embodiment, the first processing node 402A and the second processing node 402B may each include any of the devices described above with respect to FIGS. 1 and/or 2. For example, the first processing node 402A and the second processing node 402B may each include the Intel® Core™ 2 Duo processor with 250 gigabytes of hard disk space. Thus, the first processing node 402A and the second processing node 402B may optionally include client computers instead of being consolidated on expensive high-end servers. While only two processing nodes 402A and 402B are shown, it should be noted that any desired number of processing nodes may be included in the system 400, thus allowing scalability of the system 400.
The first processing node 402A and the second processing node 402B each include a respective queue of prefix trees 404A and 404B (hereinafter referred to as tries). The queues of tries 404A and 404B may each include a plurality of trie files, where each trie file includes a different trie (e.g. a different type of trie, etc.) that stores different data (e.g. URLs), for example. Thus, different trie files may be distributed across a plurality of processing nodes for providing distributed processing and load-balancing of such trie files, as described in more detail below.
Further, the trie files may be stored locally on a hard disk of a respective one of the first processing node 402A and the second processing node 402B. As an option, the queues of tries 404A and 404B may access trie files in an order based on a time in which the files were last accessed (e.g. modified, etc.). For example, the least recently accessed trie file may be stored first in the queue 404A and 404B, whereas the most recently accessed trie file may be stored last in the queue 404A and 404B. Accordingly, after a top trie file in the queue 404A and 404B is accessed (e.g. for processing, etc.), such trie file may be returned to the end of the queue 404A and 404B.
In one embodiment, each trie may only store data with a common prefix. For example, the data may include URLs, and the common prefix may include the first two characters of a DNS domain name of such URLs. Table 1 illustrates exemplary trie file names that may be stored in the queues 404A and 404B of the first processing node 402A and the second processing node 402B. Of course, it should be noted that such example is set forth for illustrative purposes only, and thus should not be construed as limiting in any manner.
TABLE 1
Processing Node_1
Trie_1: aa.ct-stores data with the common prefix aa
Trie_2: cx.ct-stores data with the common prefix cx
Trie_3: de.ct-stores data with the common prefix de
Processing Node_2
Trie_1: ab.ct-stores data with the common prefix ab
Trie_2: mi.ct-stores data with the common prefix mi
In the context of Table 1, and just by way of example, the URL msdn.microsoft.com would be stored in Trie 2 of Processing Node 2 since the first two characters of the DNS domain name (i.e. “microsoft”) of such URL share the common prefix “mi”. As another example, the URL www.dell.com would be stored in Trie 3 of Processing Node 1 since the first two characters of the DNS domain name (i.e. “dell”) of such URL share the common prefix “de”.
As an option, each trie file may be limited to a predetermined size (e.g. 100 megabytes, etc.), such that multiple tries may be capable of being loaded into random access memory (RAM). If the limit of any trie file is within a threshold amount of being exceeded, a new trie may be created and added to the queue of tries 404A and 404B in which the original trie is stored. Optionally, in the context of an original trie which stores data with a two character common prefix, the new trie may store data with a three character common prefix which includes the two character common prefix. In one embodiment, the respective tree is split into multiple files with a three character prefix.
The first processing node 402A and the second processing node 402B may also each include a crawler controller 406A and 406B. The crawler controller 406A and 406B may include a controller application, in one embodiment. In another embodiment, the crawler controller 406A and 406B may select the top trie file from a respective queue of tries 404A and 404B and may send the selected trie file to one of a plurality of different types of agents 410A, 410B, 412A, 412B, 414A, and 414B located in an agent pool 408A and 408B on the same processing node 402A and 402B. The agent 410A, 410B, 412A, 412B, 414A, and 414B to which the selected trie file is sent may be selected by the crawler controller 406A and 406B.
As shown, the agent pool 408A and 408B may include a first crawler agent 410A and 410B, a second crawler agent 412A and 412B, and a maintenance agent 414A and 414B. The first crawler agent 410A and 410B and/or second crawler agent 412A and 412B may each categorize data stored in trie files stored in the trie file queue 404A and 404B, download content (e.g. web pages, etc.) identified by the data stored in the trie files, analyze such content (e.g. for malware, etc.) and/or gather any other information associated with the data stored in the trie files. New data will be stored in an extra trie file (seed trie). The maintenance agent 414A and 414B may sort data stored in such trie files that do not have a common prefix (e.g. determine a common prefix of the data), separate such data from such trie files, and optionally merge the data into various other trie files based on a common prefix between the other trie files and the data.
Thus, just by way of example, the crawler controller 406A of the first processing node 402A may select a trie file from the top of the queue of tries 404A stored on the first processing node 402A. The crawler controller 406A may determine the type of the selected trie file for determining which processing is to be performed on the trie file. Just by way of example, a trie file storing data without a common prefix may optionally require processing for separating the data and storing the data by common prefix.
Thus, the crawler controller 406A may select an agent 410A, 412A and 414A from the agent pool 408A of the first processing node 402A based on the type of the selected trie file. In the context of the above example where the trie file includes data without a common prefix may, the crawler controller 406A may select the maintenance agent 414A for sorting and merging the data into other trie files with prefixes common to the data. As another option, the crawler controller 406A may select an agent 410A, 412A and 414A based on which agent 410A, 412A and 414A is idle.
As described above, the crawler controller 406A of the first processing node 402A is in communication with the second processing node 402B. Similarly, the crawler controller 406B of the second processing node 402B may also be in communication with the first processing node 402A. In one embodiment, the crawler controller 406A may control the copying of data (e.g. via a file share) stored in trie files of the first processing node 402A to a trie file in the queue of tries 404B stored on the second processing node 402B (e.g. utilizing an SMB protocol, etc.). In this way, the crawler controller 406B of the second processing node 402B may control a merging of the received data into trie files based on an associated common prefix of the data. In another embodiment, the crawler controller 406A of the first processing node 402A may be synchronized with the crawler controller 406B of the second processing node 402B via the node specific protocol.
FIG. 5 shows a method 500 for selecting an agent from a plurality of agents for processing a prefix tree file, in accordance with yet another embodiment. As an option, the method 500 may be carried out in the context of the architecture and environment of FIGS. 1-4. For example, the method 500 may be carried out utilizing the crawler controller 406A and 406B of FIG. 4. Of course, however, the method 500 may be carried out in any desired environment. Again, it should be noted that the aforementioned definitions may apply during the present description.
As shown in decision 502, it is determined whether any agent is in an idle state. In the context of the present embodiment, the idle state may include any state in which an agent is not being utilized (e.g. executed for processing purposes, etc.). For example, each of a plurality of agents within a pool of agents may be monitored for determining whether any of such agents are idle. As another example, each of the plurality of agents may broadcast a signal indicating an associated state, such that an agent with an idle state may be determined based on a signal broadcasted therefrom.
In response to a determination that an agent is idle, a prefix tree stored in a queue with an oldest access time is identified. Note operation 504. In one embodiment, the prefix tree may be included in a file stored in the queue. In another embodiment, the prefix tree with the oldest access time may be identified based on an ordering of prefix trees within the queue. Just by way of example, the queue may store prefix trees in an order based on a latest access time for each prefix tree. Thus, a prefix tree at a top of the queue may be the prefix tree with the oldest access time.
Further, the identified prefix tree is sent to the idle agent, as shown in operation 506. In this way, the agent may be selected for processing the prefix tree. Optionally, the prefix tree may only be sent to an agent capable of processing the type of the prefix tree. The type of the prefix tree may be indicated by the file extension of the file in which the prefix tree is included, in one embodiment. Just by way of example, if the prefix tree includes an uncategorized prefix tree, the prefix tree may only be sent to an agent capable of categorizing data within the prefix tree. Thus, the method 500 may wait to identify an idle agent capable of processing the prefix tree prior to sending the prefix tree.
FIG. 6 shows a system 600 for storing new data in a plurality of distributed prefix trees, in accordance with still yet another embodiment. As an option, the system 600 may be implemented in the context of the architecture and environment of FIGS. 1-5. Of course, however, the system 600 may be implemented in any desired environment. Yet again, it should be noted that the aforementioned definitions may apply during the present description.
As shown, a first computer 602 is in communication with a second computer 604A, a third computer 604B, and a fourth computer 604C. The first computer 602, second computer 604A, third computer 604B, and fourth computer 604C may each include a processing node (e.g. such as one of the processing nodes 402A and 402B of FIG. 4), for example. Moreover, the first computer 602 may be in communication with the second computer 604A, third computer 604B, and fourth computer 604C over a network. While not shown, it should be noted that any of the second computer 604A, third computer 604B, and fourth computer 604C may also be in communication with one another.
The first computer 602 includes a seed trie 606. In one embodiment, the seed trie may be stored in a file with a unique extension (e.g. *.st). In another embodiment, the seed trie may store newly discovered data (e.g. URLs discovered via a crawler crawling web pages on the Internet, etc.). Thus, the seed trie 606 may store data which is unsorted according to common prefixes of such data. While only a single seed trie 606 is shown, it should be noted that the first computer 602 may also include a plurality of seed tries, in another embodiment. For example, each time a module (e.g. crawler) is finished discovering new data, the crawler may produce a new seed trie containing the newly discovered data.
The first computer 602 also includes a plurality of sorted seed tries 608A-C. The sorted seed tries 608A-C may also each be included in a file with a unique extension (e.g. *.sst). Additionally, the sorted seed tries 608A-C may store data which has been sorted from the seed trie according to common prefixes of such data. For example, each sorted seed trie 608A-C may store data with a single common prefix.
In one embodiment, a maintenance agent may process the data stored in the seed trie 606 by sorting (e.g. grouping, etc.) the data according to common prefixes, separating the data based on the sorting, and storing the sorted data in the sorted seed tries 608A-C based on common prefixes. Optionally, the sorting may include identifying a name of a target sorted seed trie 608A-C by extracting a registered domain name from a URL stored in the seed trie 606 and utilizing the first n letters (e.g. where n is predefined) as the name (or part of the name) of the target sorted seed trie 608A-C.
Each of the second computer 604A, third computer 604B, and fourth computer 604C also include a sorted seed trie 610A-C. In one embodiment, the maintenance agent of the first computer 602 may copy each of the sorted seed tries 608A-C to a remote computer as a separate file. Optionally, each of the sorted seed tries 608A-C of the first computer 602 may only be copied to one of the second computer 604A, third computer 604B, and fourth computer 604C if the name of such sorted seed trig 608A-C matches the names of tries already existing on the second computer 604A, third computer 604B, and/or fourth computer 604C. In this way, each of the second computer 604A, third computer 604B, and fourth computer 604C may store a different sorted seed tree 610A-C.
Furthermore, each of the second computer 604A, third computer 604B, and fourth computer 604C include a crawl tree 612A-C. The crawl trees 612A-C may also each be included in a file with a unique extension (e.g. *.ct). The crawl trees 612-C may be utilized for permanently storing data, as an option.
In one embodiment, the maintenance agent of each of the second computer 604A, third computer 604B, and fourth computer 604C may identify data stored in a respective sorted seed tree 610A-C and merge the data into an associated crawl tree 612A-C. The merging may include copying data from the sorted seed trie 610A-C to the associated crawl tree 612A-C. During the merge, the maintenance agent may also perform operations (e.g. based on user-configured constraints) such as splitting the crawl trie 612A-C into multiple crawl tries (e.g. due to size limits of the crawl trie 612A-C), removal of data in the crawl trie 612A-C that is marked for deletion, etc.
The maintenance agent may delete data stored in the crawl trie 612A-C by reorganizing the structure of the crawl trie 612A-C. Data marked for deletion may be deleted in bulk during other maintenance operations, such as merging, for limiting a performance impact on the crawl trie 612A-C, as an option.
The crawl tries 612A-C may be utilized as input for crawlers. For example, each crawler may pull a set of URLs to be visited (e.g. according to defined crawling strategies) from one of the crawl tries 612A-C, may download an associated web page and may update such crawl trie 612A-C with metadata. The metadata may include, for example, a category of the web page, a main language used by the web page, etc.
FIG. 7 shows a system 700 for querying a plurality of distributed tries, in accordance with another embodiment. As an option, the system 700 may be implemented in the context of the architecture and environment of FIGS. 1-6. Of course, however, the system 700 may be implemented in any desired environment. Yet again, it should be noted that the aforementioned definitions may apply during the present description.
In some embodiments, tries may be queried for data stored in such tries. For example, when a central control facility (e.g. user interface) issues a plurality of query requests 704, source code 708 for each of the queries is sent to each processing node that stores a trie file 701. In one embodiment, a query agent 706 local to each processing node compiles and binds the source code 708 as dynamic link library (DLL) files 710 and iterates over data in the trie. Of course, this could result in any type of executable.
Multiple queries may optionally be processed with a single in-order run of a selected trie file 701. Information may be collected during the processing and returned in a normalized format (e.g. query results 712) to a central database server, such as the SQL server 702 shown, via a database connector 714. The information may be returned to the central database server for the purpose of collation and reporting, for example. To this end, the central database server may communicate results of the query to the initiator of the query (e.g. the user interface).
FIG. 8 shows uniform resource locators (URLs) 800 stored in a prefix tree, in accordance with yet another embodiment. As an option, the URLs 800 may be implemented in the context of the architecture and environment of FIGS. 1-7. Of course, however, the URLs 800 may be implemented in any desired environment. Yet again, it should be noted that the aforementioned definitions may apply during the present description.
As shown, the URLs 800 include arrays which are stored in the prefix tree. The prefix tree may include keys represented as strings, as an option. Each element of an array may be a node within the prefix tree. Thus, descendants of a node may share a common prefix with the node for minimizing storage space used by the prefix tree. Just by way of example, the URLs http://www.mcafe* and http://earth.goo* may share the common prefix “http://”. Thus, the URL http://earth.goo* may be a descendant of http://www.mcafe* at the first “w” node within the URL http://www.mcafe*.
While various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of a preferred embodiment should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims (24)

1. A method, comprising:
identifying a file including a prefix tree, wherein the prefix tree includes an ordered hierarchical data structure which stores arrays of data with a common prefix;
selecting an agent from a plurality of agents to process the file; and
processing the file utilizing this agent;
wherein identifying the file includes selecting the file from a queue of files, each file including a different prefix tree.
2. The method of claim 1, wherein the prefix tree stores at least a portion of each of a plurality of uniform resource locators.
3. The method of claim 2, wherein each of the uniform resource locators include the common prefix.
4. The method of claim 1, wherein the file is stored on a client computer.
5. The method of claim 4, wherein the plurality of agents are stored on the client computer.
6. The method of claim 1, wherein each of the plurality of agents includes a command line executable.
7. The method of claim 1, wherein each of the plurality of agents perform different functions.
8. The method of claim 7, wherein each of the plurality of agents perform at least one of a crawling function, a merging function, a separating function, and a reorganization function.
9. The method of claim 1, wherein a predefined number of the agents are capable of executing in parallel.
10. The method of claim 1, wherein the file is selected based on a time of a last access to the file.
11. The method of claim 1, wherein the agent is selected based on a determination of whether the agent is in an idle state.
12. The method of claim 1, wherein the processing includes downloading a web page identified using a uniform resource locator stored in the prefix tree.
13. The method of claim 12, wherein the processing includes analyzing the web page for gathering information associated with the uniform resource locator.
14. The method of claim 1, wherein the processing includes categorizing a target pointed to by uniform resource locator stored in the prefix tree.
15. The method of claim 1, wherein a plurality of different files, each file including a different prefix tree, are distributed across a plurality of devices.
16. The method of claim 15, wherein the different files include different types of prefix trees.
17. The method of claim 1, wherein the agent is selected based on a type of the processing to be performed on the file.
18. The method of claim 1, wherein processing the file includes sorting the data according to common prefixes, separating the data based on the sorting, and storing the sorted data in sorted seed prefix trees, chosen based on the common prefixes.
19. A computer program product embodied on a tangible computer readable medium, comprising:
computer code for identifying a file including a prefix tree, wherein the prefix tree includes an ordered hierarchical data structure which stores arrays of data with a common prefix;
computer code for selecting an agent from a plurality of agents to process the file; and
computer code for processing the file utilizing this agent;
wherein the computer program product is operable such that identifying the file includes selecting the file from a queue of files, each file including a different prefix tree.
20. A system, comprising:
a processor for identifying a file including a prefix tree, wherein the prefix tree includes an ordered hierarchical data structure which stores arrays of data with a common prefix, the processor further adapted to select an agent from a plurality of agents to process the file and process the file utilizing this agent;
wherein the system is operable such that identifying the file includes selecting the file from a queue of files, each file including a different prefix tree.
21. The system of claim 20, further comprising memory coupled to the processor via a bus.
22. A method, comprising:
identifying a file including a prefix tree wherein the prefix tree includes an ordered hierarchical data structure which stores arrays of data with a common prefix;
selecting an agent from a plurality of agents to process the file; and
processing the file utilizing this agent;
wherein the agent is selected based on a type of the prefix tree included in the file.
23. A computer program product embodied on a tangible computer readable medium, comprising:
computer code for identifying a file including a prefix tree, wherein the prefix tree includes an ordered hierarchical data structure which stores arrays of data with a common prefix;
computer code for selecting an agent from a plurality of agents to process the file; and
computer code for processing the file utilizing this agent;
wherein the computer program product is operable such that the agent is selected based on a type of the prefix tree included in the file.
24. A system, comprising:
a processor for identifying a file including a prefix tree, wherein the prefix tree includes an ordered hierarchical data structure which stores arrays of data with a common prefix, the processor further adapted to select an agent from a plurality of agents to process the file and process the file utilizing this agent;
wherein the system is operable such that the agent is selected based on a type of the prefix tree included in the file.
US11/963,589 2007-12-21 2007-12-21 System, method, and computer program product for processing a prefix tree file utilizing a selected agent Active 2029-07-19 US7979418B1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US11/963,589 US7979418B1 (en) 2007-12-21 2007-12-21 System, method, and computer program product for processing a prefix tree file utilizing a selected agent
US13/150,947 US8560521B2 (en) 2007-12-21 2011-06-16 System, method, and computer program product for processing a prefix tree file utilizing a selected agent

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11/963,589 US7979418B1 (en) 2007-12-21 2007-12-21 System, method, and computer program product for processing a prefix tree file utilizing a selected agent

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/150,947 Continuation US8560521B2 (en) 2007-12-21 2011-06-16 System, method, and computer program product for processing a prefix tree file utilizing a selected agent

Publications (1)

Publication Number Publication Date
US7979418B1 true US7979418B1 (en) 2011-07-12

Family

ID=44245636

Family Applications (2)

Application Number Title Priority Date Filing Date
US11/963,589 Active 2029-07-19 US7979418B1 (en) 2007-12-21 2007-12-21 System, method, and computer program product for processing a prefix tree file utilizing a selected agent
US13/150,947 Active US8560521B2 (en) 2007-12-21 2011-06-16 System, method, and computer program product for processing a prefix tree file utilizing a selected agent

Family Applications After (1)

Application Number Title Priority Date Filing Date
US13/150,947 Active US8560521B2 (en) 2007-12-21 2011-06-16 System, method, and computer program product for processing a prefix tree file utilizing a selected agent

Country Status (1)

Country Link
US (2) US7979418B1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014064564A1 (en) * 2012-10-22 2014-05-01 Telefonaktiebolaget L M Ericsson (Publ) Method and system of packet based identifier locator network protocol (ilnp) load balancing and routing
CN103823823B (en) * 2013-07-08 2016-12-28 电子科技大学 Denormalization policy selection method based on Frequent Itemsets Mining Algorithm
US10304086B2 (en) * 2011-06-22 2019-05-28 Skyhook Wireless, Inc. Techniques for estimating demographic information
US10438000B1 (en) * 2017-09-22 2019-10-08 Symantec Corporation Using recognized backup images for recovery after a ransomware attack
US10725870B1 (en) * 2018-01-02 2020-07-28 NortonLifeLock Inc. Content-based automatic backup of images

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11089056B2 (en) * 2018-09-28 2021-08-10 Sophos Limited Intrusion detection with honeypot keys
CN109508422A (en) * 2018-12-05 2019-03-22 南京邮电大学 The height of multithreading intelligent scheduling is hidden crawler system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050192966A1 (en) * 2004-03-01 2005-09-01 Hilbert David M. Remote file management
US20080033998A1 (en) * 2006-08-03 2008-02-07 Yahoo! Inc. Agent for identifying domains with content arranged for display by a mobile device
US20080114765A1 (en) * 2005-06-30 2008-05-15 Fujitsu Limited Computer program, method, and apparatus for data sorting

Family Cites Families (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6073142A (en) 1997-06-23 2000-06-06 Park City Group Automated post office based rule analysis of e-mail messages and other data objects for controlled distribution in network environments
US5987610A (en) 1998-02-12 1999-11-16 Ameritech Corporation Computer virus screening methods and systems
US6094706A (en) * 1998-03-02 2000-07-25 International Business Machines Corporation Caching in a data processing system using the pigeon hole principle
US6460050B1 (en) 1999-12-22 2002-10-01 Mark Raymond Pace Distributed content identification system
JP3879350B2 (en) * 2000-01-25 2007-02-14 富士ゼロックス株式会社 Structured document processing system and structured document processing method
US6976090B2 (en) * 2000-04-20 2005-12-13 Actona Technologies Ltd. Differentiated content and application delivery via internet
US7716163B2 (en) * 2000-06-06 2010-05-11 Microsoft Corporation Method and system for defining semantic categories and actions
US6901519B1 (en) 2000-06-22 2005-05-31 Infobahn, Inc. E-mail virus protection system and method
US7080073B1 (en) * 2000-08-18 2006-07-18 Firstrain, Inc. Method and apparatus for focused crawling
DE10237875A1 (en) * 2002-08-19 2004-03-04 Siemens Ag Device, in particular automation device, with a file directory structure stored in a file
US7328403B2 (en) * 2003-10-22 2008-02-05 Intel Corporation Device for structured data transformation
US7454429B2 (en) * 2004-02-14 2008-11-18 Alan S Rojer Declarative Dispatch
US20060053097A1 (en) * 2004-04-01 2006-03-09 King Martin T Searching and accessing documents on private networks for use with captures from rendered documents
US7877366B2 (en) * 2004-03-12 2011-01-25 Oracle International Corporation Streaming XML data retrieval using XPath
US7861304B1 (en) * 2004-05-07 2010-12-28 Symantec Corporation Pattern matching using embedded functions
US7711542B2 (en) * 2004-08-31 2010-05-04 Research In Motion Limited System and method for multilanguage text input in a handheld electronic device
US8037527B2 (en) * 2004-11-08 2011-10-11 Bt Web Solutions, Llc Method and apparatus for look-ahead security scanning
US7680791B2 (en) * 2005-01-18 2010-03-16 Oracle International Corporation Method for sorting data using common prefix bytes
US7979368B2 (en) * 2005-07-01 2011-07-12 Crossbeam Systems, Inc. Systems and methods for processing data flows
US8769127B2 (en) * 2006-02-10 2014-07-01 Northrop Grumman Systems Corporation Cross-domain solution (CDS) collaborate-access-browse (CAB) and assured file transfer (AFT)
US7599931B2 (en) * 2006-03-03 2009-10-06 Microsoft Corporation Web forum crawler
CA2675216A1 (en) * 2007-01-10 2008-07-17 Nick Koudas Method and system for information discovery and text analysis
US7720869B2 (en) * 2007-05-09 2010-05-18 Illinois Institute Of Technology Hierarchical structured abstract file system
US20080306949A1 (en) * 2007-06-08 2008-12-11 John Martin Hoernkvist Inverted index processing
US20090063538A1 (en) * 2007-08-30 2009-03-05 Krishna Prasad Chitrapura Method for normalizing dynamic urls of web pages through hierarchical organization of urls from a web site
US20090089278A1 (en) * 2007-09-27 2009-04-02 Krishna Leela Poola Techniques for keyword extraction from urls using statistical analysis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050192966A1 (en) * 2004-03-01 2005-09-01 Hilbert David M. Remote file management
US20080114765A1 (en) * 2005-06-30 2008-05-15 Fujitsu Limited Computer program, method, and apparatus for data sorting
US20080033998A1 (en) * 2006-08-03 2008-02-07 Yahoo! Inc. Agent for identifying domains with content arranged for display by a mobile device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Trie," Wikipedia, last modified Dec. 1, 2007, http://en.wikipedia.org/wiki/Trie.

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10304086B2 (en) * 2011-06-22 2019-05-28 Skyhook Wireless, Inc. Techniques for estimating demographic information
WO2014064564A1 (en) * 2012-10-22 2014-05-01 Telefonaktiebolaget L M Ericsson (Publ) Method and system of packet based identifier locator network protocol (ilnp) load balancing and routing
CN104718733A (en) * 2012-10-22 2015-06-17 瑞典爱立信有限公司 Method and system of packet based identifier locator network protocol (ILNP) load balancing and routing
CN104718733B (en) * 2012-10-22 2018-02-13 瑞典爱立信有限公司 The method and system of packet-based identifier finger URL procotol (ILNP) load balance and Route Selection
CN103823823B (en) * 2013-07-08 2016-12-28 电子科技大学 Denormalization policy selection method based on Frequent Itemsets Mining Algorithm
US10438000B1 (en) * 2017-09-22 2019-10-08 Symantec Corporation Using recognized backup images for recovery after a ransomware attack
US10725870B1 (en) * 2018-01-02 2020-07-28 NortonLifeLock Inc. Content-based automatic backup of images

Also Published As

Publication number Publication date
US8560521B2 (en) 2013-10-15
US20110246531A1 (en) 2011-10-06

Similar Documents

Publication Publication Date Title
US8560521B2 (en) System, method, and computer program product for processing a prefix tree file utilizing a selected agent
KR100971863B1 (en) System and method for batched indexing of network documents
US9026566B2 (en) Generating equivalence classes and rules for associating content with document identifiers
US6873982B1 (en) Ordering of database search results based on user feedback
US9576024B2 (en) Hierarchy of servers for query processing of column chunks in a distributed column chunk data store
US9443019B2 (en) Optimized web domains classification based on progressive crawling with clustering
CA2511098C (en) Dispersing search engine results by using page category information
US8266134B1 (en) Distributed crawling of hyperlinked documents
US7308643B1 (en) Anchor tag indexing in a web crawler system
CN109885615B (en) Index-based block chain light client-oriented range query verifiable query method
CN107645508A (en) A kind of data handling system, method, client and server
US8027961B2 (en) System and method for composite record keys ordered in a flat key space for a distributed database
US8041893B1 (en) System and method for managing large filesystem-based caches
CA2475328A1 (en) Improved systems and methods for ranking documents based upon structurally interrelated information
US20080133460A1 (en) Searching descendant pages of a root page for keywords
CN113767377A (en) Dynamic hash function combination for change detection in distributed storage systems
WO2011053377A1 (en) System for user driven ranking of web pages
US8930518B2 (en) Processing of write requests in application server clusters
Deka NoSQL web crawler application
JP2019103039A (en) Firewall device
JP5463988B2 (en) Configuration information management apparatus, configuration information management program, and configuration information management method
US20240289333A1 (en) Metadata search via n-gram index
AU2002351296B2 (en) System and method for processing a request using multiple database units
US20210089507A1 (en) Systems and methods for providing an adaptive attention-based bloom filter for tree-based information repositories
Sachdeva et al. A novel focused crawler with anti-spamming approach & fast query retrieval

Legal Events

Date Code Title Description
AS Assignment

Owner name: MCAFEE, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SCHLEMMER, ANDREAS;STEINER, THOMAS C.H.;BLAIMSCHEIN, PETER;REEL/FRAME:020296/0277

Effective date: 20071221

STCF Information on status: patent grant

Free format text: PATENTED CASE

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FPAY Fee payment

Year of fee payment: 4

AS Assignment

Owner name: MCAFEE, LLC, CALIFORNIA

Free format text: CHANGE OF NAME AND ENTITY CONVERSION;ASSIGNOR:MCAFEE, INC.;REEL/FRAME:043665/0918

Effective date: 20161220

AS Assignment

Owner name: JPMORGAN CHASE BANK, N.A., NEW YORK

Free format text: SECURITY INTEREST;ASSIGNOR:MCAFEE, LLC;REEL/FRAME:045055/0786

Effective date: 20170929

Owner name: MORGAN STANLEY SENIOR FUNDING, INC., MARYLAND

Free format text: SECURITY INTEREST;ASSIGNOR:MCAFEE, LLC;REEL/FRAME:045056/0676

Effective date: 20170929

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8

AS Assignment

Owner name: MORGAN STANLEY SENIOR FUNDING, INC., MARYLAND

Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE REMOVE PATENT 6336186 PREVIOUSLY RECORDED ON REEL 045056 FRAME 0676. ASSIGNOR(S) HEREBY CONFIRMS THE SECURITY INTEREST;ASSIGNOR:MCAFEE, LLC;REEL/FRAME:054206/0593

Effective date: 20170929

Owner name: JPMORGAN CHASE BANK, N.A., NEW YORK

Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE REMOVE PATENT 6336186 PREVIOUSLY RECORDED ON REEL 045055 FRAME 786. ASSIGNOR(S) HEREBY CONFIRMS THE SECURITY INTEREST;ASSIGNOR:MCAFEE, LLC;REEL/FRAME:055854/0047

Effective date: 20170929

AS Assignment

Owner name: MCAFEE, LLC, CALIFORNIA

Free format text: RELEASE OF INTELLECTUAL PROPERTY COLLATERAL - REEL/FRAME 045055/0786;ASSIGNOR:JPMORGAN CHASE BANK, N.A., AS COLLATERAL AGENT;REEL/FRAME:054238/0001

Effective date: 20201026

AS Assignment

Owner name: MCAFEE, LLC, CALIFORNIA

Free format text: RELEASE OF INTELLECTUAL PROPERTY COLLATERAL - REEL/FRAME 045056/0676;ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS COLLATERAL AGENT;REEL/FRAME:059354/0213

Effective date: 20220301

AS Assignment

Owner name: JPMORGAN CHASE BANK, N.A., AS ADMINISTRATIVE AGENT AND COLLATERAL AGENT, NEW YORK

Free format text: SECURITY INTEREST;ASSIGNOR:MCAFEE, LLC;REEL/FRAME:059354/0335

Effective date: 20220301

AS Assignment

Owner name: JPMORGAN CHASE BANK, N.A., AS ADMINISTRATIVE AGENT, NEW YORK

Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE THE PATENT TITLES AND REMOVE DUPLICATES IN THE SCHEDULE PREVIOUSLY RECORDED AT REEL: 059354 FRAME: 0335. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT;ASSIGNOR:MCAFEE, LLC;REEL/FRAME:060792/0307

Effective date: 20220301

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1553); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 12