[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

AU2007201222A1 - Method and apparatus for categorizing and presenting documents of a distributed database - Google Patents

Method and apparatus for categorizing and presenting documents of a distributed database Download PDF

Info

Publication number
AU2007201222A1
AU2007201222A1 AU2007201222A AU2007201222A AU2007201222A1 AU 2007201222 A1 AU2007201222 A1 AU 2007201222A1 AU 2007201222 A AU2007201222 A AU 2007201222A AU 2007201222 A AU2007201222 A AU 2007201222A AU 2007201222 A1 AU2007201222 A1 AU 2007201222A1
Authority
AU
Australia
Prior art keywords
pages
resulting
commercial
page
documents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
AU2007201222A
Inventor
Daniel C. Fain
Paul T. Ryan
Peter Savich
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Altaba Inc
Original Assignee
Yahoo Inc
Yahoo Inc until 2017
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from AU2003204327A external-priority patent/AU2003204327B2/en
Application filed by Yahoo Inc, Yahoo Inc until 2017 filed Critical Yahoo Inc
Priority to AU2007201222A priority Critical patent/AU2007201222A1/en
Publication of AU2007201222A1 publication Critical patent/AU2007201222A1/en
Assigned to YAHOO! INC. reassignment YAHOO! INC. Request for Assignment Assignors: OVERTURE SERVICES, INC.
Abandoned legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Description

21, MAR. 2007 16:34 SPRUSON FERGUSON 92615486 0NO. 6353 P. 4 S&F Ref: 636898D1
AUSTRALIA.
PATENTS ACT 1990 COMPLETE SPECIFICATION FOR A STANDARD
PATENT
Name and Address of Applicant: Actual Inventor(s): Address for Service: Invention Title: Overture Services, Inc., of 74 North Pasadena Avenue, 3rd Floor, Pasadena, California, 91103, United States of America Daniel C. Fain Paul T. Ryan Peter Savich Spruson Ferguson St Martins Tower Level 31 Market Street Sydney NSW 2000 (CCN 3710000177) Method and apparatus for categorizing and presenting documents of a distributed database The following statement is a full description of this invention, including the best method of performing it known to me/us:- 5545c(725258_1) COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21. MAR. 2007 16:34 SPRUSON FERGUSON 92615486 NO. 6353 P. 0 METHOD AND APPARATUS FOR CATEGORIZING AND PRESENTING DOCUMENTS OF A DISTRIBUTED
DATABASE
BACKGROUND
The transfer of information over computer networks has become an Cl increasingly important means by which institutions, corporations, and Cl individuals do business. Computer networks have grown over the years from independent and Isolated entities established to serve the needs of a single 0- group into vast internets which Interconnect disparate physical networks and S 10 allow them to function as a coordinated system. Currently, the largest computer network in existence is the Internet. The Intemet is a worldwide interconnection of computer networks that communicate using a common protocol. Millions of computers, from low end personal computers to high end supercomputers, are connected to the Internet.
The Internet has emerged as a large community of electronically connected users located around the world who readily and regularly exchange vast amounts of information. The Intemet continues to serve its original purposes of providing for access to and exchange of information among government agencies, laboratories, and universities for research and education. In addition, the Internet has evolved to serve a variety of interests and forums that extend beyond its original goals. In particular, the Intemet is rapidly transforming into a global electronic marketplace of goods and services as well as of ideas and information.
This transformation of the Intemet into a global marketplace was driven in large part by the introduction of common protocols such as HTTP COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21.MAR.2007 16:34 SPRUSON FERGUSON 92615486 NO. 6353 P, 6 S2 0 0 (HyperText Transfer Protocol) and TCP/IP (Transmission Control SProtocol/Internet Protocol) for facilitating the easy publishing and exchange of information. The Intemet is thus a unique distributed database designed to give wide access to a large universe of documents published from an C 5 unlimited number of users and sources. The database records of the Internet Cl are in the form of documents known as "pages" or collections of pages known O as "sites." Pages and sites reside on servers and are accessible via the Scommon protocols. The intemet is therefore a vast database of information CN dispersed across seemingly countless individual computer systems that is constantly changing and has no centralized organization.
Computers connected to the internet may access pages via a program known as a browser, which has a powerful, simple-to-learn user interface, typically graphical and enables every computer connected to the Intemet to be both a publisher and consumer of information. Another powerful technique enabled by browsers are known as hyperlinking, which permits page authors to create links to other pages that users can then retrieve by using simple commands, for example pointing and clicking within the browser. Thus each page exists within a nexus of semantically related pages because each page can be both a target and a source for hyperlinking, and this connectivity can be captured to some extent by mapping and comparing how those hyperlinks interrelate. In addition, the pages may be constructed in any one of a variety of syntaxes, such as Hyper Text Markup Language (HTML) or eXstensible Markup Language (XML), and may include multimedia information content such as graphics, audio, and still and moving pictures.
COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 I21-MAR2007 16:35 SPRUSON FERGUSON 92615486 NO. 6353 P. 7 3
O
0 Because any person with a computer and a connection to the Internet may publish their own page on the Internet as well as access any other publicly available page, the Internet enables a many-to-many model of information production and consumption tat Is not possible or practical in the N 5 offline world. Effective search services, including search engines, are an Simportant part of the many-to-many model, enabling information consumers to Srapidly and reliably Identify relevant pages among a mass of irrelevant yet Ssimilar pages. Because of the many-to-many model, a presence on the CN Internet has the capability to introduce a worldwide base of consumers to businesses, individuals, and institutions seeking to advertise their products and services to consumers who are potential customers, Furthermore, the ever increasing sophistication in the design of pages, made possible by the exponential increase in data transmission rates, computer processing speeds and browser functionality makes the Internet an increasingly attractive s1 medium for facilitating and conducting commercial transactions as well as advertising and enabling such transactions. Because the Internet allows direct identification of and connection between businesses and targeted consumers, it has the potential to be a powerfully effective advertising medium.
The availability of powerful new tools that facilitate the development and distribution of Internet content (this includes information of any kind, in any form or format) has led to a proliferation of information, products, and services offered through the Intemet and a dramatic growth in the number and types of consumers using the Intemet. International Data Corporation, COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21-MAR2007 16:35 SPRUSON FERGUSON 92615486 NO. 6353 P. 8 4 t--
O
0 commonly referred to as IDC, has estimated that the number of Intemet users d will grow to approximately 320 million worldwide by the end of 2002. In addition, commerce conducted over the Interet has grown and Is expected to grow dramatically. IDC estimates that the percentage of Internet users buying goods and services on the Internet will increase to approximately 40% in S2002, and that the total value of goods and services purchased over the o Intemet will increase to approximately $425.7 billion.
o Thus, the Internet has emerged as an attractive new medium for c advertisers of information, products and services ("advertisers") to reach not only consumers in general, but also to enable increased capabilities to identify and target specific groups of consumers based on their preferences, characteristics or behaviors. However, the Internet is composed of an unlimited number of sites dispersed across millions of different computer systems all over the world, and so- advertisers face the daunting task of locating and targeting the specific groups or subgroups of consumers who are potentially interested in their information, products and/or services.
Advertisers, rely on search services to help consumers locate the advertisers' sites. Search services, including directories and search engines, have been developed to index and search the information available on the Internet and thereby help users, including consumers, locate information, products and services of interest These search services enable users, including consumers, to search the Internet for a listing of sites based on a specific keyword topic, product, or service of interest as described by the users in their own language. Because search services are the most COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21-MAR-2007 16:35 SPRUSON FERGUSON 92615486 NO. 6353 P. 9 t- 0 Sfrequently used tool on the Intemet after email, sites providing search services Soffer advertisers significant reach into the Internet audience and create the opportunity to target consumer interests based on keyword or topical search requests.
Search services are generally created by search engine providers who Selectronically review the pages of the Intemet and create an index and o database based on that review. The search engine providers may offer the o search services directly to consumers or may provide the search services to a N third party who then provides the search services to consumers. Usually, the databases are created either by crawling the Intemet end making a local copy of every page or aspect thereof into a memory device, or by collecting submissions from the providers of the pages (the "Resulting Pages"). This can include static and/or dynamic content, whether text, image, audio, video or still images. Alternatively, only certain aspects of the pages may be copied such as the URL, title or text. Each Resulting Page is indexed for later reference.
Thus when a search of the Internet is requested by a user, the search engine does not actually search the Internet in real-time, but rather searches its own index and database for the relevant Resulting Pages ("search results" or "listings"). The search results are then presented to the user as either copies of the actual pages or a listing of pages that may be accessed via hyperlink.
Many known search engines use automated search technology to catalog search results which generally rely on invisible site descriptions known as "meta tags" that are authored by site promoters. Because advertisers may freely tag or have tagged their sites as they choose, many COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21. MAR. 2007 16:36 SPRUSON FERGUSON 92615486 NO. 6353 P. 6 r: 0 (c pages are given similar meta-tags, which increase the difficulty of providing Srelevant search results. In addition, most known search engines rely on their own hierarchy of semantic categories into which indexed pages are Scategorized. This is a top-down categorization approach where the categories are semantically related Irrespective of their commercial or noncommercial nature. Therefore, known search engines do not provide a o bottom-up, customizable categorization of search result based upon the page o or site's commercial nature and relevance.
Additionally, some advertisers and other site promoters insert popular to search terms into their site's mete tags which are not relevant to their pages so that these pages may attract additional consumer attention at little to no marginal cost. Such pages yield many undesirable results and are referred to as "spam pages." Generally, pages are referred to as s pam" if they include some mechanism for the purpose of deceiving search engines and/or relevance ordering algorithms and may also redirect users towards sites that are not relevant to the user's original search. Many such mechanisms and techniques exist and include, but are not limited to including meta tags that do not reflect the true nature of the page. Usually, spam pages are commercial in nature. That is, they attempt to sell something to users.
Many known search engines are simply not equipped to prioritize results in accordance with consumers' preferences. Known search engines also do not provide any way to determine whether each page In a listing is commercial in nature and to categorize the listing on the basis of the commercial nature of each page. When this is done, the search results can COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21-MAR2007 16:36 SPRUSON FERGUSON 92615486 NVO. 6353 P. I I O7 be processed to provide a more useful organization according to the consumer's intent (whether it be to carry out a commercial transaction or to seek information) in initiating the search. For example, a consumer seeking
C
N information on a given topic may wish to distinguish pages that are primarily C 5 informational in nature from pages that are primarily commercial in nature. In C another example, a consumer may wish to distinguish pages that are primarily Scommercial in nature and relevant to the consumers request, from unwanted Sor spam pages.
Cl Moreover, in known search engines, a consumer attempting to locate a site for purchasing goods or services will also be presented with a vast number of sites that might relate to the item but do not facilitate the purchase of that item. Likewise, consumers interested only in locating informational sites for an item will also be presented with many commercial sites for purchasing the item that may not provide the information they are seeking, Therefore, the consumer's desired result pages are hidden among large numbers of pages that do not correspond with the consumer's ultimate goal because known search engines are not able to distinguish either the consumer's intent for the search nor the commercial or non-commercial nature of the search results.
Thus, the known search engines do not provide an effective means for users to categorize the type of search results for which they are looking, informational or commercial, or for advertisers seeking to control their exposure and target their distribution of information to interested consumers, Current paradigms for presenting search results make no page by page COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21-MAR2007 16:36 SPRUSON FERGUSON 92615486 NO. 6353 P. 12 distinction between informational and commercial sources o information. and instead mix both types of results depending purely on the relevance assigned to them as responses to the user's original search query.
N Known methods used by advertisers to control their exposure and Starget their distibution, such as banner advertising, follow traditional Sadvertising paradigms and fail to utilize the unique attributes of the Internet's O many-to-many publishing model. Furthermore, to the extent that banner ads are found in the search results, they often fall to attract consumer inres S. because the consumer is looking in a directed manner for search results on that page, not for a banner.
Thus, the traditional paradigms relating to Intemet advertising and search engines fail to effectively categorize and deliver relevant information to interested parties in a timely and cost-effective manner. Therefore, consumers must manually sort through all search results to ultimately locate the type of results (commercial or non-commercial) In which they are interested. Because Internet advertising can, however, offer a level of targetability, interactivity, and measurability not generally available in other media, the ability to categorize and clearly present identified sets of commercial and non-commercial results increases consumer satisfaction and facilitates increased economic efficienoy by reducing the amount of manual sorting required of users.
Ideally, advertisers should be able to improve their visibility in an Internet search results list so that their pages not only appear prominently in the listing but are not masked by a multitude of other non-commercial pages.
COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21-MAR2007 16:37 SPRUSON FERGUSON 92615486 NO. 6353 P. 13 (see US Patent No. 6,269,361, incorporated herein by reference), Likewise consumers should be able to have their search results reliably categorized and clearly presented as either informational or commercial. Without a reliable means to distinguish between commercial and non-commercial pages, known search engines cannot exploit the true potential of the targeted market approach made possible by the Internet.
Thus, the search engine functionality of the Internet needs to be focused in a new directon to facilitate an online marketplace which offers consumers quick, relevant and customizable search results while simultaneously providing advertisers with a reliable, verifiable and costeffective way to target consumers and position the advetisers' products and services within a listing. A consumer utilizing a search engine that facilitates this on-line marketplace will find companies or businesses that offer the products or services that the consumer is seeking without the distraction of non-commercial pages. Additionally, while the user is seeking strictly informational resources, the user will not be bothered by spam pages or irrelevant commercial pages.
SUMMARY
A number of aspects an* embodiments of the invention are described herein, including: A system and method for examining and categorizing records in a distributed database as commercial or noncommercial records and then presenting those records in response to a database query s u bmitted by a user or networdefined settings.
A customizable search engine that permits users to organize search results COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21-MAR-2007 16:37 SPRUSON FERGUSON 92615486 NO. 6353 P. 14 tlstings based upon the commercial nature of the search result and to allow users to specify presentation rules based upon categories and user preferences.
a customizable search engine that permits each search engine service customer to organize search results listings based upon the commercal nature of the search result and to allow the search engine service customer to specify presentation rules for the search results based upon categories and search engine service customer preferences.
a system and method for enabling search engine service providers or users to dynamically specify the importance of various transactional criteria and threshold values in order to create a flexible scale of value based on the commercial nature of a record in order to assign a transactional rating and therefore a commercial or non-commercial designation for each record, a system and method for categorizing and presenting search results by combining a transactional rating with a quality score and a spam score In order to assign a commercial score and then rank or classify such results according to such score.
a system and method for categorizing documents in a distributed database to create categorized documents by Initially assuming all documents are noncommercial, filtering out all commercial documents and placing them in a first COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21-MAR2007 16:37 SPRUSON FERGUSON 92615486 NO. 6353 P. 11 8 category and using the first category as a collection of advertiser pr t ror a pay for performance search engine.
Sa cost-effective N system and method for managing the operation of a pay for performance search engine by automatically generating advertiser sales leads by initially categorizing pages as commercial or non-commerclal and then further Scategorizing commercial pages as existing customers or sales leads.
Ci a system and Smethod for categorizing records in a distributed database to identify commercial records and compare those records against a pay for performance search engine's listings in order to further categorize commercial records as either participating advertisers or non-participating advertisers.
a system and method of sales lead generation for pay for performance search engine advertisers by organizing and presenting non-participating commercial records to a pay for performance search engine sales staff according to dynamically specified criteria.
Described herein are methods for creating categorized documents, categorizing documents in a distributed database and categorizing Resulting Pages. Also described herein is an apparatus for searching a distributed database.
A first method for creating categorized documents generally comprises: initially assuming all documents are of type 1; filtering out all type 2 COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21-MAR2007 16:37 SPRUSON FERGUSON 92615486 NO. 6353 P. 16 12 Sdocuments and placing them n a first category filtering out all type 3 c documents and placing them in a second category; and defining all remaining documents as type 4 documents and placing all type 4 documents in a third category.
A second method for categorizing documents in a distributed database generally comprises: assuming all documents In the distributed database are Snon-commercial in nature; filterng out all documents that are commercial In nature from the documents, wherein the documents that are commercial in Snature are commercial documents; and creating sales leads from the commercial documents. In one embodiment of this method, the documents are pages and the distributed database is the Internet.
A third method for categorizing Resulting Pages into categories generally comprises: designating a first category as commercial pages and a second category as informational pages; determining a quality score q(wi) for each is Resulting Page; determining a transactional rating for each Resulting Page deriving a propagation matrix; P determining a commercial score i for each Resulting Page: filtering out all Resulting Pages that meet or exceed a commercial score threshold value; wherein the Resulting Pages that meet or exceed the commercial page threshold value are placed in the first category and all remaining Resulting Pages are placed in the second category.
A further method for categorizing a plurality of Resulting Pages into categories generally comprises; determining whether each, of the plurality of Resulting Page is a spam page; determining a quality score q(wi) for each of the plurality of Resulting Pages; determining a transactional rating r(w) for e COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21-MAR2007 16:38 SPRUSON FERGUSON 92615486 NO. 6353 P. 17 13 each of the plurality of Resulting Pages deriving a propa5tionm atrix
P;
determining a commercial score x for each of the plurality of Resulting pages; Sfiltering out all spam-nclusive commercial pages fom th plurality SResulting ages; filtering out all spam pages from the spaminlusive commercial pages; placing all commercal pages in a commercial category and placing all remaining Resulting Pages into an Information category.
A method for searching a distributed database generally comprises: (a) entering search terms or phrases into a system; generating documents Scontaining keywords that match the search terms or phrases; categorzing search results into categories according to categorization criteria to create categorized documents; and presenting the categorized documents.
Also described herein is a search engine and database for a distributed database, generally comprising at least one memory device, comprising, at least one Internet cache and an Intemet index; a computing apparatus, comprising, a crawler in communication with the Internet cache and the Internet; an indexer in communication with the Internet index and the intemet cache; a transational score generator in communication with the Internet cache; and a category assignor In communication with the Internet cache; a search server in communication with the nternet cache, the Internet index; and a user interface in communication with the search server.
The system provides numerous embodiments that will be understood by those skilled in the art based on the present disclosure. Some of these are described below and are represented in the drawings by means of several figures, in which: COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21-MAR-2007 16:38 SPRUSON FERGUSON 92615486 NO. 6353 P. 18 14 'BRIEF DESCRIPTION OF SEVERAL VIEWS OF THE
DRAWINGS
FIG. 1A is a block diagram of page categoriztion, according to an embodiment of the present invention; FIG, iB is a is a block diagram of page categorization, according to another embodiment of the present invention; SFIG. 2 is a flow chart of a system for determining whether apage is a §o Commercial Page, according to an embodiment of the present invention; P FIG. 3 is a flow chart of a system for determining a transation rating for a page, according to an embodiment of the present invention FIG. 4 is a flow chart of a system for creating a propagation matix, according to an embodiment of the present invention; FIG. 5 is a flow chart of a system for providing customized categorization of search results, according to an embodiment of the present invention; FIG. 6 is a flow chart of a system for providing customized search results and the presentation of the customized search results, according to an embodiment of the present invention; FIG. 7 Is a flow chart of a system for automating the collection of sales leads for a pay for performance search engine sales staff, according to an embodiment of the present invention; and FIG. 8 is a diagram of an apparatus for categorizing and displaying search results, according to an embodiment of the present invention.
COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21, MAR. 2007 16:38 SPRUSON FERGUSON 92615486 NO, 6353 P. 19 I. 0 0 DETAILED
DESCRIPTION
Described herein Is a method and apparatus for identifying documents in a distributed database. One embodiment comprises a heuristic for identifying pages that are commercial in nature and providing a system and CI method for the dynamic categorization and presentation of both commercial pages and informational pages in real-time to an advertiser, search engine provider or user. This system may be used In any context where it is useful to 0 categorize search results based upon the commercial nature of those pages, and can be utilized in a multitude of forms from a browser plug-in to a standalone application to a back-end search-engine or search engine tool. In addition, the system can be used to provide unique operational benefits to a pay for performance search engine provider by automating a portion of the sales cyde and enabling a collaborative account management environment between advertisers and a the pay for performance search engine provider.
Distinct sets of search results for commercial pages and informational pages returned in response to a user-defined query, are provided to advertisers, search engine service providers and users, The system distinguishes pages according to the commercial nature of each page, and thereby provides more relevant results by providing relevant search results to those users seeking information or to enter into a commercial transaction, without confusing the two categories of search results. The system also enables complete customization with regard to the set of criteria used to categorize search results, the importance of each such criterium in the COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21-MAR2007 16:39 SPRUSON FERGUSON 92615486 NO- 6353 P. 16 determination of such categorization, nd the categoriztion and presentation of such search results to the user.
Methods and apparatuses for statically and dynamically categoriing and presenting the records of a distributed database are disclosed.
C Descriptions of specific embodiments are provided only as examples, and various modificatiors will be readily apparent to those skilled In the art and are Snot intended to be limited to the embodiments described. Identical features o are marked with Identical reference symbols in the indicated drawings.
SDescribed herein Is a customizable system for identifying and categorizing the records in or the results of a search of the records in a distributed database, and for categorizing and presenting the records or search results acording to the commercial nature of the record in a more organized, more easily understood, and therefore, more useful manner. The following descriptions detail how the pages of or the results of a search of the Internet may be identified and categorized as commercial and noncommercial (informational), but it is readily understood that the records of a distributed database, including the Internet, may be categorized into a limitless variety of oategories, including sub-categories of the commercial and non-commercial categories. Other categories may include on-line shopping and advertisements for traditional stores and services. Alternatively, or additionally, the records in or the search results of the records in a distributed database may be categorized and presented geographically, via price range, and by many other criteria according to a variety of user-specified variables.
Additionally, the methods disclosed herein may be used across any COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 -2 1. MAR 2007 16 :39 SPRUSON FERGUSON 92615486 NO. 6353 P. 21 17 to B kind of network including distributed database coupled in any manner to any kind of network inoludin LOcal Area NetworKS (LAN) and Wide Area Networks (WAN). and not just the Internet A and 1B shw how the search nterReferring now to the drawings, FIGs. 1A and 18 show how the searh results of a search of the Internet can be categorized. A search of the Intemet s actually a search of a dtabse f the ontents of the internet that can be generated through the use of a crawler. The crawler crawls the Internet and saves to a local database either a duplicate of each page found or a duplicate of a portion thereof (the portion may include any of the following features of each Internet page found: the URL, titles, content brief description of the content, hypedinks or any combination thereof). The local Copies of the pages or portions thereof may then be searched using a search engine. The local copies of the pages, portions thereof or any pages or portions thereof that are the result of a search of the foregoing are all considered "Resulting Pages".
As shown in FIGs. 1A and 18, the Resulting Pages 50 can generally be categorized as commercial, and noncommercial, Resulting Pages in the commercial category (CCommercial Pages") 52, 62 generally include those Resulting Pages that facilitate the buying and/or selling of goods and/or services or that evince an intent to conduct commercial activity by the publisher of that page (are commercial in nature)- For example, Commercial Pages 52, 62 include pages that offer goods and/or services via sale, lease, trade, or other such transotion, or that provide contact Information for suoh transactions to be made by some other means such as facsimile, telephone or COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21. MAR. 2007 16:39 SPRUSON FERGUSON 92615486 NO. 6353 P. 22 18 O tp e n com l al category ('on- Sn-pero esulting Pages in the noncommercial ca y SCom-personl pages") 54, 64 generallY include those that are informational in nature and do not facilitate the buying andlo( selling of goods andlor ervioes natre and donoe NonCommeroial Pages may N and hence are not commercial in nature. onCmmera Salternately be called "informational Pas. are gener c Resulting Pages that are spa m pages derd to b a subse of the Commercial Pages 52, 62, because Spa pages 56 are generally commeCial in nature. However, it is also possible for Sspam Pages to be primarilY informatona in nature because Spam pages provide information regarding goods andor serice, ut do ot them faclitate the buy of goods andlor services. Because, Spam Pages are designed to deceive or degrade search engines, including relevance-ordedng heuristics. they are generally undesirable and may be removed or excluded from the search results. Usually, Spam Pages are considered commerial in nature because they provide a direct link to other pages that are commercial In nature\, Spam pages can be categorized as Commercial Pages, as shown in FIGs. 1A and 1B, or, alternatively, excluded from the commercial category.
In one embodiment of the invention, Resulting Pages may be further categorized in the premium-content containing category ("PCC Pages).
PCC
Pages are pages for which payment of a premium is required in order to gain access to the content. In some cases, payment of the premium is governed by an agreement of contract. There are many examples of PCC Pages such as those found at the following URLs: www.law.com and www.northemlight.com POC Pages can be considered either a subset of COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21. MAR 2007 16:40 SPRUSON FERGUSON 92615486 NO. 6353 P. 23 19 Commercial Pages and be placed in the CommeC om categoryor a subset of Non-commerdal Pages and be placed in the Non.Commorcial Category depending on the preferences of the user or search engine service customer.
For example, poC pages 5B require payment of a premium in order to gain i Saccess- Because of the payment requirement, they have a commercial nature and may be considered a subset of the Commercial Pages, as shown in FIG.
ti A, On the other hand, PC0 PageS generally provide information and do not 0facilitate the buying, and/or sellIng of goodS and/or services other than the Cinformation contained on the PCC Pages themselves. Therefore, they also haVe an informational nature and may be considered a subset of the Non- Commercial Pages. as shown in FIG. 1B.
Yet another embodiment for filtering out the Commercial Pages and placing them in the commercial category generally comprises the steps shown in FIG. 2, Indicated by reference numeral 10. These steps include: IS determining whether each page is a Spain Page 12; determining a quality score for each page 14; determining a transactional rating for each page 16; deriving a propagation matrix 18; determining a commercial score for each page 20; filtering out all pages with a commercial score that meets or exceeds e threshold value (the "Spam-nclusive Commercial Pages") 22; fitering out the Spam Pages from the Spam.Inclusive Commercial Pages 24; and placing the Commercial Pages Into the Comnmecial category 26.
In one embodiment, determining whether a page is a Spam Page involves computing a sparn score, al(wI) for each page and determining whether the sparn score meets or exceeds the threshold value assigned to COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21- MAR. 2007 16:40 SPRUSON FERGUSON 92615486 NO. 6353 P. 24 o hespa r The pa that meet or exceed the spam score threshold using nown techniques, such as, having a human assign a score, and the usng which are hereby automated techniques presented in the following papers, which hereby c incorporated by reference: a white paper by ebrandmanagement.com entitled i "The Classification of Search Engine Spa" and a paper b Danny Sullivan entitled "serch nine Spamminl." Both documents appear in the SProceedings of Seach Engine trategies, Marc 4- 2002, organized by Danny Sullivan. The foregoing and other known methods 0 include both maenl and automatic evaluation methods. These methods and similar machine-learning techniques could also be applied to computing Lau the initial vector in equation (12) described later herein.
The quality score, q(wi), is a scalar value that is a measure of the quality of a page. In one embodiment, determining the quality score of the pages includes evaluating a subset of pages against a select group of criteria.
Criteria against which the quality of the page may be judged include quality of the content, reputation of the author or source of information, the ease of use of page and many other such criteda. The quality score may be humanassigned or determined automatically, and a default value may be assigned to pages not explicitly evaluated.
A transactional rating is a scalar value that represents whether or how strongly a page facilitates transactions, such as a sale, lease, rental or auction. In one embodiment, the steps for determining a transactional rating for each page are shown generally in FIG. 3 and indicated by reference COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21. MAR. 2007 16:40 SPRUSON FERGUSON 92615486 NO. 6353 P. 21 ined from transactional score.
cumber 16, Transactional ratings are determined from a transactonal sc A transactional score is a vector that represents whether or h strongly each page meets a specified set of criteria.
Therefore, the first step is to determine whether a page and/Or the S 5 page's URL meet select criteria 32. There are many, many characteristics of Sa page that can be examined in order to ulimately determine whether the l page is transactional in nature. These criteria inlude, determining whether the page includes the following: a field for entering credit card information; a C field for a username and/r password for an online payment system such as paypal M or BidPaYT
M
a telephone number identified for a "sales office," a "sales representative," "for more information call," or any other transaction oriented phrase; a link or button with text such as "click here to purchase, "One-Click r purchase," or similar phrase, text such as "your shopping cart contains" or "has been added to your cart, and/or a tag such as a one-pixel GIF used for converlson tracking. Any text matching may be either on text strings, such as sequences of characters in the Unicode or ASCII character sets, or on text derived from optical character recognition of text rendered in images, or speech recognition on a sound recording presented in response to an http (Hyper Text Transfer Protocol) request. The criteria can be used in any combination and any individual criteria may be used or not used, Additionally, these criteria are only examples and do not constitute an exhaustive list For each page, it must then be determined h ow strongly the page meets the selected criteria, block 34. Various techniques exist for determining COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21. MAR 2007 16:40 SPRUSON FERGUSON 92615486 NO. 6353 P. 26 22 and how strongly they meet these Swhethe pages meet certan criteria, 2, and ho ewhenher pages human editor and riteria 34. For instace, each page may be exa a an or a evaluated in terms of the criteria and assigned either a Booleanvalue or a weighted value- This, however, is a very slow and subective process Much raster automated techniques include, automaticall checking for or counting CI string matches, image matches or matohes of string length and! or matches of lphanumeric) and assigning a Iog- O data entry field type (such as numericdes or ainclude for oads modelsi Ladguage likelihood score using languag model Lang m odels include, for Cl example, n-gram word transition models as described n Methods for Speech Recognition. Jenek, 1999. These methods can assign a Boolean number or a weighted value.
Using the results obtained by determining whether each page and/or its URL meet select criteria, 32, and determining how strongly the page and/or its URL meet select criteria, 34, a transactional score is determined, Determining the transactional score 35 for each page includes reating a vector ak(w) or a vector pk(wd from the results of blocks3 2 and 34, respectively. One of these vectors is created for each page "wT, wherein the index represents a particular page and the index k" represents a particular criterion against which the page was evaluated. The number of elements in the vector (1 j n) is determined by the number of criteria used and the number of vectors is determined by the number of pages The transactional score a(wO is a vector of Boolean values wherein a 0" for a given criteria indicates that that criteria is not met (false) and any chosen integer for a given criteria indicates that that criteria is met (true). The COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21. MAR. 2007 16:41 SPRUSON FERGUSON 92615486 NO. 6353 P. 27 transactional score vector P dw has the same number of elements as cw).
However, the elements in In(w,) can include any range of real numbers Cwherein each number indicates how strongly a page meets the criteria. For instance, may include the real numbers between and (although it Scan include any range of real numbers) wherein represents that a criterion Cis not met at all and represents that a criteria is completely met. The real Snumbers between and represent the various degrees to which a o criterium is met.
Transactional scores okn(wJ and sk(Wr are used to determine alternate values for the transactional rating r(wj) for each page, wherein; (Pa(W)P alternately: The transactional rating r(wO is a scalar value that is the p-norm of either the vector an(w)) or the vector fn(w. is the number of criteria used in evaluating each site wi. Generally, p 2 so that no single weighted criterion dominates the others. However, p can be altered to give more weight to the most dominant criteria, if desired. Either formula or may alternately be used to determine the transactional rating. Formula reflects the degree to which individual criteria are met.
The steps for deriving the propagation matrix are shown generally in FIG. 4 as reference numeral 18. The steps comprise, creating a hyperlink COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21-MAR2007 16:41 SPRUSON FERGUSON 92615486 NO. 6353 P. 28 24 vty matr 42, clculating transition counts and page views, 44, and creating a propagation matrix 46. A hyperlink connety matr x is a wy of representing the link strUCture of the Intemet, World Wde Web or any set of Ni hyperdocuments and the relative importance or relevance of each page. In this embodiment, the relative importance of each page is determined by N examining the number of links from each page to each pag each page, w, to each page. W. These links are represented in the hyperlink N~ each page, ai tC each Pagem owa 0 connectivity matrix. The hyperlink conneclivity matrix C has rows and m" columns. The number of rows and columns equals the number of pages, wherein a specific row is indicated by index and a speciic column is indicated by column Each element in this matrix, will contain a value of if and only if a page wi links to another page w, otherwise it will contain a
S
The hyperlink connectivity matrix is then used to calculate two scalar values, the authority score a, and the hub score hi for each page W. In general, a hub Is a page with many outgoing links and an authority is a page with many incoming links. The hub and authority scores reflect how heavily a page serves as a reference or is referenced itself. The values for the hub and authority scores are determined as follows, respectively h=,t a--Y.I 0 The next step in determining the propagation matrix s to determine transition counts and page vieWS, block 44. In one embodiment, each COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21. MAR 2007 16:41 SPRUSON FERGUSON 92615486 NO. 6353 P. 29 transition count, TIJ, represents actual user behavior on the nternet in terms of how many times a user views a page w and then directl view another page w (without Viewing any intervening pages). All the transition counts are Srepresented in matrix form wherein
T
M represents each individual transition count. Pageviews represent the number of times a page was viewed and is C related to the transition counts.
SThen the hyperlink connectivity matrix, hub score, authority score, Stransition counts, and pageviews are all used to create the propagation matrix, block 46. The propagation matrix P Is created using the following formula: h(u.
1 ,vi)
P
pF()+G(at)+RH( The functions G(al) and H(vi) provide weights to the hub scores, authority scores and pageviews. These functions, G(as) and H(vi), are monotonically ncreasing scalar functions of non-negative integers, hi, a and respectively. Each of these functions corresponds to a weighing function, such as a step function, For example: F(0) =0; F(hr)=n f 'lf h-x and F(h)=F'if z>x, wherein This gives a lower significance to a hub score if it is below a threshold value x which indicates that insufficient data was accumulated.
COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21. MAR. 2007 16:42 SPRUSON FERGUSON 92615486 NO. 6353 P. 26 similar manner However, the threshold detmiiad n asimilar rin-aner, HOWe~r G(a and H(vj) are detennined th threshold value for H(v) will be value for G(ai) will be a value of at and the threhold a value of Vi- th ontbtions C The functions f(Cilhi) g(Cijai) and h(Ti) represent the t of the lincs and transitions. Each function Is a eightd quotient of Its o.or example, arguments, except when its denoirflnator is zero.
G if hi~o; and f(Cj, F(hr) if h'0; and hi
O
O (11) f(CO,o)= 0 T)d re determined In a similar manner.
The functions g Cl,,ai) and h(Tij~dai AS shown in FIG, 1, the next step In determining whether each page is commercial is determining a commercial score for each Page 2. This determination invoWles not only the propagation matrix, P, and the transaUon rating s(wO, but the spam score, o(wl), and quality score, as well. The transaction rating r(w) and the spam score 0(w) determinS the weight of the dlfferent components. The commercial score is determined recursivelY for each page, wi. by the following: *()for each page W (12) 40r) 4A-4B+ (13) r;PM'(r +(1-flx (14) K= 2Where is the weighted average of the transaction rating, r(wi), the spam score, and the quality score, A and B are weighing factors COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21. MAR, 2007 16:42 SPRUSON FERGUSON 92615486 NO. 6353 P. 31 27 that determine the weight given to c(w) and q(Wi), respectively. A and B may be selected by the search engine provider or creator, The vector K(t) has an Selement d 1 for every page examined wi. q Is the propagation matrix weight Sand may also be set by the search engine provider or creator, q determines the degree to which the propagation matrix effects the commercial score In c the initial Iterations. The symbol t" indicates an incrementing integer that starts at one and increases by one for each iteration. Each iteration has the o potential to affect all wl. The Iterations continue for a predetermined number Sof iterations or until there is little varation in the value of the commercial score:
DU,
p is the norm-level and A is a commercial score variation value. Once the difference in values obtained from two subsequent iterations equals or is less than the commercial score variation value, the iterations stop and the commercial score Is obtained 22.
All pages with a commercial score above or equal to a-commercial score threshold value are filtered out and comprise the Spam-Inclusive Commercial Pages 22. Although they may often be considered a subset of the Commercial Pages, the Spam Pages are filtered out from the Spaminclusive Pages 24 to yield the Commercial Pages, because Spam Pages are generally undesirable. The Commercial Pages are then placed into the commercial category 26. Once the Commercial Pages and the Spam Pages are filtered from the pages, the remaining pages are placed in the non- COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21. MAR. 2007 16:42 SPRUSON FERGUSON 92615486 NO. 6353 P. 32 28 Commerc$l category. The non-commercial category may also Include the POC Pages. categoed Into Commeal nd In another embodiment, pages are categorized into commercal and C NonCommercial categories as described above, however Spam Pages are C not separated into a distinct category. Instead, the Spam pages are categorized as either Commercalsl or Non-commercial pages depending on
S
a p a a nd the threshold Sthe underlying commercial score assigned to that page and th Sscores for each categorY specified. Because Spa Page may in theo be either commercial or non-commercial and because the inclusion of Spam Pages may be useful for some users and/or in some applications, this embodiment does not include a step for the identification and filter out Spam pages. By removing the identification and filtering of Sp am Pages, this embodiment is more modularly compatible with existing search engines because many existing search engines are equipped with their own systems for identifying and eliminating Spam Pages. In yet another embodiment, the Span Pages are not removed from the commercial category because Spam Pages do have potential value, for instance, as sales leads for a pay for performance search engine.
In another embodiment, categoriztilon of Resulting Pages may be customized by or for the user (including consumers, Site Providers and Advertisers). In the first stage of the process, the user defines their categorization preferences by entering such preferences through the system's user interface and then refining their selections until the desired categorization is achieved. Both the categories themselves and how the COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21.MAR.2007 16:42 SPRUSON FERGUSON 92615486 NO. 6353 P. 33 29 c Resulting Pages are categorized can be customied. The system can be customized to categoeriz Resultng Pages into ca l egories specified b the user, using the preiOUSlY described methods. nto hich category a given Resulting Page is categorized can be effeted by selecting any of th Oro categorized, the following alone or in combination; how CC Pages are categorized, the 0_ threshold levels, the p-norm level, parameters A and B in equation the number of iterations t' for computng the commerolal score. commercia score variation value A, the criteria used to determine which Resulting Pages are c Commerolal or pOC Pages and how much weight to give each criteria, the criteria used to determine the transaction score, and the transaction score formula used to determine the transaction rating (the "Categorization Criteria")- The Categorizaton Criteria can all be chosen so that Resulting Pages are categorized and presented in a variety of ways in order to satisfy the users preferences, In general, the categoriation Crteria may be chosen empirically by manual-seeding the system with pre-selected pages and examining the categories in which the pre-selected pages are categorized and then adjusting the Categorization Criteria to tune the system until the desired categorizations are achieved. For example, as shown in FIG. 5A, the user hand-seeds the system 200 with pre-selected pages for which the user knows the categories Into which the pages should be placed 210, The user than inputs the users preferences in terms of the categories into which the pages are to be categorized and the format in which the categorized results should be displayed 212, The user then sets the Categorization Criteria 214.
COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21-MAR2007 16:43 SPRUSON FERGUSON 92615486 NO. 6353 P. 34 SThe system then categorizes and presents the categorized results to the user 216. The user then determines whether the system has categorized the preselected pages into the desired categories 218. If the pre-selected pages are N not categorized in the desired categories, any one or combination of the Categorization Criteria may be altered and set in the system 214. Steps 214.
S216 and 218 may be repeated until the desired categorization is achieved.
oIn step 212. the user may set preferences for the way in which the categorized results are displayed. The results obtained from categorizing the SResulting Pages may be displayed In a variety of ways. For instance, the user may specify that only Resulting Pages matching a keyword search are to be categorized and presented or that a specific type or category of pages are to always be excluded, e.g. pornography or debt relier advertisements.
Additionally or alternatively, the user may view the categorized pages contained in certain categories in a variety of ways, including displaying by category or only displaying particular categories while not others, Additionally or alternatively, the user may specify the order in which the categorized pages are to be displayed. For instance, the categorized pages may be displayed by category with a preferred category appearing first- Additionally or alternately, intermediate values such as the transaction score, transaction rating, hyperlink connectivity matrix, propagation matrix, transaction authority and hub scores, the commercial, spam and quality scores may also be displayed. Additionally or alternately, the user may also request that the anchor text of the links be examined. If the anchor text contains the keywords, the pages containing any number of the keywords would be given COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 1. MAR. 2007 16:43 SPRUSON FERGUSON 92615486 NO. 6353 P. 31 Sa higher weighting than the links that do not contain any of the keywords.
Alternatively, links containing a greater number of keywords can be given a Shigher weighting than those with a lower number, Custmizing the display of categorized pages be accomplished using known display and presentation c 5 techniques.
SOnce the user has specified the categories, Categorization Criteria and Sdisplay preferences, a search 250 may be performed. As shown in FIG. 6, a search 250 begins when a user enters a search term or phrase nto the Ssystem using a user interface 260. The system will then generate the Resulting Pages according to any o a variety of known relevance methods, including returning Resulting Pages that contain a keyword or the keywords mat match the search term or phrase (the search results) 262. The system will then categorize the search results into categories specified by the user so that the Categorization Criteria specified by the user are satisfied 264. The system then presents the categorized pages according to the users presentation preferences 266.
In a further embodiment, the Commercial Pages may be used to generate sales leads, Using the URLs of the Commercial Pages, contact information for the companies hosting the Commercial Pages can be obtained from a domain name registry. The list of companies'and their contact information can then be compiled to develop a list of sales leads. As depicted in FIG. 7 a system 270 for categorizing the Resulting Pages generally includes the following steps; assume that each Resulting Page is noncommercial in nature 272; identify and filter out the pages that are COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21. MAR. 2007 16:43 SPRUSON FERGUSON 92615486 NO, 6353 P. 36 32 commercial in nature into a first category 274; identify and filter out existing advertiser client pages from the pages In the first category 276; gather contact information for the remaining pages (lead pages") 278; and (e) Sprovide the lead pages and their associated contact information as sales N 5 leads 280 to, for instance, a pay for performance search engine provider or C any other interested party.
Sin another embodiment, advertisers are offered the opportunity to pay 0 to have their listings included in or excluded from, certain categories, using l the techniques described in US Patent Number 6,269,361, incorporated by reference, herein. The fee paid by the advertisers may be a function of the prominence given their listing in a select category. In a further embodiment, only pages for which a fee has been paid will appear in the commercial (or other designated) category. In one embodiment, a customizable system for categorizing and presenting the records or the results of a search of the records in a distributed database may be configured as an account management server or search engine server associated with a database search apparatus, such as the type disclosed in US Patent Number 6,269.361. The functions described herein and illustrated in FIGS, 1-B may be implemented in any suitable manner.
One implementation is computer-readable source or object code that controls a processor of a seiver or other computing device to perform the described functions. The computer-readable code may be implemented as an article including a computer-readable signal-bearing medium. In one embodiment, the medium is a recordable data storage medium such as a COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21. MAR. 2007 16:44 SPRUSON FERGUSON 92615486 NO. 6353 P. 37 33 Sfloppy disk or a hard disk drive of a computer or a nonvolatile type of semiconductor memory. in another embodiment, the medium Is a modulated Scarrier signal such as data read over a network such as the interet. The C medium includes means in the medium for determining whether a page is transactional, means in the medium for deriving a propagation matrix for he Spage, and means in the medium for defining a commercial score as a function Sof the propagation matrix for the page. The various means may be Simplemented as computer source code, computerreadable object code or C any other suitable apparatus for controlling a processing device to perform the t0 described function.
Another embodiment of the present invention constitutes an apparatus for categorizing and presenting the records or the results of a search of the records in a distributed database over a distributed client-server architecture is shown in FIG. 8. This search engine and database 100 shown in FIG. 8 generally comprises a computing apparatus 110, 114, 118, 120, memory devices 112 and 116, a server 124 and an interface 122. The computing apparatuses 110, 114, 118, 120 may Include any processors that can perform computations. The crawler 110 Is a computing apparatus that is connected to the Intemet via a network and goes to every page and makes a copy of the page (the "Resulting Page'), including the static and/or dynamic content, whether text, image, audio, video or still images and stores the copy in the Internet cache 112. Alternatively, only a discrete number of parts of each Resulting Page, such as the URL and/or title are copied and stored In the Internet cache 112. Then the Indexer 114 assigns each Resulting Page copy, COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21.MAR. 2007 16:44 SPRUSON FERGUSON 92615486 NO. 6353 P. 38 34 Sor portion thereof, an address in he Internet cache 112 by (the "nterne cache address). The indexer also generates search terms for each Resulting SPage and stores these search terms with the associated Intemet cache address, in the Itemet index 116. The Internet cache and the Internet index S S would use approximately 30 terabytes and 5 terabytes, respectively, given the C( current size of the Internet.
O The transaction score generator 11 uses the information contained in the copies of each Resulting Page (or portions thereof) stored in the Inteet Scache 112 to generate the transaction scores. These transaction scores are then stored in the Internet cache 112 with their associated Resulting Internet pges. The categr ssigr 120 uses the transaction scores and other information stored in the Internet cache 112 to generate the propagaion matrix and assign a category to each Resulting Page. The transaction scores, commercial scores, quality scores, apam scores and catego f es for each page are stored in the Internet cache 112 with their associated pages.
The customizable threshold values p, norm parameter p, commercial score variation values A, etc, may be stored on the client or server side of the system as is well known to those skilled in the art. A search server 124 is coupled to the Internet index 116 and the Intemet cache 112 and allows the apparatus to connect to the users via the system's user interface 122. The system's user interface 122 may be a browser or it may be agent or application software.
A user desiring to search the Internet may use the system user interface 122 to connect to the search server 124 via the Internet. If the COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21. MAR. 2007 16:44 SPRUSON FERGUSON 92615486 NO. 6353 P. 39 system user Interface 122 is a browser, It sends the user's search request to the search server 124 via the inteme. Alternatively, if the user interface 122 SIs agent software, the agent sends an automated search request over the interet. AdditionallY, the user interface 12 may comprises both a rowser Sand agent software and send an automated search request to the search Sserver 124 over the Intemet, The search server 124 then uses the Interet O index 116 to determine which Resulting Pages are associated with the user o search terms. These Resulting Pages are then retrieved from the Internet SCache 112 and presented to the user via the user interace 122 in the manner specifled by the user.
From the foregoing, it can be seen that the presently disclosed embodiments provide a method and apparatus for categorizing and presenting select elements of a distributed database. Further advantages include providing advertisers, search service providers and users with a search engine and database that permits the customizable categorization of search results and providing a method and apparatus for filtering search results so that only a desired category or categories of search results are retumed or displayed.
Further benefits of the presently disclosed embodiments include providing to users, advertisers, search site providers and search engine providers a method of customizing searches to search and/or display search results according to category or criteria, and providing advertisers with a method for controling with which other links that advertiser's products andlor services are categorized and displayed, Still further, the present COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21. MAR. 2007 16:45 SPRUSON FERGUSON 92615486 NO. 6353 P. 36 embodiments disclose providing a method of identifying the nature of a site and providing a search engine capable of categozing search results as well as providing a search engine that is customizable by users and advertisers.
Although the invention has been described in terms of specific embodiments and applications, persons skilled in the art can, in light of this Sdisclosure, generate additional embodiments without exceeding the scope or 0departing from the spirit of the claimed invention. For example, the system Sand methods presented herein may be applied not just to databases accessed over the Internet, but to any distributed database. Furthermore, there is a vast variety of categories into which the pages or documents may be placed and In the criteria used to place them there, Accordingly, it is to be understood that the drawings and descrptions in this disclosure are proffered to facilitate comprehension of the invention and should not be construed to limit the scope thereof.
COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21

Claims (15)

  1. 21. MAR 2007 16:45 SPRUSON FERGUSON 92615486 NO0 6353 P. 41 37 0 0 C] SThe claims defining the invention are as follows 1. A erch engine and database for a distributed database, comprising: least one memory device, comprising, tc at least one Internet cache; and 0 C an Internet index; a crawler in communication with the intemet cache and an internet; an indexer in communication with the Internet index and the at least one Internet cache; a transactional score generator in communication wth the Intemet cache; and a category assignor in communication with the Internet cache; a search server in communication with the Internet cache, the Internet index; and a user interface in communication with the search server, 2. A search engine and database for a distributed database, as claimed in Claim 1, wherein the Internet cache is at least approximately terabytes. COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 -21-.MAR- 2007 16:45 SPRUSON FERGUSON 92615486 NO- 6353 P. 42 c 5 to 0 21 23 38 3. A search engine and database for a distributed database, as claimed in claim 1, wherein the Internet index is at least approxmately terabytes. 4. A method for searching a distributed database, 0 omprising: enterng search terms or phrases into a system; generating documents containing keywords that match the search terms or phrases; categorzing search results into categories according to categorzation criteria to create categorized documents; and presenting the cstegorized documents. A method for searching a distributed database, as claimed in Claim 4, wherein Categorizaton Critera are selected by a user. 6. A method for searching a distributed database, as claimed in Claim 5, wherein the categories are seleted by a user. 7. A method for searching a distributed database, as claimed in Claim 6, wherein Categorization Criteria are selected using steps coprising: manual-seeding the system with pre-selected documents; and repeating the steps of Claim 4 while varying the categorization criteria of step during each iteration until the categorized documents are categorized into the categories approximately as desired. 8. A method for searching a distributed database, as claimed in Claim 4, further comprising selecting display preferences, wherein the display preferences effect how the categorized documents are presented in step 9. A method for categorizing documents in a distributed database to create categorized documents, the method comprising: COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21, MAR, 2007 16:45 SPRUSON FERGUSON 92615486 NO, 6353 P. 43 39 initially assuming all documents are of type 1; Sfiltering out all type 2 documents and placing them in h first category; filtering ot all type 3 documents and placing them in a second S 5 category; and Sdefining al remaining documents as type 4 documents and placing all type 4 documents in a third category. A method for categorizing documents in a distributed database, Sdas claimed in Claim 9, wherein the documents are pages and the distributed l0 database is.the-Irtenet. 11. A method for categorizing Resulting Pages into categories, comprising: designating a first category as commercial pages and a second category as informational pages; determining a quality score q(wl) for each Resulting Page; determining a transactional rating for each Resulting Page deriving a propagation matrix; P determining a commercial score K for each Resulting Page; filtering out all Resulting Pages that meet or exceed a commercial score threshold value; wherein the Resulting Pages that meet or exceed the commercial page threshold value are placed in the first category and all remaining Resulting Pages are placed in the second category. COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21, MAR. 2007 16:46 SPRUSON FERGUSON 92615486 NO, 6353 P. 44 12. A method for categorizing Resulting Pages into categoriesr 8s claimed in Claim 1I, wherein determining the quality sc for each Resulting caimed in nlimItywhrenegainst a select Spage comprises evaluating a subset of Resultng pages against a select group of criteria. o use. 14. A method for categorizing Resulting Pagesinto categories, as claimed in Claim 12 wherein a default value is assigned to Resulting Pages not included in the subset of Resulting Pages- A method for categorizng Resulting Pages into categories, as claimed in Claim 11 wherein determining the transactional rating r(wl). comprises: determining whether each Resulting Page meets select criteria; determining how strongly each Resulting Page meets the select criteria; determining a transactional score for each page; and determining the transactional rating for each page from the transactional score. 16. A method for categorizing Resulting Pages into categories, as claimed in Claim 15 wherein determining how strongly each Resulting Page meets the select criteria, evaluating each Resulting Page in terms of the select criteria and assigning each of the Resulting Page either a Boolean or weighted value that reflects how strongly each of the Resulting Pages meets each of the select criteria, respectively. COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21. MAR 2007 16:46 SPRUSON FERGUSON 92615486 NO. 6353 P. 41 17. A method for categorizing Resulting pages into categories, as 17. A method for categoinpage claimed in Claim 15 wherein determning a transagtional sore for each page comprises creating a vector for each Resulting Page uck(wd, wherein each vector contains a pluralitY of elements cakrwi, wherein each of the plurality of elements otk(wd is a Boolean value that reflects how stronly each of the Resulting Pages meets each of the select criteria. 18. A method for categorizing Resulting Pages into categories, as claimed in Claim 15 wherein determining a transactional score for each page Ccomprises creating a vector for each Resulting Page k(wd), wherein each oms e in each of the pluality of 0 vector cntains a plurality of elements k( wherein kn(wd is a weighted value that reflects how strongly each of the Resulting Pages meets each of the select criteria. 19. A method for categorizing Resulting Pages into categories, as claimed in Claim 15 wherein determining the transactional rating r(wr) for each page from the transactional score comprises evaluating a relationship between the transactional rating r(wy, and a p-norm of a vector for each Resulting Page ak(wd wherein the relationship is defined by A method for categorlzing Resulting ages into categories, as claimed in Claim 19 wherein p=2. 21. A method for categorizing Resulting Pages into categories, as claimed in Claim 15 wherein determining the transactional rating r(wi) for each page from the transactional score comprises evaluating a relationship between the transactional rating r(wi) and a p-norm of a vector for each Resulting Page pk(wj) wherein the relationship is defined by IlP(wi)DP COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21. MAR 2007 16:46 SPRUSON FERGUSON 92615486 NO. 6353 P. 46 o 42
  2. 22. A method for categorizing Resultng pages into categories, Sclaimed in Claim 21 wherein p= 2
  3. 23. A method for oategorizing Resulting pages into categories, as claimed in Claim 11 wherein deriving a propagation matrix, comprises: l 5 creating a hyperlink connectivit matrix C containing elements Cl C, calculating a plurality of authority scores ai and a plurality of hub scores hi; Scalculating a plurality of transition counts T1i and a plurality of pageviews vi for each Resulting Page; and creating the propagation matrix P containing propagation matix elements PlU.
  4. 24. A method for categorizing Resulting Pages into categories, as claimed in Claim 23, wherein creating a hyperlink connecivity matrix C comprises representing a link structure of the internet in a matrix. A method for categorizing Resulting Pages into categories, as claimed in Claim 24, wherein the link structure if the Internet is represented by examining a number of links from each Resulting Page to each Resulting Page.
  5. 26. A method for categorizing Resulting Pages into categories, as claimed in Claim 23, wherein the plurality of hub scores hi and the plurality of authority scores are related to the hyperlink connectivity matrix C and wherein the plurality of authority scores ai are defined as; and wherein the plurality of hub scores are defined as: h, respectively. COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21-MAR2007 16:46 SPRUSON FERGUSON 92615486 NO. 6353 P. 47 43 S e or Resulting Pages into categories, as claied in Cla wherein the plurality of pagevies vi are related to the plurality of transition counts TIj and are defined by: v v ;,T.i Sing Resulting pages into categories, as
  6. 28. A method for categorlzing Resulting Pages into categories, as claimed In Claim 27, wherein the prpagatio matrix is a uction of the hyperlik connetivity matrix, the plurality of hub scores, the plurality o Sauthority scores, the plurality of transition counts and the Plurality pagevieWS
  7. 29. A method for categorizing Resulting Pages into categories, as o claimed In Claim 27, whrein calculating the propagation matrix further c 10 comprises weighting the plurality of hub scors, the plurality of authoity scores, and the plurality pageviews. A method for categorizing Resulting Pages into categories, as claimed in Claim 27, wherein the propagation matrix P is a further function of weighing functions F(hi). G(ai) and and wherein the propagation matrix P is defined as: Prj Fh)+ G( H(vi) 3.3 A method for categorizing Resulting Pages Into categories, as claimed in Claim 30, wherein each of the weighting functions comprises a step function.
  8. 32. A method for categorizing Resulting Pages into categories, as claimed in Claim 31, whereln the commercial score x for each Resulting Page wi is determined reoursively.
  9. 33. A method for categorizing Resulting pages Into categories, as claimed in Claim 32, wherein the commercial score K is recursively determined over t iterations from a transverse of the propagation matrix p, a propagation matrix weight q, and a commerial score initial value wherein is weighted by select quantities A and B and defined as: COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21-MAR2007 16:47 SPRUSON FERGUSON 92615486 NO. 6353 P. 48 44 0 SAr(wI+) and a prior iteration of the-eemMHersGiaW- clA.4 wiereB+ wherein is defined as: rPr and wherein
  10. 34. A method for categorizing Resulting Pages into categories, as C 5 claimed in Claim 11, 0 further comprising designating a third category as spam pages; C and determining a spam score o(wi) for each Resulting Page; 0 wherein determining the commercial score n for each Resulting Page is recursively determined over t Iterations from a transverse of the propagation matrix pT, propagation matrix weight r and commercial score initial value wherein K(0) is weighted by select quantities A Ar(w)+Bq(w)+(W an and prior and B and defined as: A+B+1 iteration of the commercial score wherein is defined as: rlPrK'tl '(O),and wherein
  11. 35. A method for categorizing a plurality of Resulting Pages into categories, comprising: determining whether each of the plurality of Resulting Page is a spam page; determining a quality score q(wi) for each of the plurality of Resulting Pages; determining a transactional rating r(w) for each of the plurality of Resulting Pages; deriving a propagation matrix P, COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21. MAR. 2007 16:47 SPRUSON FERGUSON 92615486 NO. 6353 P. 49 c etermining a commercial score K for each of the plurality of Resulting pages; Sfiltering out all spam-inclusive commercial pages from the plurality of Resulting Pages; filtering out all spam pages from the sparm-indusive commerial pages; Saplacing all commercial pages in a commercial category; and 0 placing all remaining Resulting Pages into an information o category.
  12. 36. A method for categorizing documents in a distributed database. comprising: assuming all documents in the distributed database are non- commercial in nature; filtering out all documents that are commercial in nature from the documents, wherein the documents that are commercial in nature are commercial documents; and creating sales leads from the commercial documents.
  13. 37. A method for categorizing documents in a distributed database, as claimed in Claim 36, wherein filtering out all the commercial documents comprises placing all the commercial documents into a first category. 38, A method for categorizing documents in a distributed database, as claimed in Claim 37, further comprising after placing all the documents that are commercial in nature into a first category, filtering out existing advertiser client pages from the commercial pages in the first category wherein the commercial pages remaining in the first category are lead pages. COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21 21. MAR. 2007 16:47 SPRUSON FERGUSON 92615486 NO. 6353 P. -46- o39 A method for categorizing documents in a distributed database, as claimed in Claim 37, wherein creating sales leads from the commaercial documents comprises creating sales leads from the lead pages, wherein creating leads from the load pages comprises: Cl gathering contact information for the lead pages; and providing a list of the lead pages anid the contact information. A search enine substantially as described herein with reference to any o one or more of the accompanying drawings. 01 o41. A method for searching a distributed database, substantially as described herein with reference to any one or more of the accompanying drawings.
  14. 42. A method for categorizing documents in a distributed database, is substantially as described herein with refoence to any one or more of the accompanying drawings,
  15. 43. A method for categorizing Resulting Pages, substantially as described herein with reference to any one or wore of the accompanying drawings. Dated 21 March, 2007 Overture Services, Inc. Patent Attorneys for the Applicant/Nominated Person SPRUSON FERGUSON COMS ID No: SBMI-06700027 Received by IP Australia: Time 16:49 Date 2007-03-21
AU2007201222A 2002-05-24 2007-03-21 Method and apparatus for categorizing and presenting documents of a distributed database Abandoned AU2007201222A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2007201222A AU2007201222A1 (en) 2002-05-24 2007-03-21 Method and apparatus for categorizing and presenting documents of a distributed database

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US10155290 2002-05-24
AU2003204327A AU2003204327B2 (en) 2002-05-24 2003-05-23 Method and Apparatus for Categorizing and Presenting Documents of a Distributed Database
AU2007201222A AU2007201222A1 (en) 2002-05-24 2007-03-21 Method and apparatus for categorizing and presenting documents of a distributed database

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
AU2003204327A Division AU2003204327B2 (en) 2002-05-24 2003-05-23 Method and Apparatus for Categorizing and Presenting Documents of a Distributed Database

Publications (1)

Publication Number Publication Date
AU2007201222A1 true AU2007201222A1 (en) 2007-04-19

Family

ID=38009160

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2007201222A Abandoned AU2007201222A1 (en) 2002-05-24 2007-03-21 Method and apparatus for categorizing and presenting documents of a distributed database

Country Status (1)

Country Link
AU (1) AU2007201222A1 (en)

Similar Documents

Publication Publication Date Title
AU2003204327B2 (en) Method and Apparatus for Categorizing and Presenting Documents of a Distributed Database
US8260786B2 (en) Method and apparatus for categorizing and presenting documents of a distributed database
US7092901B2 (en) System and method for influencing a position on a search result list generated by a computer network search engine
Srivastava et al. Web mining–concepts, applications and research directions
US8150716B1 (en) Website and method for search engine optimization by prompting, recording and displaying feedback of a web site user
US7035812B2 (en) System and method for enabling multi-element bidding for influencing a position on a search result list generated by a computer network search engine
US7702537B2 (en) System and method for enabling multi-element bidding for influencing a position on a search result list generated by a computer network search engine
Clay et al. Search engine optimization all-in-one for dummies
WO2001025947A1 (en) Method of dynamically recommending web sites and answering user queries based upon affinity groups
AU2007201222A1 (en) Method and apparatus for categorizing and presenting documents of a distributed database
Neethling Search engine optimisation or paid placement systems: user preference
Köhne Optimizing a large dynamically generated Website for search engine crawling and ranking
Sharma et al. ChatterCrop: Reaping the benefits of online product reviews

Legal Events

Date Code Title Description
PC1 Assignment before grant (sect. 113)

Owner name: YAHOO! INC.

Free format text: FORMER APPLICANT(S): OVERTURE SERVICES, INC.

MK4 Application lapsed section 142(2)(d) - no continuation fee paid for the application