[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

A highly scalable and effective method for metasearch

Published: 01 July 2001 Publication History

Abstract

A metasearch engine is a system that supports unified access to multiple local search engines. Database selection is one of the main challenges in building a large-scale metasearch engine. The problem is to efficiently and accurately determine a small number of potentially useful local search engines to invoke for each user query. In order to enable accurate selection, metadata that reflect the contents of each search engine need to be collected and used. This article proposes a highly scalable and accurate database selection method. This method has several novel features. First, the metadata for representing the contents of all search engines are organized into a single integrated representative. Such a representative yields both computational efficiency and storage efficiency. Second, the new selection method is based on a theory for ranking search engines optimally. Experimental results indicate that this new method is very effective. An operational prototype system has been built based on the proposed approach.

References

[1]
ARMS, W., BOWMAN, C., FUHR, N., GRAVANO, L., KAPIDAKIS, S., KOVACS, L., LAGOZE,C., LEVAN, B., PAPAZOGLOU, M., AND SMEATON, A. 1999. Resource Discovery in a Globally- DistributedDigital Library. Digital Library Collaborative Working Groups Report, http://www. iei.pi.cnr.it/DELOS/NSF/resourcediscovery.htm.
[2]
BAUMGARTEN, C. 1997. A probabilistic model for distributed information retrieval. In Proceedings of the ACM SIGIR Conference (Philadelphia, July), 258-266.
[3]
BAUMGARTEN, C. 1999. A probabilistic solution to the selection and fusion problem in distributed information retrieval. In Proceedings of the ACM SIGIR Conference (Berkeley, Calif., August), 246-253.
[4]
BERGMAN, M. 2000. The Deep Web:Surfacing the Hidden Value. BrightPlanet, www. completeplanet.com/Tutorials/DeepWeb/index.asp.
[5]
BHARAT,K.AND BRODER, A. 1998. A technique for measuring the relative size and overlap of public web search engines. In Proceedings of the Seventh World Wide Web Conference (Brisbane, April), 379-388.
[6]
CALLAN, J., CONNELL, M., AND DU, A. 1999. Automatic discovery of language models for text databases. In Proceedings of the ACM SIGMOD Conference (Philadelphia, June), 479-490.
[7]
CALLAN, J., LU, Z., AND CROFT, W. 1995. Searching distributed collections with inference networks. In Proceedings of the ACM SIGIR Conference (Seattle), 21-28.
[8]
DREILINGER,D.AND HOWE, A. 1997. Experiences with selecting search engines using metasearch. ACM Trans. Inf. Syst. 15, 3 (July), 195-222.
[9]
FAN,Y.AND GAUCH, S. 1999. Adaptive agents for information gathering from multiple, distributed information sources. In Proceedings of the 1999 AAAI Symposium on Intelligent Agents in Cyberspace (Stanford University, March), 40-46.
[10]
FRENCH, J., POWELL, A., CALLAN, J., VILES, C., EMMITT, T., PREY, K., AND MOU, Y. 1999. Comparing the performance of database selection algorithms. In Proceedings of the ACM SIGIR Conference (Berkeley, Calif., August), 238-245.
[11]
FRENCH, J., POWELL, A., AND VILES, C. 1998. Evaluating database selection techniques: A testbed and experiment. In Proceedings of the ACM SIGIR Conference (Melbourne, August), 121-129.
[12]
GAUCH, S., WANG,G.,AND GOMEZ, M. 1996. Profusion: Intelligent fusion from multiple, distributed search engines. J. Universal Comput. Sci. 2, 9, 637-649.
[13]
GRAVANO,L.AND GARCIA-MOLINA, H. 1995. Generalizing gloss to vector-space databases and broker hierarchies. In Proceedings of the International Conferences on Very Large Data Bases (Zurich, September), 78-89.
[14]
GRAVANO,L.AND GARCIA-MOLINA, H. 1997. Merging ranks from heterogeneous internet sources. In Proceedings of the International Conferences on Very Large Data Bases (Athens, August), 196-205.
[15]
HAWKING,D.AND THISTLEWAITE, P. 1999. Methods for information server selection. ACM Trans. Inf. Syst. 17, 1 (Jan.), 40-76.
[16]
IPEIROTIS, P., GRAVANO, L., AND SAHAMI, M. 2001. Probe, count, and classify: Categorizing hiddenweb databases. In Proceedings of the ACM SIGMOD Conference (Santa Barbara, Calif.), 67-78.
[17]
JANSEN, B., SPINK, A., BATEMAN,J.,AND SARACEVIC, T. 1998. Real life information retrieval: A study of user queries on the web. ACM SIGIR Forum 32, 1, 5-17.
[18]
KIRK, T., LEVY, A., SAGIV,Y.,AND SRIVASTAVA, D. 1995. The information manifold. In AAAI Spring Symposium on Information Gathering in Distributed Heterogeneous Environments.
[19]
KIRSCH, S. 1998. Internet search: Infoseek's experiences searching the internet. ACM SIGIR Forum 32, 2, 3-7.
[20]
LAWRENCE,S.AND LEE GILES, C. 1998a. Inquirus, the neci meta search engine. In Proceedings of the Seventh International World Wide Web Conference (Brisbane, April), 95-105.
[21]
LAWRENCE,S.AND LEE GILES, C. 1998b. Searching the world wide web. Science 280, 98-100.
[22]
LAWRENCE,S.AND LEE GILES, C. 1999. Accessibility of information on the web. Nature 400, 107- 109.
[23]
LIMA,E.AND PEDERSEN, J. 1999. Phrases recognition and expansion for short, precision-biased queries based on a query log. In Proceedings of the ACM SIGIR Conference (Berkeley, Calif. August), 145-152.
[24]
LIU, K., YU,C.,AND MENG, W. 2001a. Discovering the representative of a search engine. Tech. Rep., DePaul University.
[25]
LIU, K., YU, C., MENG, W., WU,W.,AND RISHE, N. 2001b. A statistical method for estimating the usefulness of text databases. IEEE Trans. Knowl. Data Eng. (to appear).
[26]
LIU, L. 1999. Query routing in large-scale digital library systems. In Proceedings of the IEEE International Conference on Data Engineering. (Sydney, March), 154-163.
[27]
MANBER,U.AND BIGOT, P. 1997. The search broker. In Proceedings of the USENIX Symposium on Internet Technologies and Systems (Monterey, Calif., December), 231-239.
[28]
MENG, M., LIU, K., YU, C., WANG, X., CHANG,Y.,AND RISHE, N. 1998. Determine text databases to search in the internet. In Proceedings of the International Conferences on Very Large Data Bases, (New York, August), 14-25.
[29]
MENG, M., LIU, K., YU, C., WU,W.,AND RISHE, N. 1999a. Estimating the usefulness of search engines. In Proceedings of the IEEE International Conference on Data Engineering (Sydney, March), 146-153.
[30]
MENG, W., WANG, W., SUN, H., AND YU, C. 2001a. Concept hierarchy based text database categorization. Int. J. Knowl. Inf. Syst. (to appear).
[31]
MENG, W., YU,C.,AND LIU, K. 2001b. Building effective and efficient metasearch engines. ACM Comput. Surv. (to appear).
[32]
MENG, W., YU,C.,AND LIU, K. 1999b. Detection of heterogeneities in a multiple text database environment. In Proceedings of the Fourth IFCIS Conference on Cooperative Information Systems (Edinburgh, September), 22-33.
[33]
PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD, T. 1998. The pagerank citation ranking: Bring order to the web. Tech. Rep., Stanford University.
[34]
SALTON,G.AND MCGILL, M. 1983. Introduction to Modern Information Retrieval. McGraw-Hill, New York.
[35]
SELBERG,E.AND ETZIONI, O. 1995. Multi-service search and comparison using the metacrawler. In Proceedings of the Fourth World Wide Web Conference (Boston, December), 195-208.
[36]
SELBERG,E.AND ETZIONI, O. 1997. The metacrawler architecture for resource aggregation on the Web. IEEE Expert 12, 1, 8-14.
[37]
SPARCK JONES, K. 1972. Statistical interpretation of term specificity and its application in retrieval. J. Doc. 28, 1, 11-20.
[38]
SUGIURA,A.AND ETZIONI, O. 2000. Query routing for web search engines: Architecture and experiments. In Proceedings of the Ninth World Wide Web Conference (Amsterdam, May), 417-429.
[39]
VOORHEES, E., GUPTA,N.,AND JOHNSON-LAIRD, B. 1995. Learning collection fusion strategies. In Proceedings of the ACM SIGIR Conference (Seattle, July), 172-179.
[40]
WANG, W., MENG,W.,AND YU, C. 2000. Concept hierarchy based text database categorization in a metasearch engine environment. In Proceedings of the First International Conference on Web Information Systems Engineering (Hong Kong, June), 283-290.
[41]
XU,J.AND CALLAN, J. 1998. Effective retrieval with distributed collections. In Proceedings of the ACM SIGIR Conference (Melbourne, Australia), 112-120.
[42]
XU,J.AND CROFT, B. 1999. Cluster-based language models for distributed retrieval. In Proceedings of the ACM SIGIR Conference (Berkeley, Calif., August), 254-261.
[43]
YU,C.AND MENG, W. 1998. Principles of Database Query Processing for Advanced Applications. Kaufmann, San Francisco.
[44]
YU, C., LIU, K., MENG, W., WU, Z., AND RISHE, N. 2001b. Amethodology for retrieving text documents from multiple databases. IEEE Trans. Knowl. Data Eng. (to appear).
[45]
YU, C., LIU, K., WU, M., W., W., AND RISHE, N. 1999a. Finding the most similar documents across multiple text databases. In Proceedings of the IEEE Conference on Advances in Digital Libraries (Baltimore, May), 150-162.
[46]
YU, C., MENG, W., LIU, K., WU,W.,AND RISHE, N. 1999b. Efficient and effective metasearch for a large number of text databases. In Proceedings of the Eighth ACM International Conference on Information and Knowledge Management (Kansas City, November), 217-224.
[47]
YU, C., MENG, W., WU,W.,AND LIU, K. 2001a. Efficient and effective metasearch for text databases incorporating linkages among documents. In Proceedings of the ACMSIGMOD Conference (Santa Barbara, Calif., May), 187-198.
[48]
YUWONO,B.AND LEE, D. 1997. Server ranking for distributed text resource systems on the internet. In Proceedings of the fifth International Conference On Database Systems For Advanced Applications (Melbourne, Australia, April), 391-400.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 19, Issue 3
July 2001
119 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/502115
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2001
Published in TOIS Volume 19, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Database selection
  2. distributed text retrieval
  3. metasearch engine
  4. resource discovery

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media