[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Kosmix: high-performance topic exploration using the deep web

Published: 01 August 2009 Publication History

Abstract

Kosmix lies at the intersection of two important trends: topic exploration and the Deep Web. Topic exploration is a new approach to information discovery on the web that satisfies certain use cases not served well by conventional web search. The Deep Web, an inhospitable region for web crawlers, is emerging as a significant information resource. We describe the anatomy of Kosmix, the first general-purpose topic exploration engine to harness the Deep Web using a federated search approach. We focus in particular on the Kosmix approach to query tranformation and caching, which is essential to ensure reasonable performance.

References

[1]
L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces. In SBBD, 2004.
[2]
S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. In WWW7, 1998.
[3]
A. Broder. A taxonomy of web search. In SIGIR Forum, 36(2):3--10, 2002.
[4]
A. Doan, P. Domingos, A. Y. Halevy. Reconciling schemas of disparate data sources: A machine-learning approach. In SIGMOD, 2001.
[5]
Ehcache. http://ehcache.sourceforge.net.
[6]
L. Gravano, P. G. Iperiotis, M. Sahami. QProber: A system for automatic classification of deep-web databases. ACM Transactions on Information Systems, 21(1):1--41, 2003.
[7]
P. G. Iperiotis and L. Gravano. Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In VLDB, 2002.
[8]
A. Y. Levy, A. Rajaraman, J. J. Ordille. Querying Heterogeneous Information Sources Using Source Descriptions. In VLDB, 1996.
[9]
J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, A. Y. Halevy. Google's Deep Web crawl. PVLDB 1(2):1241--1252, 2008.
[10]
A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden-web content through keyword queries. In JCDL, 2005.
[11]
S. Raghavan and H. Garcia-Molina. Crawling the Hidden Web. In VLDB, 2001.
[12]
A. Rajaraman. Kosmix: Exploring the Deep Web using Taxonomies and Categorization. In IEEE Data Engineering Bulletin, June 2009.
[13]
A. Rajaraman, Y. Sagiv, and J. D. Ullman. Answering Queries using Templates with Binding Patterns. In PODS, 1995.
[14]
J. Wang, J.-R. Wen, F. Lochovsky, and W.-Y. Ma. Instance-based schema matching for web databases by domain-specific query probing. In VLDB, 2004.
[15]
Wikipedia. Solid-state Drive. http://en.wikipedia.org/wiki/Solid-state drive.
[16]
A. Wright. Searching the Deep Web. In CACM, 51(10):14--15, October 2008.
[17]
P. Wu, J.-R. Wen, H. Liu, and W.-Y. Ma. Query selection techniques for efficient crawling of structured web sources. In ICDE, 2006.
[18]
W. Wu, C. Yu, A. Doan, and W. Meng. An interactive clustering-based approach to integrating source query interfaces on the Deep Web. In SIGMOD, 2004.

Cited By

View all
  • (2019)Dataset search: a surveyThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-019-00564-x29:1(251-272)Online publication date: 24-Aug-2019
  • (2017)A survey of Web crawlers for information retrievalWIREs Data Mining and Knowledge Discovery10.1002/widm.12187:6Online publication date: 7-Aug-2017
  • (2014)A novel approach for comparing web sites by using MicroGenresEngineering Applications of Artificial Intelligence10.1016/j.engappai.2014.06.01135(187-198)Online publication date: 1-Oct-2014
  • Show More Cited By

Index Terms

  1. Kosmix: high-performance topic exploration using the deep web

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 2, Issue 2
        August 2009
        367 pages

        Publisher

        VLDB Endowment

        Publication History

        Published: 01 August 2009
        Published in PVLDB Volume 2, Issue 2

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)1
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 02 Mar 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2019)Dataset search: a surveyThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-019-00564-x29:1(251-272)Online publication date: 24-Aug-2019
        • (2017)A survey of Web crawlers for information retrievalWIREs Data Mining and Knowledge Discovery10.1002/widm.12187:6Online publication date: 7-Aug-2017
        • (2014)A novel approach for comparing web sites by using MicroGenresEngineering Applications of Artificial Intelligence10.1016/j.engappai.2014.06.01135(187-198)Online publication date: 1-Oct-2014
        • (2014)Publishing deep web geographic dataGeoinformatica10.1007/s10707-013-0201-318:4(769-792)Online publication date: 1-Oct-2014
        • (2013)Meta-search and Multi-domain SearchWeb Information Retrieval10.1007/978-3-642-39314-3_11(161-179)Online publication date: 2013
        • (2012)Thematic clustering and exploration of linked dataSearch Computing10.5555/2427336.2427350(157-175)Online publication date: 1-Jan-2012
        • (2012)Explanatory semantic relatedness and explicit spatialization for exploratory searchProceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval10.1145/2348283.2348341(415-424)Online publication date: 12-Aug-2012
        • (2012)Thematic Clustering and Exploration of Linked DataSearch Computing10.1007/978-3-642-34213-4_11(157-175)Online publication date: 2012
        • (2012)Designing Exploratory Search Applications upon Web Data SourcesSemantic Search over the Web10.1007/978-3-642-25008-8_3(61-77)Online publication date: 28-Jan-2012
        • (2011)The anatomy of a multi-domain search infrastructureProceedings of the 11th international conference on Web engineering10.5555/2027776.2027778(1-12)Online publication date: 20-Jun-2011
        • Show More Cited By

        View Options

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media