[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1065385.1065407acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
Article

Downloading textual hidden web content through keyword queries

Published: 07 June 2005 Publication History

Abstract

An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the Hidden Web or the Deep Web. Since there are no static links to the Hidden Web pages, search engines cannot discover and index such pages and thus do not return them in the results. However, according to recent studies, the content provided by many Hidden Web sites is often of very high quality and can be extremely valuable to many users.In this paper, we study how we can build an effective Hidden Web crawler that can autonomously discover and download pages from the Hidden Web. Since the only "entry point" to a Hidden Web site is a query interface, the main challenge that a Hidden Web crawler has to face is how to automatically generate meaningful queries to issue to the site. Here, we provide a theoretical framework to investigate the query generation problem for the Hidden Web and we propose effective policies for generating queries automatically. Our policies proceed iteratively, issuing a different query in every iteration. We experimentally evaluate the effectiveness of these policies on 4 real Hidden Web sites and our results are very promising. For instance, in one experiment, one of our policies downloaded more than 90% of a Hidden Web site (that contains 14 million documents) after issuing fewer than 100 queries.

References

[1]
Lexisnexis http://www.lexisnexis.com.
[2]
The Open Directory Project, http://www.dmoz.org.
[3]
E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. In ICDE, 2003.
[4]
E. Agichtein, P. Ipeirotis, and L. Gravano. Modeling query-based access to text databases. In WebDB, 2003.
[5]
Article on New York Times. Old Search Engine, the Library, Tries to Fit Into a Google World. Available at: http://www.nytimes.com/2004/06/21/technology/21LIBR.html, June 2004.
[6]
L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces. In SBBD, 2004.
[7]
M. K. Bergman. The deep web: Surfacing hidden value, http://www.press.umich.edu/jep/07-01/bergman.html.
[8]
K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public web search engines. In WWW, 1998.
[9]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In WWW, 1997.
[10]
J. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In SIGMOD, 1999.
[11]
J. P. Callan and M. E. Connell. Query-based sampling of text databases. Information Systems, 19(2):97--130, 2001.
[12]
K. C.-C. Chang, B. He, C. Li, and Z. Zhang. Structured databases on the web: Observations and implications. Technical report, UIUC.
[13]
J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In SIGMOD, 2000.
[14]
W. Cohen and Y. Singer. Learning to query the web. In AAAI Workshop on Internet-Based Information Systems, 1996.
[15]
J. Cope, N. Craswell, and D. Hawking. Automated discovery of search interfaces on the web. In 14th Australasian conference on Database technologies, 2003.
[16]
T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms, 2nd Edition. MIT Press/McGraw Hill, 2001.
[17]
D. Florescu, A. Y. Levy, and A. O. Mendelzon. Database techniques for the world-wide web: A survey. SIGMOD Record, 27(3):59--74, 1998.
[18]
B. He and K. C.-C. Chang. Statistical schema matching across web query interfaces. In SIGMOD Conference, 2003.
[19]
P. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. In VLDB, 2002.
[20]
P. G. Ipeirotis, L. Gravano, and M. Sahami. Probe, count, and classify: Categorizing hidden web databases. In SIGMOD, 2001.
[21]
C. Lagoze and H. V. Sompel. The Open Archives Initiative: Building a low-barrier interoperability framework In JCDL, 2001.
[22]
S. Lawrence and C. L. Giles. Searching the World Wide Web. Science, 280(5360):98--100, 1998.
[23]
V. Z. Liu, J. C. Richard C. Luo~and, and W. W. Chu. Dpro: A probabilistic approach for hidden web database selection using dynamic probing. In ICDE, 2004.
[24]
X. Liu, K. Maly, M. Zubair and M. L. Nelson. DP9-An OAI Gateway Service for Web Crawlers. In JCDL, 2002.
[25]
B. B. Mandelbrot. Fractal Geometry of Nature. W. H. Freeman & Co.
[26]
A. Ntoulas, J. Cho, and C. Olston. What's new on the web? the evolution of the web from a search engine perspective. In WWW, 2004.
[27]
A. Ntoulas, P. Zerfos, and J. Cho. Downloading hidden web content. Technical report, UCLA, 2004.
[28]
S. Olsen. Does search engine's power threaten web's independence? http://news.com.com/2009-1023-963618.html.
[29]
S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In VLDB, 2001.
[30]
G. K. Zipf. Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge, MA, 1949.

Cited By

View all
  • (2024)Semantic Constraint Inference for Web Form Test GenerationProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680332(932-944)Online publication date: 11-Sep-2024
  • (2024)CHURN FORECASTING USING DEEP LJEARNING MODEL2024 International Conference on Intelligent Systems for Cybersecurity (ISCS)10.1109/ISCS61804.2024.10581266(01-05)Online publication date: 3-May-2024
  • (2023)Automated Selection of Web Form Text Field Values Based on Bayesian InferencesInternational Journal of Information Retrieval Research10.4018/IJIRR.31839913:1(1-13)Online publication date: 16-Feb-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
JCDL '05: Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
June 2005
450 pages
ISBN:1581138768
DOI:10.1145/1065385
  • General Chair:
  • Mary Marlino,
  • Program Chairs:
  • Tamara Sumner,
  • Frank Shipman
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. adaptive algorithm
  2. deep web crawler
  3. hidden web crawling
  4. keyword queries
  5. query selection

Qualifiers

  • Article

Conference

JCDL05

Acceptance Rates

Overall Acceptance Rate 415 of 1,482 submissions, 28%

Upcoming Conference

JCDL '24
The 2024 ACM/IEEE Joint Conference on Digital Libraries
December 16 - 20, 2024
Hong Kong , China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Semantic Constraint Inference for Web Form Test GenerationProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680332(932-944)Online publication date: 11-Sep-2024
  • (2024)CHURN FORECASTING USING DEEP LJEARNING MODEL2024 International Conference on Intelligent Systems for Cybersecurity (ISCS)10.1109/ISCS61804.2024.10581266(01-05)Online publication date: 3-May-2024
  • (2023)Automated Selection of Web Form Text Field Values Based on Bayesian InferencesInternational Journal of Information Retrieval Research10.4018/IJIRR.31839913:1(1-13)Online publication date: 16-Feb-2023
  • (2023)Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive ReviewDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.109712311:3(1399-1423)Online publication date: 31-Jul-2023
  • (2023)EMACrawler: Web Arama Motoru Veritabanı Tazeliği OptimizasyonuEMACrawler: Web Search Engine Database Freshness OptimizationJournal of Polytechnic10.2339/politeknik.1347054Online publication date: 15-Nov-2023
  • (2023)A federated approach for detecting data hidden in icons of mobile applications delivered via web and multiple storesSocial Network Analysis and Mining10.1007/s13278-023-01121-913:1Online publication date: 14-Sep-2023
  • (2022)La visualidad algorítmica: una aproximación social a la visión artificial en la era post internetArte, Individuo y Sociedad10.5209/aris.7466434:2(627-647)Online publication date: 13-Jan-2022
  • (2021)Design of a Parallel and Scalable Crawler for the Hidden WebInternational Journal of Information Retrieval Research10.4018/IJIRR.28961212:1(1-23)Online publication date: 15-Oct-2021
  • (2019)Dark Web and Its Research ScopesApplying Methods of Scientific Inquiry Into Intelligence, Security, and Counterterrorism10.4018/978-1-5225-8976-1.ch010(240-268)Online publication date: 2019
  • (2019)Progressive Deep Web Crawling Through Keyword Queries For Data EnrichmentProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3319899(229-246)Online publication date: 25-Jun-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media