[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1062745.1062900acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Finding the boundaries of information resources on the web

Published: 10 May 2005 Publication History

Abstract

In recent years, many algorithms for the Web have been developed that work with information units distinct from individual web pages. These include segments of web pages or aggregation of web pages into web communities. Using these logical information units has been shown to improve the performance of many web algorithms. In this paper, we focus on a type of logical information units called compound documents. We argue that the ability to identify compound documents can improve information retrieval, automatic metadata generation, and navigation on the Web. We propose a unified framework for identifying the boundaries of compound documents, which combines both structural and content features of constituent web pages. The framework is based on a combination of machine learning and clustering algorithms, with the former algorithm supervising the latter one. Experiments on a collection of educational web sites show that our approach can reliably identify most of the compound documents on these sites.

References

[1]
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y. Block-based Web Search. 27th ACM Conference on Research and Development in Information Retrieval, 2004.
[2]
Eiron, N., McCurley, K. S. Untangling compound documents on the web. 14th ACM Conference on Hypertext and Hypermedia, 2003.
[3]
Kleinberg, J. Authoritative sources in a hyperlinked environment. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998.
[4]
Kumar, P., Raghavan, P., Rajagopalan, S., Tomkins, A. Extracting large-scale knowledge bases from the Web. 25th Conference on Very Large Data Bases, 1999.
[5]
Toyoda, M., Kitsuregawa, M. Creating a Web community chart for navigating related communities. 12th ACM Conference on Hypertext and Hypermedia, 2001.

Cited By

View all
  • (2010)Web-site boundary detectionProceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects10.5555/1880672.1880721(529-543)Online publication date: 12-Jul-2010
  • (2010)Web-Site Boundary DetectionAdvances in Data Mining. Applications and Theoretical Aspects10.1007/978-3-642-14400-4_41(529-543)Online publication date: 2010

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web
May 2005
454 pages
ISBN:1595930515
DOI:10.1145/1062745
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 May 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. WWW
  2. clustering
  3. compound documents

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2010)Web-site boundary detectionProceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects10.5555/1880672.1880721(529-543)Online publication date: 12-Jul-2010
  • (2010)Web-Site Boundary DetectionAdvances in Data Mining. Applications and Theoretical Aspects10.1007/978-3-642-14400-4_41(529-543)Online publication date: 2010

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media