More Web Proxy on the site http://driver.im/

Article

As we may perceive: inferring logical documents from hypertext

Authors:

Pavel Dmitriev,

Boris SuchkovAuthors Info & Claims

HYPERTEXT '05: Proceedings of the sixteenth ACM conference on Hypertext and hypermedia

Pages 66 - 74

https://doi.org/10.1145/1083356.1083370

Published: 06 September 2005 Publication History

Abstract

In recent years, many algorithms for the Web have been developed that work with information units distinct from individual web pages. These include segments of web pages or aggregation of web pages into web communities. Such logical information units improve a variety of web algorithms and provide the building blocks for the construction of organized information spaces such as digital libraries. In this paper, we focus on a type of logical information units called "compound documents". We argue that the ability to identify compound documents can improve information retrieval, automatic metadata generation, and navigation on the Web. We propose a unified framework for identifying the boundaries of compound documents, which combines both structural and content features of constituent web pages. The framework is based on a combination of machine learning and clustering algorithms, with the former algorithm supervising the latter one. We also propose a new method for evaluating quality of clusterings, based on a user behavior model. Experiments on a collection of educational web sites show that our approach can reliably identify most of the compound documents on these sites.

References

[1]

Bergmark, D. Collection Synthesis. 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, 2002.

Digital Library

[2]

Bharat, K., Henzinger, M.R. Improved algorithms for topic distillation in a hyperlinked environment. 21st ACM Conference On Research and Development in Information Retrieval, 1998.

Digital Library

[3]

Cai, D. He, X., Wen, J.-R., Ma, W.-Y. Block-level Link Analysis. Microsoft Research Technical Report, MSR-TR-2004-50, 2004.

Digital Library

[4]

Cai, D., He, X., Ma, W.-Y., Wen, J.-R., Zhang, H.-J. Organizing WWW Images Based on The Analysis of Page Layout and Web Link Structure. IEEE International Conference on Multimedia and EXPO, 2004.

[5]

Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y. Block-based Web Search. 27th ACM Conference on Research and Development in Information Retrieval, 2004.

Digital Library

[6]

Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y. Extracting content structure for web pages based on visual representation. 5th Asia Pacific Web Conference, 2003.

Digital Library

[7]

Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y. VIPS: a Visionbased Page Segmentation Algorithm. Microsoft Technical Report, MSR-TR-2003-79, 2003.

[8]

Chakrabarti, S., Van den Berg, M., Dom, B. Focused crawling: A new approach to topic-specific Web resource discovery. 8th International World Wide Web Conference, 1999.

Digital Library

[9]

Davidson, B.D. Recognizing nepotistic links on the Web. 7th National Conference on Artificial Intelligence, Workshop on Artificial Intelligence for Web Search, 2000.

[10]

Eiron, N., McCurley, K. S. Untangling compound documents on the web. 14th ACM Conference on Hypertext and Hypermedia, 2003.

Digital Library

[11]

Faaborg, A., Lagoze, C. "Semantic Browsing". 8th European Conference on Digital Libraries, 2004.

[12]

Flake, G.W., Lawrence, S., Giles, C.L. Efficient identification of web communities. 6th International Conference on Knowledge Discovery and Data Mining, 2000.

Digital Library

[13]

Gibson, D., Kleinberg J., Raghavan P. Inferring Web communities from link topology. 9th ACM Conference on Hypertext and Hypermedia, 1998.

Digital Library

[14]

Han,H., Giles, C.L., Manavoglu, E., Hongyuan, Z., Zhang, A., Fox, E.A. Automatic document metadata extraction using support vector machines. 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, 2003.

Digital Library

[15]

iVia Under the Hood. http://infomine.ucr.edu/iVia/newtech.shtml

[16]

Kleinberg, J. Authoritative sources in a hyperlinked environment. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998.

Digital Library

[17]

Kumar, P., Raghavan, P., Rajagopalan, S., Tomkins, A. Extracting large-scale knowledge bases from the Web. 25th Conference on Very Large Data Bases, 1999.

Digital Library

[18]

Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A. Trawling the Web for Emerging Cyber-communities. 8th International World Wide Web Conference, 1999.

Digital Library

[19]

Li, W.-S., Candan, K.S., Vu, Q., Agrawal, D. Retrieving and Organizing Web Pages by "Information Unit". 10th International World Wide Web Conference, 2000.

Digital Library

[20]

Li, W.-S., Kolak, O., Vu, Q., Takano, H. Defining logical domains in a Web Site. 11th ACM Conference on Hypertext and Hypermedia, 2000.

Digital Library

[21]

Mitchell, T., M. Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression. http://www.cs.cmu.edu/~tom/mlbook/NbayesLogReg-2-05.pdf.

[22]

Mizuuchi, Y., Tajima, K. Finding Context Paths for Web Pages. 10th ACM Conference on Hypertext and Hypermedia, 1999.

Digital Library

[23]

National Science, Technology, Engeneering, and Mathematics Education Digital Library, http://www.ehr.nsf.gov/ehr/due/programs/nsdl/.

[24]

Tajima, K., Hatano, K., Matsukura, T., Sano, R., Tanaka, K. Discovery and Retrieval of Logical Information Units in Web. ACM Digital Library Workshop on Organizing Web Space, 1999.

[25]

Taskar, B., Abbeel, P., Koller, D. Discriminative Probabilistic Models for Relational Data. 18th Conference on Uncertainty in Artificial Intelligence, 2002.

Digital Library

[26]

Toyoda, M., Kitsuregawa, M. Creating a Web community chart for navigating related communities. 12th ACM Conference on Hypertext and Hypermedia, 2001.

Digital Library

[27]

Yilmazel, O., Finneran, C., Liddy, E.D. MetaExtract: An NLP System to Automatically Assign Metadata. 4th ACM/IEEE-CS Joint Conference on Digital Libraries, 2004.

Digital Library

Cited By

Lagoze CVan de Sompel HNelson MWarner SSanderson RJohnston P(2010)A Web-based resource model for scholarship 2.0: object reuse & exchangeConcurrency and Computation: Practice and Experience10.1002/cpe.159424:18(2221-2240)Online publication date: 7-Jun-2010
https://doi.org/10.1002/cpe.1594
Dmitriev PHuai JChen RHon HLiu YMa WTomkins AZhang X(2008)As we may perceiveProceedings of the 17th international conference on World Wide Web10.1145/1367497.1367640(1029-1030)Online publication date: 21-Apr-2008
https://dl.acm.org/doi/10.1145/1367497.1367640
Dmitriev PLagoze C(2006)Mining Generalized Graph Patterns Based on User ExamplesProceedings of the Sixth International Conference on Data Mining10.1109/ICDM.2006.108(857-862)Online publication date: 18-Dec-2006
https://dl.acm.org/doi/10.1109/ICDM.2006.108
Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

HYPERTEXT '05: Proceedings of the sixteenth ACM conference on Hypertext and hypermedia

September 2005

310 pages

ISBN:1595931686

DOI:10.1145/1083356

General Chair:
Siegfried Reich
Salzburg Research, Austria
,
Program Chair:
Manolis Tzagarakis
Computer Technology Institute, Greece

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 September 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

HT05

Sponsor:

HT05: 16th Conference on Hypertext and Hypermedia

September 6 - 9, 2005

Salzburg, Austria

Acceptance Rates

Overall Acceptance Rate 378 of 1,158 submissions, 33%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
361
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lagoze CVan de Sompel HNelson MWarner SSanderson RJohnston P(2010)A Web-based resource model for scholarship 2.0: object reuse & exchangeConcurrency and Computation: Practice and Experience10.1002/cpe.159424:18(2221-2240)Online publication date: 7-Jun-2010
https://doi.org/10.1002/cpe.1594
Dmitriev PHuai JChen RHon HLiu YMa WTomkins AZhang X(2008)As we may perceiveProceedings of the 17th international conference on World Wide Web10.1145/1367497.1367640(1029-1030)Online publication date: 21-Apr-2008
https://dl.acm.org/doi/10.1145/1367497.1367640
Dmitriev PLagoze C(2006)Mining Generalized Graph Patterns Based on User ExamplesProceedings of the Sixth International Conference on Data Mining10.1109/ICDM.2006.108(857-862)Online publication date: 18-Dec-2006
https://dl.acm.org/doi/10.1109/ICDM.2006.108
Dmitriev PLagoze C(2006)Automatically constructing descriptive site mapsProceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development10.1007/11610113_19(201-212)Online publication date: 16-Jan-2006
https://dl.acm.org/doi/10.1007/11610113_19

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten