[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1083356.1083370acmconferencesArticle/Chapter ViewAbstractPublication PageshtConference Proceedingsconference-collections
Article

As we may perceive: inferring logical documents from hypertext

Published: 06 September 2005 Publication History

Abstract

In recent years, many algorithms for the Web have been developed that work with information units distinct from individual web pages. These include segments of web pages or aggregation of web pages into web communities. Such logical information units improve a variety of web algorithms and provide the building blocks for the construction of organized information spaces such as digital libraries. In this paper, we focus on a type of logical information units called "compound documents". We argue that the ability to identify compound documents can improve information retrieval, automatic metadata generation, and navigation on the Web. We propose a unified framework for identifying the boundaries of compound documents, which combines both structural and content features of constituent web pages. The framework is based on a combination of machine learning and clustering algorithms, with the former algorithm supervising the latter one. We also propose a new method for evaluating quality of clusterings, based on a user behavior model. Experiments on a collection of educational web sites show that our approach can reliably identify most of the compound documents on these sites.

References

[1]
Bergmark, D. Collection Synthesis. 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, 2002.
[2]
Bharat, K., Henzinger, M.R. Improved algorithms for topic distillation in a hyperlinked environment. 21st ACM Conference On Research and Development in Information Retrieval, 1998.
[3]
Cai, D. He, X., Wen, J.-R., Ma, W.-Y. Block-level Link Analysis. Microsoft Research Technical Report, MSR-TR-2004-50, 2004.
[4]
Cai, D., He, X., Ma, W.-Y., Wen, J.-R., Zhang, H.-J. Organizing WWW Images Based on The Analysis of Page Layout and Web Link Structure. IEEE International Conference on Multimedia and EXPO, 2004.
[5]
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y. Block-based Web Search. 27th ACM Conference on Research and Development in Information Retrieval, 2004.
[6]
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y. Extracting content structure for web pages based on visual representation. 5th Asia Pacific Web Conference, 2003.
[7]
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y. VIPS: a Visionbased Page Segmentation Algorithm. Microsoft Technical Report, MSR-TR-2003-79, 2003.
[8]
Chakrabarti, S., Van den Berg, M., Dom, B. Focused crawling: A new approach to topic-specific Web resource discovery. 8th International World Wide Web Conference, 1999.
[9]
Davidson, B.D. Recognizing nepotistic links on the Web. 7th National Conference on Artificial Intelligence, Workshop on Artificial Intelligence for Web Search, 2000.
[10]
Eiron, N., McCurley, K. S. Untangling compound documents on the web. 14th ACM Conference on Hypertext and Hypermedia, 2003.
[11]
Faaborg, A., Lagoze, C. "Semantic Browsing". 8th European Conference on Digital Libraries, 2004.
[12]
Flake, G.W., Lawrence, S., Giles, C.L. Efficient identification of web communities. 6th International Conference on Knowledge Discovery and Data Mining, 2000.
[13]
Gibson, D., Kleinberg J., Raghavan P. Inferring Web communities from link topology. 9th ACM Conference on Hypertext and Hypermedia, 1998.
[14]
Han,H., Giles, C.L., Manavoglu, E., Hongyuan, Z., Zhang, A., Fox, E.A. Automatic document metadata extraction using support vector machines. 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, 2003.
[15]
iVia Under the Hood. http://infomine.ucr.edu/iVia/newtech.shtml
[16]
Kleinberg, J. Authoritative sources in a hyperlinked environment. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998.
[17]
Kumar, P., Raghavan, P., Rajagopalan, S., Tomkins, A. Extracting large-scale knowledge bases from the Web. 25th Conference on Very Large Data Bases, 1999.
[18]
Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A. Trawling the Web for Emerging Cyber-communities. 8th International World Wide Web Conference, 1999.
[19]
Li, W.-S., Candan, K.S., Vu, Q., Agrawal, D. Retrieving and Organizing Web Pages by "Information Unit". 10th International World Wide Web Conference, 2000.
[20]
Li, W.-S., Kolak, O., Vu, Q., Takano, H. Defining logical domains in a Web Site. 11th ACM Conference on Hypertext and Hypermedia, 2000.
[21]
Mitchell, T., M. Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression. http://www.cs.cmu.edu/~tom/mlbook/NbayesLogReg-2-05.pdf.
[22]
Mizuuchi, Y., Tajima, K. Finding Context Paths for Web Pages. 10th ACM Conference on Hypertext and Hypermedia, 1999.
[23]
National Science, Technology, Engeneering, and Mathematics Education Digital Library, http://www.ehr.nsf.gov/ehr/due/programs/nsdl/.
[24]
Tajima, K., Hatano, K., Matsukura, T., Sano, R., Tanaka, K. Discovery and Retrieval of Logical Information Units in Web. ACM Digital Library Workshop on Organizing Web Space, 1999.
[25]
Taskar, B., Abbeel, P., Koller, D. Discriminative Probabilistic Models for Relational Data. 18th Conference on Uncertainty in Artificial Intelligence, 2002.
[26]
Toyoda, M., Kitsuregawa, M. Creating a Web community chart for navigating related communities. 12th ACM Conference on Hypertext and Hypermedia, 2001.
[27]
Yilmazel, O., Finneran, C., Liddy, E.D. MetaExtract: An NLP System to Automatically Assign Metadata. 4th ACM/IEEE-CS Joint Conference on Digital Libraries, 2004.

Cited By

View all
  • (2010)A Web-based resource model for scholarship 2.0: object reuse & exchangeConcurrency and Computation: Practice and Experience10.1002/cpe.159424:18(2221-2240)Online publication date: 7-Jun-2010
  • (2008)As we may perceiveProceedings of the 17th international conference on World Wide Web10.1145/1367497.1367640(1029-1030)Online publication date: 21-Apr-2008
  • (2006)Mining Generalized Graph Patterns Based on User ExamplesProceedings of the Sixth International Conference on Data Mining10.1109/ICDM.2006.108(857-862)Online publication date: 18-Dec-2006
  • Show More Cited By

Recommendations

Reviews

Xiaoya Tang

Most current research on Web page clustering is based on link structure analysis using predefined criteria, such as user queries. The authors propose an approach to grouping Web pages based on features of both link structure and content, with the probability of two pages belonging to the same cluster being determined through machine learning algorithms. The authors' experiment used several Web sites, together with human-labeled page clusters called compound documents (cDocs), as training examples. Web page features such as content and title, and various link structures were extracted. A real number vector was used to represent the similarity of such features between a pair of pages. Using feature vectors and the actual human-labeled values, machine learning algorithms were used to learn the criteria to compute the probability of the pair of pages belonging to the same cDoc, based on the content and structural features. Clustering quality was evaluated based on the number of clusters a user needs to view to find all of the correct clusters. This research investigates the feasibility of using Web page content features, as well as link structures, to aggregate Web pages into semantically meaningful units. Furthermore, because it introduces machine learning techniques to research on Web page aggregation, the approach does not need predefined user queries, and has the potential to be generalized for other Web sites. The procedure to evaluate cluster quality could also be of practical use in similar research. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
HYPERTEXT '05: Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
September 2005
310 pages
ISBN:1595931686
DOI:10.1145/1083356
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 September 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. WWW
  2. clustering
  3. compound documents

Qualifiers

  • Article

Conference

HT05
Sponsor:
HT05: 16th Conference on Hypertext and Hypermedia
September 6 - 9, 2005
Salzburg, Austria

Acceptance Rates

Overall Acceptance Rate 378 of 1,158 submissions, 33%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2010)A Web-based resource model for scholarship 2.0: object reuse & exchangeConcurrency and Computation: Practice and Experience10.1002/cpe.159424:18(2221-2240)Online publication date: 7-Jun-2010
  • (2008)As we may perceiveProceedings of the 17th international conference on World Wide Web10.1145/1367497.1367640(1029-1030)Online publication date: 21-Apr-2008
  • (2006)Mining Generalized Graph Patterns Based on User ExamplesProceedings of the Sixth International Conference on Data Mining10.1109/ICDM.2006.108(857-862)Online publication date: 18-Dec-2006
  • (2006)Automatically constructing descriptive site mapsProceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development10.1007/11610113_19(201-212)Online publication date: 16-Jan-2006

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media