A Graph-Based Framework for Web Document Mining

Adam Schenker¹⁸,
Horst Bunke¹⁹,
Mark Last²⁰ &
…
Abraham Kandel^18,21

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3163))

Included in the following conference series:

International Workshop on Document Analysis Systems

1399 Accesses
4 Citations

Abstract

In this paper we describe methods of performing data mining on web documents, where the web document content is represented by graphs. We show how traditional clustering and classification methods, which usually operate on vector representations of data, can be extended to work with graph-based data. Specifically, we give graph-theoretic extensions of the k-Nearest Neighbors classification algorithm and the k-means clustering algorithm that process graphs, and show how the retention of structural information can lead to improved performance over the case of the vector model approach. We introduce several different types of web document representations that utilize graphs and compare their performance for clustering and classification.

Download to read the full chapter text

Chapter PDF

Exploration of Document Classification with Linked Data and PageRank

Learning Structural Representations of Text Documents in Large Document Collections

On a Novel Representation of Multiple Textual Documents in a Single Graph

References

Zhong, N., Liu, J., Yao, Y.: In search of the wisdom web. Computer 35, 27–32 (2002)
Article Google Scholar
Madria, S.K., Bhowmick, S.S., Ng, W.K., Lim, E.P.: Research issues in web data mining. Data Warehousing and Knowledge Discovery, 303–312 (1999)
Google Scholar
Dumais, S., Chen, H.: Hierarchical classification of web content. In: Proceedings of SIGIR–00, 23rd ACM International Conference on Research and Development in Information Retrieval, pp. 256–263 (2000)
Google Scholar
Apte, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Transactions on Information Systems 12, 233–251 (1994)
Article Google Scholar
Zamir, O., Etzioni, O.: Web document clustering: A feasibility demonstration. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54 (1998)
Google Scholar
Salton, G.: Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Google Scholar
Lopresti, D., Wilfong, G.: Applications of graph probing to web document analysis. In: Proceedings of the 1st International Workshop on Web Document Analysis, pp. 51–54 (2001)
Google Scholar
Liang, J., Doermann, D.: Logical labeling of document images using layout graph matching with adaptive learning. In: Lopresti, D., Hu, J., Kashi, R. (eds.) DAS 2002. LNCS, vol. 2423, pp. 224–235. Springer, Heidelberg (2002)
Chapter Google Scholar
Schenker, A., Last, M., Bunke, H., Kandel, A.: Classification of documents using graph matching. International Journal of Pattern Recognition and Artificial Intelligence 18 (to appear)
Google Scholar
Schenker, A., Last, M., Bunke, H., Kandel, A.: Classification of web documents using a graph model. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, pp. 240–244 (2003)
Google Scholar
Schenker, A., Last, M., Bunke, H., Kandel, A.: Graph representations for web document clustering. In: Perales, F.J., Campilho, A.C., Pérez, N., Sanfeliu, A. (eds.) IbPRIA 2003. LNCS, vol. 2652, pp. 935–942. Springer, Heidelberg (2003)
Chapter Google Scholar
Schenker, A., Last, M., Bunke, H., Kandel, A.: Clustering of web documents using a graph model. In: Antonacopoulos, A., Hu, J. (eds.) Web Document Analysis: Challenges and Opportunities, pp. 3–18. World Scientific Publishing Company, Singapore (2003)
Chapter Google Scholar
Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorization. Information Processing and Management 38, 529–546 (2002)
Article MATH Google Scholar
Mitchell, T.M.: Machine Learning. McGraw-Hill, Boston (1997)
MATH Google Scholar
Bunke, H., Shearer, K.: A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters 19, 225–259 (1998)
Article Google Scholar
Fernández, M.L., Valiente, G.: A graph distance metric combining maximum common subgraph and minimum common supergraph. Pattern Recognition Letters 22, 753–758 (2001)
Article MATH Google Scholar
Wallis, W.D., Shoubridge, P., Kraetz, M., Ray, D.: Graph distances using graph union. Pattern Recognition Letters 22, 701–704 (2001)
Article MATH Google Scholar
Dickinson, P., Bunke, H., Dadej, A., Kretzl, M.: On graphs with unique node labels. In: Hancock, E.R., Vento, M. (eds.) GbRPR 2003. LNCS, vol. 2726, pp. 13–23. Springer, Heidelberg (2003)
Chapter Google Scholar
Jiang, X., Muenger, A., Bunke, H.: On median graphs: properties, algorithms, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 1144–1151 (2001)
Article Google Scholar
Zahn, C.T.: Graph-theoretical methods for detecting and describing gestalt structures. IEEE Transactions on Computers C-20, 68–86 (1971)
Article Google Scholar
Boley, D., Gini, M., Gross, R., Han, S., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Partitioning-based clustering for web document categorization. Decision Support Systems 27, 329–341 (1999)
Article Google Scholar
Turney, P.: Learning algorithms for keyphrase extraction. Information Retrieval 2, 303–336 (2000)
Article Google Scholar
Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 846–850 (1971)
Article Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Chichester (1991)
Book MATH Google Scholar

Download references

Author information

Authors and Affiliations

University of South Florida, Tampa, FL, 33620, USA
Adam Schenker & Abraham Kandel
University of Bern, CH-3012, Bern, Switzerland
Horst Bunke
Ben-Gurion University of the Negev, Beer-Sheva, 84105, Israel
Mark Last
Tel-Aviv University, Tel-Aviv, 69978, Israel
Abraham Kandel

Authors

Adam Schenker
View author publications
You can also search for this author in PubMed Google Scholar
Horst Bunke
View author publications
You can also search for this author in PubMed Google Scholar
Mark Last
View author publications
You can also search for this author in PubMed Google Scholar
Abraham Kandel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dipartimento di Sistemi e Informatica, Università di Firenze, Via di Santa Marta 3, 50139, Firenze, Italy
Simone Marinai
Knowledge Management Department, German Research Center for Artificial Intelligence (DFKI) GmbH, Kaiserslautern, Germany
Andreas R. Dengel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schenker, A., Bunke, H., Last, M., Kandel, A. (2004). A Graph-Based Framework for Web Document Mining. In: Marinai, S., Dengel, A.R. (eds) Document Analysis Systems VI. DAS 2004. Lecture Notes in Computer Science, vol 3163. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28640-0_38

Download citation

DOI: https://doi.org/10.1007/978-3-540-28640-0_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23060-1
Online ISBN: 978-3-540-28640-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

A Graph-Based Framework for Web Document Mining

Abstract

Chapter PDF

Similar content being viewed by others

Exploration of Document Classification with Linked Data and PageRank

Learning Structural Representations of Text Documents in Large Document Collections

On a Novel Representation of Multiple Textual Documents in a Single Graph

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

A Graph-Based Framework for Web Document Mining

Abstract

Chapter PDF

Similar content being viewed by others

Exploration of Document Classification with Linked Data and PageRank

Learning Structural Representations of Text Documents in Large Document Collections

On a Novel Representation of Multiple Textual Documents in a Single Graph

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation