Abstract
In this paper we describe methods of performing data mining on web documents, where the web document content is represented by graphs. We show how traditional clustering and classification methods, which usually operate on vector representations of data, can be extended to work with graph-based data. Specifically, we give graph-theoretic extensions of the k-Nearest Neighbors classification algorithm and the k-means clustering algorithm that process graphs, and show how the retention of structural information can lead to improved performance over the case of the vector model approach. We introduce several different types of web document representations that utilize graphs and compare their performance for clustering and classification.
Chapter PDF
Similar content being viewed by others
References
Zhong, N., Liu, J., Yao, Y.: In search of the wisdom web. Computer 35, 27–32 (2002)
Madria, S.K., Bhowmick, S.S., Ng, W.K., Lim, E.P.: Research issues in web data mining. Data Warehousing and Knowledge Discovery, 303–312 (1999)
Dumais, S., Chen, H.: Hierarchical classification of web content. In: Proceedings of SIGIR–00, 23rd ACM International Conference on Research and Development in Information Retrieval, pp. 256–263 (2000)
Apte, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Transactions on Information Systems 12, 233–251 (1994)
Zamir, O., Etzioni, O.: Web document clustering: A feasibility demonstration. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54 (1998)
Salton, G.: Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Lopresti, D., Wilfong, G.: Applications of graph probing to web document analysis. In: Proceedings of the 1st International Workshop on Web Document Analysis, pp. 51–54 (2001)
Liang, J., Doermann, D.: Logical labeling of document images using layout graph matching with adaptive learning. In: Lopresti, D., Hu, J., Kashi, R. (eds.) DAS 2002. LNCS, vol. 2423, pp. 224–235. Springer, Heidelberg (2002)
Schenker, A., Last, M., Bunke, H., Kandel, A.: Classification of documents using graph matching. International Journal of Pattern Recognition and Artificial Intelligence 18 (to appear)
Schenker, A., Last, M., Bunke, H., Kandel, A.: Classification of web documents using a graph model. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, pp. 240–244 (2003)
Schenker, A., Last, M., Bunke, H., Kandel, A.: Graph representations for web document clustering. In: Perales, F.J., Campilho, A.C., Pérez, N., Sanfeliu, A. (eds.) IbPRIA 2003. LNCS, vol. 2652, pp. 935–942. Springer, Heidelberg (2003)
Schenker, A., Last, M., Bunke, H., Kandel, A.: Clustering of web documents using a graph model. In: Antonacopoulos, A., Hu, J. (eds.) Web Document Analysis: Challenges and Opportunities, pp. 3–18. World Scientific Publishing Company, Singapore (2003)
Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorization. Information Processing and Management 38, 529–546 (2002)
Mitchell, T.M.: Machine Learning. McGraw-Hill, Boston (1997)
Bunke, H., Shearer, K.: A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters 19, 225–259 (1998)
Fernández, M.L., Valiente, G.: A graph distance metric combining maximum common subgraph and minimum common supergraph. Pattern Recognition Letters 22, 753–758 (2001)
Wallis, W.D., Shoubridge, P., Kraetz, M., Ray, D.: Graph distances using graph union. Pattern Recognition Letters 22, 701–704 (2001)
Dickinson, P., Bunke, H., Dadej, A., Kretzl, M.: On graphs with unique node labels. In: Hancock, E.R., Vento, M. (eds.) GbRPR 2003. LNCS, vol. 2726, pp. 13–23. Springer, Heidelberg (2003)
Jiang, X., Muenger, A., Bunke, H.: On median graphs: properties, algorithms, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 1144–1151 (2001)
Zahn, C.T.: Graph-theoretical methods for detecting and describing gestalt structures. IEEE Transactions on Computers C-20, 68–86 (1971)
Boley, D., Gini, M., Gross, R., Han, S., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Partitioning-based clustering for web document categorization. Decision Support Systems 27, 329–341 (1999)
Turney, P.: Learning algorithms for keyphrase extraction. Information Retrieval 2, 303–336 (2000)
Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 846–850 (1971)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Chichester (1991)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Schenker, A., Bunke, H., Last, M., Kandel, A. (2004). A Graph-Based Framework for Web Document Mining. In: Marinai, S., Dengel, A.R. (eds) Document Analysis Systems VI. DAS 2004. Lecture Notes in Computer Science, vol 3163. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28640-0_38
Download citation
DOI: https://doi.org/10.1007/978-3-540-28640-0_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23060-1
Online ISBN: 978-3-540-28640-0
eBook Packages: Springer Book Archive