Abstract
One of the solutions of retrieving information from the Internet is by classifying web pages automatically. In almost all classification methods that have been published, feature selection is a very important issue. Although there are many feature selection methods has been proposed. Most of them focus on the features within a category and ignore that the hierarchy of categories also plays an important role in achieving accurate classification results. This paper proposes a new feature selection method that incorporates hierarchical information, which prevents the classifying process from going through every node in the hierarchy. Our test results show that our classification algorithm using hierarchical information reduces the search complexity from n to log(n) and increases the accuracy by 6.2% comparing to a related algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Rousu, J., Saunders, C., Szedmak, S., Shawe-Taylor, J.: Learning Hierarchical Multi-Category Text Classification Models. In: Proceedings of 22nd International Conference on Machine Learning (ICML 2005), Bonn, Germany (2005)
Yahoo.: http://www.Yahoo.com
Kan, M.-Y., Thi, H.O.N.: Fast webpage classification using URL features. In: Proc. of Conf. on Info. and Knowledge Management (CIKM 2005), Germany (2005)
Dumais, S., Chen, H.: Hierarchical Classification of Web Content. In: Proceedings of SIGIR 2000, 23rd ACM International Conference on Research and Development in Information Retrieval (2000)
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings of the 14th international Conference on Machine Learning ECML 1998 (1998)
Lang, K.: Newsweeder: Learning to filter news. In: Proceedings of the 12th International Conference on Machine Learning, pp. 331–339 (1995)
Mladenic, D., Grobelnik, M.: Word sequences as features in text-learning. In: Proceedings of ERK 1998, the Seventh Electro-technical and Computer Science Conference, pp. 145–148 (1998)
Chan, P.K.: A non-invasive learning approach to building web user profiles. In: KDD 1999 Workshop on Web Usage Analysis and User Profiling (1999)
Salton, G., Buckley, C.: Term Weighting Approaches in Automatic Text Retrieval. Technical Report, COR-87-881, Department of Computer Science, Cornell University (1987)
Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: International Conference on Machine Learning (ICML) (1997)
Dominggos, P., Pazzani, M.: On the optimality of the simple Baysian classifier under zero-one loss. Machine learning 29, 103–130 (1997)
Yang, Y., Pedersen, O.J.: A comparative Study o Feature Selection in Text Categorization. In: Proc. of the fifth International Conference on Machine Learning ICML 1997, pp. 412–420 (1997)
Paice, C.D.: Constructing Literature Abstracts by Computer: Techniques and Prospects. Information Processing and Management 26(1), 171–186 (1990)
Mladenic, D.: Machine Learning on non-homogeneous, distributed text data. Ph.D thesis. University of Ljubljana, Slovenia (1998)
Labrou, Y., Finin, T.: Yahoo! as an ontology – using Yahoo! Categories to Describe Document. In: CIKM 1999. Proceedings of the Eighth International Conference on Knowledge and Information Management, pp. 180–187. ACM, New York (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Peng, X., Ming, Z., Wang, H. (2008). Text Learning and Hierarchical Feature Selection in Webpage Classification. In: Tang, C., Ling, C.X., Zhou, X., Cercone, N.J., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2008. Lecture Notes in Computer Science(), vol 5139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88192-6_43
Download citation
DOI: https://doi.org/10.1007/978-3-540-88192-6_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88191-9
Online ISBN: 978-3-540-88192-6
eBook Packages: Computer ScienceComputer Science (R0)