Abstract
Automatic text categorization requires the construction of appropriate surrogates for documents within a text collection. The surrogates, often called document vectors, are used to train learning systems for categorising unseen documents. A comparison of different measures (tfidf and weirdness) for creating document vectors is presented together with two different state-of-theart classifiers: supervised Kohonen’s SOFM and unsupervised Vapniak’s SVM. The methods are tested using two ‘gold standard’ document collections and one data set from a ‘real-world’ news stream. There appears to be an optimal size both for the of document vectors and for the dimensionality of each vector that gives the best compromise between categorization accuracy and training time. The performance of each of the classifiers was computed for five different surrogate vector models: the first two surrogates were created with tfidf and weirdness measures accordingly, the third surrogate was created purely on the basis of high-frequency words in the training corpus, and the fourth vector model was created from a standardised terminology database. Finally, the fifth surrogate (used for evaluation purposes) was based on a random selection of words from the training corpus.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Manomaisupat, P.: Term Extraction for Text Categorisation (Unpublished PhD Dissertation, Department of Computing, University of Surrey) (2006)
Liao, D., Alpha, S., Dixon, P.: Feature Preparation in Text Categorisation. Technical Report, Oracle Corporation Available, http://www.oracle.com/technology/products/text/index.html (accessed: May 25, 2005)
Croft, W.B., Lewis, D.D.: Term Clustering of Syntactic Phrases. In: Proc. of the 13th Annual Int. ACM SIGIR Conf. on R&D in Information Retrieval, Brussels, Belgium, pp. 385–404 (1990)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999/2003)
Beitzel, S.M., Eric, C., Jensen, E.C., Frieder, O., Lewis, D.D., Chowdhury, A., Kołcz, A.: Improving Automatic Query Classification via Semi-Supervised Learning. In: IEEE Int. Conf. on Data Mining (ICDM 2005), pp. 42–49 (2005)
Lewis, D.D.: Applying Support Vector Machines to the TREC-2001 Batch Filtering and Routing Tasks (2001)
Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Paatero, V., Saarela, A.: Self Organization of a Massive Document Collection. IEEE Trans. NN 11(3), 574–585 (2000)
Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (2001)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Xu, R., Wunsch, D.: Survey of Clustering Algorithms. IEEE Transactions on Neural Networks 16(3), 645–678 (2005)
Hearst, M.A.: Support Vector Machines. IEEE Intelligent Systems 13(4), 18–28 (1998)
Ahmad, K., Rogers, M.A.: Corpus Linguistics and Terminology Extraction. In: Wright, S.-E., Budin, G. (eds.) Handbook of Terminology Management, vol. 2, pp. 725–760. John Benjamins Publishing Company, Amsterdam (2001)
Investorwords.com, http://www.Investorwords.com (Accessed on September 5, 2005)
Manomaisupat, P., Ahmad, K.: Feature Selection for Text Categorisation Using Self-Organising Map. In: Proc. ICNN&B Int. Conf. on Neural Networks and Brain, October, vol. 3, pp. 1875–1880 (2005)
Azcarraga, A.P., Yap Jr., T.N., Chua, T.S., Tan, J.: Evaluating Keyword Selection Methods for WEBSOM Text Archives. IEEE Trans. on DKE 16(3), 380–383 (2004)
Keerthi, S.S., Line, C.J.: Asymptotic Behaviours of Support Vector Machines with Gaussian Kernel. Neural Computation 15, 1667–1669 (2003)
Hsu, W., Chang, C.C., Line, C.J.: A Practical Guild to Support Vector Classification. Technical Report, Dept of CS and Info. Engineering, National Taiwan University, Taipei (2003)
Yang, Y., Liu, X.: A Re-examination of Text Categorization methods. In: Proc. of the 22nd Int. ACM SIGIR Conf. of Research and Development in Information Retrieval (SIGIR), pp. 42–49 (1999)
Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Induction Learning Algorithms and Representations for Text Categorization. In: Proc. of the 7th ACM Int. Conf. on Information and Knowledge Management CIKM 1998, Washington, US, pp. 148–155 (1998)
Hung, C., Wermter, S.: A Dynamic Adaptive Self-Organizing Hybrid Model for Text Clustering. In: Proc. of the 3rd IEEE Int. Conf. Data Mining (ICDM 2003), pp. 75–82. IEEE Press, Los Alamitos (2003)
Hung, C., Wermter, S., Smith, P.: Hybrid Neural Document Clustering Using Guided Self-Organization and WordNet. IEEE Intelligent Systems 19(2), 68–77 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Manomaisupat, P., Vrusias, B., Ahmad, K. (2006). Categorization of Large Text Collections: Feature Selection for Training Neural Networks. In: Corchado, E., Yin, H., Botti, V., Fyfe, C. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2006. IDEAL 2006. Lecture Notes in Computer Science, vol 4224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11875581_120
Download citation
DOI: https://doi.org/10.1007/11875581_120
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45485-4
Online ISBN: 978-3-540-45487-8
eBook Packages: Computer ScienceComputer Science (R0)