Categorization of Large Text Collections: Feature Selection for Training Neural Networks

Pensiri Manomaisupat²⁰,
Bogdan Vrusias²⁰ &
Khurshid Ahmad²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4224))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

1696 Accesses
4 Citations

Abstract

Automatic text categorization requires the construction of appropriate surrogates for documents within a text collection. The surrogates, often called document vectors, are used to train learning systems for categorising unseen documents. A comparison of different measures (tfidf and weirdness) for creating document vectors is presented together with two different state-of-theart classifiers: supervised Kohonen’s SOFM and unsupervised Vapniak’s SVM. The methods are tested using two ‘gold standard’ document collections and one data set from a ‘real-world’ news stream. There appears to be an optimal size both for the of document vectors and for the dimensionality of each vector that gives the best compromise between categorization accuracy and training time. The performance of each of the classifiers was computed for five different surrogate vector models: the first two surrogates were created with tfidf and weirdness measures accordingly, the third surrogate was created purely on the basis of high-frequency words in the training corpus, and the fourth vector model was created from a standardised terminology database. Finally, the fifth surrogate (used for evaluation purposes) was based on a random selection of words from the training corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Automatic Text Classification Using Neural Network and Statistical Approaches

Constructing Document Vectors Using Kernel Density Estimates

A Probabilistic Vector Representation and Neural Network for Text Classification

References

Manomaisupat, P.: Term Extraction for Text Categorisation (Unpublished PhD Dissertation, Department of Computing, University of Surrey) (2006)
Google Scholar
Liao, D., Alpha, S., Dixon, P.: Feature Preparation in Text Categorisation. Technical Report, Oracle Corporation Available, http://www.oracle.com/technology/products/text/index.html (accessed: May 25, 2005)
Croft, W.B., Lewis, D.D.: Term Clustering of Syntactic Phrases. In: Proc. of the 13th Annual Int. ACM SIGIR Conf. on R&D in Information Retrieval, Brussels, Belgium, pp. 385–404 (1990)
Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999/2003)
Google Scholar
Beitzel, S.M., Eric, C., Jensen, E.C., Frieder, O., Lewis, D.D., Chowdhury, A., Kołcz, A.: Improving Automatic Query Classification via Semi-Supervised Learning. In: IEEE Int. Conf. on Data Mining (ICDM 2005), pp. 42–49 (2005)
Google Scholar
Lewis, D.D.: Applying Support Vector Machines to the TREC-2001 Batch Filtering and Routing Tasks (2001)
Google Scholar
Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Paatero, V., Saarela, A.: Self Organization of a Massive Document Collection. IEEE Trans. NN 11(3), 574–585 (2000)
Google Scholar
Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (2001)
MATH Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Xu, R., Wunsch, D.: Survey of Clustering Algorithms. IEEE Transactions on Neural Networks 16(3), 645–678 (2005)
Article Google Scholar
Hearst, M.A.: Support Vector Machines. IEEE Intelligent Systems 13(4), 18–28 (1998)
Article Google Scholar
Ahmad, K., Rogers, M.A.: Corpus Linguistics and Terminology Extraction. In: Wright, S.-E., Budin, G. (eds.) Handbook of Terminology Management, vol. 2, pp. 725–760. John Benjamins Publishing Company, Amsterdam (2001)
Google Scholar
Investorwords.com, http://www.Investorwords.com (Accessed on September 5, 2005)
Manomaisupat, P., Ahmad, K.: Feature Selection for Text Categorisation Using Self-Organising Map. In: Proc. ICNN&B Int. Conf. on Neural Networks and Brain, October, vol. 3, pp. 1875–1880 (2005)
Google Scholar
Azcarraga, A.P., Yap Jr., T.N., Chua, T.S., Tan, J.: Evaluating Keyword Selection Methods for WEBSOM Text Archives. IEEE Trans. on DKE 16(3), 380–383 (2004)
Google Scholar
Keerthi, S.S., Line, C.J.: Asymptotic Behaviours of Support Vector Machines with Gaussian Kernel. Neural Computation 15, 1667–1669 (2003)
Article MATH Google Scholar
Hsu, W., Chang, C.C., Line, C.J.: A Practical Guild to Support Vector Classification. Technical Report, Dept of CS and Info. Engineering, National Taiwan University, Taipei (2003)
Google Scholar
Yang, Y., Liu, X.: A Re-examination of Text Categorization methods. In: Proc. of the 22nd Int. ACM SIGIR Conf. of Research and Development in Information Retrieval (SIGIR), pp. 42–49 (1999)
Google Scholar
Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Induction Learning Algorithms and Representations for Text Categorization. In: Proc. of the 7th ACM Int. Conf. on Information and Knowledge Management CIKM 1998, Washington, US, pp. 148–155 (1998)
Google Scholar
Hung, C., Wermter, S.: A Dynamic Adaptive Self-Organizing Hybrid Model for Text Clustering. In: Proc. of the 3rd IEEE Int. Conf. Data Mining (ICDM 2003), pp. 75–82. IEEE Press, Los Alamitos (2003)
Chapter Google Scholar
Hung, C., Wermter, S., Smith, P.: Hybrid Neural Document Clustering Using Guided Self-Organization and WordNet. IEEE Intelligent Systems 19(2), 68–77 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing, University of Surrey, Guildford, Surrey, UK
Pensiri Manomaisupat & Bogdan Vrusias
Department of Computer Science, O’reilly Institute, Trinity College, Dublin 2, Ireland
Khurshid Ahmad

Authors

Pensiri Manomaisupat
View author publications
You can also search for this author in PubMed Google Scholar
Bogdan Vrusias
View author publications
You can also search for this author in PubMed Google Scholar
Khurshid Ahmad
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Escuela Politécnica Superior, GICAP Research Group, Universidad de Burgo, Calle Francisco de Vitoria S/N, Edifico C, Campus Vena, 09006, Burgos, Spain
Emilio Corchado
School of Electrical and Electronic Engineering, University of Manchester, UK
Hujun Yin
Department of Information Systems and Computation, Technical University of Valencia, Camino de Vera, Valencia, Spain
Vicente Botti
University of West Scotland, PA1 2BE, Paisley, Scotland
Colin Fyfe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Manomaisupat, P., Vrusias, B., Ahmad, K. (2006). Categorization of Large Text Collections: Feature Selection for Training Neural Networks. In: Corchado, E., Yin, H., Botti, V., Fyfe, C. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2006. IDEAL 2006. Lecture Notes in Computer Science, vol 4224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11875581_120

Download citation

DOI: https://doi.org/10.1007/11875581_120
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45485-4
Online ISBN: 978-3-540-45487-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Categorization of Large Text Collections: Feature Selection for Training Neural Networks

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Automatic Text Classification Using Neural Network and Statistical Approaches

Constructing Document Vectors Using Kernel Density Estimates

A Probabilistic Vector Representation and Neural Network for Text Classification

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Categorization of Large Text Collections: Feature Selection for Training Neural Networks

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Automatic Text Classification Using Neural Network and Statistical Approaches

Constructing Document Vectors Using Kernel Density Estimates

A Probabilistic Vector Representation and Neural Network for Text Classification

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation