[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Categorization of Large Text Collections: Feature Selection for Training Neural Networks

  • Conference paper
Intelligent Data Engineering and Automated Learning – IDEAL 2006 (IDEAL 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4224))

Abstract

Automatic text categorization requires the construction of appropriate surrogates for documents within a text collection. The surrogates, often called document vectors, are used to train learning systems for categorising unseen documents. A comparison of different measures (tfidf and weirdness) for creating document vectors is presented together with two different state-of-theart classifiers: supervised Kohonen’s SOFM and unsupervised Vapniak’s SVM. The methods are tested using two ‘gold standard’ document collections and one data set from a ‘real-world’ news stream. There appears to be an optimal size both for the of document vectors and for the dimensionality of each vector that gives the best compromise between categorization accuracy and training time. The performance of each of the classifiers was computed for five different surrogate vector models: the first two surrogates were created with tfidf and weirdness measures accordingly, the third surrogate was created purely on the basis of high-frequency words in the training corpus, and the fourth vector model was created from a standardised terminology database. Finally, the fifth surrogate (used for evaluation purposes) was based on a random selection of words from the training corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 71.50
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 89.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  • Manomaisupat, P.: Term Extraction for Text Categorisation (Unpublished PhD Dissertation, Department of Computing, University of Surrey) (2006)

    Google Scholar 

  • Liao, D., Alpha, S., Dixon, P.: Feature Preparation in Text Categorisation. Technical Report, Oracle Corporation Available, http://www.oracle.com/technology/products/text/index.html (accessed: May 25, 2005)

  • Croft, W.B., Lewis, D.D.: Term Clustering of Syntactic Phrases. In: Proc. of the 13th Annual Int. ACM SIGIR Conf. on R&D in Information Retrieval, Brussels, Belgium, pp. 385–404 (1990)

    Google Scholar 

  • Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999/2003)

    Google Scholar 

  • Beitzel, S.M., Eric, C., Jensen, E.C., Frieder, O., Lewis, D.D., Chowdhury, A., Kołcz, A.: Improving Automatic Query Classification via Semi-Supervised Learning. In: IEEE Int. Conf. on Data Mining (ICDM 2005), pp. 42–49 (2005)

    Google Scholar 

  • Lewis, D.D.: Applying Support Vector Machines to the TREC-2001 Batch Filtering and Routing Tasks (2001)

    Google Scholar 

  • Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Paatero, V., Saarela, A.: Self Organization of a Massive Document Collection. IEEE Trans. NN 11(3), 574–585 (2000)

    Google Scholar 

  • Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (2001)

    MATH  Google Scholar 

  • Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  • Xu, R., Wunsch, D.: Survey of Clustering Algorithms. IEEE Transactions on Neural Networks 16(3), 645–678 (2005)

    Article  Google Scholar 

  • Hearst, M.A.: Support Vector Machines. IEEE Intelligent Systems 13(4), 18–28 (1998)

    Article  Google Scholar 

  • Ahmad, K., Rogers, M.A.: Corpus Linguistics and Terminology Extraction. In: Wright, S.-E., Budin, G. (eds.) Handbook of Terminology Management, vol. 2, pp. 725–760. John Benjamins Publishing Company, Amsterdam (2001)

    Google Scholar 

  • Investorwords.com, http://www.Investorwords.com (Accessed on September 5, 2005)

  • Manomaisupat, P., Ahmad, K.: Feature Selection for Text Categorisation Using Self-Organising Map. In: Proc. ICNN&B Int. Conf. on Neural Networks and Brain, October, vol. 3, pp. 1875–1880 (2005)

    Google Scholar 

  • Azcarraga, A.P., Yap Jr., T.N., Chua, T.S., Tan, J.: Evaluating Keyword Selection Methods for WEBSOM Text Archives. IEEE Trans. on DKE 16(3), 380–383 (2004)

    Google Scholar 

  • Keerthi, S.S., Line, C.J.: Asymptotic Behaviours of Support Vector Machines with Gaussian Kernel. Neural Computation 15, 1667–1669 (2003)

    Article  MATH  Google Scholar 

  • Hsu, W., Chang, C.C., Line, C.J.: A Practical Guild to Support Vector Classification. Technical Report, Dept of CS and Info. Engineering, National Taiwan University, Taipei (2003)

    Google Scholar 

  • Yang, Y., Liu, X.: A Re-examination of Text Categorization methods. In: Proc. of the 22nd Int. ACM SIGIR Conf. of Research and Development in Information Retrieval (SIGIR), pp. 42–49 (1999)

    Google Scholar 

  • Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Induction Learning Algorithms and Representations for Text Categorization. In: Proc. of the 7th ACM Int. Conf. on Information and Knowledge Management CIKM 1998, Washington, US, pp. 148–155 (1998)

    Google Scholar 

  • Hung, C., Wermter, S.: A Dynamic Adaptive Self-Organizing Hybrid Model for Text Clustering. In: Proc. of the 3rd IEEE Int. Conf. Data Mining (ICDM 2003), pp. 75–82. IEEE Press, Los Alamitos (2003)

    Chapter  Google Scholar 

  • Hung, C., Wermter, S., Smith, P.: Hybrid Neural Document Clustering Using Guided Self-Organization and WordNet. IEEE Intelligent Systems 19(2), 68–77 (2004)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Manomaisupat, P., Vrusias, B., Ahmad, K. (2006). Categorization of Large Text Collections: Feature Selection for Training Neural Networks. In: Corchado, E., Yin, H., Botti, V., Fyfe, C. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2006. IDEAL 2006. Lecture Notes in Computer Science, vol 4224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11875581_120

Download citation

  • DOI: https://doi.org/10.1007/11875581_120

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-45485-4

  • Online ISBN: 978-3-540-45487-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics