[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Automated Mining of Relevant N-grams in Relation to Predominant Topics of Text Documents

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9302))

Included in the following conference series:

Abstract

The article describes a method focused on the automatic analysis of large collections of short Internet textual documents, freely written in various natural languages and represented as sparse vectors, to reveal what multi-word phrases are relevant in relation to a given basic categorization. In addition, the revealed phrases serve for discovering additional different predominant topics, which are not explicitly expressed by the basic categories. The main idea is to look for n-grams where an n-gram is a collocation of n consecutive words. This leads to the problem of relevant feature selection where a feature is an n-gram that provides more information than an individual word. The feature selection is carried out by entropy minimization which returns a set of combined relevant n-grams and can be used for creating rules, decision trees, or information retrieval. The results are demonstrated for English, German, Spanish, and Russian customer reviews of hotel services publicly available on the web. The most informative output was given by 3-grams.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 35.99
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 44.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. booking.com: http://www.booking.com (2015)

  2. Dařena, F., Žižka, J.: Text mining-based formation of dictionaries expressing opinions in natural languages. In: Proceedings of the 17th International Conference on Soft Computing Mendel 2011, June 15–17, Brno, pp. 374–381 (2011)

    Google Scholar 

  3. Liu, H., Motoda, H.: Computational Methods of Feature Selection. Chapman and Hall/CRC (2007)

    Google Scholar 

  4. Miner, G.: Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. Elsevier Inc. (2012)

    Google Scholar 

  5. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993)

    Google Scholar 

  6. Salton, G., Buckley, C.: Term-weighting Approaches in Automatic Text Retrieval. Information Processing & Management 24(5), 513–523 (1988)

    Article  Google Scholar 

  7. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 1, 1–47 (2002)

    Article  Google Scholar 

  8. Weiss, S.M., Indurkhya, N., Zhang, T.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer (2005)

    Google Scholar 

  9. Žižka, J., Dařena, F.: Mining significant words from customer opinions written in different natural languages. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 211–218. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  10. Žižka, J., Dařena, F.: Revealing prevailing semantic contents of clusters generated from untagged freely written text documents in natural languages. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS, vol. 8082, pp. 434–441. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan Žižka .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Žižka, J., Dařena, F. (2015). Automated Mining of Relevant N-grams in Relation to Predominant Topics of Text Documents. In: Král, P., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2015. Lecture Notes in Computer Science(), vol 9302. Springer, Cham. https://doi.org/10.1007/978-3-319-24033-6_52

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-24033-6_52

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-24032-9

  • Online ISBN: 978-3-319-24033-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics