Automated Mining of Relevant N-grams in Relation to Predominant Topics of Text Documents

Jan Žižka¹⁵ &
František Dařena¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9302))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1857 Accesses
1 Citations

Abstract

The article describes a method focused on the automatic analysis of large collections of short Internet textual documents, freely written in various natural languages and represented as sparse vectors, to reveal what multi-word phrases are relevant in relation to a given basic categorization. In addition, the revealed phrases serve for discovering additional different predominant topics, which are not explicitly expressed by the basic categories. The main idea is to look for n-grams where an n-gram is a collocation of n consecutive words. This leads to the problem of relevant feature selection where a feature is an n-gram that provides more information than an individual word. The feature selection is carried out by entropy minimization which returns a set of combined relevant n-grams and can be used for creating rules, decision trees, or information retrieval. The results are demonstrated for English, German, Spanish, and Russian customer reviews of hotel services publicly available on the web. The most informative output was given by 3-grams.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

The Comparison of Effects of Relevant-Feature Selection Algorithms on Certain Social-Network Text-Mining Viewpoints

A Model for Predicting n-gram Frequency Distribution in Large Corpora

New Metrics and Related Statistical Approaches for Efficient Mining in Very Large and Highly Multidimensional Databases

References

booking.com: http://www.booking.com (2015)
Dařena, F., Žižka, J.: Text mining-based formation of dictionaries expressing opinions in natural languages. In: Proceedings of the 17th International Conference on Soft Computing Mendel 2011, June 15–17, Brno, pp. 374–381 (2011)
Google Scholar
Liu, H., Motoda, H.: Computational Methods of Feature Selection. Chapman and Hall/CRC (2007)
Google Scholar
Miner, G.: Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. Elsevier Inc. (2012)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993)
Google Scholar
Salton, G., Buckley, C.: Term-weighting Approaches in Automatic Text Retrieval. Information Processing & Management 24(5), 513–523 (1988)
Article Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 1, 1–47 (2002)
Article Google Scholar
Weiss, S.M., Indurkhya, N., Zhang, T.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer (2005)
Google Scholar
Žižka, J., Dařena, F.: Mining significant words from customer opinions written in different natural languages. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 211–218. Springer, Heidelberg (2011)
Chapter Google Scholar
Žižka, J., Dařena, F.: Revealing prevailing semantic contents of clusters generated from untagged freely written text documents in natural languages. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS, vol. 8082, pp. 434–441. Springer, Heidelberg (2013)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, FBE, Mendel University in Brno, Zemědělská 1, 613 00, Brno, Czech Republic
Jan Žižka & František Dařena

Authors

Jan Žižka
View author publications
You can also search for this author in PubMed Google Scholar
František Dařena
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jan Žižka .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Pavel Král
University of West Bohemia, Pilsen, Czech Republic
Václav Matoušek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Žižka, J., Dařena, F. (2015). Automated Mining of Relevant N-grams in Relation to Predominant Topics of Text Documents. In: Král, P., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2015. Lecture Notes in Computer Science(), vol 9302. Springer, Cham. https://doi.org/10.1007/978-3-319-24033-6_52

Download citation

DOI: https://doi.org/10.1007/978-3-319-24033-6_52
Published: 11 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24032-9
Online ISBN: 978-3-319-24033-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Automated Mining of Relevant N-grams in Relation to Predominant Topics of Text Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

The Comparison of Effects of Relevant-Feature Selection Algorithms on Certain Social-Network Text-Mining Viewpoints

A Model for Predicting n-gram Frequency Distribution in Large Corpora

New Metrics and Related Statistical Approaches for Efficient Mining in Very Large and Highly Multidimensional Databases

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Automated Mining of Relevant N-grams in Relation to Predominant Topics of Text Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

The Comparison of Effects of Relevant-Feature Selection Algorithms on Certain Social-Network Text-Mining Viewpoints

A Model for Predicting n-gram Frequency Distribution in Large Corpora

New Metrics and Related Statistical Approaches for Efficient Mining in Very Large and Highly Multidimensional Databases

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation