Abstract
The article describes a method focused on the automatic analysis of large collections of short Internet textual documents, freely written in various natural languages and represented as sparse vectors, to reveal what multi-word phrases are relevant in relation to a given basic categorization. In addition, the revealed phrases serve for discovering additional different predominant topics, which are not explicitly expressed by the basic categories. The main idea is to look for n-grams where an n-gram is a collocation of n consecutive words. This leads to the problem of relevant feature selection where a feature is an n-gram that provides more information than an individual word. The feature selection is carried out by entropy minimization which returns a set of combined relevant n-grams and can be used for creating rules, decision trees, or information retrieval. The results are demonstrated for English, German, Spanish, and Russian customer reviews of hotel services publicly available on the web. The most informative output was given by 3-grams.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
booking.com: http://www.booking.com (2015)
Dařena, F., Žižka, J.: Text mining-based formation of dictionaries expressing opinions in natural languages. In: Proceedings of the 17th International Conference on Soft Computing Mendel 2011, June 15–17, Brno, pp. 374–381 (2011)
Liu, H., Motoda, H.: Computational Methods of Feature Selection. Chapman and Hall/CRC (2007)
Miner, G.: Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. Elsevier Inc. (2012)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993)
Salton, G., Buckley, C.: Term-weighting Approaches in Automatic Text Retrieval. Information Processing & Management 24(5), 513–523 (1988)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 1, 1–47 (2002)
Weiss, S.M., Indurkhya, N., Zhang, T.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer (2005)
Žižka, J., Dařena, F.: Mining significant words from customer opinions written in different natural languages. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 211–218. Springer, Heidelberg (2011)
Žižka, J., Dařena, F.: Revealing prevailing semantic contents of clusters generated from untagged freely written text documents in natural languages. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS, vol. 8082, pp. 434–441. Springer, Heidelberg (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Žižka, J., Dařena, F. (2015). Automated Mining of Relevant N-grams in Relation to Predominant Topics of Text Documents. In: Král, P., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2015. Lecture Notes in Computer Science(), vol 9302. Springer, Cham. https://doi.org/10.1007/978-3-319-24033-6_52
Download citation
DOI: https://doi.org/10.1007/978-3-319-24033-6_52
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24032-9
Online ISBN: 978-3-319-24033-6
eBook Packages: Computer ScienceComputer Science (R0)