Abstract
In this paper, we propose an approach to improvement of text categorization using interaction with the user. The quality of categorization has been defined in terms of a distribution of objects related to the classes and projected on the self-organizing maps. For the experiments, we use the articles and categories from the subset of Simple Wikipedia. We test three different approaches for text representation. As a baseline we use Bag-of-Words with weighting based on Term Frequency-Inverse Document Frequency that has been used for evaluation of neural representations of words and documents: Word2Vec and Paragraph Vector. In the representation, we identify subsets of features that are the most useful for differentiating classes. They have been presented to the user, and his or her selection allow increase the coherence of the articles that belong to the same category and thus are close on the SOM.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Tayal, S., Goel, S.K., Sharma, K.: A comparative study of various text mining techniques. In: 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), pp. 1637–1642 (2015)
Schütze, H., Manning, C.D., Raghavan, P.: Introduction to Information Retrieval, pp. 117–119. Cambridge University Press, New York (2008)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013)
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. CoRR abs/1405.4053 (2014)
Mujtaba, G., Shuib, L., Raj, R.G., Rajandram, R., Shaikh, K.: Automatic text classification of ICD-10 related CoD from complex and free text forensic autopsy reports. In: 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 1055–1058 (2016)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1799 (2013)
Resnik, P.: Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. CoRR abs/1105.5444, p. 95 (2011)
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. In: Proceedings of the 10th International Conference on World Wide Web, pp. 406–414. ACM (2001)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, vol. 97, pp. 412–420 (1997)
Godbole, S., Harpale, A., Sarawagi, S., Chakrabarti, S.: Document classification through interactive supervision of document and term labels. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 185–196. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30116-5_19
Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B., Kochut, K.: A brief survey of text mining: classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919 (2017)
Stanković, R., Krstev, C., Obradović, I., Kitanović, O.: Improving document retrieval in large domain specific textual databases using lexical resources. In: Nguyen, N.T., Kowalczyk, R., Pinto, A.M., Cardoso, J. (eds.) TCCI XXVI. LNCS, vol. 10190, pp. 162–185. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59268-8_8
Hu, Y., Milios, E.E., Blustein, J.: Interactive feature selection for document clustering. In: Proceedings of the 2011 ACM Symposium on Applied Computing, SAC 2011, pp. 1143–1150. ACM, New York (2011)
Raghavan, H., Madani, O., Jones, R.: Interactive feature selection. In: IJCAI, vol. 5, pp. 841–846 (2005)
Dzemyda, G., Kurasova, O., Žilinskas, J.: Multidimensional Data Visualization. SOIA, vol. 75. Springer, New York (2012). https://doi.org/10.1007/978-1-4419-0236-8
Borg, I., Groenen, P.J.F.: Modern Multidimensional Scaling: Theory and Applications. SSS. Springer, New York (2005). https://doi.org/10.1007/0-387-28981-X
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Kohonen, T.: The self-organizing map. Proc. IEEE 78, 1464–1465, 1474 (1990)
Ultsch, A.: Emergence in self-organizing feature maps. University Library of Bielefeld (2007)
Szymański, J.: Self-organizing map representation for clustering Wikipedia search results. In: Nguyen, N.T., Kim, C.-G., Janiak, A. (eds.) ACIIDS 2011. LNCS (LNAI), vol. 6592, pp. 140–149. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20042-7_15
Szymański, J., Duch, W.: Self organizing maps for visualization of categories. In: Huang, T., Zeng, Z., Li, C., Leung, C.S. (eds.) ICONIP 2012. LNCS, vol. 7663, pp. 160–167. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34475-6_20
Zhao, Z., Morstatter, F., Sharma, S., Alelyani, S., Anand, A., Liu, H.: Advancing feature selection research. ASU feature selection repository, pp. 1–28 (2010)
Tang, J., Alelyani, S., Liu, H.: Feature selection for classification: a review. In: Data Classification: Algorithms and Applications, p. 37 (2014)
Vergara, J.R., Estévez, P.A.: A review of feature selection methods based on mutual information. Neural Comput. Appl. 24, 175–186 (2014)
Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55, 78–87 (2012)
Kotsiantis, S.B., Zaharakis, I.D., Pintelas, P.E.: Machine learning: a review of classification and combining techniques. Artif. Intell. Rev. 26, 159–190 (2006)
Cha, S.H.: Comprehensive survey on distance/similarity measures between probability density functions. Int. J. Math. Models Methods Appl. Sci. 1, 300–302, 306 (2007)
Ultsch, A., Mörchen, F.: ESOM-maps: tools for clustering, visualization, and classification with emergent SOM. Technical report, Department of Mathematics and Computer Science, University of Marburg, Germany (2005)
Draszawka, K., Szymański, J.: External validation measures for nested clustering of text documents. In: Ryżko, D., Rybiński, H., Gawrysiak, P., Kryszkiewicz, M. (eds.) Emerging Intelligent Technologies in Industry. SCI, vol. 369, pp. 207–225. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22732-5_18
Szymański, J., Duch, W.: Semantic memory knowledge acquisition through active dialogues. In: 2007 International Joint Conference on Neural Networks, IJCNN 2007, pp. 536–541. IEEE (2007)
Czarnul, P., Rościszewski, P., Matuszek, M., Szymański, J.: Simulation of parallel similarity measure computations for large data sets. In: 2015 IEEE 2nd International Conference on Cybernetics (CYBCONF), pp. 472–477. IEEE (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Atroszko, J., Szymański, J., Gil, D., Mora, H. (2018). Text Categorization Improvement via User Interaction. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2018. Lecture Notes in Computer Science(), vol 10842. Springer, Cham. https://doi.org/10.1007/978-3-319-91262-2_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-91262-2_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91261-5
Online ISBN: 978-3-319-91262-2
eBook Packages: Computer ScienceComputer Science (R0)