Abstract
Word embedding technique is among the most widely known and used representations of text documents vocabulary. It serves to capture word context in a document, but in many applications the need is to understand the content of text, which is longer than just a single word, that’s what we call “Document Embedding”. This paper presents an empirical study that evaluates the morphosyntactic data preprocessing impact on document embedding techniques over textual semantic similarity evaluation task, and that by comparing the impact of the most widely known text preprocessing techniques, such as: (1) Cleaning technique containing stop-words removal, lowercase conversion, punctuation and number elimination, (2) Stemming technique using the most known algorithms in the literature: Porter, Snowball and Lancaster stemmer and (3) Lemmatization technique using Wordnet Lemmatizer. Experimental analysis on MSRP (Microsoft Research Paraphrase) dataset reveals that preprocessing techniques improve classifier accuracy, where Stemming methods outperforms other techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Batet, M., Sanchez, D.: A review on semantic similarity. In: Encyclopedia of Information Science and Technology, Third Edition, pp. 7575–7583. IGI Global (2015)
Bengio, Y., et al.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
Camacho-Collados, J., Pilehvar, M.T.: On the role of text preprocessing in neural network architectures: an evaluation study on text categorization and sentiment analysis. arXiv preprint arXiv:1707-1780 (2017)
Chu, H.: Information Representation and Retrieval in the Digital Age. Information Today, Inc., Medford (2003)
Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 350. Association for Computational Linguistics (2004)
Duwairi, R., El-Orfali, M.: A study of the effects of preprocessing strategies on sentiment analysis for Arabic text. J. Inform. Sci. 40(4), 501–513 (2014)
Ibrahim, A., Katz, B., Lin, J.: Extracting structural paraphrases from aligned monolingual corpora. In: Proceedings of the Second International Workshop on Paraphrasing, vol. 16, pp. 57–64. Association for Computational Linguistics (2003)
Kenter, T., De Rijke, M.: Short text similarity with word embeddings. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1411–1420. ACM (2015)
Kiela, D., Clark, S.: A systematic study of semantic vector space model parameters. In: Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC), pp. 21–30 (2014)
Kosala, R., Blockeel, H.: Web mining research: a survey. ACM Sigkdd Explor. Newsl. 2(1), 1–15 (2000)
Le, Q., Mikolov, T.: Distributed representations of sentences and douments. In: International Conference on Machine Learning, pp. 1188–1196 (2014)
Lodhi, H., et al.: Text classification using string kernels. J. Mach. Learn. Res. 2(Feb), 419–444 (2002)
Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751 (2013)
Mikolov, T., et al.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Pak, M.Y., Gunal, S.: The impact of text representation and preprocessing on author identification. Anadolu Üniversitesi Bilim Ve Teknoloji Dergisi A-Uygulamalı Bilimler ve Mühendislik 18(1), 218–224 (2017)
Park, E.-K., Ra, D.-Y., Jang, M.-G.: Techniques for improving web retrieval effectiveness. Inf. Process. Manage. 41(5), 1207–1223 (2005)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Sergienko, R., Shan, M., Minker, W.: A comparative study of text preprocessing approaches for topic detection of user utterances. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 1826–1831 (2016)
Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Inform. Process. Manage. 50(1), 104–112 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Yahi, N., Belhadef, H. (2020). Morphosyntactic Preprocessing Impact on Document Embedding: An Empirical Study on Semantic Similarity. In: Saeed, F., Mohammed, F., Gazem, N. (eds) Emerging Trends in Intelligent Computing and Informatics. IRICT 2019. Advances in Intelligent Systems and Computing, vol 1073. Springer, Cham. https://doi.org/10.1007/978-3-030-33582-3_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-33582-3_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33581-6
Online ISBN: 978-3-030-33582-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)