Abstract
During the past years, we have seen a steady increase in the number of social networks worldwide. Among them, Twitter has consolidated its position as one of the most influential social platforms, with Brazilian Portuguese speakers holding the fifth position in the number of users. Due to the informal linguistic style of tweets, the discovery of information in such an environment poses a challenge to Natural Language Processing (NLP) tasks such as sentiment analysis. In this work, we state sentiment analysis as a binary (positive and negative) and multiclass (positive, negative, and neutral) classification task at the Portuguese-written tweet level. Following a feature extraction approach, embeddings are initially gathered for a tweet and then given as input to learning a classifier. This study was designed to evaluate the effectiveness of different word representations, from the original pre-trained language model to continued pre-training strategies, to improve the predictive performance of sentiment classification, using three different classifier algorithms and eight Portuguese tweets datasets. Because of the lack of a language model specific to Brazilian Portuguese tweets, we have expanded our evaluation to consider six different embeddings: fastText, GloVe, Word2Vec, BERT-multilingual (mBERT), BERTweet, and BERTimbau. The experiments showed that embeddings trained from scratch solely using the target Portuguese language, BERTimbau, outperform the static representations, fastText, GloVe, and Word2Vec, and the Transformer-based models BERT multilingual and BERTweet. In addition, we show that extracting the contextualized embedding without any adjustment to the pre-trained language model is the best approach for most datasets.
Similar content being viewed by others
Notes
References
Agüero-Torales, M. M., Salas, J. I. A., & López-Herrera, A. G. (2021). Deep learning and multilingual sentiment analysis on social media data: An overview. Applied Soft Computing, 107, 107373.
Alves, A. L., Baptista, C. D. S., Andrade, L. H. D., & Paes, R. (2015). Uso de técnicas de análise de sentimentos em tweets relacionados ao meio-ambiente. In Anais do Workshop de Computação Aplicada à Gestão do Meio Ambiente e Recursos Naturais (WCAMA), 2015 (pp. 37–46). Sociedade Brasileira de Computacao.
Alves, A. L., Baptista, C. D. S., Firmino, A. A., Oliveira, M. G. D., & Paiva, A. C. D. (2014). A comparison of SVM versus Naive-Bayes techniques for sentiment analysis in tweets: A case study with the 2013 FIFA confederations cup. In WebMedia 2014—Proceedings of the 20th Brazilian symposium on multimedia and the web, 2014 (pp. 123–130). Association for Computing Machinery, Inc.
Araújo, G., Teixeira, F., Mancini, F., Guimarães, M., & Pisa, I. (2018). Sentiment analysis of Twitter’s health messages in Brazilian Portuguese. Journal of Health Informatics, 10, 17–24.
Araújo, M., Pereira, A., & Benevenuto, F. (2020). A comparative study of machine translation for multilingual sentence-level sentiment analysis. Information Sciences, 512, 1078–1102.
Araujo, M., Pereira, A., Reis, J., & Benevenuto, F. (2016). An evaluation of machine translation for multilingual sentence-level sentiment analysis. In Proceedings of the ACM symposium on applied computing, 2016, August 4 (pp. 1140–1145). Association for Computing Machinery.
Barreto, S., Moura, R., Carvalho, J., Paes, A., & Plastino, A. (2021). Sentiment analysis in tweets: an assessment study from classical to modern text representation models. Data Min Knowl Disc, 37, 318–380 (2023). https://doi.org/10.1007/s10618-022-00853-0
Belisário, L. B., Ferreira, L. G., & Pardo, T. A. S. (2020). Evaluating methods of different paradigms for subjectivity classification in Portuguese. In Proceedings of the 14th international conference on the computational processing of Portuguese, LNAI, 2020 (Vol. 12037, pp. 261–269). Springer.
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In: M. C. Elish, W. Isaac & R. S. Zemel (Eds.), FAccT ’21: 2021 ACM conference on fairness, accountability, and transparency, virtual event, Toronto, Canada, March 3–10, 2021 (pp. 610–623). ACM. https://doi.org/10.1145/3442188.3445922.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.
Brum, H. B., & das Graças Volpe Nunes, M. (2018). Building a sentiment corpus of tweets in Brazilian Portuguese. In N. C. C. (chair), K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis & T. Tokunaga (Eds.), Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), 2018. European Language Resources Association (ELRA).
Brum, H. B., & Nunes, M. D. G. V. (2018). Semi-supervised sentiment annotation of large corpora. In Proceedings of the 13th international conference on the computational processing of Portuguese, 2018 (pp. 385–395). Springer.
Carmo, D., Piau, M., Campiotti, I., Nogueira, R., & Lotufo, R. (2020). PTT5: Pretraining and validating the t5 model on Brazilian Portuguese data. arXiv preprint. arXiv:2008.09144
Carosia, A., Coelho, G. P., & da Silva, A. E. A. (2019). The influence of tweets and news on the Brazilian Stock Market through sentiment analysis. In Proceedings of the 25th Brazilian symposium on multimedia and the web, 2019. ACM.
Carosia, A., Coelho, G. P., & Silva, A. E. A. (2020). Analyzing the Brazilian financial market through Portuguese sentiment analysis in social media. Applied Artificial Intelligence, 34, 1–19.
Carvalho, J., & Plastino, A. (2021). On the evaluation and combination of state-of-the-art features in Twitter sentiment analysis. Artificial Intelligence Review, 54(3), 1887–1936.
Carvalho, P., & Silva, M. J. (2015). Sentilex-pt: Principais características e potencialidades. Oslo Studies in Language. https://doi.org/10.5617/osla.1444
Chan, B., Schweter, S., & Möller, T. (2020). German’s next language model. In Proceedings of the 28th International Conference on Computational Linguistics, 2020 (pp. 6788–6796). International Committee on Computational Linguistics (Online).
Correa, E. A., Marinho, V. Q., Santos, L. B. D., Bertaglia, T. F. C., Treviso, M. V., & Brum, H. B. (2017). PELESent: Cross-domain polarity classification using distant supervision. In Proceedings—2017 Brazilian conference on intelligent systems, BRACIS 2017, 2017, January 2018 (pp. 49–54). Institute of Electrical and Electronics Engineers, Inc.
Costa, J. M. R., Rotabi, R., Murnane, E. L., & Choudhury, T. (2015). It is not only about grievances: Emotional dynamics in social media during the Brazilian protests. In Proceedings of the international AAAI conference on web and social media, 2015 (Vol. 9).
Cury, R. M. (2019). Oscillation of tweet sentiments in the election of João Doria Jr. for Mayor. Journal of Big Data, 6, 1–15.
da Silva, A. M., Bastos, R. D. M., & de Azevedo da Rocha, R. L. (2018). Sentiment analysis in Brazilian Portuguese tweets in the domain of calamity: Application of the summarization method and semantic similarity in polarized terms. In IJCCI 2018—Proceedings of the 10th international joint conference on computational intelligence, 2018 (pp. 225–231). SciTe Press.
De Aguiar, E. J., Faiçal, B. S., Ueyama, J., Silva, G. C., & Menolli, A. (2018). Análise de sentimento em redes sociais para a língua portuguesa utilizando algoritmos de classificação. In Anais do XXXVI Simpósio Brasileiro de Redes de Computadores e Sistemas Distribuídos. SBC.
De Barros, T. M., Pedrini, H., & Dias, Z. (2021). Leveraging emoji to improve sentiment classification of tweets. In Proceedings of the 36th annual ACM symposium on applied computing, 2021 (pp. 845–852). ACM.
de Carvalho, V. D. H., Nepomuceno, T. C. C., & Costa, A. P. C. S. (2020). An automated corpus annotation experiment in Brazilian Portuguese for sentiment analysis in public security. In Lecture notes in business information processing, LNBIP (Vol. 384 pp. 99–111). Springer.
de Melo, T., & Figueiredo, C. M. (2021). Comparing news articles and tweets about COVID-19 in Brazil: Sentiment analysis and topic modeling approach. JMIR Public Health and Surveillance, 7, e24585.
de Oliveira, D. N., & de Campos Merschmann, L. H. (2021). Joint evaluation of preprocessing tasks with classifiers for sentiment analysis in Brazilian Portuguese language. Multimedia Tools and Applications, 80, 15391–15412.
de Souza, K. F., Pereira, M. H. R., & Dalip, D. H. (2017). Unilex: Método léxico para análise de sentimentos textuais sobre conteúdo de tweets em português brasileiro. Abakós, 5(2), 79–96.
de Vargas Feijó, D., & Moreira, V. P. (2020). Mono vs. multilingual transformer-based models: A comparison across several language tasks. CoRR abs/2007.09757. arxiv:2007.09757
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran & T. Solorio (Eds.), Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019: Long and Short Papers, Minneapolis, MN, USA, June 2–7, 2019. (Vol.1, pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/n19-1423.
dos Santos, A., Júnior, J. D. B., & de Arruda Camargo, H. (2018). Annotation of a corpus of tweets for sentiment analysis. In Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), LNAI (Vol. 11122, pp. 294–302). Springer.
Filho, J. A. W., Wilkens, R., Idiart, M., & Villavicencio, A. (2018). The brWaC corpus: A new open resource for Brazilian Portuguese. In Proceedings of the eleventh international conference on language resources and evaluation, LREC 2018, Miyazaki, Japan, May 7–12, 2018. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2018/summaries/599.html
França, T., & Oliveira, J. (2014). Análise de sentimento de tweets relacionados aos protestos que ocorreram no brasil entre junho e agosto de 2013. In Anais do III Brazilian workshop on social network analysis and mining, 2014 (pp. 128–139). SBC.
Gage, P. (1994). A new algorithm for data compression. The C Users Journal Archive, 12, 23–38.
Garcia, K., & Berton, L. (2021). Topic detection and sentiment analysis in Twitter content related to COVID-19 from Brazil and the USA. Applied Soft Computing, 101, 107057.
Gengo, P., & Verri, F. A. (2020). Semi-supervised sentiment analysis of Portuguese tweets with random walk in feature sample networks. In Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and Lecture notes in bioinformatics), LNAI (Vol. 12319, pp. 595–605). Springer.
Ghojogh, B., Crowley, M., Karray, F., & Ghodsi, A. (2023). Uniform manifold approximation and projection (UMAP) (pp. 479–497). Springer. https://doi.org/10.1007/978-3-031-10602-6.
Gomes, F. B., Adán-Coello, J. M., & Kintschner, F. E. (2018). Studying the effects of text preprocessing and ensemble methods on sentiment analysis of Brazilian Portuguese tweets. In Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), LNAI (Vol. 11171, pp. 167–177). Springer.
Grandin, P., & Adan, J. M. (2016). Piegas: A systems for sentiment analysis of tweets in Portuguese. IEEE Latin America Transactions, 14, 3467–3473.
Guerra, P. H. C., Meira, W., & Cardie, C. (2014). Sentiment analysis on evolving social streams: How self-report imbalances can help. In WSDM 2014—Proceedings of the 7th ACM international conference on web search and data mining (pp. 443–452). Association for Computing Machinery.
Guerra, P. H. C., Veloso, A., Meira, W., & Almeida, V. (2011). From bias to opinion: A transfer-learning approach to real-time sentiment analysis. In Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining—KDD ’11, 2011. ACM Press.
Gururangan, S., Marasovic, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N. A. (2020). Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th annual meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020 (pp. 8342–8360). Association for Computational Linguistics.
Heinzerling, B., & Strube, M. (2018). BPEmb: Tokenization-free pre-trained subword embeddings in 275 languages. In N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis & T. Tokunaga (Eds.), Proceedings of the eleventh international conference on language resources and evaluation, LREC 2018, Miyazaki, Japan, May 7–12, 2018. European Language Resources Association (ELRA).
Hutto, C. J., & Gilbert, E. (2014). VADER: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media, 2014. The AAAI Press.
Kouloumpis, E., Wilson, T., & Moore, J. (2011). Twitter sentiment analysis: The good the bad and the omg! In Proceedings of the international AAAI conference on web and social media, 2011 (Vol. 5).
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). ALBERT: A lite BERT for self-supervised learning of language representations. In 8th International conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net. https://openreview.net/forum?id=H1eA7AEtvS
Lauand, B. P., & Oliveira, J. (2014). Inferindo as condiçōes de trânsito através da análise de sentimentos no Twitter. iSys - Revista Brasileira de Sistemas de Informação, 7(3), 56–74.
Lima, M. L., Nascimento, T. P., Labidi, S., Timbó, N. S., Batista, M. V. L., Neto, G. N., Costa, E. A. M., & Sousa, S. R. S. (2016). Using sentiment analysis for stock exchange prediction. International Journal of Artificial Intelligence and Applications. https://doi.org/10.5121/ijaia.2016.7106
Liu, B. (2020). Sentiment analysis: Mining opinions, sentiments, and emoticons. Cambridge University Press.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. arxiv:1907.11692
Lourenco Jr., R., Veloso, A., Pereira, A., Meira Jr., W., Ferreira, R., & Parthasarathy, S. (2014). Economically-efficient sentiment stream analysis. In Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’14, 2014 (pp. 637–646). Association for Computing Machinery.
Machado, M. T., Pardo, T. A. S., & Ruiz, E. E. S. (2018). Creating a Portuguese context sensitive lexicon for sentiment analysis. In Computational processing of the Portuguese language—13th International conference, PROPOR 2018, Canela, Brazil, September 24–26, 2018, Proceedings, lecture notes in computer science (vol. 11122, pp. 335–344). Springer.
Malini, F., Ciarelli, P., & Medeiros, J. (2017). O sentimento político em redes sociais: big data, algoritmos e as emoçōes nos tweets sobre o impeachment de dilma rousseff. Liinc em Revista, 13, 323–342.
Martin, L., Müller, B., Suárez, P.J.O., Dupont, Y., Romary, L., de la Clergerie, É., Seddah, D., & Sagot, B. (2020). CamemBERT: A tasty French language model. In Proceedings of the 58th annual meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020 (pp. 7203–7219). Association for Computational Linguistics.
Martins, R., Pereira, A., & Benevenuto, F. (2015). An approach to sentiment analysis of web applications in Portuguese. In Proceedings of the 21st Brazilian symposium on multimedia and the web, WebMedia ’15, 2015 (pp. 105–112). Association for Computing Machinery.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013) Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th international conference on neural information processing systems, NIPS’13, 2013 (Vol. 2, pp. 3111–3119).
Mohammad, S. M., & Turney, P. D. (2013). Crowdsourcing a word–emotion association lexicon. Computational Intelligence, 29(3), 436–465.
Moraes, S. M., Santos, A. L., Redecker, M., Machado, R. M., & Meneguzzi, F. R. (2016). Comparing approaches to subjectivity classification: A study on Portuguese tweets. In Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) (Vol. 9727, pp. 86–94). Springer.
Nankani, H., Dutta, H., Shrivastava, H., Krishna, P. R., Mahata, D., & Shah, R. R. (2020). Multilingual sentiment analysis. In Deep learning-based approaches for sentiment analysis (pp. 193–236). Springer.
Nascimento, P., Osiek, B., & Xexéo, G. (2015). Análise de sentimento de tweets com foco em notícias. Revista Eletrônica de Sistemas de Informação, 14, 2.
Neuenschwander, B., Pereira, A., Meira, W., & Barbosa, D. (2014). Sentiment analysis for streams of web data: A case study of Brazilian financial markets. In WebMedia 2014—Proceedings of the 20th Brazilian symposium on multimedia and the web, 2014 (pp. 167–170). Association for Computing Machinery, Inc.
Nguyen, D. Q., Vu, T., & Nguyen, A. T. (2020). BERTweet: A pre-trained language model for English Tweets. In Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations, 2020 (pp. 9–14).
Nozza, D., Bianchi, F., & Hovy, D. (2020). What the [mask]? Making sense of language-specific BERT models. CoRR abs/2003.02912. arxiv:2003.02912
Oliveira, D. J. S., & de Souza Bermejo, P. H. (2017). Mídias sociais e administração pública: análise do sentimento social perante a atuação do governo federal brasileiro. Organizaçōes & Sociedade, 24, 491–508.
Oliveira, D. J. S., de Souza Bermejo, P. H., & dos Santos, P. A. (2017). Can social media reveal the preferences of voters? A comparison between sentiment analysis and traditional opinion polls. Journal of Information Technology and Politics, 14, 34–45.
Oliveira, D. J. S., Souza Bermejo, P. H., Pereira, J. R., & Barbosa, D. A. (2019). The application of the sentiment analysis technique in social media as a tool for social management practices at the governmental level. Revista de Administracao Publica, 53, 235–251.
Pasqualotti, P. R., & Vieira, R. (2008). Wordnetaffectbr: uma base lexical de palavras de emoções para a língua portuguesa. RENOTE-Revista Novas Tecnologias na Educação, 6(1).
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Pennington, J., Socher, R., & Manning, C. (2014) GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014 (pp. 1532–1543). Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162.
Pessanha, G. R. G., Fidelis, T. O., Freire, C. D., & Soares, E. A. (2020). Fiqueemcasa: Análise de sentimento dos usuários do twitter em relação ao covid19. HOLOS, 5, 2020.
Praciano, B. J. G., da Costa, J. P. C. L., Maranhao, J. P. A., de Mendonça, F. L. L., de Sousa Junior, R. T., & Prettz, J. B. (2018). Spatio-temporal trend analysis of the Brazilian elections based on Twitter data. In IEEE international conference on data mining workshops, November 2018 (pp. 1355–1360). IEEE Computer Society.
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In K. Inui, J. Jiang, V. Ng & X. Wan (Eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019 (pp. 3980–3990). Association for Computational Linguistics.
Rosa, R. L., Rodriguez, D. Z., & Bressan, G. (2013). SentiMeter-Br: A social web analysis tool to discover consumers’ sentiment. In Proceedings—IEEE international conference on mobile data management, 2013 (Vol. 2, pp. 122–124).
Schuster, M., & Nakajima, K. (2012). Japanese and Korean voice search. In 2012 IEEE international conference on acoustics, speech and signal processing, ICASSP 2012, Kyoto, Japan, March 25–30, 2012 (pp. 5149–5152). IEEE.
Severyn, A., & Moschitti, A. (2015). Twitter sentiment analysis with deep convolutional neural networks. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, 2015 (pp. 959–962).
Silva, A. N. D., Souza, O. D., & Souza, J. N. D. (2020). Sentiment parser based on x-bar theory to Brazilian Portuguese. In Proceedings of the 2020 international conference on computing, electronics and communications engineering, 2020 (pp. 166–171). Institute of Electrical and Electronics Engineers, Inc.
Silva, I. S., Gomide, J., Veloso, A., Meira, W., & Ferreira, R. (2011). Effective sentiment stream analysis with self-augmenting training and demand-driven projection. In SIGIR’11—Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, 2011 (pp. 475–484). Association for Computing Machinery.
Singhal, P., & Bhattacharyya, P. (2016). Borrow a little from your rich cousin: Using embeddings and polarities of English words for multilingual sentiment classification. In Proceedings of COLING 2016, the 26th international conference on computational linguistics: Technical papers, 2016 (pp. 3053–3062).
Souza, B. A., Almeida, T. G., Menezes, A. A., Nakamura, F. G., Figueiredo, C. M., & Nakamura, E. F. (2016). For or against? Polarity analysis in tweets about impeachment process of Brazil President. In Proceedings of the 22nd Brazilian symposium on multimedia and the web, 2016 (pp. 335–338). ACM.
Souza, E., Alves, T., Teles, I., Oliveira, A. L., & Gusmão, C. (2016). TOPIE: An open-source opinion mining pipeline to analyze consumers’ sentiment in Brazilian Portuguese. In Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) (Vol. 9727, pp. 95–105). Springer.
Souza, F., Nogueira, R., & Lotufo, R. (2020). BERTimbau: Pretrained BERT models for Brazilian Portuguese. In Brazilian conference on intelligent systems, 2020 (pp. 403–417). Springer.
Souza, M., & Vieira, R. (2011). Construction of a Portuguese opinion lexicon from multiple resources. In Anais do Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, 2011, Brasil.
Souza, M., & Vieira, R. (2012). Sentiment analysis on Twitter data for Portuguese language. In Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), LNAI (Vol. 7243, pp. 241–247). Springer.
Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. In A. Korhonen, D. R. Traum & L. Màrquez (Eds.), Proceedings of the 57th conference of the Association for Computational Linguistics, ACL 2019: Long papers, Florence, Italy, July 28–August 2, 2019 (Vol. 1, pp. 3645–3650). Association for Computational Linguistics. https://doi.org/10.18653/v1/p19-1355.
Strubell, E., Ganesh, A., & McCallum, A. (2020). Energy and policy considerations for modern deep learning research. In The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, the thirty-second innovative applications of artificial intelligence conference, IAAI 2020, the tenth AAAI symposium on educational advances in artificial intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020 (pp. 13693–13696). AAAI Press. https://aaai.org/ojs/index.php/AAAI/article/view/7123
Taboada, M., Brooke, J., Tofiloski, M., Voll, K., & Stede, M. (2011). Lexicon-based methods for sentiment analysis. Computational Linguistics, 37(2), 267–307.
Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., & Qin, B. (2014). Learning sentiment-specific word embedding for Twitter sentiment classification. In Proceedings of the 52nd annual meeting of the Association for Computational Linguistics: Long papers, 2014 (Vol. 1, pp. 1555–1565).
Vargas, F. A., Santos, R. S. S. D., & Rocha, P. R. (2020). Identifying fine-grained opinion and classifying polarity on coronavirus pandemic. In Proceedings of the 9th Brazilian conference on intelligent systems, LNAI, 2020 (Vol. 12319, pp. 511–520). Springer.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need, 5998–6008.
Vilhagra, L. A., Fernandes, E. R., & Nogueira, B. M. (2020). TextCSN: A semi-supervised approach for text clustering using pairwise constraints and convolutional Siamese network. In Proceedings of the ACM symposium on applied computing, 2020 (pp. 1135–1142). Association for Computing Machinery.
Virtanen, A., Kanerva, J., Ilo, R., Luoma, J., Luotolahti, J., Salakoski, T., Ginter, F., & Pyysalo, S. (2019). Multilingual is not enough: BERT for Finnish. CoRR abs/1912.07076 . arxiv:1912.07076
Vitório, D., Souza, E., & Oliveira, A. L. (2019). Evaluating active learning sampling strategies for opinion mining in Brazilian politics corpora. In Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), LNAI (Vol. 11805, pp. 695–707). Springer.
Vitório, D., Souza, E. P. R., Pereira, I., & Oliveira, A. (2017). Investigating opinion mining through language varieties: A case study of Brazilian and European Portuguese tweets. In Proceedings of the 11th Brazilian symposium in information and human language technology, 2017 (pp. 43–52). Sociedade Brasileira de Computação, Uberlândia.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., & Rush, A. M. (2020). Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations, 2020 (pp. 38–45). Association for Computational Linguistics (Online). https://www.aclweb.org/anthology/2020.emnlp-demos.6
Yagui, M., & Maia, L. (2017). Data mining of social manifestations in Twitter: An ETL approach focused on sentiment analysis. In XIII Brazilian symposium on information systems, 2017 (pp. 1–8). Sociedade Brasileira de Computacao.
Yang, Y., Cer, D., Ahmad, A., Guo, M., Law, J., Constant, N., Abrego, G. H., Yuan, S., Tar, C., Sung, Y. H., Strope & B., Kurzweil, R. (2020). Multilingual universal sentence encoder for semantic retrieval. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 87–94, Online. Association for Computational Linguistics.
Acknowledgements
This research was partially financed by CNPq (National Council for Scientific and Technological Development) under Grants 311275/2020-6 and 315750/2021-9 and FAPERJ—Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro, Process SEI-260003/000614/2023, E-26/202.914/2019 and E-26/201.139/2022.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Evaluation of static embeddings
Appendix: Evaluation of static embeddings
An initial evaluation using three different static embeddings, fastText, Word2Vec, and GloVe, was performed to identify the best static embedding for our scenario. Tables 38, 39, and 40 present Acc and F1-score (F1) for all three static embeddings and classifiers LR, SVM, and XGB, respectively. The superiority of fastText is clear for all datasets when combined with classifiers LR and SVM, and almost all datasets when used with the XGB classifier. The same can be seen with the multiclass set of datasets, as shown in Tables 41, 42, and 43. The superiority of fastText can be explained by its ability to deal with subwords, an important feature when working with tweets since those usually follow an informal style with heavy use of abbreviations and misspelling words.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Vianna, D., Carneiro, F., Carvalho, J. et al. Sentiment analysis in Portuguese tweets: an evaluation of diverse word representation models. Lang Resources & Evaluation 58, 223–272 (2024). https://doi.org/10.1007/s10579-023-09661-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-023-09661-4