Abstract
This paper explores methods to disambiguate Part-of-Speech (PoS) tags for closed class words in Brazilian Portuguese corpora annotated according to the Universal Dependencies annotation model. We evaluate disambiguation methods of different paradigms, namely a Markov-based method, a widely adopted parsing tool, and a BERT-based language modeling method. We compare their performances with two baselines, and observe a significant increase of more than 10% over the baselines for all proposed methods. We also show that while the BERT-based model outperforms the others reaching for the best case a 98% accuracy predicting the correct PoS tag, the use of the three methods as an Ensemble method offers more stable result according to the smaller variance for the numerical results we performed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Similarly to English with the ending -“ly”, in Portuguese it is possible to turn adjectives into adverbs by adding -“mente” at the end. We disconsider those -ly adverbs, as only the primitive adverbs are a closed class.
- 2.
The verbs ser and estar (“to be” in English) are always annotated as AUX, either by being true auxiliary verbs, either by being copula verbs. The verbs “ir”, “haver”, and “ter” (“to go”, “to exist”, and “to have” in English) are sometimes annotated as VERB, sometimes annotated as AUX (as “going to” and “have” + a past participle in English).
- 3.
While adjectives are not a closed class, the adjectives that are ordinal numbers are considered belonging to a closed subset of class ADJ.
- 4.
- 5.
The values stated as average are the macro average of the values of each fold, but since the folds have about the same size, the values for micro and macro average have are practically the same (less than 0.01% difference).
- 6.
For reproducibility purposes, all data (including fold splits) and implementation of all methods are available at https://sites.google.com/icmc.usp.br/poetisa/publications.
References
Afonso, S., Bick, E., Haber, R., Santos, D.: Floresta sintá(c)tica: A treebank for Portuguese. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02). ELRA, Las Palmas, Canary Islands - Spain (May 2002), http://www.lrec-conf.org/proceedings/lrec2002/pdf/1.pdf
Assunção, J., Fernandes, P., Lopes, L.: Language independent pos-tagging using automatically generated markov chains. In: Proceedings of the 31st International Conference on Software Engineering & Knowledge Engineering, pp. 1–5. Lisbon, Portugal (2019). https://doi.org/10.18293/SEKE2019-097
De Souza, E., Freitas, C.: Polishing the gold-how much revision do we need in treebanks? In: Procedings of the Universal Dependencies Brazilian Festival, pp. 1–11 (2022). https://aclanthology.org/2022.udfestbr-1.2.pdf
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). https://doi.org/10.48550/ARXIV.1810.04805, https://arxiv.org/abs/1810.04805
DiPietro, R., Hager, G.D.: Chapter 21 - deep learning: RNNs and LSTM. In: Zhou, S.K., Rueckert, D., Fichtinger, G. (eds.) Handbook of Medical Image Computing and Computer Assisted Intervention, pp. 503–519. The Elsevier and MICCAI Society Book Series, Academic Press (2020). https://doi.org/10.1016/B978-0-12-816176-0.00026-0
Duran, M., Oliveira, H., Scandarolli, C.: Que simples que nada: a anotação da palavra que em córpus de UD. In: Proceedings of the Universal Dependencies Brazilian Festival, pp. 1–11 (2022). https://aclanthology.org/2022.udfestbr-1.3
Ehsani, R., Alper, M.E., Eryiğit, G., Adali, E.: Disambiguating main POS tags for Turkish. In: Proceedings of the 24th Conference on Computational Linguistics and Speech Processing (ROCLING 2012), pp. 202–213. The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), Chung-Li, Taiwan (2012). https://aclanthology.org/O12-1021
Gers, F.A., Schmidhuber, J.A., Cummins, F.A.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000). https://doi.org/10.1162/089976600300015015
Hoang, M., Bihorac, O.A., Rouces, J.: Aspect-based sentiment analysis using BERT. In: Proceedings of the 22nd Nordic Conference on Computational Linguistics, pp. 187–196. Linköping University Electronic Press, Turku, Finland (2019). https://aclanthology.org/W19-6120
Hoya Quecedo, J.M., Maximilian, K., Yangarber, R.: Neural disambiguation of lemma and part of speech in morphologically rich languages. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 3573–3582. European Language Resources Association, Marseille, France (2020). https://aclanthology.org/2020.lrec-1.439
Ide, N., Suderman, K.: Integrating linguistic resources: The American national corpus model. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). ELRA, Genoa, Italy (2006). http://www.lrec-conf.org/proceedings/lrec2006/pdf/560_pdf.pdf
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) Proceedings of the 3rd International Conference on Learning Representations (2015). http://arxiv.org/abs/1412.6980
Kupiec, J.: Robust part-of-speech tagging using a hidden markov model. Comput. Speech Lang. 6(3), 225–242 (1992). https://www.sciencedirect.com/science/article/pii/088523089290019Z
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. ICML ’01, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001). https://dl.acm.org/doi/10.5555/645530.655813
Liu, Y., Lapata, M.: Text summarization with pretrained encoders. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3730–3740. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1387
Lopes, L., Duran, M., Fernandes, P., Pardo, T.: Portilexicon-ud: a Portuguese lexical resource according to universal dependencies model. In: Proceedings of the Language Resources and Evaluation Conference, pp. 6635–6643. European Language Resources Association, Marseille, France (2022). https://aclanthology.org/2022.lrec-1.715
Lopes, L., Duran, M.S., Pardo, T.A.S.: Universal dependencies-based pos tagging refinement through linguistic resources. In: Proceedings of the 10th Brazilian Conference on Intelligent System. BRACIS’21 (2021). https://link.springer.com/chapter/10.1007/978-3-030-91699-2_41
de Marneffe, M.C., Manning, C.D., Nivre, J., Zeman, D.: Universal Dependencies. Comput. Linguist. 47(2), 255–308 (2021). https://doi.org/10.1162/coli_a_00402, https://aclanthology.org/2021.cl-2.11
Muñoz-Valero, D., Rodriguez-Benitez, L., Jimenez-Linares, L., Moreno-Garcia, J.: Using recurrent neural networks for part-of-speech tagging and subject and predicate classification in a sentence. Int. J. Comput. Intell. Syst. 13, 706–716 (2020). https://doi.org/10.2991/ijcis.d.200527.005
Nivre, J., et al.: Universal Dependencies v1: A multilingual treebank collection. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 1659–1666. ELRA, Portorož, Slovenia (2016). https://aclanthology.org/L16-1262
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc. (2019). http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Rademaker, A., Chalub, F., Real, L., Cláudia Freitas, Bick, E., De Paiva, V.: Universal dependencies for Portuguese. In: Proceedings of the Fourth International Conference on Dependency Linguistics (Depling), pp. 197–206 (2017)
Santana, M.: Kaggle - news of the brazilian newspaper. https://www.kaggle.com/marlesson/news-of-the-site-folhauol, accessed: 2021-06-14
Shen, Q., Clothiaux, D., Tagtow, E., Littell, P., Dyer, C.: The role of context in neural morphological disambiguation. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 181–191. Osaka, Japan (2016). https://aclanthology.org/C16-1018
Silva, E., Pardo, T., Roman, N., Fellipo, A.: Universal dependencies for tweets in brazilian portuguese: Tokenization and part of speech tagging. In: Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional. pp. 434–445. SBC, Porto Alegre, RS, Brasil (2021). https://doi.org/10.5753/eniac.2021.18273, https://sol.sbc.org.br/index.php/eniac/article/view/18273
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20–23 (2020), https://link.springer.com/chapter/10.1007/978-3-030-61377-8_28
Straka, M.: UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 197–207 (2018). https://aclanthology.org/K18-2020
Straka, M., Straková, J.: Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88–99. Association for Computational Linguistics, Vancouver, Canada (2017), https://aclanthology.org/K17-3009
Universal Dependencies: UD Portuguese Bosque - UD version 2. https://universaldependencies.org/treebanks/pt_bosque/index.html. Accessed 14 Jun 2021
Vandenbussche, P.Y., Scerri, T., Jr., R.D.: Word sense disambiguation with transformer models. In: Proceedings of the 6th Workshop on Semantic Deep Learning (SemDeep-6), pp. 7–12. Association for Computational Linguistics, Online (2021) https://aclanthology.org/2021.semdeep-1.2
Wolf, T., et al.: Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, Online (2020). https://www.aclweb.org/anthology/2020.emnlp-demos.6
Zalmout, N., Habash, N.: Don’t throw those morphological analyzers away just yet: Neural morphological disambiguation for Arabic. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 704–713. Association for Computational Linguistics, Copenhagen, Denmark (2017). https://aclanthology.org/D17-1073
Acknowledgements
This work was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by the São Paulo Research Foundation (FAPESP grant number 2019/07665-4) and by the IBM Corporation. The project was also supported by the Ministry of Science, Technology and Innovation, with resources of Law N. 8.248, of October 23, 1991, within the scope of PPI-SOFTEX, coordinated by Softex and published as Residence in TIC 13, DOU 01245.010222/2022-44.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lopes, L., Fernandes, P., Inacio, M.L., Duran, M.S., Pardo, T.A.S. (2023). Disambiguation of Universal Dependencies Part-of-Speech Tags of Closed Class Words in Portuguese. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14197. Springer, Cham. https://doi.org/10.1007/978-3-031-45392-2_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-45392-2_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45391-5
Online ISBN: 978-3-031-45392-2
eBook Packages: Computer ScienceComputer Science (R0)