Abstract
We present a new resource for discourse studies in Portuguese, the CRPC Discourse Bank (CRPC-DB). CRPC-DB follows the Penn Discourse Treebank style of annotation. The annotation is performed on the PAROLE corpus, a free subset of the Reference Corpus of Contemporary Portuguese (CRPC) that includes news, fiction and didactic/scientific texts. The discourse bank covers explicit and implicit relations at intra and inter-sentential levels, and includes for now a total of 14,436 discourse relations. We present the main guidelines of our annotation and discuss specific cases. An experiment in inter-annotator agreement was performed and holds results of 0.88 F1-score for discourse relation identification, 0.71 Cohen’s K for the classification of discourse relation types, and 0,75 for top-level sense classification. The CRPC-DB will be distributed free of charge through the PORTULAN CLARIN infrastructure.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Al-Saif, A., Markert, K.: The leeds arabic discourse treebank: annotating discourse connectives for Arabic. In: Proceedings of LREC 2010, pp. 2046–2053 (2010)
Aleixo, P., Pardo, T.A.: CSTTool: um parser multidocumento automático para o português do Brasil. In: Proceedings of the IV Workshop on MSc Dissertation and PhD Thesis in Artificial Intelligence – WTDIA, pp. 140–145 (2008)
Aleixo, P., Pardo, T.A.: CSTNews: Um córpus de textos jornalísticos anotados segundo a teoria discursiva multidocumento CST (cross-document structure theory). Technical report NILC-TR-08-05, Núcleo Interinstitucional de Lingüística Computacional NILC, Universidade de São Paulo (2008)
Asher, N., Hunter, J., Morey, M., Farah, B., Afantenos, S.: Discourse structure and dialogue acts in multiparty dialogue: the STAC corpus. In: The Tenth International Conference on Language Resources and Evaluation (LREC 2016) (2016)
Asher, N., Lascarides, A.: The semantics and pragmatics of presupposition. J. Semant. 15(2), 239–299 (1988)
Asher, N.: Reference to Abstract Objects in Discourse. Kluwer, Dordrecht (1993)
Asher, N., et al.: ANNODIS and related projects: case studies on the annotation of discourse structure. In: Ide, N., Pustejovsky, J. (eds.) Handbook of Linguistic Annotation, pp. 1241–1264. Springer, Dordrecht (2017). https://doi.org/10.1007/978-94-024-0881-2_47
Branco, A., et al.: The Portuguese Language in the Digital Age / A Língua Portuguesa na Era Digital. Springer, Heidelberg (2012)
Branco, A., Mendes, A., Quaresma, P., Gomes, L., Silva, J., Teixeira, A.: Infrastructure for the science and technology of language PORTULAN CLARIN. In: LREC 2020 Worskhop IWLTP 2020–1st International Workshop on Language Technology Platforms, pp. 1–7. ELRA (2020)
Branco, A.H., Silva, J.R.: Contractions: breaking the tokenization-tagging circularity. In: Mamede, N.J., Trancoso, I., Baptista, J., das Graças Volpe Nunes, M. (eds.) PROPOR 2003. LNCS (LNAI), vol. 2721, pp. 167–170. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45011-4_24
Carbonel, T., Fuchs, J.T., Rino, L.: Anotação parcial de estruturas retóricas (RST) do Corpus Summ-it. Technical report, NILC-TR-04-07, Núcleo Interinstitucional de Lingüística Computacional NILC, Universidade de São Paulo (2007)
Carlson, L., Marcu, D.: Discourse tagging reference manual. Technical report ISI-TR-545 (2001)
Généreux, M., Hendrickx, I., Mendes, A.: Introducing the reference corpus of contemporary Portuguese on-line. In: Calzolari, N., et al. (eds.) LREC’2012 - Eighth International Conference on Language Resources and Evaluation, pp. 2237–2244. European Language Resources Association (ELRA), Istanbul, Turkey (2012)
Grésillon, A., Lebrave, J.L.: Qui interroge qui et pourquoi? In: La langue au ras du texte, pp. 57–132. Presses Universitaires de Lille (1984)
Lanham, R.: A Handlist of Rhetorical Terms. University of California Press, Berkeley (1991)
Mayoral, J.A.: Figuras Retóricas. Editorial Sintesis, Madrid (1994)
Maziero, E., Pardo, T.A.: CSTParser - a multi-document discourse parser. In: Proceedings of the PROPOR 2012 Demonstration, pp. 1–3 (2012)
Mírovský, J., Mladová, L., Zikánová, Š.: Connective-based measuring of the inter-annotator agreement in the annotation of discourse in PDT. In: COLING 2010: Posters, pp. 775–781. Coling 2010 Organizing Committee, Beijing, China, August 2010. https://www.aclweb.org/anthology/C10-2089
Nunes, M.V., Pardo, T.A.: A construção de um corpus de textos científicos em português do Brasil e sua marcação retórica. Technical report, NILC-TR-03-08, Núcleo Interinstitucional de Lingüística Computacional NILC, Universidade de São Paulo (2003)
Oza, U., Prasad, R., Kolachina, S., Sharma, D.M., Joshi, A.: The Hindi discourse relation bank. In: Proceedings of the 3rd Linguistic Annotation Workshop, pp. 158–161. Association for Computational Linguistics (2009)
Pardo, T., Seno, E.: Rhetalho: Um corpus de referência anotado retoricamente. In: Anais do V Encontro de Corpora (2005)
Prasad, R., et al.: The Penn discourse treebank 2.0. In: Proceedings of LREC 2008, pp. 2961–2968 (2008)
Prasad, R., Webber, B., Joshi, A.: Reflections on the Penn discourse treebank, comparable corpora, and complementary annotation. Comput. Linguist. 40(4), 921–950 (2014)
Rachakonda, R.T., Sharma, D.M.: Creating an annotated Tamil corpus as a discourse resource. In: Proceedings of the 5th Linguistic Annotation Workshop, pp. 119–123. Association for Computational Linguistics, Portland, Oregon, USA, June 2011. https://www.aclweb.org/anthology/W11-0414
Sanders, T., Spooren, W., Noordman, L.: Toward a taxonomy of coherence relations. Disc. Process. 15, 1–35 (1992)
Sharma, H., Dakwale, P., Sharma, D.M., Prasad, R., Joshi, A.: Assessment of different workflow strategies for annotating discourse relations: a case study with HDRB. In: Gelbukh, A. (ed.) CICLing 2013. LNCS, vol. 7816, pp. 523–532. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37247-6_42
Spooren, W., Degand, L.: Coding coherence relations: reliability and validity. Corpus Linguist. Linguist. Theory 6(2), 241–266 (2010)
Tonelli, S., Riccardi, G., Prasad, R., Joshi, A.: Annotation of discourse relations for conversational spoken dialogs. In: Calzolari, N., et al. (eds.) Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), European Language Resources Association (ELRA), Valletta, Malta, May 2010
Webber, B., Prasad, R., Lee, A., Joshi, A.: A discourse-annotated corpus of conjoined VPs. In: Proceedings of the 10th Linguistics Annotation Workshop, pp. 22–31 (2016)
Webber, B., Prasad, R., Lee, A., Joshi, A.: The Penn Discourse Treebank 3.0 annotation manual. Technical report, Institute for Research in Cognitive Science (2019)
Zeyrek, D., Mendes, A., Grishina, Y., Kurfalı, M., Gibbon, S., Ogrodniczuk, M.: TED multilingual discourse bank (TED-MDB) a parallel corpus annotated in the PDTB style. Lang. Resour. Eval. 54, 587–613 (2020)
Zeyrek, D., Webber, B.L.: A discourse resource for Turkish: annotating discourse connectives in the METU corpus. In: IJCNLP, pp. 65–72 (2008)
Zhou, Y., Xue, N.: PDTB-style discourse annotation of Chinese text. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol. 1, pp. 69–77. Association for Computational Linguistics (2012)
Acknowledgements
This work was partially supported by PORTULAN CLARIN-Research Infrastructure for the Science and Technology of Language, funded by Lisboa2020, Alentejo2020 and FCT-Fundação para a Ciência e Tecnologia under the grant PINFRA/22117/2016, and by FCT under the project UIDP/00214/2020. Some of its developments were implemented in the scope of the COST Action TextLink - Structuring Discourse in Multilingual Europe. We wish to thank the anonymous reviewers for their helpful comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Mendes, A., Lejeune, P. (2022). CRPC-DB a Discourse Bank for Portuguese. In: Pinheiro, V., et al. Computational Processing of the Portuguese Language. PROPOR 2022. Lecture Notes in Computer Science(), vol 13208. Springer, Cham. https://doi.org/10.1007/978-3-030-98305-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-98305-5_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-98304-8
Online ISBN: 978-3-030-98305-5
eBook Packages: Computer ScienceComputer Science (R0)