Abstract
The use of Transformers for text processing has attracted a large deal of attention in the last years. This is particularly true for sentence models, which present high capacity to comprehend and generate text contextually, improving the predictive performance in different Natural Language Processing tasks, when compared with previous approaches. Even so, there are still several challenges when applied to long documents, especially for some knowledge areas with very specific characteristics, such as legislative proposals. This study investigated different strategies for utilizing BERT-based models in long document retrieval written in Brazilian Portuguese. We used three corpora from the Brazilian Chamber of Deputies to build a dataset and assess the models, incorporating zero-shot and fine-tuning strategies. Five sentence models were evaluated: BERTimbau, LegalBert, LegalBert-pt, LegalBERTimbau, and LaBSE. We also assessed a summarized corpus of bills considering the input size limitation of the sentence models. Finaly, we propose a hybrid model, named HIRS, combining BM25 and BERTimbau with fine-tuning. According to the experimental results, the predictive performance obtained by HIRS was superior to the performance obtained by the other models, with a Recall of 84.78% for 20 documents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Translated version of: “Sumarize o texto a seguir. Retorne o sumário em um único parágrafo abrangendo os pontos principais que foram identificados no texto: \(\backslash \)n <texto_original> \(\backslash \)n RESUMO:”.
- 3.
huggingface.co/.
- 4.
- 5.
- 6.
- 7.
- 8.
References
Bast, H., Buchhold, B., Haussmann, E.: Semantic search on text and knowledge bases. Found. Trends® Inf. Retrieval 10(2-3), 119–271 (2016)
Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., Androutsopoulos, I.: LEGAL-BERT: the muppets straight out of law school. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2898–2904. Association for Computational Linguistics, November 2020
Cordeiro, N.P., Dias, J., Santos, P.A.: LeSSE-a semantic search engine applied to portuguese consumer law. In: Moniz, N., Vale, Z., Cascalho, J., Silva, C., Sebastião, R. (eds.) Progress in Artificial Intelligence, EPIA 2023, LNCS, vol. 14116, pp. 118–130. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-49011-8_10
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2019)
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., Wang, W.: Language-agnostic BERT sentence embedding. CoRR abs/2007.01852 (2020)
da Fonseca, G.H.G.: Recuperação de informação (2020)
Gomes, T., Ladeira, M.: A new conceptual framework for enhancing legal information retrieval at the brazilian superior court of justice. In: Proceedings of the 12th International Conference on Management of Digital EcoSystems, MEDES 2020, pp. 26–29. Association for Computing Machinery, New York, NY, USA (2020)
José, M.M., José, M.A., Mauá, D.D., Cozman, F.G.: Integrating question answering and text-to-SQL in Portuguese. In: Pinheiro, V., et al. (eds.) Computational Processing of the Portuguese Language, pp. 278–287. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-030-98305-5_26
Kamphuis, C., de Vries, A.P., Boytsov, L., Lin, J.: Which bm25 do you mean? a large-scale reproducibility study of scoring variants. In: Jose, J.M., et al. (eds.) Advances in Information Retrieval, pp. 28–34. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-45442-5_4
Lee, H.D., Lee, S., Kang, U.: Auber: automated bert regularization. PLOS ONE 16(6), 1–16 (2021)
Lin, J., Nogueira, R., Yates, A.: Pretrained transformers for text ranking: bert and beyond (2021)
Melo, R., Santos, P.A., Dias, J.: A semantic search system for the supremo tribunal de justiça. In: Moniz, N., Vale, Z., Cascalho, J., Silva, C., Sebastião, R. (eds.) Progress in Artificial Intelligence, pp. 142–154. Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-49011-8_12
Min, B., et al.: Recent advances in natural language processing via large pre-trained language models: a survey. CoRR abs/2111.01243 (2021)
Paul, S., Mandal, A., Goyal, P., Ghosh, S.: Pre-trained language models for the legal domain: A case study on Indian law. In: Proceedings of 19th International Conference on Artificial Intelligence and Law - ICAIL 2023 (2023)
Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks, August 2019
Robertson, S., Zaragoza, H., et al.: The probabilistic relevance framework: Bm25 and beyond. Found. Trends® Inf. Retrieval 3(4), 333–389 (2009)
Rosa, G.M., Rodrigues, R.C., de Alencar Lotufo, R., Nogueira, R.: To tune or not to tune? zero-shot models for legal case entailment. In: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, ICAIL 2021, pp. 295-300. Association for Computing Machinery, New York, NY, USA (2021)
Savelka, J.: Discovering sentences for argumentation about the meaning of statutory terms, August 2020
Schütze, H., Manning, C.D., Raghavan, P.: Introduction to Information Retrieval, vol. 39. Cambridge University Press, Cambridge (2008)
Silva, N., et al.: Evaluating topic models in portuguese political comments about bills from brazil’s chamber of deputies. In: Anais da X Brazilian Conference on Intelligent Systems. SBC, Porto Alegre, RS, Brasil (2021)
Silveira, R., Ponte, C., Almeida, V., Pinheiro, V., Furtado, V.: LegalBert-PT: A pretrained language model for the Brazilian Portuguese legal domain. In: Naldi, M.C., Bianchi, R.A.C. (eds.) Intelligent Systems, pp. 268–282. Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-45392-2_18
Souza, E., et al.: An information retrieval pipeline for legislative documents from the Brazilian chamber of deputies, vol. 346, pp. 119–126. IOS Press BV, December 2021
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28
Tüselmann, O., Fink, G.A.: Exploring semantic word representations for recognition-free NLP on handwritten document images. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) Document Analysis and Recognition - ICDAR 2023, ICDAR 2023, LNCS, vol. 14190, pp. 85–100. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41685-9_6
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Yang, Y., Wu, Z., Yang, Y., Lian, S., Guo, F., Wang, Z.: A survey of information extraction based on deep learning. Appl. Sci. 12(19), 9691 (2022)
Zhang, Y., Li, X., Zhang, Z.: Disease-pertinent knowledge extraction in online health communities using GRU based on a double attention mechanism. IEEE Access 8, 95947–95955 (2020)
Acknowledgement
This research was financed in part by CAPES (Brazil) and National Institute of Artificial Intelligence (IAIA), using computational resources provided by CeMEAI (funded by FAPESP grant 2013/07375-0).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
dos Santos, J.A. et al. (2025). HIRS: A Hybrid Information Retrieval System for Legislative Documents. In: Santos, M.F., Machado, J., Novais, P., Cortez, P., Moreira, P.M. (eds) Progress in Artificial Intelligence. EPIA 2024. Lecture Notes in Computer Science(), vol 14967. Springer, Cham. https://doi.org/10.1007/978-3-031-73497-7_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-73497-7_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73496-0
Online ISBN: 978-3-031-73497-7
eBook Packages: Computer ScienceComputer Science (R0)