HIRS: A Hybrid Information Retrieval System for Legislative Documents

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14967))

Included in the following conference series:

EPIA Conference on Artificial Intelligence

82 Accesses

Abstract

The use of Transformers for text processing has attracted a large deal of attention in the last years. This is particularly true for sentence models, which present high capacity to comprehend and generate text contextually, improving the predictive performance in different Natural Language Processing tasks, when compared with previous approaches. Even so, there are still several challenges when applied to long documents, especially for some knowledge areas with very specific characteristics, such as legislative proposals. This study investigated different strategies for utilizing BERT-based models in long document retrieval written in Brazilian Portuguese. We used three corpora from the Brazilian Chamber of Deputies to build a dataset and assess the models, incorporating zero-shot and fine-tuning strategies. Five sentence models were evaluated: BERTimbau, LegalBert, LegalBert-pt, LegalBERTimbau, and LaBSE. We also assessed a summarized corpus of bills considering the input size limitation of the sentence models. Finaly, we propose a hybrid model, named HIRS, combining BM25 and BERTimbau with fine-tuning. According to the experimental results, the predictive performance obtained by HIRS was superior to the performance obtained by the other models, with a Recall of 84.78% for 20 documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 49.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 64.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Empowering LLMs for Long-Text Information Extraction in Chinese Legal Documents

CAPTAIN at COLIEE 2024: Large Language Model for Legal Text Retrieval and Entailment

Segmenting Brazilian legislative text using weak supervision and active learning

Article 26 September 2024

Notes

1.
https://huggingface.co/meta-llama/Llama-2-13b-chat-hf.
2.
Translated version of: “Sumarize o texto a seguir. Retorne o sumário em um único parágrafo abrangendo os pontos principais que foram identificados no texto: \(\backslash \)n <texto_original> \(\backslash \)n RESUMO:”.
3.
huggingface.co/.
4.
https://huggingface.co/neuralmind/bert-large-portuguese-cased.
5.
https://huggingface.co/ulysses-camara/legal-bert-pt-br.
6.
https://huggingface.co/raquelsilveira/legalbertpt_fp.
7.
https://huggingface.co/rufimelo/Legal-BERTimbau-large.
8.
https://huggingface.co/sentence-transformers/LaBSE.

References

Bast, H., Buchhold, B., Haussmann, E.: Semantic search on text and knowledge bases. Found. Trends® Inf. Retrieval 10(2-3), 119–271 (2016)
Google Scholar
Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., Androutsopoulos, I.: LEGAL-BERT: the muppets straight out of law school. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2898–2904. Association for Computational Linguistics, November 2020
Google Scholar
Cordeiro, N.P., Dias, J., Santos, P.A.: LeSSE-a semantic search engine applied to portuguese consumer law. In: Moniz, N., Vale, Z., Cascalho, J., Silva, C., Sebastião, R. (eds.) Progress in Artificial Intelligence, EPIA 2023, LNCS, vol. 14116, pp. 118–130. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-49011-8_10
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2019)
Google Scholar
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., Wang, W.: Language-agnostic BERT sentence embedding. CoRR abs/2007.01852 (2020)
Google Scholar
da Fonseca, G.H.G.: Recuperação de informação (2020)
Google Scholar
Gomes, T., Ladeira, M.: A new conceptual framework for enhancing legal information retrieval at the brazilian superior court of justice. In: Proceedings of the 12th International Conference on Management of Digital EcoSystems, MEDES 2020, pp. 26–29. Association for Computing Machinery, New York, NY, USA (2020)
Google Scholar
José, M.M., José, M.A., Mauá, D.D., Cozman, F.G.: Integrating question answering and text-to-SQL in Portuguese. In: Pinheiro, V., et al. (eds.) Computational Processing of the Portuguese Language, pp. 278–287. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-030-98305-5_26
Chapter Google Scholar
Kamphuis, C., de Vries, A.P., Boytsov, L., Lin, J.: Which bm25 do you mean? a large-scale reproducibility study of scoring variants. In: Jose, J.M., et al. (eds.) Advances in Information Retrieval, pp. 28–34. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-45442-5_4
Chapter Google Scholar
Lee, H.D., Lee, S., Kang, U.: Auber: automated bert regularization. PLOS ONE 16(6), 1–16 (2021)
Google Scholar
Lin, J., Nogueira, R., Yates, A.: Pretrained transformers for text ranking: bert and beyond (2021)
Google Scholar
Melo, R., Santos, P.A., Dias, J.: A semantic search system for the supremo tribunal de justiça. In: Moniz, N., Vale, Z., Cascalho, J., Silva, C., Sebastião, R. (eds.) Progress in Artificial Intelligence, pp. 142–154. Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-49011-8_12
Chapter Google Scholar
Min, B., et al.: Recent advances in natural language processing via large pre-trained language models: a survey. CoRR abs/2111.01243 (2021)
Google Scholar
Paul, S., Mandal, A., Goyal, P., Ghosh, S.: Pre-trained language models for the legal domain: A case study on Indian law. In: Proceedings of 19th International Conference on Artificial Intelligence and Law - ICAIL 2023 (2023)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks, August 2019
Google Scholar
Robertson, S., Zaragoza, H., et al.: The probabilistic relevance framework: Bm25 and beyond. Found. Trends® Inf. Retrieval 3(4), 333–389 (2009)
Google Scholar
Rosa, G.M., Rodrigues, R.C., de Alencar Lotufo, R., Nogueira, R.: To tune or not to tune? zero-shot models for legal case entailment. In: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, ICAIL 2021, pp. 295-300. Association for Computing Machinery, New York, NY, USA (2021)
Google Scholar
Savelka, J.: Discovering sentences for argumentation about the meaning of statutory terms, August 2020
Google Scholar
Schütze, H., Manning, C.D., Raghavan, P.: Introduction to Information Retrieval, vol. 39. Cambridge University Press, Cambridge (2008)
Google Scholar
Silva, N., et al.: Evaluating topic models in portuguese political comments about bills from brazil’s chamber of deputies. In: Anais da X Brazilian Conference on Intelligent Systems. SBC, Porto Alegre, RS, Brasil (2021)
Google Scholar
Silveira, R., Ponte, C., Almeida, V., Pinheiro, V., Furtado, V.: LegalBert-PT: A pretrained language model for the Brazilian Portuguese legal domain. In: Naldi, M.C., Bianchi, R.A.C. (eds.) Intelligent Systems, pp. 268–282. Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-45392-2_18
Chapter Google Scholar
Souza, E., et al.: An information retrieval pipeline for legislative documents from the Brazilian chamber of deputies, vol. 346, pp. 119–126. IOS Press BV, December 2021
Google Scholar
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28
Chapter Google Scholar
Tüselmann, O., Fink, G.A.: Exploring semantic word representations for recognition-free NLP on handwritten document images. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) Document Analysis and Recognition - ICDAR 2023, ICDAR 2023, LNCS, vol. 14190, pp. 85–100. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41685-9_6
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Yang, Y., Wu, Z., Yang, Y., Lian, S., Guo, F., Wang, Z.: A survey of information extraction based on deep learning. Appl. Sci. 12(19), 9691 (2022)
Google Scholar
Zhang, Y., Li, X., Zhang, Z.: Disease-pertinent knowledge extraction in online health communities using GRU based on a double attention mechanism. IEEE Access 8, 95947–95955 (2020)
Article Google Scholar

Download references

Acknowledgement

This research was financed in part by CAPES (Brazil) and National Institute of Artificial Intelligence (IAIA), using computational resources provided by CeMEAI (funded by FAPESP grant 2013/07375-0).

Author information

Authors and Affiliations

University of Pernambuco, Recife, PE, Brazil
José Antônio dos Santos & Carmelo J. A. Bastos Filho
Federal Rural University of Pernambuco, Recife, PE, Brazil
Ellen Souza & Hidelberg O. Albuquerque
Federal University of Pernambuco, Recife, PE, Brazil
Hidelberg O. Albuquerque, Douglas Vitório & Danilo Carlos Gouveia de Lucena
Federal University of Goiás, Goiânia, GO, Brazil
Nádia Silva
University of São Paulo, São Paulo, SP, Brazil
Ellen Souza & André de Carvalho

Authors

José Antônio dos Santos
View author publications
You can also search for this author in PubMed Google Scholar
Ellen Souza
View author publications
You can also search for this author in PubMed Google Scholar
Carmelo J. A. Bastos Filho
View author publications
You can also search for this author in PubMed Google Scholar
Hidelberg O. Albuquerque
View author publications
You can also search for this author in PubMed Google Scholar
Douglas Vitório
View author publications
You can also search for this author in PubMed Google Scholar
Danilo Carlos Gouveia de Lucena
View author publications
You can also search for this author in PubMed Google Scholar
Nádia Silva
View author publications
You can also search for this author in PubMed Google Scholar
André de Carvalho
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José Antônio dos Santos .

Editor information

Editors and Affiliations

University of Minho, Braga, Portugal
Manuel Filipe Santos
University of Minho, Braga, Portugal
José Machado
University of Minho, Braga, Portugal
Paulo Novais
University of Minho, Braga, Portugal
Paulo Cortez
Polytechnic Institute of Viana do Castelo, Viana do Castelo, Portugal
Pedro Miguel Moreira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

dos Santos, J.A. et al. (2025). HIRS: A Hybrid Information Retrieval System for Legislative Documents. In: Santos, M.F., Machado, J., Novais, P., Cortez, P., Moreira, P.M. (eds) Progress in Artificial Intelligence. EPIA 2024. Lecture Notes in Computer Science(), vol 14967. Springer, Cham. https://doi.org/10.1007/978-3-031-73497-7_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-73497-7_26
Published: 16 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73496-0
Online ISBN: 978-3-031-73497-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

HIRS: A Hybrid Information Retrieval System for Legislative Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Empowering LLMs for Long-Text Information Extraction in Chinese Legal Documents

CAPTAIN at COLIEE 2024: Large Language Model for Legal Text Retrieval and Entailment

Segmenting Brazilian legislative text using weak supervision and active learning

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

HIRS: A Hybrid Information Retrieval System for Legislative Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Empowering LLMs for Long-Text Information Extraction in Chinese Legal Documents

CAPTAIN at COLIEE 2024: Large Language Model for Legal Text Retrieval and Entailment

Segmenting Brazilian legislative text using weak supervision and active learning

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation