Sensitive Topics Retrieval in Digital Libraries: A Case Study of ḥadīṯ collections

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15178))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

285 Accesses

Abstract

The advent of Large Language Models (LLMs) has led to the development of new Question-Answering (QA) systems based on Retrieval-Augmented Generation (RAG) to incorporate query-specific knowledge at inference time. In this paper, the trustworthiness of RAG systems is investigated, particularly focusing on the performance of their retrieval phase when dealing with sensitive topics. This issue is particularly relevant as it could hinder a user’s ability to analyze sections of the available corpora, effectively biasing any following research. To mimic a specialised library possibly containing sensitive topics, a ḥādīṯ dataset has been curated using an ad-hoc framework called Question-Classify-Retrieve (QCR), which automatically assesses the performance of document retrieval by operating in three main steps: Question Generation, Passage Classification, and Passage Retrieval. Different sentence embedding models for document retrieval were tested showing significant performance gap between sensitive and non-sensitive topics compared to baseline. In real-world applications this would mean relevant documents placed lower in the retrieval list leading to the presence of irrelevant information or the absence of relevant one in case of a lower cut-off.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 89.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 109.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SESAME - self-supervised framework for extractive question answering over document collections

Article 30 July 2024

Query Expansion Based on WordNet and Word2vec for Italian Question Answering Systems

Closed-Domain Multiple-Choice Question Answering System for Science Questions

Notes

1.
the ITSERR research project is funded by the NextGenerationEU program provided by the Italian Ministry of Research, to enhance the European Research Infrastructure RESILIENCE to better meet the needs of the Religious Studies scientific community in terms of technological integration and increased innovative potential.
2.
Chunking refers to the common pre-processing step of dividing the documents into smaller chunks of text. This makes embedding vector and retrieved passages more content specific.
3.
Accuracy is the base metric used for model evaluation describing the number of correct predictions over all predictions; Precision measures how many of the positive predictions made are true positives; Recall measures how many of the positive cases the classifier correctly predicted, over all the positives in the data. It is sometimes also referred to as Sensitivity; F1 Score is a measure combining both precision and recall. It is generally described as the harmonic mean of the two.

References

Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Ed. by Jill Burstein, Christy Doran, and Thamar Solorio. Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., et al. (eds.) Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., pp. 1877–1901 (2020)
Google Scholar
OpenAI: GPT-4 technical report (2024). arXiv: 2303.08774
Zhang, Q., et al.: A survey for efficient open domain question answering. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: Association for Computational Linguistics, pp. 14447–14465 (2023). https://doi.org/10.18653/v1/2023.acl-long.808, https://aclanthology.org/2023.acl-long.808
Huang, L., et al.: A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. arXiv:2311.05232 (2023). https://doi.org/10.48550/ARXIV.2311.05232 (visited on 05/10/2024). Publisher: [object Object] Version Number: 1
Xu, Z., Jain, S., Kankanhalli, M.: Hallucination is inevitable: an innate limitation of large language models. arXiv:2401.11817(2024). https://doi.org/10.48550/arXiv.2401.11817. (visited on 05/10/2024)
Lewis, P., et al.: Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural. Inf. Process. Syst. 33, 9459–9474 (2020)
Google Scholar
Gao, Y., et al.: Retrieval-augmented generation for large language models: a survey. arXiv:2312.10997 (2024). https://doi.org/10.48550/arXiv.2312.10997. (visited on 05/10/2024)
Gao, L., et al.: Precise zero-shot dense retrieval without relevance labels. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: Association for Computational Linguistics, pp. 1762–1777 (2023). https://doi.org/10.18653/v1/2023.acl-long.99, https://aclanthology.org/2023.acl-long.99
Ma, X., et al.: Query rewriting in retrieval-augmented large language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, pp. 5303–5315 (2023). https://doi.org/10.18653/v1/2023.emnlp-main.322, https://aclanthology.org/2023.emnlp-main.322
Liu, T.-Y., et al.: Learning to rank for information retrieval. In: Foundations and Trends® in Information Retrieval 3.3, pp. 225–331 (2009)
Google Scholar
Jones, K.S., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments: Part 2. In: Information Processing Management, vol. 36 (2000), pp. 809–840. https://doi.org/10.1016/S0306-4573(00)00016-9
Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Webber, B., et al. (eds.)Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics (2020), pp. 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550, https://aclanthology.org/2020.emnlp-main.550
Izacard, G., Grave, E.: Leveraging passage retrieval with generative models for open domain question answering. In: Merlo, P., Tiedemann, J., Tsarfaty, R. (eds.)Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Online: Association for Computational Linguistics, pp. 874–880, (2021). https://doi.org/10.18653/v1/2021.eacl-main.74, https://aclanthology.org/2021.eacl-main.74
Wang, L., et al.: Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368 (2023)
Gallegos, I.O., et al.: Bias and fairness in large language models: a survey. arXiv preprint arXiv:2309.00770 (2023)
Mei, K., Fereidooni, S., Caliskan, A.: Bias against 93 stigmatized groups in masked language models and downstream sentiment classification tasks. In: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp. 1699–1710 (2023)
Google Scholar
Mozafari, M., Farahbakhsh, R., Crespi, N.: Hate speech detection and racial bias mitigation in social media based on BERT model. PloS one 15(8), e0237861 (2020)
Google Scholar
Rekabsaz, N., Schedl, M.: Do neural ranking models intensify gender bias? In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2065–2068 (2020)
Google Scholar
Rekabsaz, N., Kopeinik, S., Schedl, M.: Societal biases in retrieved contents: measurement framework and adversarial mitigation of BERT rankers. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 306–316 (2021)
Google Scholar
Gallegos, I.O., et al.: Bias and fairness in large language models: a survey. arXiv: 2309.00770 (2024)
Bergamaschi, S., et al.: Preserving and conserving culture: first steps towards a knowledge extractor and cataloguer for multilingual and multi-alphabetic heritages. In: Proceedings of the Conference on Information Technology for Social Good. GoodIT 2021. Roma, Italy: Association for Computing Machinery, pp. 301–304 (2021). ISBN: 9781450384780. https://doi.org/10.1145/3462203.3475927, https://doi.org/10.1145/3462203.3475927
Bergamaschi, S., et al.: Novel perspectives for the management of multilingual and multialphabetic heritages through automatic knowledge extraction: the DigitalMaktaba approach. In: Sensors 22.11 (2022). ISSN: 1424–8220. https://doi.org/10.3390/s22113995, https://www.mdpi.com/1424-8220/22/11/3995
Martoglia, R., et al.: A tool for semiautomatic cataloguing of an islamic digital library: a use case from the digital Maktaba project. In: Paschke, A., et al. (eds.) Proceedings of the Third Conference on Digital Curation Technologies (Qurator 2022), Berlin, Germany, 19th–23rd Sept. 2022, vol. 3234. CEUR Workshop Proceedings. CEUR-WS.org (2022). https://ceur-ws.org/Vol-3234/paper1.pdf
Martoglia, R., et al.: Knowledge extraction, management and longterm preservation of non-Latin cultural heritages - Digital Maktaba project presentation. In:Alessia, B., et al. (eds.) Proceedings of the 19th Conference on Information and Research Science Connecting to Digital and Library Science, vol. 3365. CEUR Workshop Proceedings. ISSN: 1613–0073. Bari, Italy: CEUR, pp. 153–161 (2023). https://ceur-ws.org/Vol-3365/#short11 (visited on 09/14/2023)
El Ganadi, A., et al.: Bridging Islamic knowledge and AI: inquiring ChatGPT on possible categorizations for an Islamic digital library (full paper). In: 2nd Italian Workshop on Artificial Intelligence for Cultural Heritage, co-located with the 22nd International Conference of the Italian Association for Artificial Intelligence (AIxIA 2023), vol. 3536. Rome, Italy, pp. 21–33 (2023). https://ai4ch.di.unito.it/
Abū al-Ḥasan Aḥmad bin Fārsī Ibn Zakariyyā. Mu\(^c\)ǧam maqāyīs ar. Ed. by \(^{\rm c}\)A. M. Hārūn. Bayrūt: Dār al-fikr (1979)
Google Scholar
Muḥammad Murtaḍà al-Ḥuseiynī al-Zubaiydī. Tāǧ al-\(^c\)arūs min ǧawhar al-qaūms. ar. Ed. by Ḥasan Naṣṣār. al-turāth al-\(^{\rm c}\)arabī. Kuwayt: Maṭba\(^{\rm c}\)a ḥukūma Kuwayt (2001)
Google Scholar
Encyclopaedia of Islam new edition online (EI-2 English). https://referenceworks.brill.com/display/db/eieo (visited on 05/10/2024)
Siddiqi, M.Z.: Ḥadīth Literature: its origin, development and special features. In: Google-Books-ID: cCnYAAAAMAAJ. Islamic Texts Society (1993). ISBN: 978-0-946621-38-5
Google Scholar
Allport, G.W.: Taboo Topics. In: Farberow, N.L. (ed.) Atherton Press, New York (1963). ISBN: 978-1-4128-5236-4
Google Scholar
Lee, R.M.: Doing research on sensitive topics. In: Google-Books- ID: AVW_MGH5ZsIC. Sage (1993). ISBN: 978-1-4462-2691-9
Google Scholar
Dickson-Swift, V., James, E., Liamputtong, P.: Undertaking sensitive research in the health and social sciences: managing boundaries, emotions and risks. Cambridge: Cambridge University Press (2008). ISBN: 978-0-521-71823-3. https://doi.org/10.1017/CBO9780511545481. (Visited on 05/13/2024)
Touvron, H., et al.: LLama2: open foundation and fine-tuned chat models. arXiv: 2307.09288 (2023)
LLama-3: LLama-3-8B-Instruct 4b-quantized. https://ai.meta.com/blog/meta-llama-3 (visited on 05/03/2024)
Lin, S., Hilton, J., Evans, O.: Teaching models to express their uncertainty in words (2022). arXiv: 2205.14334
Muennighoff, N., et al.: MTEB: Massive Text Embedding Benchmark. In: arXiv preprint arXiv:2210.07316 (2022). https://doi.org/10.48550/ARXIV.2210.07316
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2019). https://arxiv.org/abs/1908.10084
Li, Z., et al.: Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281 (2023)
Wang, L., et al.: Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022)
Merrick, L., et al.: Arctic-embed: scalable, efficient, and accurate text embedding models (2024). arXiv: 2405.05374
SFR-Embedding-Mistral: Enhance text retrieval with transfer learning
Google Scholar
OpenAI. Embeddings - OpenAI API. https://platform.openai.com/docs/guides/embeddings (visited on 05/13/2024)

Download references

Acknowledgments

This work was supported by the PNRR project Italian Strengthening of Esfri RI Resilience (ITSERR) funded by the European Union - NextGenerationEU (CUP:B53C22001770006).

Author information

Authors and Affiliations

Università di Modena e Reggio Emilia (UNIMORE), Modena, Italy
Giovanni Sullutrone, Riccardo Amerigo Vigliermo, Luca Sala & Sonia Bergamaschi
DBGroup, UNIMORE, Modena, Italy
Giovanni Sullutrone, Luca Sala & Sonia Bergamaschi
Fondazione per le Scienze Religiose (FSCIRE), Bologna, Italy
Riccardo Amerigo Vigliermo

Authors

Giovanni Sullutrone
View author publications
You can also search for this author in PubMed Google Scholar
Riccardo Amerigo Vigliermo
View author publications
You can also search for this author in PubMed Google Scholar
Luca Sala
View author publications
You can also search for this author in PubMed Google Scholar
Sonia Bergamaschi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giovanni Sullutrone .

Editor information

Editors and Affiliations

University of Salford, Salford, UK
Apostolos Antonacopoulos
University of Waikato, Hamilton, New Zealand
Annika Hinze
Sorbonne University (CNRS), Paris, France
Benjamin Piwowarski
University of La Rochelle (L3i Laboratory), La Rochelle, France
Mickaël Coustaty
University of Padova, Padua, Italy
Giorgio Maria Di Nunzio
University of Hamburg, Hamburg, Germany
Francesco Gelati
University of Waikato, Hamilton, New Zealand
Nicholas Vanderschantz

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sullutrone, G., Vigliermo, R.A., Sala, L., Bergamaschi, S. (2024). Sensitive Topics Retrieval in Digital Libraries: A Case Study of ḥadīṯ collections. In: Antonacopoulos, A., et al. Linking Theory and Practice of Digital Libraries. TPDL 2024. Lecture Notes in Computer Science, vol 15178. Springer, Cham. https://doi.org/10.1007/978-3-031-72440-4_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-72440-4_5
Published: 25 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72439-8
Online ISBN: 978-3-031-72440-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics