[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Sensitive Topics Retrieval in Digital Libraries: A Case Study of ḥadīṯ collections

  • Conference paper
  • First Online:
Linking Theory and Practice of Digital Libraries (TPDL 2024)

Abstract

The advent of Large Language Models (LLMs) has led to the development of new Question-Answering (QA) systems based on Retrieval-Augmented Generation (RAG) to incorporate query-specific knowledge at inference time. In this paper, the trustworthiness of RAG systems is investigated, particularly focusing on the performance of their retrieval phase when dealing with sensitive topics. This issue is particularly relevant as it could hinder a user’s ability to analyze sections of the available corpora, effectively biasing any following research. To mimic a specialised library possibly containing sensitive topics, a ḥādīṯ dataset has been curated using an ad-hoc framework called Question-Classify-Retrieve (QCR), which automatically assesses the performance of document retrieval by operating in three main steps: Question Generation, Passage Classification, and Passage Retrieval. Different sentence embedding models for document retrieval were tested showing significant performance gap between sensitive and non-sensitive topics compared to baseline. In real-world applications this would mean relevant documents placed lower in the retrieval list leading to the presence of irrelevant information or the absence of relevant one in case of a lower cut-off.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 89.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 109.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    the ITSERR research project is funded by the NextGenerationEU program provided by the Italian Ministry of Research, to enhance the European Research Infrastructure RESILIENCE to better meet the needs of the Religious Studies scientific community in terms of technological integration and increased innovative potential.

  2. 2.

    Chunking refers to the common pre-processing step of dividing the documents into smaller chunks of text. This makes embedding vector and retrieved passages more content specific.

  3. 3.

    Accuracy is the base metric used for model evaluation describing the number of correct predictions over all predictions; Precision measures how many of the positive predictions made are true positives; Recall measures how many of the positive cases the classifier correctly predicted, over all the positives in the data. It is sometimes also referred to as Sensitivity; F1 Score is a measure combining both precision and recall. It is generally described as the harmonic mean of the two.

References

  1. Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Ed. by Jill Burstein, Christy Doran, and Thamar Solorio. Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423

  2. Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., et al. (eds.) Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., pp. 1877–1901 (2020)

    Google Scholar 

  3. OpenAI: GPT-4 technical report (2024). arXiv: 2303.08774

  4. Zhang, Q., et al.: A survey for efficient open domain question answering. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: Association for Computational Linguistics, pp. 14447–14465 (2023). https://doi.org/10.18653/v1/2023.acl-long.808, https://aclanthology.org/2023.acl-long.808

  5. Huang, L., et al.: A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. arXiv:2311.05232 (2023). https://doi.org/10.48550/ARXIV.2311.05232 (visited on 05/10/2024). Publisher: [object Object] Version Number: 1

  6. Xu, Z., Jain, S., Kankanhalli, M.: Hallucination is inevitable: an innate limitation of large language models. arXiv:2401.11817(2024). https://doi.org/10.48550/arXiv.2401.11817. (visited on 05/10/2024)

  7. Lewis, P., et al.: Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural. Inf. Process. Syst. 33, 9459–9474 (2020)

    Google Scholar 

  8. Gao, Y., et al.: Retrieval-augmented generation for large language models: a survey. arXiv:2312.10997 (2024). https://doi.org/10.48550/arXiv.2312.10997. (visited on 05/10/2024)

  9. Gao, L., et al.: Precise zero-shot dense retrieval without relevance labels. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: Association for Computational Linguistics, pp. 1762–1777 (2023). https://doi.org/10.18653/v1/2023.acl-long.99, https://aclanthology.org/2023.acl-long.99

  10. Ma, X., et al.: Query rewriting in retrieval-augmented large language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, pp. 5303–5315 (2023). https://doi.org/10.18653/v1/2023.emnlp-main.322, https://aclanthology.org/2023.emnlp-main.322

  11. Liu, T.-Y., et al.: Learning to rank for information retrieval. In: Foundations and Trends® in Information Retrieval 3.3, pp. 225–331 (2009)

    Google Scholar 

  12. Jones, K.S., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments: Part 2. In: Information Processing Management, vol. 36 (2000), pp. 809–840. https://doi.org/10.1016/S0306-4573(00)00016-9

  13. Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Webber, B., et al. (eds.)Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics (2020), pp. 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550, https://aclanthology.org/2020.emnlp-main.550

  14. Izacard, G., Grave, E.: Leveraging passage retrieval with generative models for open domain question answering. In: Merlo, P., Tiedemann, J., Tsarfaty, R. (eds.)Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Online: Association for Computational Linguistics, pp. 874–880, (2021). https://doi.org/10.18653/v1/2021.eacl-main.74, https://aclanthology.org/2021.eacl-main.74

  15. Wang, L., et al.: Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368 (2023)

  16. Gallegos, I.O., et al.: Bias and fairness in large language models: a survey. arXiv preprint arXiv:2309.00770 (2023)

  17. Mei, K., Fereidooni, S., Caliskan, A.: Bias against 93 stigmatized groups in masked language models and downstream sentiment classification tasks. In: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp. 1699–1710 (2023)

    Google Scholar 

  18. Mozafari, M., Farahbakhsh, R., Crespi, N.: Hate speech detection and racial bias mitigation in social media based on BERT model. PloS one 15(8), e0237861 (2020)

    Google Scholar 

  19. Rekabsaz, N., Schedl, M.: Do neural ranking models intensify gender bias? In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2065–2068 (2020)

    Google Scholar 

  20. Rekabsaz, N., Kopeinik, S., Schedl, M.: Societal biases in retrieved contents: measurement framework and adversarial mitigation of BERT rankers. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 306–316 (2021)

    Google Scholar 

  21. Gallegos, I.O., et al.: Bias and fairness in large language models: a survey. arXiv: 2309.00770 (2024)

  22. Bergamaschi, S., et al.: Preserving and conserving culture: first steps towards a knowledge extractor and cataloguer for multilingual and multi-alphabetic heritages. In: Proceedings of the Conference on Information Technology for Social Good. GoodIT 2021. Roma, Italy: Association for Computing Machinery, pp. 301–304 (2021). ISBN: 9781450384780. https://doi.org/10.1145/3462203.3475927, https://doi.org/10.1145/3462203.3475927

  23. Bergamaschi, S., et al.: Novel perspectives for the management of multilingual and multialphabetic heritages through automatic knowledge extraction: the DigitalMaktaba approach. In: Sensors 22.11 (2022). ISSN: 1424–8220. https://doi.org/10.3390/s22113995, https://www.mdpi.com/1424-8220/22/11/3995

  24. Martoglia, R., et al.: A tool for semiautomatic cataloguing of an islamic digital library: a use case from the digital Maktaba project. In: Paschke, A., et al. (eds.) Proceedings of the Third Conference on Digital Curation Technologies (Qurator 2022), Berlin, Germany, 19th–23rd Sept. 2022, vol. 3234. CEUR Workshop Proceedings. CEUR-WS.org (2022). https://ceur-ws.org/Vol-3234/paper1.pdf

  25. Martoglia, R., et al.: Knowledge extraction, management and longterm preservation of non-Latin cultural heritages - Digital Maktaba project presentation. In:Alessia, B., et al. (eds.) Proceedings of the 19th Conference on Information and Research Science Connecting to Digital and Library Science, vol. 3365. CEUR Workshop Proceedings. ISSN: 1613–0073. Bari, Italy: CEUR, pp. 153–161 (2023). https://ceur-ws.org/Vol-3365/#short11 (visited on 09/14/2023)

  26. El Ganadi, A., et al.: Bridging Islamic knowledge and AI: inquiring ChatGPT on possible categorizations for an Islamic digital library (full paper). In: 2nd Italian Workshop on Artificial Intelligence for Cultural Heritage, co-located with the 22nd International Conference of the Italian Association for Artificial Intelligence (AIxIA 2023), vol. 3536. Rome, Italy, pp. 21–33 (2023). https://ai4ch.di.unito.it/

  27. Abū al-Ḥasan Aḥmad bin Fārsī Ibn Zakariyyā. Mu\(^c\)ǧam maqāyīs ar. Ed. by \(^{\rm c}\)A. M. Hārūn. Bayrūt: Dār al-fikr (1979)

    Google Scholar 

  28. Muḥammad Murtaḍà al-Ḥuseiynī al-Zubaiydī. Tāǧ al-\(^c\)arūs min ǧawhar al-qaūms. ar. Ed. by Ḥasan Naṣṣār. al-turāth al-\(^{\rm c}\)arabī. Kuwayt: Maṭba\(^{\rm c}\)a ḥukūma Kuwayt (2001)

    Google Scholar 

  29. Encyclopaedia of Islam new edition online (EI-2 English). https://referenceworks.brill.com/display/db/eieo (visited on 05/10/2024)

  30. Siddiqi, M.Z.: Ḥadīth Literature: its origin, development and special features. In: Google-Books-ID: cCnYAAAAMAAJ. Islamic Texts Society (1993). ISBN: 978-0-946621-38-5

    Google Scholar 

  31. Allport, G.W.: Taboo Topics. In: Farberow, N.L. (ed.) Atherton Press, New York (1963). ISBN: 978-1-4128-5236-4

    Google Scholar 

  32. Lee, R.M.: Doing research on sensitive topics. In: Google-Books- ID: AVW_MGH5ZsIC. Sage (1993). ISBN: 978-1-4462-2691-9

    Google Scholar 

  33. Dickson-Swift, V., James, E., Liamputtong, P.: Undertaking sensitive research in the health and social sciences: managing boundaries, emotions and risks. Cambridge: Cambridge University Press (2008). ISBN: 978-0-521-71823-3. https://doi.org/10.1017/CBO9780511545481. (Visited on 05/13/2024)

  34. Touvron, H., et al.: LLama2: open foundation and fine-tuned chat models. arXiv: 2307.09288 (2023)

  35. LLama-3: LLama-3-8B-Instruct 4b-quantized. https://ai.meta.com/blog/meta-llama-3 (visited on 05/03/2024)

  36. Lin, S., Hilton, J., Evans, O.: Teaching models to express their uncertainty in words (2022). arXiv: 2205.14334

  37. Muennighoff, N., et al.: MTEB: Massive Text Embedding Benchmark. In: arXiv preprint arXiv:2210.07316 (2022). https://doi.org/10.48550/ARXIV.2210.07316

  38. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2019). https://arxiv.org/abs/1908.10084

  39. Li, Z., et al.: Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281 (2023)

  40. Wang, L., et al.: Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022)

  41. Merrick, L., et al.: Arctic-embed: scalable, efficient, and accurate text embedding models (2024). arXiv: 2405.05374

  42. SFR-Embedding-Mistral: Enhance text retrieval with transfer learning

    Google Scholar 

  43. OpenAI. Embeddings - OpenAI API. https://platform.openai.com/docs/guides/embeddings (visited on 05/13/2024)

Download references

Acknowledgments

This work was supported by the PNRR project Italian Strengthening of Esfri RI Resilience (ITSERR) funded by the European Union - NextGenerationEU (CUP:B53C22001770006).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giovanni Sullutrone .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sullutrone, G., Vigliermo, R.A., Sala, L., Bergamaschi, S. (2024). Sensitive Topics Retrieval in Digital Libraries: A Case Study of ḥadīṯ collections. In: Antonacopoulos, A., et al. Linking Theory and Practice of Digital Libraries. TPDL 2024. Lecture Notes in Computer Science, vol 15178. Springer, Cham. https://doi.org/10.1007/978-3-031-72440-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72440-4_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72439-8

  • Online ISBN: 978-3-031-72440-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics