Another Look at DPR: Reproduction of Training and Replication of Retrieval

Xueguang Ma¹⁵,
Kai Sun¹⁵,
Ronak Pradeep¹⁵,
Minghan Li¹⁵ &
…
Jimmy Lin¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13185))

Included in the following conference series:

European Conference on Information Retrieval

2913 Accesses
6 Citations

Abstract

Text retrieval using learned dense representations has recently emerged as a promising alternative to “traditional” text retrieval using sparse bag-of-words representations. One foundational work that has garnered much attention is the dense passage retriever (DPR) proposed by Karpukhin et al. for end-to-end open-domain question answering. This work presents a reproduction and replication study of DPR. We first verify the reproducibility of the DPR model checkpoints by training passage and query encoders from scratch using two different implementations: the original code released by the authors and another independent codebase. After that, we conduct a detailed replication study of the retrieval stage, starting with model checkpoints provided by the authors but with an independent implementation from our group’s Pyserini IR toolkit and PyGaggle neural text ranking library. Although our experimental results largely verify the claims of the original DPR paper, we arrive at two important additional findings: First, it appears that the original authors under-report the effectiveness of the BM25 baseline and hence also dense–sparse hybrid retrieval results. Second, by incorporating evidence from the retriever and improved answer span scoring, we manage to improve end-to-end question answering effectiveness using the same DPR models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 79.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 99.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Query Generation Using Large Language Models

Building an Efficient Retriever System with Limited Resources

Lora for dense passage retrieval of ConTextual masked auto-encoding

Article 02 December 2024

Notes

References

Berant, J., Chou, A., Frostig, R., Liang, P.: Semantic parsing on Freebase from question-answer pairs. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, pp. 1533–1544. Association for Computational Linguistics (2013)
Google Scholar
Chen, D., Fisch, A., Weston, J., Bordes, A.: Reading Wikipedia to answer open-domain questions. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), Vancouver, British Columbia, Canada, pp. 1870–1879 (2017)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. Association for Computational Linguistics (2019)
Google Scholar
Gao, L., Zhang, Y., Han, J., Callan, J.: Scaling deep contrastive learning batch size under memory limited setup. In: Proceedings of the 6th Workshop on Representation Learning for NLP (2021)
Google Scholar
Hofstätter, S., Althammer, S., Schröder, M., Sertkan, M., Hanbury, A.: Improving efficient neural ranking models with cross-architecture knowledge distillation. arXiv:2010.02666 (2020)
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7(3), 535–547 (2021)
Article Google Scholar
Joshi, M., Choi, E., Weld, D., Zettlemoyer, L.: TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1601–1611. Association for Computational Linguistics (2017)
Google Scholar
Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781 (2020)
Google Scholar
Kwiatkowski, T., et al.: Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguist. 7, 452–466 (2019)
Google Scholar
Lin, J., Ma, X., Lin, S.C., Yang, J.H., Pradeep, R., Nogueira, R.: Pyserini: a Python toolkit for reproducible information retrieval research with sparse and dense representations. In: Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pp. 2356–2362 (2021)
Google Scholar
Lin, S.C., Yang, J.H., Lin, J.: In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In: Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pp. 163–173 (2021)
Google Scholar
Mao, Y., et al.: Generation-augmented retrieval for open-domain question answering. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4089–4100. Online (2021)
Google Scholar
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2383–2392 (2016)
Google Scholar
Voorhees, E.M., Tice, D.M.: The TREC-8 question answering track evaluation. In: Proceedings of the Eighth Text REtrieval Conference (TREC-8), Gaithersburg, Maryland, pp. 83–106 (1999)
Google Scholar
Xie, Y., et al.: Distant supervision for multi-stage fine-tuning in retrieval-based question answering. In: Proceedings of the Web Conference 2020 (WWW 2020), pp. 2934–2940 (2020)
Google Scholar
Xiong, L., et al.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: Proceedings of the 9th International Conference on Learning Representations (ICLR 2021) (2021)
Google Scholar
Yang, P., Fang, H., Lin, J.: Anserini: enabling the use of Lucene for information retrieval research. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (2017)
Google Scholar
Yang, W., et al.: End-to-end open-domain question answering with BERTserini. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, Minnesota, pp. 72–77 (2019)
Google Scholar
Zhan, J., Mao, J., Liu, Y., Zhang, M., Ma, S.: RepBERT: contextualized text embeddings for first-stage retrieval. arXiv:2006.15498 (2020)

Download references

Acknowledgment

This research was supported in part by the Canada First Research Excellence Fund and the Natural Sciences and Engineering Research Council (NSERC) of Canada. Computational resources were provided by Compute Ontario and Compute Canada.

Author information

Authors and Affiliations

David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada
Xueguang Ma, Kai Sun, Ronak Pradeep, Minghan Li & Jimmy Lin

Authors

Xueguang Ma
View author publications
You can also search for this author in PubMed Google Scholar
Kai Sun
View author publications
You can also search for this author in PubMed Google Scholar
Ronak Pradeep
View author publications
You can also search for this author in PubMed Google Scholar
Minghan Li
View author publications
You can also search for this author in PubMed Google Scholar
Jimmy Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xueguang Ma .

Editor information

Editors and Affiliations

Martin Luther University Halle-Wittenberg, Halle, Germany
Matthias Hagen
Leiden University, Leiden, The Netherlands
Suzan Verberne
University of Glasgow, Glasgow, UK
Craig Macdonald
University of Duisburg-Essen, Essen, Germany
Christin Seifert
University of Stavanger, Stavanger, Norway
Krisztian Balog
Norwegian University of Science and Technology, Trondheim, Norway
Kjetil Nørvåg
University of Stavanger, Stavanger, Norway
Vinay Setty

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, X., Sun, K., Pradeep, R., Li, M., Lin, J. (2022). Another Look at DPR: Reproduction of Training and Replication of Retrieval. In: Hagen, M., et al. Advances in Information Retrieval. ECIR 2022. Lecture Notes in Computer Science, vol 13185. Springer, Cham. https://doi.org/10.1007/978-3-030-99736-6_41

Download citation

DOI: https://doi.org/10.1007/978-3-030-99736-6_41
Published: 05 April 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-99735-9
Online ISBN: 978-3-030-99736-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Another Look at DPR: Reproduction of Training and Replication of Retrieval

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Query Generation Using Large Language Models

Building an Efficient Retriever System with Limited Resources

Lora for dense passage retrieval of ConTextual masked auto-encoding

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Another Look at DPR: Reproduction of Training and Replication of Retrieval

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Query Generation Using Large Language Models

Building an Efficient Retriever System with Limited Resources

Lora for dense passage retrieval of ConTextual masked auto-encoding

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation