MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders Are Better Dense Retrievers

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14170))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1512 Accesses
2 Citations

Abstract

Pre-trained Transformers (e.g., BERT) have been commonly used in existing dense retrieval methods for parameter initialization, and recent studies are exploring more effective pre-training tasks for further improving the quality of dense vectors. Although various novel and effective tasks have been proposed, their different input formats and learning objectives make them hard to be integrated for jointly improving the model performance. In this work, we aim to unify a variety of pre-training tasks into the bottlenecked masked autoencoder manner, and integrate them into a multi-task pre-trained model, namely MASTER. Concretely, MASTER utilizes a shared-encoder multi-decoder architecture that can construct a representation bottleneck to compress the abundant semantic information across tasks into dense vectors. Based on it, we integrate three types of representative pre-training tasks: corrupted passages recovering, related passages recovering and PLMs outputs recovering, to characterize the inner-passage information, inter-passage relations and PLMs knowledge. Extensive experiments have shown that our approach outperforms competitive dense retrieval methods. Our code and data are publicly released in https://github.com/microsoft/SimXNS.

K. Zhou—This work was done during internship at MSRA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 67.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 84.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Lora for dense passage retrieval of ConTextual masked auto-encoding

Article 02 December 2024

A Literature Review on Bidirectional Encoder Representations from Transformers

Injecting the score of the first-stage retriever as text improves BERT-based re-rankers

Article Open access 26 June 2024

References

Craswell, N., Mitra, B., Yilmaz, E., Campos, D.: Overview of the TREC 2020 deep learning track. arXiv preprint arXiv:2102.07662 (2021)
Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.M.: Overview of the TREC 2019 deep learning track. arXiv preprint arXiv:2003.07820 (2020)
Dai, Z., Callan, J.: Deeper text understanding for IR with contextual neural language modeling. In: Proceedings of SIGIR 2019, pp. 985–988 (2019). https://doi.org/10.1145/3331184.3331303
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL 2019, pp. 4171–4186 (2019). https://aclanthology.org/N19-1423
Gao, L., Callan, J.: Condenser: a pre-training architecture for dense retrieval. In: Proceedings of EMNLP 2021, pp. 981–993 (2021). https://aclanthology.org/2021.emnlp-main.75
Gao, L., Callan, J.: Is your language model ready for dense representation fine-tuning? arXiv preprint arXiv:2104.08253 (2021)
Gao, L., Callan, J.: Unsupervised corpus aware language model pre-training for dense passage retrieval. In: Proceedings of ACL 2022, pp. 2843–2853 (2022). https://doi.org/10.18653/v1/2022.acl-long.203
Gao, L., Dai, Z., Callan, J.: COIL: revisit exact lexical match in information retrieval with contextualized inverted list. In: Proceedings of NAACL 2021, pp. 3030–3042 (2021).https://aclanthology.org/2021.naacl-main.241
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Google Scholar
Hofstätter, S., Lin, S., Yang, J., Lin, J., Hanbury, A.: Efficiently teaching an effective dense retriever with balanced topic aware sampling. In: Proceedings of SIGIR 2021, pp. 113–122 (2021). https://doi.org/10.1145/3404835.3462891
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Trans. Big Data 7(3), 535–547 (2021). https://doi.org/10.1109/TBDATA.2019.2921572
Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Proceedings of EMNLP 2020, pp. 6769–6781 (2020). https://aclanthology.org/2020.emnlp-main.550
Khattab, O., Zaharia, M.: Colbert: efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of SIGIR 2020, pp. 39–48 (2020). https://doi.org/10.1145/3397271.3401075
Kwiatkowski, T., et al.: Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguist. 7, 452–466 (2019). https://aclanthology.org/Q19-1026
Lee, K., Chang, M.W., Toutanova, K.: Latent retrieval for weakly supervised open domain question answering. In: Proceedings of ACL 2019, pp. 6086–6096 (2019). https://aclanthology.org/P19-1612
Lin, Z., et al.: Prod: progressive distillation for dense retrieval. In: Proceedings of the ACM Web Conference 2023, pp. 3299–3308 (2023)
Google Scholar
Liu, Z., Shao, Y.: Retromae: pre-training retrieval-oriented transformers via masked auto-encoder. arXiv preprint arXiv:2205.12035 (2022)
Lu, S., et al.: Less is more: pretrain a strong Siamese encoder for dense text retrieval using a weak decoder. In: Proceedings of EMNLP 2021, pp. 2780–2791 (2021). https://aclanthology.org/2021.emnlp-main.220
Lu, Y., et al.: Ernie-search: bridging cross-encoder with dual-encoder via self on-the-fly distillation for dense passage retrieval. arXiv preprint arXiv:2205.09153 (2022). https://doi.org/10.48550/arXiv.2205.09153
Ma, G., Wu, X., Wang, P., Hu, S.: Cot-mote: exploring contextual masked auto-encoder pre-training with mixture-of-textual-experts for passage retrieval. arXiv preprint arXiv:2304.10195 (2023)
Ma, X., Guo, J., Zhang, R., Fan, Y., Cheng, X.: Pre-train a discriminative text encoder for dense retrieval via contrastive span prediction. In: Proceedings of SIGIR 2022, pp. 848–858 (2022). https://doi.org/10.1145/3477495.3531772
Nguyen, T., et al.: MS MARCO: a human generated machine reading comprehension dataset. In: Proceedings of the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches 2016, vol. 1773 (2016). http://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf
Nogueira, R., Lin, J.: From doc2query to doctttttquery (2019). https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery.pdf
Nogueira, R.F., Yang, W., Lin, J., Cho, K.: Document expansion by query prediction. arXiv preprint arXiv:1904.08375 (2019)
Qu, Y., et al.: RocketQA: an optimized training approach to dense passage retrieval for open-domain question answering. In: Proceedings of NAACL 2021, pp. 5835–5847 (2021). https://aclanthology.org/2021.naacl-main.466
Radford, A., et al.: Language models are unsupervised multitask learners (2019). https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67 (2020). http://jmlr.org/papers/v21/20-074.html
Ramos, J., et al.: Using TF-IDF to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, vol. 242, pp. 29–48 (2003)
Google Scholar
Ren, R., et al.: PAIR: Leveraging passage-centric similarity relation for improving dense passage retrieval. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 2173–2183 (2021). https://aclanthology.org/2021.findings-acl.191
Ren, R., et al.: Rocketqav2: a joint training method for dense passage retrieval and passage re-ranking. In: Proceedings of EMNLP 2021, pp. 2825–2835 (2021). https://doi.org/10.18653/v1/2021.emnlp-main.224
Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., Zaharia, M.: Colbertv2: effective and efficient retrieval via lightweight late interaction. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, 10–15 July 2022, pp. 3715–3734 (2022). https://doi.org/10.18653/v1/2022.naacl-main.272
Sun, H., et al.: Lead: liberal feature-based distillation for dense retrieval. arXiv preprint arXiv:2212.05225 (2022)
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I.: BEIR: a heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663 (2021)
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of ICLR 2019 (2019). https://openreview.net/forum?id=rJ4km2R5t7
Wang, L., et al.: Simlm: pre-training with representation bottleneck for dense passage retrieval. arXiv preprint arXiv:2207.02578 (2022)
Wu, X., Ma, G., Lin, M., Lin, Z., Wang, Z., Hu, S.: Contextual mask auto-encoder for dense passage retrieval. arXiv preprint arXiv:2208.07670 (2022)
Wu, X., et al.: Cot-mae v2: contextual masked auto-encoder with multi-view modeling for passage retrieval. arXiv preprint arXiv:2304.03158 (2023)
Xiao, S., Liu, Z.: Retromae v2: duplex masked auto-encoder for pre-training retrieval-oriented language models. arXiv preprint arXiv:2211.08769 (2022)
Xiong, L., et al.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021 (2021). https://openreview.net/forum?id=zeFrfgyZln
Yang, P., Fang, H., Lin, J.: Anserini: enabling the use of lucene for information retrieval research. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, 7–11 August 2017, pp. 1253–1256 (2017). https://doi.org/10.1145/3077136.3080721
Zhan, J., Mao, J., Liu, Y., Guo, J., Zhang, M., Ma, S.: Optimizing dense retrieval model training with hard negatives. In: SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, 11–15 July 2021, pp. 1503–1512 (2021). https://doi.org/10.1145/3404835.3462880
Zhang, H., Gong, Y., Shen, Y., Lv, J., Duan, N., Chen, W.: Adversarial retriever-ranker for dense text retrieval. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022 (2022). https://openreview.net/forum?id=MR7XubKUFB
Zhao, W.X., Liu, J., Ren, R., Wen, J.R.: Dense text retrieval based on pretrained language models: a survey. arXiv preprint arXiv:2211.14876 (2022)
Zhao, W.X., et al.: A survey of large language models. arXiv preprint arXiv:2303.18223 (2023)
Zhou, K., et al.: Simans: simple ambiguous negatives sampling for dense text retrieval. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2022)
Google Scholar
Zhou, K., Zhang, B., Zhao, W.X., Wen, J.R.: Debiased contrastive learning of unsupervised sentence representations. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6120–6130 (2022)
Google Scholar
Zhou, Y.J., Yao, J., Dou, Z.C., Wu, L., Wen, J.R.: Dynamicretriever: a pre-trained model-based IR system without an explicit index. In: Machine Intelligence Research, pp. 1–13 (2023)
Google Scholar

Download references

Acknowledgement

Kun Zhou, Wayne Xin Zhao and Ji-Rong Wen were partially supported by National Natural Science Foundation of China under Grant No. 62222215, Beijing Natural Science Foundation under Grant No. 4222027, Beijing Outstanding Young Scientist Program under Grant No. BJJWZYJH012019100020098, and the Outstanding Innovative Talents Cultivation Funded Programs 2021 of Renmin University of China. Xin Zhao is the corresponding author.

Author information

Authors and Affiliations

School of Information, Renmin University of China, Beijing, China
Kun Zhou & Ji-Rong Wen
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
Wayne Xin Zhao & Ji-Rong Wen
Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing, China
Kun Zhou, Wayne Xin Zhao & Ji-Rong Wen
Microsoft Research, Beijing, China
Xiao Liu, Yeyun Gong, Daxin Jiang & Nan Duan

Authors

Kun Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yeyun Gong
View author publications
You can also search for this author in PubMed Google Scholar
Wayne Xin Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Daxin Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Nan Duan
View author publications
You can also search for this author in PubMed Google Scholar
Ji-Rong Wen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wayne Xin Zhao .

Editor information

Editors and Affiliations

University of Michigan, Ann Arbor, MI, USA
Danai Koutra
University of Vienna, Vienna, Austria
Claudia Plant
Max Planck Institute for Software Systems, Kaiserslautern, Germany
Manuel Gomez Rodriguez
Politecnico di Torino, Turin, Italy
Elena Baralis
CENTAI, Turin, Italy
Francesco Bonchi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, K. et al. (2023). MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders Are Better Dense Retrievers. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14170. Springer, Cham. https://doi.org/10.1007/978-3-031-43415-0_37

Download citation

DOI: https://doi.org/10.1007/978-3-031-43415-0_37
Published: 17 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43414-3
Online ISBN: 978-3-031-43415-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders Are Better Dense Retrievers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Lora for dense passage retrieval of ConTextual masked auto-encoding

A Literature Review on Bidirectional Encoder Representations from Transformers

Injecting the score of the first-stage retriever as text improves BERT-based re-rankers

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders Are Better Dense Retrievers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Lora for dense passage retrieval of ConTextual masked auto-encoding

A Literature Review on Bidirectional Encoder Representations from Transformers

Injecting the score of the first-stage retriever as text improves BERT-based re-rankers

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation