Abstract
Pre-trained Transformers (e.g., BERT) have been commonly used in existing dense retrieval methods for parameter initialization, and recent studies are exploring more effective pre-training tasks for further improving the quality of dense vectors. Although various novel and effective tasks have been proposed, their different input formats and learning objectives make them hard to be integrated for jointly improving the model performance. In this work, we aim to unify a variety of pre-training tasks into the bottlenecked masked autoencoder manner, and integrate them into a multi-task pre-trained model, namely MASTER. Concretely, MASTER utilizes a shared-encoder multi-decoder architecture that can construct a representation bottleneck to compress the abundant semantic information across tasks into dense vectors. Based on it, we integrate three types of representative pre-training tasks: corrupted passages recovering, related passages recovering and PLMs outputs recovering, to characterize the inner-passage information, inter-passage relations and PLMs knowledge. Extensive experiments have shown that our approach outperforms competitive dense retrieval methods. Our code and data are publicly released in https://github.com/microsoft/SimXNS.
K. Zhou—This work was done during internship at MSRA.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Craswell, N., Mitra, B., Yilmaz, E., Campos, D.: Overview of the TREC 2020 deep learning track. arXiv preprint arXiv:2102.07662 (2021)
Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.M.: Overview of the TREC 2019 deep learning track. arXiv preprint arXiv:2003.07820 (2020)
Dai, Z., Callan, J.: Deeper text understanding for IR with contextual neural language modeling. In: Proceedings of SIGIR 2019, pp. 985–988 (2019). https://doi.org/10.1145/3331184.3331303
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL 2019, pp. 4171–4186 (2019). https://aclanthology.org/N19-1423
Gao, L., Callan, J.: Condenser: a pre-training architecture for dense retrieval. In: Proceedings of EMNLP 2021, pp. 981–993 (2021). https://aclanthology.org/2021.emnlp-main.75
Gao, L., Callan, J.: Is your language model ready for dense representation fine-tuning? arXiv preprint arXiv:2104.08253 (2021)
Gao, L., Callan, J.: Unsupervised corpus aware language model pre-training for dense passage retrieval. In: Proceedings of ACL 2022, pp. 2843–2853 (2022). https://doi.org/10.18653/v1/2022.acl-long.203
Gao, L., Dai, Z., Callan, J.: COIL: revisit exact lexical match in information retrieval with contextualized inverted list. In: Proceedings of NAACL 2021, pp. 3030–3042 (2021).https://aclanthology.org/2021.naacl-main.241
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Hofstätter, S., Lin, S., Yang, J., Lin, J., Hanbury, A.: Efficiently teaching an effective dense retriever with balanced topic aware sampling. In: Proceedings of SIGIR 2021, pp. 113–122 (2021). https://doi.org/10.1145/3404835.3462891
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Trans. Big Data 7(3), 535–547 (2021). https://doi.org/10.1109/TBDATA.2019.2921572
Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Proceedings of EMNLP 2020, pp. 6769–6781 (2020). https://aclanthology.org/2020.emnlp-main.550
Khattab, O., Zaharia, M.: Colbert: efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of SIGIR 2020, pp. 39–48 (2020). https://doi.org/10.1145/3397271.3401075
Kwiatkowski, T., et al.: Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguist. 7, 452–466 (2019). https://aclanthology.org/Q19-1026
Lee, K., Chang, M.W., Toutanova, K.: Latent retrieval for weakly supervised open domain question answering. In: Proceedings of ACL 2019, pp. 6086–6096 (2019). https://aclanthology.org/P19-1612
Lin, Z., et al.: Prod: progressive distillation for dense retrieval. In: Proceedings of the ACM Web Conference 2023, pp. 3299–3308 (2023)
Liu, Z., Shao, Y.: Retromae: pre-training retrieval-oriented transformers via masked auto-encoder. arXiv preprint arXiv:2205.12035 (2022)
Lu, S., et al.: Less is more: pretrain a strong Siamese encoder for dense text retrieval using a weak decoder. In: Proceedings of EMNLP 2021, pp. 2780–2791 (2021). https://aclanthology.org/2021.emnlp-main.220
Lu, Y., et al.: Ernie-search: bridging cross-encoder with dual-encoder via self on-the-fly distillation for dense passage retrieval. arXiv preprint arXiv:2205.09153 (2022). https://doi.org/10.48550/arXiv.2205.09153
Ma, G., Wu, X., Wang, P., Hu, S.: Cot-mote: exploring contextual masked auto-encoder pre-training with mixture-of-textual-experts for passage retrieval. arXiv preprint arXiv:2304.10195 (2023)
Ma, X., Guo, J., Zhang, R., Fan, Y., Cheng, X.: Pre-train a discriminative text encoder for dense retrieval via contrastive span prediction. In: Proceedings of SIGIR 2022, pp. 848–858 (2022). https://doi.org/10.1145/3477495.3531772
Nguyen, T., et al.: MS MARCO: a human generated machine reading comprehension dataset. In: Proceedings of the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches 2016, vol. 1773 (2016). http://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf
Nogueira, R., Lin, J.: From doc2query to doctttttquery (2019). https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery.pdf
Nogueira, R.F., Yang, W., Lin, J., Cho, K.: Document expansion by query prediction. arXiv preprint arXiv:1904.08375 (2019)
Qu, Y., et al.: RocketQA: an optimized training approach to dense passage retrieval for open-domain question answering. In: Proceedings of NAACL 2021, pp. 5835–5847 (2021). https://aclanthology.org/2021.naacl-main.466
Radford, A., et al.: Language models are unsupervised multitask learners (2019). https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67 (2020). http://jmlr.org/papers/v21/20-074.html
Ramos, J., et al.: Using TF-IDF to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, vol. 242, pp. 29–48 (2003)
Ren, R., et al.: PAIR: Leveraging passage-centric similarity relation for improving dense passage retrieval. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 2173–2183 (2021). https://aclanthology.org/2021.findings-acl.191
Ren, R., et al.: Rocketqav2: a joint training method for dense passage retrieval and passage re-ranking. In: Proceedings of EMNLP 2021, pp. 2825–2835 (2021). https://doi.org/10.18653/v1/2021.emnlp-main.224
Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., Zaharia, M.: Colbertv2: effective and efficient retrieval via lightweight late interaction. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, 10–15 July 2022, pp. 3715–3734 (2022). https://doi.org/10.18653/v1/2022.naacl-main.272
Sun, H., et al.: Lead: liberal feature-based distillation for dense retrieval. arXiv preprint arXiv:2212.05225 (2022)
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I.: BEIR: a heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663 (2021)
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of ICLR 2019 (2019). https://openreview.net/forum?id=rJ4km2R5t7
Wang, L., et al.: Simlm: pre-training with representation bottleneck for dense passage retrieval. arXiv preprint arXiv:2207.02578 (2022)
Wu, X., Ma, G., Lin, M., Lin, Z., Wang, Z., Hu, S.: Contextual mask auto-encoder for dense passage retrieval. arXiv preprint arXiv:2208.07670 (2022)
Wu, X., et al.: Cot-mae v2: contextual masked auto-encoder with multi-view modeling for passage retrieval. arXiv preprint arXiv:2304.03158 (2023)
Xiao, S., Liu, Z.: Retromae v2: duplex masked auto-encoder for pre-training retrieval-oriented language models. arXiv preprint arXiv:2211.08769 (2022)
Xiong, L., et al.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021 (2021). https://openreview.net/forum?id=zeFrfgyZln
Yang, P., Fang, H., Lin, J.: Anserini: enabling the use of lucene for information retrieval research. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, 7–11 August 2017, pp. 1253–1256 (2017). https://doi.org/10.1145/3077136.3080721
Zhan, J., Mao, J., Liu, Y., Guo, J., Zhang, M., Ma, S.: Optimizing dense retrieval model training with hard negatives. In: SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, 11–15 July 2021, pp. 1503–1512 (2021). https://doi.org/10.1145/3404835.3462880
Zhang, H., Gong, Y., Shen, Y., Lv, J., Duan, N., Chen, W.: Adversarial retriever-ranker for dense text retrieval. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022 (2022). https://openreview.net/forum?id=MR7XubKUFB
Zhao, W.X., Liu, J., Ren, R., Wen, J.R.: Dense text retrieval based on pretrained language models: a survey. arXiv preprint arXiv:2211.14876 (2022)
Zhao, W.X., et al.: A survey of large language models. arXiv preprint arXiv:2303.18223 (2023)
Zhou, K., et al.: Simans: simple ambiguous negatives sampling for dense text retrieval. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2022)
Zhou, K., Zhang, B., Zhao, W.X., Wen, J.R.: Debiased contrastive learning of unsupervised sentence representations. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6120–6130 (2022)
Zhou, Y.J., Yao, J., Dou, Z.C., Wu, L., Wen, J.R.: Dynamicretriever: a pre-trained model-based IR system without an explicit index. In: Machine Intelligence Research, pp. 1–13 (2023)
Acknowledgement
Kun Zhou, Wayne Xin Zhao and Ji-Rong Wen were partially supported by National Natural Science Foundation of China under Grant No. 62222215, Beijing Natural Science Foundation under Grant No. 4222027, Beijing Outstanding Young Scientist Program under Grant No. BJJWZYJH012019100020098, and the Outstanding Innovative Talents Cultivation Funded Programs 2021 of Renmin University of China. Xin Zhao is the corresponding author.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhou, K. et al. (2023). MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders Are Better Dense Retrievers. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14170. Springer, Cham. https://doi.org/10.1007/978-3-031-43415-0_37
Download citation
DOI: https://doi.org/10.1007/978-3-031-43415-0_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43414-3
Online ISBN: 978-3-031-43415-0
eBook Packages: Computer ScienceComputer Science (R0)