[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3471158.3472238acmconferencesArticle/Chapter ViewAbstractPublication PagesictirConference Proceedingsconference-collections
research-article
Open access

Ensemble Distillation for BERT-Based Ranking Models

Published: 31 August 2021 Publication History

Abstract

Over the past two years, large pretrained language models such as BERT have been applied to text ranking problems and showed superior performance on multiple public benchmark data sets. Prior work demonstrated that an ensemble of multiple BERT-based ranking models can not only boost the performance, but also reduce the performance variance. However, an ensemble of models is more costly because it needs computing resource and/or inference time proportional to the number of models. In this paper, we study how to retain the performance of an ensemble of models at the inference cost of a single model by distilling the ensemble into a single BERT-based student ranking model. Specifically, we study different designs of teacher labels, various distillation strategies, as well as multiple distillation losses tailored for ranking problems. We conduct experiments on the MS MARCO passage ranking and the TREC-COVID data set. Our results show that even with these simple distillation techniques, the distilled model can effectively retain the performance gain of the ensemble of multiple models. More interestingly, the performances of distilled models are also more stable than models fine-tuned on original labeled data. The results reveal a promising direction to capitalize on the gains achieved by an ensemble of BERT-based ranking models.

References

[1]
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268
[2]
Michael Bendersky, Honglei Zhuang, Ji Ma, Shuguang Han, Keith Hall, and Ryan McDonald. 2020. RRF102: Meeting the TREC-COVID challenge with a 100+ runs ensemble. arXiv:2010.00200
[3]
Sebastian Bruch, Xuanhui Wang, Michael Bendersky, and Marc Najork. 2019. An Analysis of the Softmax Cross Entropy Loss for Learning-to-Rank with Binary Relevance. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR '19). 75--78.
[4]
Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to Rank: From Pairwise Approach to Listwise Approach. In Proceedings of the 24th International Conference on Machine Learning (ICML '07). 129--136.
[5]
Xuanang Chen, Ben He, Kai Hui, Le Sun, and Yingfei Sun. 2021. Simplified TinyBERT: Knowledge Distillation for Document Retrieval. In Advances in Information Retrieval - Proceedings of the 43rd European Conference on IR Research, Part II (Lecture Notes in Computer Science), Vol. 12657. Springer, 241--248.
[6]
Gordon V Cormack, Charles LA Clarke, and Stefan Buettcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 758--759.
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.
[8]
Thomas G. Dietterich. 2000. Ensemble Methods in Machine Learning. In International workshop on multiple classifier systems. Springer, 1--15.
[9]
Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. 2018. Born Again Neural Networks. In Proceedings of the 35th International Conference on Machine Learning. 1607--1616.
[10]
Luyu Gao, Zhuyun Dai, and Jamie Callan. 2020. Understanding BERT Rankers under Distillation. In Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval (ICTIR '20). 149--152.
[11]
Jianping Gou, Baosheng Yu, Stephen John Maybank, and Dacheng Tao. 2020. Knowledge distillation: A survey. arXiv:2006.05525
[12]
Shuguang Han, Xuanhui Wang, Mike Bendersky, and Marc Najork. 2020. Learning-to-Rank with BERT in TF-Ranking. arXiv:2004.08476
[13]
Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowl- edge in a Neural Network. In NIPS Deep Learning and Representation Learning Workshop.
[14]
Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '20). 39--48.
[15]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In Proceedings of the 8th International Conference on Learning Representations (ICLR '19).
[16]
Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2020. Pretrained transformers for text ranking: BERT and beyond. arXiv:2010.06467
[17]
Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv:1901.04085
[18]
Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019. Multi-stage document ranking with BERT. arXiv:1910.14424
[19]
Rama Kumar Pasumarthi, Sebastian Bruch, Xuanhui Wang, Cheng Li, Michael Bendersky, Marc Najork, Jan Pfeifer, Nadav Golbandi, Rohan Anil, and Stephan Wolf. 2019. TF-Ranking: Scalable tensorflow library for learning-to-rank. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '19). 2970--2978.
[20]
Zhen Qin, Le Yan, Honglei Zhuang, Yi Tay, Rama Kumar Pasumarthi, Xuanhui Wang, Michael Bendersky, and Marc Najork. 2021. Are Neural Rankers still Outperformed by Gradient Boosted Decision Trees?. In International Conference on Learning Representations (ICLR '21).
[21]
Zhen Qin, Honglei Zhuang, Rolf Jagerman, Xinyu Qian, Po Hu, Chary Chen, Xuanhui Wang, Mike Bendersky, and Marc Najork. 2021. Bootstrapping Rec- ommendations at Chrome Web Store. In Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '21).
[22]
Kirk Roberts, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, Kyle Lo, Ian Soboroff, Ellen Voorhees, Lucy Lu Wang, and William R Hersh. 2020. TREC-COVID: rationale and structure of an information retrieval shared task for COVID-19. Journal of the American Medical Informatics Association 27, 9 (2020), 1431--1436.
[23]
Jiaxi Tang and Ke Wang. 2018. Ranking Distillation: Learning Compact Ranking Models With High Performance for Recommender System. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2289--2298.
[24]
Meng-Chieh Wu, Ching-Te Chiu, and Kun-Hsuan Wu. 2019. Multi-teacher knowledge distillation for compressed video action recognition on deep neural networks. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '19). IEEE, 2202--2206.
[25]
Ze Yang, Linjun Shou, Ming Gong, Wutao Lin, and Daxin Jiang. 2020. Model Compression with Two-Stage Multi-Teacher Knowledge Distillation for Web Question Answering System. In Proceedings of the 13th International Conference on Web Search and Data Mining (WSDM '20). 690--698.
[26]
Fei Yuan, Linjun Shou, Jian Pei, Wutao Lin, Ming Gong, Yan Fu, and Daxin Jiang. 2021. Reinforced Multi-Teacher Selection for Knowledge Distillation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI '21). 14284--14291.
[27]
Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2020. An analysis of BERT in document ranking. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '20). 1941--1944.
[28]
Wangshu Zhang, Junhong Liu, Zujie Wen, Yafang Wang, and Gerard de Melo. 2020. Query Distillation: BERT-based Distillation for Ensemble Ranking. In Proceedings of the 28th International Conference on Computational Linguistics: Industry Track. 33--43.
[29]
Honglei Zhuang, Xuanhui Wang, Michael Bendersky, and Marc Najork. 2020. Feature transformation for neural ranking models. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1649--1652.

Cited By

View all
  • (2024)A Self-Distilled Learning to Rank Model for Ad Hoc RetrievalACM Transactions on Information Systems10.1145/368178442:6(1-28)Online publication date: 25-Jul-2024
  • (2024)Can Query Expansion Improve Generalization of Strong Cross-Encoder Rankers?Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657979(2321-2326)Online publication date: 10-Jul-2024
  • (2024)QMKD: A Two-Stage Approach to Enhance Multi-Teacher Knowledge Distillation2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650186(1-7)Online publication date: 30-Jun-2024
  • Show More Cited By

Index Terms

  1. Ensemble Distillation for BERT-Based Ranking Models

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICTIR '21: Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval
    July 2021
    334 pages
    ISBN:9781450386111
    DOI:10.1145/3471158
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 August 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. BERT
    2. ensemble distillation
    3. ranker ensemble

    Qualifiers

    • Research-article

    Conference

    ICTIR '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 235 of 527 submissions, 45%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)212
    • Downloads (Last 6 weeks)25
    Reflects downloads up to 11 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Self-Distilled Learning to Rank Model for Ad Hoc RetrievalACM Transactions on Information Systems10.1145/368178442:6(1-28)Online publication date: 25-Jul-2024
    • (2024)Can Query Expansion Improve Generalization of Strong Cross-Encoder Rankers?Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657979(2321-2326)Online publication date: 10-Jul-2024
    • (2024)QMKD: A Two-Stage Approach to Enhance Multi-Teacher Knowledge Distillation2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650186(1-7)Online publication date: 30-Jun-2024
    • (2023)RD-SuiteProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667673(35748-35760)Online publication date: 10-Dec-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media