More Web Proxy on the site http://driver.im/

research-article

Open access

Ensemble Distillation for BERT-Based Ranking Models

Authors:

Honglei Zhuang,

Michael Bendersky,

Marc NajorkAuthors Info & Claims

ICTIR '21: Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval

Pages 131 - 136

https://doi.org/10.1145/3471158.3472238

Published: 31 August 2021 Publication History

Abstract

Over the past two years, large pretrained language models such as BERT have been applied to text ranking problems and showed superior performance on multiple public benchmark data sets. Prior work demonstrated that an ensemble of multiple BERT-based ranking models can not only boost the performance, but also reduce the performance variance. However, an ensemble of models is more costly because it needs computing resource and/or inference time proportional to the number of models. In this paper, we study how to retain the performance of an ensemble of models at the inference cost of a single model by distilling the ensemble into a single BERT-based student ranking model. Specifically, we study different designs of teacher labels, various distillation strategies, as well as multiple distillation losses tailored for ranking problems. We conduct experiments on the MS MARCO passage ranking and the TREC-COVID data set. Our results show that even with these simple distillation techniques, the distilled model can effectively retain the performance gain of the ensemble of multiple models. More interestingly, the performances of distilled models are also more stable than models fine-tuned on original labeled data. The results reveal a promising direction to capitalize on the gains achieved by an ensemble of BERT-based ranking models.

References

[1]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268

[2]

Michael Bendersky, Honglei Zhuang, Ji Ma, Shuguang Han, Keith Hall, and Ryan McDonald. 2020. RRF102: Meeting the TREC-COVID challenge with a 100+ runs ensemble. arXiv:2010.00200

[3]

Sebastian Bruch, Xuanhui Wang, Michael Bendersky, and Marc Najork. 2019. An Analysis of the Softmax Cross Entropy Loss for Learning-to-Rank with Binary Relevance. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR '19). 75--78.

Digital Library

[4]

Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to Rank: From Pairwise Approach to Listwise Approach. In Proceedings of the 24th International Conference on Machine Learning (ICML '07). 129--136.

Digital Library

[5]

Xuanang Chen, Ben He, Kai Hui, Le Sun, and Yingfei Sun. 2021. Simplified TinyBERT: Knowledge Distillation for Document Retrieval. In Advances in Information Retrieval - Proceedings of the 43rd European Conference on IR Research, Part II (Lecture Notes in Computer Science), Vol. 12657. Springer, 241--248.

[6]

Gordon V Cormack, Charles LA Clarke, and Stefan Buettcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 758--759.

Digital Library

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.

[8]

Thomas G. Dietterich. 2000. Ensemble Methods in Machine Learning. In International workshop on multiple classifier systems. Springer, 1--15.

[9]

Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. 2018. Born Again Neural Networks. In Proceedings of the 35th International Conference on Machine Learning. 1607--1616.

[10]

Luyu Gao, Zhuyun Dai, and Jamie Callan. 2020. Understanding BERT Rankers under Distillation. In Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval (ICTIR '20). 149--152.

Digital Library

[11]

Jianping Gou, Baosheng Yu, Stephen John Maybank, and Dacheng Tao. 2020. Knowledge distillation: A survey. arXiv:2006.05525

[12]

Shuguang Han, Xuanhui Wang, Mike Bendersky, and Marc Najork. 2020. Learning-to-Rank with BERT in TF-Ranking. arXiv:2004.08476

[13]

Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowl- edge in a Neural Network. In NIPS Deep Learning and Representation Learning Workshop.

[14]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '20). 39--48.

Digital Library

[15]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In Proceedings of the 8th International Conference on Learning Representations (ICLR '19).

[16]

Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2020. Pretrained transformers for text ranking: BERT and beyond. arXiv:2010.06467

[17]

Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv:1901.04085

[18]

Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019. Multi-stage document ranking with BERT. arXiv:1910.14424

[19]

Rama Kumar Pasumarthi, Sebastian Bruch, Xuanhui Wang, Cheng Li, Michael Bendersky, Marc Najork, Jan Pfeifer, Nadav Golbandi, Rohan Anil, and Stephan Wolf. 2019. TF-Ranking: Scalable tensorflow library for learning-to-rank. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '19). 2970--2978.

Digital Library

[20]

Zhen Qin, Le Yan, Honglei Zhuang, Yi Tay, Rama Kumar Pasumarthi, Xuanhui Wang, Michael Bendersky, and Marc Najork. 2021. Are Neural Rankers still Outperformed by Gradient Boosted Decision Trees?. In International Conference on Learning Representations (ICLR '21).

[21]

Zhen Qin, Honglei Zhuang, Rolf Jagerman, Xinyu Qian, Po Hu, Chary Chen, Xuanhui Wang, Mike Bendersky, and Marc Najork. 2021. Bootstrapping Rec- ommendations at Chrome Web Store. In Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '21).

Digital Library

[22]

Kirk Roberts, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, Kyle Lo, Ian Soboroff, Ellen Voorhees, Lucy Lu Wang, and William R Hersh. 2020. TREC-COVID: rationale and structure of an information retrieval shared task for COVID-19. Journal of the American Medical Informatics Association 27, 9 (2020), 1431--1436.

[23]

Jiaxi Tang and Ke Wang. 2018. Ranking Distillation: Learning Compact Ranking Models With High Performance for Recommender System. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2289--2298.

Digital Library

[24]

Meng-Chieh Wu, Ching-Te Chiu, and Kun-Hsuan Wu. 2019. Multi-teacher knowledge distillation for compressed video action recognition on deep neural networks. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '19). IEEE, 2202--2206.

[25]

Ze Yang, Linjun Shou, Ming Gong, Wutao Lin, and Daxin Jiang. 2020. Model Compression with Two-Stage Multi-Teacher Knowledge Distillation for Web Question Answering System. In Proceedings of the 13th International Conference on Web Search and Data Mining (WSDM '20). 690--698.

Digital Library

[26]

Fei Yuan, Linjun Shou, Jian Pei, Wutao Lin, Ming Gong, Yan Fu, and Daxin Jiang. 2021. Reinforced Multi-Teacher Selection for Knowledge Distillation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI '21). 14284--14291.

[27]

Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2020. An analysis of BERT in document ranking. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '20). 1941--1944.

Digital Library

[28]

Wangshu Zhang, Junhong Liu, Zujie Wen, Yafang Wang, and Gerard de Melo. 2020. Query Distillation: BERT-based Distillation for Ensemble Ranking. In Proceedings of the 28th International Conference on Computational Linguistics: Industry Track. 33--43.

[29]

Honglei Zhuang, Xuanhui Wang, Michael Bendersky, and Marc Najork. 2020. Feature transformation for neural ranking models. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1649--1652.

Digital Library

Cited By

Keshvari SSaeedi FSadoghi Yazdi HEnsan F(2024)A Self-Distilled Learning to Rank Model for Ad Hoc RetrievalACM Transactions on Information Systems10.1145/368178442:6(1-28)Online publication date: 25-Jul-2024
https://dl.acm.org/doi/10.1145/3681784
Li MZhuang HHui KQin ZLin JJagerman RWang XBendersky MHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Can Query Expansion Improve Generalization of Strong Cross-Encoder Rankers?Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657979(2321-2326)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657979
Lee YWu W(2024)QMKD: A Two-Stage Approach to Enhance Multi-Teacher Knowledge Distillation2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650186(1-7)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650186
Show More Cited By

Index Terms

Ensemble Distillation for BERT-Based Ranking Models
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Ranking Distillation: Learning Compact Ranking Models With High Performance for Recommender System
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

We propose a novel way to train ranking models, such as recommender systems, that are both effective and efficient. Knowledge distillation (KD) was shown to be successful in image recognition to achieve both effectiveness and efficiency. We propose a KD ...
Learning to Rank with Ensemble Ranking SVM

In this paper, we propose a novel learning to rank method using Ensemble Ranking SVM. Ensemble Ranking SVM is based on Ranking SVM which has been commonly used for learning to rank. The basic idea of Ranking SVM is to formulate the problem of learning ...
Chinese Sentiment Classification Model based on Pre-Trained BERT
CIPAE 2021: 2021 2nd International Conference on Computers, Information Processing and Advanced Education

In order to solve the problems of low accuracy, less training data and poor training results of traditional machine learning algorithm in Chinese sentient classification task, this paper proposes a Chinese sentient classification model based on pre ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICTIR '21: Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval

July 2021

334 pages

ISBN:9781450386111

DOI:10.1145/3471158

General Chair:
Faegheh Hasibi
Radboud University, Netherlands
,
Program Chairs:
Yi Fang
Santa Clara University, USA
,
Akiko Aizawa
National Institute of Informatics, Japan

Copyright © 2021 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 August 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICTIR '21

Sponsor:

SIGIR

ICTIR '21: The 2021 ACM SIGIR International Conference on the Theory of Information Retrieval

July 11, 2021

Virtual Event, Canada

Acceptance Rates

Overall Acceptance Rate 235 of 527 submissions, 45%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
986
Total Downloads

Downloads (Last 12 months)212
Downloads (Last 6 weeks)25

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Keshvari SSaeedi FSadoghi Yazdi HEnsan F(2024)A Self-Distilled Learning to Rank Model for Ad Hoc RetrievalACM Transactions on Information Systems10.1145/368178442:6(1-28)Online publication date: 25-Jul-2024
https://dl.acm.org/doi/10.1145/3681784
Li MZhuang HHui KQin ZLin JJagerman RWang XBendersky MHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Can Query Expansion Improve Generalization of Strong Cross-Encoder Rankers?Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657979(2321-2326)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657979
Lee YWu W(2024)QMKD: A Two-Stage Approach to Enhance Multi-Teacher Knowledge Distillation2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650186(1-7)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650186
Qin ZJagerman RPasumarthi RZhuang HZhang HBai AHui KYan LWang XOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)RD-SuiteProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667673(35748-35760)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667673

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents