A Merge Sort Based Ranking System for the Evaluation of Large Language Models

Chenchen Li¹¹,
Linfeng Shi¹¹,
Chunyi Zhou¹¹,
Zhaoxin Huan¹¹,
Chengfu Tang¹¹,
Xiaolu Zhang¹¹,
Xudong Wang¹¹,
Jun Zhou¹¹ &
…
Song Liu¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14949))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

651 Accesses

Abstract

Efficient and accurate evaluation of Large Language Models (LLMs) is essential for progress in the field of natural language processing. To address this, our paper introduces Transitive Merge Sort (TMS), a novel method that harnesses the advantages of merge sort’s efficiency, stability and parallelizability for model ranking in LLMs evaluation. This approach applies a divide-and-conquer strategy for pairwise comparisons, streamlining the evaluation process. Our experimental findings reveal that TMS not only improves the accuracy of model rankings when compared to methods like Elo rating and SuperCLUE (compared with GPT-3.5) but also significantly reduces the need for annotation resources by up to 70%. Additionally, we present an iterated version of TMS that effectively handles scenarios where initial model rankings are unknown.

C. Li and L. Shi—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 99.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 64.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Gar $$\scriptstyle ++$$ : Natural Language to SQL Translation with Efficient Generate-and-Rank

Improving RAG Quality for Large Language Models with Topic-Enhanced Reranking

Analyzing the Efficacy of Large Language Models: A Comparative Study

References

Berry, K.J., Mielke, P.W., Jr.: Spearman’s footrule as a measure of agreement. Psychol. Rep. 80(3), 839–846 (1997)
Article Google Scholar
Bommasani, R., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)
Chang, Y., et al.: A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 15, 1–45 (2023)
Article Google Scholar
Opencompass Contributors.: Opencompass: a universal evaluation platform for foundation models (2023). https://github.com/open-compass/opencompass
Elo, A.E.: The proposed USCF rating system, its development, theory, and applications. Chess Life 22(8), 242–247 (1967)
Google Scholar
Hendrycks, D., et al.: Measuring massive multitask language understanding. In: Proceedings of the International Conference on Learning Representations (ICLR) (2021)
Google Scholar
Huang, Y., et al.: C-eval: a multi-level multi-discipline Chinese evaluation suite for foundation models. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Li, R., Patel, T., Du, X.: PRD: peer rank and discussion improve large language model based evaluations. arXiv preprint arXiv:2307.02762 (2023)
Liang, P., et al.: Holistic evaluation of language models. arXiv:2211.09110 (2023)
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Big-Bench authors.: Beyond the imitation game: quantifying and extrapolating the capabilities of language models. Trans. Mach. Learn. Res. (2023)
Google Scholar
Wang, P., et al.: Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926 (2023)
Wei, J., et al.: Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022)
Wikipedia: Round-robin tournament (2023). https://en.wikipedia.org/wiki/Round-robin_tournament
Xu, C., et al.: WizardLM: empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023)
Xu, L., et al.: Superclue: a comprehensive Chinese large language model benchmark. arXiv preprint arXiv:2307.15020 (2023)
Zhang, Y., et al.: Llmeval: a preliminary study on how to evaluate large language models. arXiv preprint arXiv:2312.07398 (2023)
Zheng, L., et al.: Judging LLM-as-a-judge with MT-bench and chatbot arena. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Zhong, W., et al.: AGIEval: a human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364 (2023)

Download references

Author information

Authors and Affiliations

AntGroup, HangZhou, China
Chenchen Li, Linfeng Shi, Chunyi Zhou, Zhaoxin Huan, Chengfu Tang, Xiaolu Zhang, Xudong Wang, Jun Zhou & Song Liu

Authors

Chenchen Li
View author publications
You can also search for this author in PubMed Google Scholar
Linfeng Shi
View author publications
You can also search for this author in PubMed Google Scholar
Chunyi Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Zhaoxin Huan
View author publications
You can also search for this author in PubMed Google Scholar
Chengfu Tang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xudong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Song Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jun Zhou or Song Liu .

Editor information

Editors and Affiliations

LTCI, Télécom Paris, Palaiseau Cedex, France
Albert Bifet
Faculty of Informatics, Vytautas Magnus University, Akademija, Lithuania
Tomas Krilavičius
Stockholm University, Kista, Sweden
Ioanna Miliou
School of Information Technology, Halmstad University, Halmstad, Sweden
Slawomir Nowaczyk

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 68 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, C. et al. (2024). A Merge Sort Based Ranking System for the Evaluation of Large Language Models. In: Bifet, A., Krilavičius, T., Miliou, I., Nowaczyk, S. (eds) Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14949. Springer, Cham. https://doi.org/10.1007/978-3-031-70378-2_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-70378-2_15
Published: 22 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70377-5
Online ISBN: 978-3-031-70378-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)