Computer Science > Computation and Language

arXiv:2410.15553 (cs)

[Submitted on 21 Oct 2024 (v1), last revised 13 Nov 2024 (this version, v2)]

Title:Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following

Authors:Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, Shruti Bhosale, Chenguang Zhu, Karthik Abinav Sankararaman, Eryk Helenowski, Melanie Kambadur, Aditya Tayade, Hao Ma, Han Fang, Sinong Wang

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in various tasks, including instruction following, which is crucial for aligning model outputs with user expectations. However, evaluating LLMs' ability to follow instructions remains challenging due to the complexity and subjectivity of human language. Current benchmarks primarily focus on single-turn, monolingual instructions, which do not adequately reflect the complexities of real-world applications that require handling multi-turn and multilingual interactions. To address this gap, we introduce Multi-IF, a new benchmark designed to assess LLMs' proficiency in following multi-turn and multilingual instructions. Multi-IF, which utilizes a hybrid framework combining LLM and human annotators, expands upon the IFEval by incorporating multi-turn sequences and translating the English prompts into another 7 languages, resulting in a dataset of 4,501 multilingual conversations, where each has three turns. Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks. All the models tested showed a higher rate of failure in executing instructions correctly with each additional turn. For example, o1-preview drops from 0.877 at the first turn to 0.707 at the third turn in terms of average accuracy over all languages. Moreover, languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models' multilingual capabilities. We release Multi-IF prompts and the evaluation code base to encourage further research in this critical area.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2410.15553 [cs.CL]
	(or arXiv:2410.15553v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2410.15553

Submission history

From: Yun He [view email]
[v1] Mon, 21 Oct 2024 00:59:47 UTC (12,307 KB)
[v2] Wed, 13 Nov 2024 04:26:13 UTC (12,422 KB)

Computer Science > Computation and Language

Title:Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators