Computer Science > Computation and Language

arXiv:2409.18892 (cs)

[Submitted on 27 Sep 2024 (v1), last revised 5 Oct 2024 (this version, v2)]

Title:IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation

Authors:Fan Lin, Shuyi Xie, Yong Dai, Wenlin Yao, Tianjiao Lang, Zishan Xu, Zhichao Hu, Xiao Xiao, Yuhong Liu, Yu Zhang

Abstract:As Large Language Models (LLMs) grow increasingly adept at managing complex tasks, the evaluation set must keep pace with these advancements to ensure it remains sufficiently discriminative. Item Discrimination (ID) theory, which is widely used in educational assessment, measures the ability of individual test items to differentiate between high and low performers. Inspired by this theory, we propose an ID-induced prompt synthesis framework for evaluating LLMs to ensure the evaluation set can continually update and refine according to model abilities. Our data synthesis framework prioritizes both breadth and specificity. It can generate prompts that comprehensively evaluate the capabilities of LLMs while revealing meaningful performance differences between models, allowing for effective discrimination of their relative strengths and weaknesses across various tasks and domains. To produce high-quality data, we incorporate a self-correct mechanism into our generalization framework, and develop two models to predict prompt discrimination and difficulty score to facilitate our data synthesis framework, contributing valuable tools to evaluation data synthesis research. We apply our generated data to evaluate five SOTA models. Our data achieves an average score of 51.92, accompanied by a variance of 10.06. By contrast, previous works (i.e., SELF-INSTRUCT and WizardLM) obtain an average score exceeding 67, with a variance below 3.2. The results demonstrate that the data generated by our framework is more challenging and discriminative compared to previous works. We will release a dataset of over 3,000 carefully crafted prompts to facilitate evaluation research of LLMs.

Comments:	NeurIPS 2024
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2409.18892 [cs.CL]
	(or arXiv:2409.18892v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2409.18892

Submission history

From: Yong Dai [view email]
[v1] Fri, 27 Sep 2024 16:29:12 UTC (734 KB)
[v2] Sat, 5 Oct 2024 06:17:38 UTC (734 KB)

Computer Science > Computation and Language

Title:IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators