Computer Science > Computation and Language

arXiv:2404.01067 (cs)

[Submitted on 1 Apr 2024 (v1), last revised 7 Sep 2024 (this version, v2)]

Title:Exploring the Mystery of Influential Data for Mathematical Reasoning

Authors:Xinzhe Ni, Yeyun Gong, Zhibin Gou, Yelong Shen, Yujiu Yang, Nan Duan, Weizhu Chen

Abstract:Selecting influential data for fine-tuning on downstream tasks is a key factor for both performance and computation efficiency. Recent works have shown that training with only limited data can show a superior performance on general tasks. However, the feasibility on mathematical reasoning tasks has not been validated. To go further, there exist two open questions for mathematical reasoning: how to select influential data and what is an influential data composition. For the former one, we propose a Quality-aware Diverse Selection (QaDS) strategy adaptable for mathematical reasoning. A comparison with other selection strategies validates the superiority of QaDS. For the latter one, we first enlarge our setting and explore the influential data composition. We conduct a series of experiments and highlight: scaling up reasoning data, and training with general data selected by QaDS is helpful. Then, we define our optimal mixture as OpenMathMix, an influential data mixture with open-source data selected by QaDS. With OpenMathMix, we achieve a state-of-the-art 48.8% accuracy on MATH with 7B base model. Additionally, we showcase the use of QaDS in creating efficient fine-tuning mixtures with various selection ratios, and analyze the quality of a wide range of open-source datasets, which can perform as a reference for future works on mathematical reasoning tasks.

Comments:	Accepted by COLM 2024
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2404.01067 [cs.CL]
	(or arXiv:2404.01067v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2404.01067

Submission history

From: Xinzhe Ni [view email]
[v1] Mon, 1 Apr 2024 12:01:06 UTC (3,401 KB)
[v2] Sat, 7 Sep 2024 06:03:06 UTC (3,319 KB)

Computer Science > Computation and Language

Title:Exploring the Mystery of Influential Data for Mathematical Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Exploring the Mystery of Influential Data for Mathematical Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators