[EVAL] Correct way to handle GSM8K in Turkish Evals? #692

mertbozkir · 2025-04-28T12:12:52Z

Turkish Community Evals.

I'm working on a PR to add different turkish evaluation sets to the lighteval such as MMLU, ARC, GSM8K

My northstar repository: Malhajar/lm-evaluation-harness_turkish

During the implementation I realized that Doc doesn't have any answer keys, and Idk how to handle the numerical_answers.

def turkish_gsm8k_eval_prompt(line: dict, task_name: Optional[str] = "", instruction: Optional[str] = "") -> Doc:
    question = line["question"]
    answer = line["answer"]
    
    # Extract numerical answer
    numerical_answer = None
    if "####" in answer:
        match = re.search(r"####\s*(\d+)", answer)
        if match:
            numerical_answer = int(match.group(1))
            
    # ... query building code ...

    return Doc(
        task_name=task_name,
        query=query,
        choices=[],  # Empty list since not multiple choice
        gold_index=-1,  # Using -1 as sentinel
        instruction=instruction,
    )

Should we:

Use empty choices and gold_index as now?
Pass the numerical answer as a single choice with gold_index=0?
Add the answer to specific dict?
Use a different approach entirely?

Questions/Concerns

Can I use the default metrics used in the lighteval task:
- Metrics.quasi_exact_match_gsm8k
- Metrics.maj_at_8_gsm8k

Request

Could someone from the team clarify:

The correct way to handle numerical answers in custom-lang GSM8K tasks?
If there are any best practices or examples we should follow?

/cc @malhajar17

The text was updated successfully, but these errors were encountered:

NathanHB · 2025-04-30T12:14:25Z

Hi ! you can look at the way it is done in the original gsm8k.
also, for multinlang taskls, we have a lot of examples here

mertbozkir added the new-task label Apr 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EVAL] Correct way to handle GSM8K in Turkish Evals? #692

[EVAL] Correct way to handle GSM8K in Turkish Evals? #692

[EVAL] Correct way to handle GSM8K in Turkish Evals? #692

[EVAL] Correct way to handle GSM8K in Turkish Evals? #692

Comments

Turkish Community Evals.

Questions/Concerns

Request