8000 [EVAL] Correct way to handle GSM8K in Turkish Evals? · Issue #692 · huggingface/lighteval · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[EVAL] Correct way to handle GSM8K in Turkish Evals? #692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mertbozkir opened this issue Apr 28, 2025 · 1 comment
Open

[EVAL] Correct way to handle GSM8K in Turkish Evals? #692

mertbozkir opened this issue Apr 28, 2025 · 1 comment
Labels

Comments

@mertbozkir
Copy link

Turkish Community Evals.

I'm working on a PR to add different turkish evaluation sets to the lighteval such as MMLU, ARC, GSM8K

My northstar repository: Malhajar/lm-evaluation-harness_turkish

  1. During the implementation I realized that Doc doesn't have any answer keys, and Idk how to handle the numerical_answers.
def turkish_gsm8k_eval_prompt(line: dict, task_name: Optional[str] = "", instruction: Optional[str] = "") -> Doc:
    question = line["question"]
    answer = line["answer"]
    
    # Extract numerical answer
    numerical_answer = None
    if "####" in answer:
        match = re.search(r"####\s*(\d+)", answer)
        if match:
            numerical_answer = int(match.group(1))
            
    # ... query building code ...

    return Doc(
        task_name=task_name,
        query=query,
        choices=[],  # Empty list since not multiple choice
        gold_index=-1,  # Using -1 as sentinel
        instruction=instruction,
    )

Should we:

  • Use empty choices and gold_index as now?
  • Pass the numerical answer as a single choice with gold_index=0?
  • Add the answer to specific dict?
  • Use a different approach entirely?

Questions/Concerns

  1. Can I use the default metrics used in the lighteval task:
    • Metrics.quasi_exact_match_gsm8k
    • Metrics.maj_at_8_gsm8k

Request

Could someone from the team clarify:

  1. The correct way to handle numerical answers in custom-lang GSM8K tasks?
  2. If there are any best practices or examples we should follow?

/cc @malhajar17

@NathanHB
Copy link
Member

Hi ! you can look at the way it is done in the original gsm8k.
also, for multinlang taskls, we have a lot of examples here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants
0