8000 Add eval command by sgoedecke · Pull Request #54 · github/gh-models · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add eval command #54

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 5, 2025
Merged

Add eval command #54

merged 9 commits into from
Jun 5, 2025

Conversation

sgoedecke
Copy link
Collaborator
@sgoedecke sgoedecke commented Jun 4, 2025

Adds a gh models eval command that runs evaluators based on the test data and evaluators in a .prompt.yml file (https://docs.github.com/en/github-models/use-github-models/storing-prompts-in-github-repositories)

It supports string evaluators and the set of built-in evaluators supported in the Models prompting UI.

@sgoedecke ➜ /workspaces/gh-models (sgoedecke/evals) $ ./gh-models eval fixtures/sample_prompt.yml 
Running evaluation: Sample Evaluation
Description: A sample evaluation for testing the eval command
Model: openai/gpt-4o
Test cases: 2

Running test case 1/2...
  ✗ FAILED
    Model Response: Hello there! How can I assist you today? 😊
    ✗ string evaluator (score: 0.00)
      Expected to contain: 'world'
    ✗ similarity check (score: 0.00)
      LLM evaluation matched choice: '1'

Running test case 2/2...
  ✗ FAILED
    Model Response: Goodbye! Take care and see you next time! 🌍👋
    ✗ string evaluator (score: 0.00)
      Expected to contain: 'world'
    ✓ similarity check (score: 0.25)
      LLM evaluation matched choice: '2'

Evaluation Summary:
Passed: 0/2 (0.0%)
❌ Some tests failed.

As part of this change, I've extracted the prompt-yml-parsing code into a common package.

Here's an example GitHub Action that runs this on new PRs:

Details

name: Run prompt evals on pull request

# The job only starts when at least one *.prompt.yml file is touched.
on:
  pull_request:
    paths:
      - '**/*.prompt.yml'

jobs:
  evaluate-model:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Setup gh-models
        run: gh extension install github/gh-models
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

      - name: Find changed prompt files
        id: find-prompts
        run: |
          # Get the list of changed files that match *.prompt.yml pattern
          changed_prompts=$(git diff --name-only origin/${{ github.base_ref }}..HEAD | grep '\.prompt\.yml$' | head -1)
          
          if [[ -z "$changed_prompts" ]]; then
            echo "No prompt files found in the changes"
            exit 1
          fi
          
          echo "first_prompt=$changed_prompts" >> "$GITHUB_OUTPUT"
          echo "Found changed prompt file: $changed_prompts"

      - name: Run model evaluation
        id: eval
        run: |
          set -e
          PROMPT_FILE="${{ steps.find-prompts.outputs.first_prompt }}"
          echo "## Model Evaluation Results" >> "$GITHUB_STEP_SUMMARY"
          echo "Evaluating: $PROMPT_FILE" >> "$GITHUB_STEP_SUMMARY"
          echo "" >> "$GITHUB_STEP_SUMMARY"

          if gh models eval "$PROMPT_FILE" > eval_output.txt 2>&1; then
            echo "✅ All evaluations passed!"   >> "$GITHUB_STEP_SUMMARY"
            cat  eval_output.txt               >> "$GITHUB_STEP_SUMMARY"
            echo "eval_status=success"         >> "$GITHUB_OUTPUT"
          else
            echo "❌ Some evaluations failed!" >> "$GITHUB_STEP_SUMMARY"
            cat  eval_output.txt               >> "$GITHUB_STEP_SUMMARY"
            echo "eval_status=failure"         >> "$GITHUB_OUTPUT"
            exit 1
          fi
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

      - name: Comment on PR (if evaluation failed)
        if: failure() && github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const output = fs.readFileSync('eval_output.txt', 'utf8');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner:        context.repo.owner,
              repo:         context.repo.repo,
              body: `## 🚨 Model Evaluation Failed

              \`\`\`
              ${output}
              \`\`\`

              Please review the evaluation results and fix any issues before merging.`
            });

@sgoedecke sgoedecke marked this pull request as ready for review June 4, 2025 09:11
@Copilot Copilot AI review requested due to automatic review settings June 4, 2025 09:11
@sgoedecke sgoedecke requested a review from a team as a code owner June 4, 2025 09:11
Copy link
Contributor
@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a new gh models eval command to run evaluators defined in .prompt.yml files, along with built-in LLM evaluators and sample fixtures for testing.

  • Introduces prompt parsing and templating (LoadFromFile, TemplateString) in pkg/prompt
  • Implements the eval subcommand under cmd/eval and registers it in the root command
  • Provides built-in LLM evaluators (builtins.go) and dozens of YAML fixtures plus comprehensive tests

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/prompt/prompt.go Adds File, LoadFromFile, and simple TemplateString logic
pkg/prompt/prompt_test.go Tests loading/parsing, templating, and error cases
cmd/root.go Registers the new eval subcommand
cmd/root_test.go Updates root help tests to include eval
cmd/run/run.go Switches run command to use the shared prompt loader and templater
cmd/eval/… Implements the eval command handler, tests, and built-in evaluators
fixtures/*.yml Sample and test data for string and plugin evaluators
Comments suppressed due to low confidence (3)

cmd/eval/builtins.go:1

  • [nitpick] The builtins.go file contains very large inline prompt definitions, which can be hard to navigate and maintain. Consider extracting these templates into separate resource files or moving each evaluator into its own file to improve readability and manageability.
package eval

pkg/prompt/prompt.go:82

  • The doc comment for TemplateString is minimal. It would help to explain edge-case behavior (e.g., handling of non-map data or unresolved placeholders) so future maintainers know exactly what to expect.
// TemplateString templates a string with the given data using simple {{variable}} replacement

pkg/prompt/prompt_test.go:11

  • We don’t have a test verifying how ModelParameters.TopP is handled when it's omitted from the YAML. Adding a case for missing optional fields would ensure consistent behavior.
func TestPromptFile(t *testing.T) {

@pelikhan
Copy link
pelikhan commented Jun 4, 2025

Can you support a machine friendly output like JSON?

Copy link
Member
@cheshire137 cheshire137 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the test coverage!

Comment on lines +213 to +223
var role azuremodels.ChatMessageRole
switch strings.ToLower(msg.Role) {
case "system":
role = azuremodels.ChatMessageRoleSystem
case "user":
role = azuremodels.ChatMessageRoleUser
case "assistant":
role = azuremodels.ChatMessageRoleAssistant
default:
return nil, fmt.Errorf("unknown message role: %s", msg.Role)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could put this into a function like getAzureChatMessageRole(msg.Role).

Comment on lines +239 to +254
req := azuremodels.ChatCompletionOptions{
Messages: messages,
Model: h.evalFile.Model,
Stream: false,
}

// Apply model parameters
if h.evalFile.ModelParameters.MaxTokens != nil {
req.MaxTokens = h.evalFile.ModelParameters.MaxTokens
}
if h.evalFile.ModelParameters.Temperature != nil {
req.Temperature = h.evalFile.ModelParameters.Temperature
}
if h.evalFile.ModelParameters.TopP != nil {
req.TopP = h.evalFile.ModelParameters.TopP
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You use h.evalFile a lot here. What about adding a method on the EvalFile type like BuildChatCompletionOptions? Then this could be:

Suggested change
req := azuremodels.ChatCompletionOptions{
Messages: messages,
Model: h.evalFile.Model,
Stream: false,
}
// Apply model parameters
if h.evalFile.ModelParameters.MaxTokens != nil {
req.MaxTokens = h.evalFile.ModelParameters.MaxTokens
}
if h.evalFile.ModelParameters.Temperature != nil {
req.Temperature = h.evalFile.ModelParameters.Temperature
}
if h.evalFile.ModelParameters.TopP != nil {
req.TopP = h.evalFile.ModelParameters.TopP
}
req := h.evalFile.BuildChatCompletionOptions(messages, false)

Copy link
Member
@Deborah-Digges Deborah-Digges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 This looks great, thanks for all the hard work here! 🙇‍♀

@sgoedecke
Copy link
Collaborator Author

Thank you for the reviews! @cheshire137, I'm going to address your points in a new PR, given the size (and the fact that I'm going to immediately follow up with a new PR to do JSON mode)

@sgoedecke sgoedecke merged commit 392c27b into main Jun 5, 2025
5 checks passed
@sgoedecke sgoedecke deleted the sgoedecke/evals branch June 5, 2025 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0