Add eval command #54

sgoedecke · 2025-06-04T07:08:53Z

Adds a gh models eval command that runs evaluators based on the test data and evaluators in a .prompt.yml file (https://docs.github.com/en/github-models/use-github-models/storing-prompts-in-github-repositories)

It supports string evaluators and the set of built-in evaluators supported in the Models prompting UI.

@sgoedecke ➜ /workspaces/gh-models (sgoedecke/evals) $ ./gh-models eval fixtures/sample_prompt.yml 
Running evaluation: Sample Evaluation
Description: A sample evaluation for testing the eval command
Model: openai/gpt-4o
Test cases: 2

Running test case 1/2...
  ✗ FAILED
    Model Response: Hello there! How can I assist you today? 😊
    ✗ string evaluator (score: 0.00)
      Expected to contain: 'world'
    ✗ similarity check (score: 0.00)
      LLM evaluation matched choice: '1'

Running test case 2/2...
  ✗ FAILED
    Model Response: Goodbye! Take care and see you next time! 🌍👋
    ✗ string evaluator (score: 0.00)
      Expected to contain: 'world'
    ✓ similarity check (score: 0.25)
      LLM evaluation matched choice: '2'

Evaluation Summary:
Passed: 0/2 (0.0%)
❌ Some tests failed.

As part of this change, I've extracted the prompt-yml-parsing code into a common package.

Here's an example GitHub Action that runs this on new PRs:

Details

name: Run prompt evals on pull request

# The job only starts when at least one *.prompt.yml file is touched.
on:
  pull_request:
    paths:
      - '**/*.prompt.yml'

jobs:
  evaluate-model:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Setup gh-models
        run: gh extension install github/gh-models
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

      - name: Find changed prompt files
        id: find-prompts
        run: |
          # Get the list of changed files that match *.prompt.yml pattern
          changed_prompts=$(git diff --name-only origin/${{ github.base_ref }}..HEAD | grep '\.prompt\.yml$' | head -1)
          
          if [[ -z "$changed_prompts" ]]; then
            echo "No prompt files found in the changes"
            exit 1
          fi
          
          echo "first_prompt=$changed_prompts" >> "$GITHUB_OUTPUT"
          echo "Found changed prompt file: $changed_prompts"

      - name: Run model evaluation
        id: eval
        run: |
          set -e
          PROMPT_FILE="${{ steps.find-prompts.outputs.first_prompt }}"
          echo "## Model Evaluation Results" >> "$GITHUB_STEP_SUMMARY"
          echo "Evaluating: $PROMPT_FILE" >> "$GITHUB_STEP_SUMMARY"
          echo "" >> "$GITHUB_STEP_SUMMARY"

          if gh models eval "$PROMPT_FILE" > eval_output.txt 2>&1; then
            echo "✅ All evaluations passed!"   >> "$GITHUB_STEP_SUMMARY"
            cat  eval_output.txt               >> "$GITHUB_STEP_SUMMARY"
            echo "eval_status=success"         >> "$GITHUB_OUTPUT"
          else
            echo "❌ Some evaluations failed!" >> "$GITHUB_STEP_SUMMARY"
            cat  eval_output.txt               >> "$GITHUB_STEP_SUMMARY"
            echo "eval_status=failure"         >> "$GITHUB_OUTPUT"
            exit 1
          fi
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

      - name: Comment on PR (if evaluation failed)
        if: failure() && github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const output = fs.readFileSync('eval_output.txt', 'utf8');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner:        context.repo.owner,
              repo:         context.repo.repo,
              body: `## 🚨 Model Evaluation Failed

              \`\`\`
              ${output}
              \`\`\`

              Please review the evaluation results and fix any issues before merging.`
            });

Copilot

Pull Request Overview

Adds a new gh models eval command to run evaluators defined in .prompt.yml files, along with built-in LLM evaluators and sample fixtures for testing.

Introduces prompt parsing and templating (LoadFromFile, TemplateString) in pkg/prompt
Implements the eval subcommand under cmd/eval and registers it in the root command
Provides built-in LLM evaluators (builtins.go) and dozens of YAML fixtures plus comprehensive tests

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
pkg/prompt/prompt.go	Adds `File`, `LoadFromFile`, and simple `TemplateString` logic
pkg/prompt/prompt_test.go	Tests loading/parsing, templating, and error cases
cmd/root.go	Registers the new `eval` subcommand
cmd/root_test.go	Updates root help tests to include `eval`
cmd/run/run.go	Switches run command to use the shared prompt loader and templater
cmd/eval/…	Implements the eval command handler, tests, and built-in evaluators
fixtures/*.yml	Sample and test data for string and plugin evaluators

Comments suppressed due to low confidence (3)

cmd/eval/builtins.go:1

[nitpick] The builtins.go file contains very large inline prompt definitions, which can be hard to navigate and maintain. Consider extracting these templates into separate resource files or moving each evaluator into its own file to improve readability and manageability.

package eval

pkg/prompt/prompt.go:82

The doc comment for TemplateString is minimal. It would help to explain edge-case behavior (e.g., handling of non-map data or unresolved placeholders) so future maintainers know exactly what to expect.

// TemplateString templates a string with the given data using simple {{variable}} replacement

pkg/prompt/prompt_test.go:11

We don’t have a test verifying how ModelParameters.TopP is handled when it's omitted from the YAML. Adding a case for missing optional fields would ensure consistent behavior.

func TestPromptFile(t *testing.T) {

pelikhan · 2025-06-04T13:32:08Z

Can you support a machine friendly output like JSON?

cheshire137

Thanks for all the test coverage!

cmd/eval/builtins.go

cmd/eval/eval.go

cheshire137 · 2025-06-04T13:51:34Z

cmd/eval/eval.go

+		var role azuremodels.ChatMessageRole
+		switch strings.ToLower(msg.Role) {
+		case "system":
+			role = azuremodels.ChatMessageRoleSystem
+		case "user":
+			role = azuremodels.ChatMessageRoleUser
+		case "assistant":
+			role = azuremodels.ChatMessageRoleAssistant
+		default:
+			return nil, fmt.Errorf("unknown message role: %s", msg.Role)
+		}


Could put this into a function like getAzureChatMessageRole(msg.Role).

cheshire137 · 2025-06-04T13:57:46Z

cmd/eval/eval.go

+	req := azuremodels.ChatCompletionOptions{
+		Messages: messages,
+		Model:    h.evalFile.Model,
+		Stream:   false,
+	}
+
+	// Apply model parameters
+	if h.evalFile.ModelParameters.MaxTokens != nil {
+		req.MaxTokens = h.evalFile.ModelParameters.MaxTokens
+	}
+	if h.evalFile.ModelParameters.Temperature != nil {
+		req.Temperature = h.evalFile.ModelParameters.Temperature
+	}
+	if h.evalFile.ModelParameters.TopP != nil {
+		req.TopP = h.evalFile.ModelParameters.TopP
+	}


You use h.evalFile a lot here. What about adding a method on the EvalFile type like BuildChatCompletionOptions? Then this could be:

Suggested change

req := azuremodels.ChatCompletionOptions{

Messages: messages,

Model: h.evalFile.Model,

Stream: false,

}

// Apply model parameters

if h.evalFile.ModelParameters.MaxTokens != nil {

req.MaxTokens = h.evalFile.ModelParameters.MaxTokens

}

if h.evalFile.ModelParameters.Temperature != nil {

req.Temperature = h.evalFile.ModelParameters.Temperature

}

if h.evalFile.ModelParameters.TopP != nil {

req.TopP = h.evalFile.ModelParameters.TopP

}

req := h.evalFile.BuildChatCompletionOptions(messages, false)

pkg/prompt/prompt.go

Deborah-Digges

🚢 This looks great, thanks for all the hard work here! 🙇‍♀

sgoedecke · 2025-06-05T00:13:14Z

Thank you for the reviews! @cheshire137, I'm going to address your points in a new PR, given the size (and the fact that I'm going to immediately follow up with a new PR to do JSON mode)

sgoedecke added 9 commits June 4, 2025 03:17

Add eval command

d205ffa

Fixup tests

223d724

Revert change to templating syntax

b0fd466

Fixup syntax

498dcdd

Guard against nil test cases

f936d04

Implement proper builtin evaluators

< 8000 /div>

cbad2a0

Truncate to first 100 chars

3bdb250

Remove extraneous files

e527146

Remove unused allResults var

f70f858

Truncate to first 100 chars

3bdb250

Remove extraneous files

e527146

Remove unused allResults var

f70f858

sgoedecke marked this pull request as ready for review June 4, 2025 09:11

Copilot AI review requested due to automatic review settings June 4, 2025 09:11

sgoedecke requested a review from a team as a code owner June 4, 2025 09:11

Copilot AI reviewed Jun 4, 2025

View reviewed changes

cheshire137 approved these changes Jun 4, 2025

View reviewed changes

Deborah-Digges approved these changes Jun 4, 2025

View reviewed changes

sgoedecke merged commit 392c27b into main Jun 5, 2025
5 checks passed

sgoedecke deleted the sgoedecke/evals branch June 5, 2025 00:13

sgoedecke mentioned this pull request Jun 5, 2025

Add a --json mode to evals #55

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add eval command #54

Add eval command #54

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add eval command #54

Add eval command #54

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!