-
Notifications
You must be signed in to change notification settings - Fork 8
Add eval command #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add eval command #54
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds a new gh models eval
command to run evaluators defined in .prompt.yml
files, along with built-in LLM evaluators and sample fixtures for testing.
- Introduces prompt parsing and templating (
LoadFromFile
,TemplateString
) inpkg/prompt
- Implements the eval subcommand under
cmd/eval
and registers it in the root command - Provides built-in LLM evaluators (
builtins.go
) and dozens of YAML fixtures plus comprehensive tests
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
pkg/prompt/prompt.go | Adds File , LoadFromFile , and simple TemplateString logic |
pkg/prompt/prompt_test.go | Tests loading/parsing, templating, and error cases |
cmd/root.go | Registers the new eval subcommand |
cmd/root_test.go | Updates root help tests to include eval |
cmd/run/run.go | Switches run command to use the shared prompt loader and templater |
cmd/eval/… | Implements the eval command handler, tests, and built-in evaluators |
fixtures/*.yml | Sample and test data for string and plugin evaluators |
Comments suppressed due to low confidence (3)
cmd/eval/builtins.go:1
- [nitpick] The builtins.go file contains very large inline prompt definitions, which can be hard to navigate and maintain. Consider extracting these templates into separate resource files or moving each evaluator into its own file to improve readability and manageability.
package eval
pkg/prompt/prompt.go:82
- The doc comment for TemplateString is minimal. It would help to explain edge-case behavior (e.g., handling of non-map data or unresolved placeholders) so future maintainers know exactly what to expect.
// TemplateString templates a string with the given data using simple {{variable}} replacement
pkg/prompt/prompt_test.go:11
- We don’t have a test verifying how
ModelParameters.TopP
is handled when it's omitted from the YAML. Adding a case for missing optional fields would ensure consistent behavior.
func TestPromptFile(t *testing.T) {
Can you support a machine friendly output like JSON? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the test coverage!
var role azuremodels.ChatMessageRole | ||
switch strings.ToLower(msg.Role) { | ||
case "system": | ||
role = azuremodels.ChatMessageRoleSystem | ||
case "user": | ||
role = azuremodels.ChatMessageRoleUser | ||
case "assistant": | ||
role = azuremodels.ChatMessageRoleAssistant | ||
default: | ||
return nil, fmt.Errorf("unknown message role: %s", msg.Role) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could put this into a function like getAzureChatMessageRole(msg.Role)
.
req := azuremodels.ChatCompletionOptions{ | ||
Messages: messages, | ||
Model: h.evalFile.Model, | ||
Stream: false, | ||
} | ||
|
||
// Apply model parameters | ||
if h.evalFile.ModelParameters.MaxTokens != nil { | ||
req.MaxTokens = h.evalFile.ModelParameters.MaxTokens | ||
} | ||
if h.evalFile.ModelParameters.Temperature != nil { | ||
req.Temperature = h.evalFile.ModelParameters.Temperature | ||
} | ||
if h.evalFile.ModelParameters.TopP != nil { | ||
req.TopP = h.evalFile.ModelParameters.TopP | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You use h.evalFile
a lot here. What about adding a method on the EvalFile type like BuildChatCompletionOptions
? Then this could be:
req := azuremodels.ChatCompletionOptions{ | |
Messages: messages, | |
Model: h.evalFile.Model, | |
Stream: false, | |
} | |
// Apply model parameters | |
if h.evalFile.ModelParameters.MaxTokens != nil { | |
req.MaxTokens = h.evalFile.ModelParameters.MaxTokens | |
} | |
if h.evalFile.ModelParameters.Temperature != nil { | |
req.Temperature = h.evalFile.ModelParameters.Temperature | |
} | |
if h.evalFile.ModelParameters.TopP != nil { | |
req.TopP = h.evalFile.ModelParameters.TopP | |
} | |
req := h.evalFile.BuildChatCompletionOptions(messages, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢 This looks great, thanks for all the hard work here! 🙇♀
Thank you for the reviews! @cheshire137, I'm going to address your points in a new PR, given the size (and the fact that I'm going to immediately follow up with a new PR to do JSON mode) |
Adds a
gh models eval
command that runs evaluators based on the test data and evaluators in a.prompt.yml
file (https://docs.github.com/en/github-models/use-github-models/storing-prompts-in-github-repositories)It supports string evaluators and the set of built-in evaluators supported in the Models prompting UI.
As part of this change, I've extracted the prompt-yml-parsing code into a common package.
Here's an example GitHub Action that runs this on new PRs:
Details