Zevals

Simple, practical AI evaluations in TypeScript.

Zevals provides utilities for testing AI agents. Unlike the few existing AI eval libraries and frameworks:

Treats AI evals like end-to-end tests, with less focus on metrics and more focus on binary assertions
Designed to evaluate full conversations, not just single query/response pairs
Does not impose any testing framework or test runner

Example

import { ChatOpenAI } from '@langchain/openai';
import zevals from '@zevals/core';
import { langChainZEvalsJudge } from '@zevals/langchain';

test('Simple example', async () => {
  const agent: zevals.Agent = {
    async invoke(messages) {
      // Run your application logic to generate a response for the user
      return { response: { role: 'assistant', content: 'What kind of vitamin?' } };
    },
  };

  const judge = langChainZEvalsJudge({
    model: new ChatOpenAI({
      modelName: 'gpt-4.1-mini',
      temperature: 0,
    }),
  });

  const followupAssertion = zevals.aiAssertion({
    judge,
    prompt: 'The assistant asked a followup question',
  });

  const { getResultOrThrow } = await zevals.evaluate({
    agent,
    segments: [
      // The user wants vitamins
      zevals.message({ role: 'user', content: 'I want another bottle of the vitamin' }),

      // The agent responds
      zevals.agentResponse(),

      // We judge the above
      zevals.aiEval(followupAssertion),
    ],
  });

  // Run your assertions on type-safe outputs
  expect(getResultOrThrow(followupAssertion).output).toBe(true);
});

Note

You can find more examples in the examples directory.

Installation

npm install @zevals/core

# To use with LangChain models
# (feel free to use anything other than OpenAI)
npm install @langchain/core @langchain/openai @zevals/langchain

# To use Vercel AI SDK model providers
npm install ai @zevals/vercel

# To use autoevals scorers
npm install autoevals @zevals/autoevals

Features

Support for evaluating entire scenarios, with optional user simulation
Utilities for LLM-as-a-judge, and more programmatic assertion functions for tool calling
Utilities for wrapping your existing user message-handling logic into an agent that can be easily tested
Very simple, extensible design, with no assumptions about how you will use the library
Facilities for benchmarking using popular benchmarks like tau-bench
(Optionally) integrates with any LLM provider via LangChain or Vercel AI SDK
(Optionally) integrates with Braintrust's autoevals scorers
As type-safe as practically possible

Assertions

Zevals provides a few ways to assert that your agent did what it was supposed to do. Most notably, you can run assertions on the tool calls made by your agents:

import zevals from '@zevals/core';
import { simpleExampleAgent } from './simple-example-agent.js';

test('Tool calls example', { timeout: 20000 }, async () => {
  const agent = simpleExampleAgent();

  // We expect the `get_current_date` tool to be called and to return the current date
  const dateToolCalledAssertion = zevals.aiToolCalls({
    assertion(toolCalls) {
      const date = new Date();

      expect(toolCalls).toEqual(
        expect.arrayContaining([
          {
            name: 'get_current_date',
            args: {},
            result: {
              day: date.getDate(),
              month: date.getMonth(),
              year: date.getFullYear(),
            },
          },
        ]),
      );
    },
  });

  const { getResultOrThrow } = await zevals.evaluate({
    agent,
    segments: [
      zevals.message({ role: 'user', content: 'What day of month are we at?' }),

      zevals.agentResponse(),

      zevals.aiEval(dateToolCalledAssertion),
    ],
  });

  const error = getResultOrThrow(dateToolCalledAssertion).error;
  if (error) throw error;
});

Note

You can very easily create new types of assertions. See the Criterion interface.

User Simulation

You can simulate a human user by specifying a prompt for the user, and a condition to signal the end of the simulation.

import { RunnableLambda } from '@langchain/core/runnables';
import { ChatOpenAI } from '@langchain/openai';
import zevals from '@zevals/core';
import { langChainZEvalsJudge, langChainZEvalsSyntheticUser } from '@zevals/langchain';
import { simpleExampleAgent } from './simple-example-agent.js';

test('User simulation example', { timeout: 60000 }, async () => {
  const agent = simpleExampleAgent();
  const model = new ChatOpenAI({ model: 'gpt-4.1-mini', temperature: 0 });
  const judge = langChainZEvalsJudge({ model });

  const user = langChainZEvalsSyntheticUser({
    runnable: RunnableLambda.from((messages) =>
      model.invoke([
        {
          role: 'system',
          content: `You will ask three questions, each in a separate message. Do not repeat the same question.
             Question 1) What is the capital of France? 
             Question 2) What is the capital of Germany? 
             Question 3) What is the capital of Italy?`,
        },
        ...messages,
      ]),
    ),
  });

  const { messages, success } = await zevals.evaluate({
    agent,
    segments: [
      zevals.userSimulation({
        user,
        until: zevals.aiAssertion({
          judge,
          prompt: 'The assistant has answered THREE questions',
        }),
      }),
    ],
  });

  expect(success).toBe(true);
  expect(messages.length).toBe(6);
});

Note

A userSimulation is a type of Segment. You can very easily create new types of segments by implementing the interface.

Integration with LLM Providers

As we've seen in the examples above, you can use any LangChain model as judge, agent, or synthetic user by using functions from @zevals/langchain. Similarly, you can use the @zevals/vercel package to utilize any provider from the Vercel AI SDK:

import { openai } from '@ai-sdk/openai';
import zevals from '@zevals/core';
import { vercelZEvalsAgent, vercelZEvalsJudge } from '@zevals/vercel';
import { generateText } from 'ai';

test('Vercel AI SDK integration', { timeout: 10000 }, async () => {
  const model = openai.chat('gpt-4.1-mini');

  const agent = vercelZEvalsAgent({
    runnable({ messages }) {
      // Put any agent logic here
      return generateText({ model, messages });
    },
  });

  const judge = vercelZEvalsJudge({ model });

  const { messages, success } = await zevals.evaluate({
    agent,
    segments: [
      zevals.message({ role: 'user', content: 'Hello' }),

      zevals.agentResponse(),

      zevals.aiEval(zevals.aiAssertion({ judge, prompt: 'The assistant greeted the user' })),
    ],
  });

  expect(success).toBe(true);
  expect(messages).toHaveLength(2);
});

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
packages		packages
scripts		scripts
static		stat 8000 ic
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.prettierrc		.prettierrc
LICENSE		LICENSE
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.json		tsconfig.json
vite.config.mts		vite.config.mts
vite.workspace.mts		vite.workspace.mts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Zevals

Example

Installation

Features

Assertions

User Simulation

Integration with LLM Providers

License

About

Uh oh!

Uh oh!

Languages

License

openchatai/zevals

Folders and files

Latest commit

History

Repository files navigation

Zevals

Example

Installation

Features

Assertions

User Simulation

Integration with LLM Providers

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages