DocsTesting & Evaluation

Testing & Evaluation

Strategies for ensuring your agents are reliable, accurate, and cost-effective before deployment.

The Challenge of Testing AI#

Testing non-deterministic systems is harder than traditional software testing. A passing test today might fail tomorrow if the model output shifts slightly.

Deterministic Tests

Test the "plumbing" (Tool execution, Guardrails, Memory) by mocking the LLM response. These should pass 100% of the time.

Probabilistic Evals

Test the "intelligence" by running scenarios and scoring the output using an LLM-as-a-Judge.

1. Unit Testing with Mocks#

Use the `MockProvider` to simulate LLM responses. This allows you to test tool calls and logic without making API calls.

agent.test.ts
import { Agent, MockProvider } from '@akios/sdk'
import { weatherTool } from './tools'

test('Agent calls weather tool correctly', async () => {
  // 1. Setup Mock
  const mockLLM = new MockProvider([
    // First response: Tool Call
    { 
      role: 'assistant', 
      tool_calls: [{ name: 'get_weather', arguments: { city: 'Paris' } }] 
    },
    // Second response: Final Answer
    { 
      role: 'assistant', 
      content: 'The weather in Paris is sunny.' 
    }
  ])

  // 2. Init Agent with Mock
  const agent = new Agent({
    name: 'WeatherBot',
    model: mockLLM,
    tools: [weatherTool]
  })

  // 3. Run & Assert
  const result = await agent.run("What's the weather in Paris?")
  
  expect(result.steps[0].toolCalls[0].name).toBe('get_weather')
  expect(result.output).toBe('The weather in Paris is sunny.')
})

2. Evaluation (LLM-as-a-Judge)#

For quality assurance, create a dataset of "Golden Questions" and use a stronger model (e.g., GPT-4) to grade your agent's responses.

Cost Warning

Running evals can be expensive. Run them on a smaller sample (e.g., 50 examples) during development and the full suite before release.
1

Define Criteria

What makes a good answer? Correctness? Tone? Brevity? define this in your "Judge" prompt.

2

Run the Suite

eval.ts
const dataset = [
  { question: "Reset my password", expected_action: "send_reset_email" },
  { question: "Who is the CEO?", expected_fact: "Jane Doe" }
]

for (const item of dataset) {
  const result = await agent.run(item.question)
  const score = await judge.grade(result.output, item)
  console.log(`Score for "${item.question}": ${score}`)
}