Testing & Evaluation
Strategies for ensuring your agents are reliable, accurate, and cost-effective before deployment.
The Challenge of Testing AI#
Testing non-deterministic systems is harder than traditional software testing. A passing test today might fail tomorrow if the model output shifts slightly.
Deterministic Tests
Test the "plumbing" (Tool execution, Guardrails, Memory) by mocking the LLM response. These should pass 100% of the time.
Probabilistic Evals
Test the "intelligence" by running scenarios and scoring the output using an LLM-as-a-Judge.
1. Unit Testing with Mocks#
Use the `MockProvider` to simulate LLM responses. This allows you to test tool calls and logic without making API calls.
import { Agent, MockProvider } from '@akios/sdk'
import { weatherTool } from './tools'
test('Agent calls weather tool correctly', async () => {
// 1. Setup Mock
const mockLLM = new MockProvider([
// First response: Tool Call
{
role: 'assistant',
tool_calls: [{ name: 'get_weather', arguments: { city: 'Paris' } }]
},
// Second response: Final Answer
{
role: 'assistant',
content: 'The weather in Paris is sunny.'
}
])
// 2. Init Agent with Mock
const agent = new Agent({
name: 'WeatherBot',
model: mockLLM,
tools: [weatherTool]
})
// 3. Run & Assert
const result = await agent.run("What's the weather in Paris?")
expect(result.steps[0].toolCalls[0].name).toBe('get_weather')
expect(result.output).toBe('The weather in Paris is sunny.')
})2. Evaluation (LLM-as-a-Judge)#
For quality assurance, create a dataset of "Golden Questions" and use a stronger model (e.g., GPT-4) to grade your agent's responses.
Cost Warning
Define Criteria
What makes a good answer? Correctness? Tone? Brevity? define this in your "Judge" prompt.
Run the Suite
const dataset = [
{ question: "Reset my password", expected_action: "send_reset_email" },
{ question: "Who is the CEO?", expected_fact: "Jane Doe" }
]
for (const item of dataset) {
const result = await agent.run(item.question)
const score = await judge.grade(result.output, item)
console.log(`Score for "${item.question}": ${score}`)
}