AgenticAssure
Test and benchmark LLM-powered AI agents before deployment.
AgenticAssure is an open-source Python SDK that gives you a structured, repeatable way to test AI agents. Define test scenarios in YAML, run them against your agent through a simple adapter interface, and get detailed reports on pass/fail status, tool usage, latency, and more.
The Problem
Testing AI agents is fundamentally different from testing traditional software:
- Non-deterministic outputs. The same prompt can produce different responses every run. You cannot rely on exact string matching alone.
- Tool use verification. Modern agents call tools (APIs, databases, search), and you need to verify they call the right tools with the right arguments — not just that they return plausible-sounding text.
- No standard test harness. Most teams end up writing ad-hoc scripts, manually spot-checking outputs, or skipping testing entirely. There is no
pytestequivalent for agent behavior. - Regression detection is hard. When you change a prompt, swap a model, or update a tool schema, how do you know nothing broke? Without a test suite, you don’t.
AgenticAssure solves these problems by providing a structured testing framework purpose-built for AI agents.
Key Features
- YAML-based test scenarios — Define inputs, expected outputs, expected tool calls, and scoring criteria in declarative YAML files. No test code to maintain.
- Multiple scoring strategies — Built-in scorers for pass/fail checks, exact match, regex pattern matching, and semantic similarity. Use one or combine several per scenario.
- Adapter pattern — A simple protocol-based interface lets you plug in any agent, whether it is built with OpenAI, LangChain, a custom framework, or raw API calls.
- CLI-first workflow — Scaffold projects, validate scenarios, run tests, and generate reports from the command line. Integrates naturally into CI/CD pipelines.
- Rich reporting — View results as formatted CLI tables, structured JSON, or standalone HTML reports.
- Tool call assertions — Verify that your agent calls the expected tools with the expected arguments, not just that it produces the right text.
- Tag-based filtering — Organize scenarios with tags and run subsets on demand.
- Retry and timeout support — Handle flaky LLM responses with configurable retries and per-scenario timeouts.
- JSON Schema validation — Scenario files are validated against a strict schema before execution, catching errors early.
How It Works
AgenticAssure follows a simple pipeline:
YAML Scenarios --> Loader (JSON Schema validation) --> Runner --> Adapter.run() --> Scorers --> Reports- You write test scenarios in YAML, specifying inputs, expected outputs, and which scorers to apply.
- The loader parses and validates your YAML files against a JSON Schema.
- The runner iterates through each scenario, calling your agent adapter’s
run()method. - Scorers evaluate the agent’s response against your expectations and produce scored results.
- A reporter formats the results as CLI output, JSON, or HTML.
Quick Example
1. Install AgenticAssure:
pip install agenticassure2. Write an adapter for your agent:
# my_agent.py
from agenticassure.adapters.base import AgentAdapter
from agenticassure.results import AgentResult, ToolCall
class MyAgent:
"""Wraps your AI agent so AgenticAssure can test it."""
def run(self, input: str, context=None) -> AgentResult:
# Call your real agent here. This is a simplified example.
response = call_my_llm(input)
return AgentResult(
output=response.text,
tool_calls=[
ToolCall(name=tc.name, arguments=tc.args)
for tc in response.tool_calls
],
latency_ms=response.latency,
)3. Define test scenarios in YAML:
# scenarios/search_tests.yaml
suite:
name: search-agent-tests
config:
default_timeout: 30
retries: 1
scenarios:
- name: weather_query
input: "What is the weather in San Francisco?"
expected_tools:
- get_weather
expected_tool_args:
get_weather:
location: "San Francisco"
scorers:
- passfail
tags:
- tools
- weather
- name: greeting
input: "Hello, how are you?"
expected_output: "hello"
scorers:
- passfail
tags:
- basic4. Run your tests:
agenticassure run scenarios/ --adapter my_agent.MyAgent5. View results:
Loaded 2 scenario(s) from 1 suite(s)
Using adapter: my_agent.MyAgent
Suite: search-agent-tests
+-----------------+--------+-------+-----------+
| Scenario | Passed | Score | Duration |
+-----------------+--------+-------+-----------+
| weather_query | PASS | 1.00 | 245.3ms |
| greeting | PASS | 1.00 | 128.7ms |
+-----------------+--------+-------+-----------+
Pass rate: 100.0% (2/2)What’s Next
- Installation — System requirements and install options.
- Quickstart — End-to-end walkthrough from zero to first test run.
- Project Setup — Project structure, config files, and environment setup.
Last updated on