AgenticAssure

Test and benchmark LLM-powered AI agents before deployment.

AgenticAssure is an open-source Python SDK that gives you a structured, repeatable way to test AI agents. Define test scenarios in YAML, run them against your agent through a simple adapter interface, and get detailed reports on pass/fail status, tool usage, latency, and more.

The Problem

Testing AI agents is fundamentally different from testing traditional software:

Non-deterministic outputs. The same prompt can produce different responses every run. You cannot rely on exact string matching alone.
Tool use verification. Modern agents call tools (APIs, databases, search), and you need to verify they call the right tools with the right arguments — not just that they return plausible-sounding text.
No standard test harness. Most teams end up writing ad-hoc scripts, manually spot-checking outputs, or skipping testing entirely. There is no pytest equivalent for agent behavior.
Regression detection is hard. When you change a prompt, swap a model, or update a tool schema, how do you know nothing broke? Without a test suite, you don’t.

AgenticAssure solves these problems by providing a structured testing framework purpose-built for AI agents.

Key Features

YAML-based test scenarios — Define inputs, expected outputs, expected tool calls, and scoring criteria in declarative YAML files. No test code to maintain.
Multiple scoring strategies — Built-in scorers for pass/fail checks, exact match, regex pattern matching, and semantic similarity. Use one or combine several per scenario.
Adapter pattern — A simple protocol-based interface lets you plug in any agent, whether it is built with OpenAI, LangChain, a custom framework, or raw API calls.
CLI-first workflow — Scaffold projects, validate scenarios, run tests, and generate reports from the command line. Integrates naturally into CI/CD pipelines.
Rich reporting — View results as formatted CLI tables, structured JSON, or standalone HTML reports.
Tool call assertions — Verify that your agent calls the expected tools with the expected arguments, not just that it produces the right text.
Tag-based filtering — Organize scenarios with tags and run subsets on demand.
Retry and timeout support — Handle flaky LLM responses with configurable retries and per-scenario timeouts.
JSON Schema validation — Scenario files are validated against a strict schema before execution, catching errors early.

How It Works

AgenticAssure follows a simple pipeline:


YAML Scenarios --> Loader (JSON Schema validation) --> Runner --> Adapter.run() --> Scorers --> Reports

You write test scenarios in YAML, specifying inputs, expected outputs, and which scorers to apply.
The loader parses and validates your YAML files against a JSON Schema.
The runner iterates through each scenario, calling your agent adapter’s run() method.
Scorers evaluate the agent’s response against your expectations and produce scored results.
A reporter formats the results as CLI output, JSON, or HTML.

Quick Example

1. Install AgenticAssure:


pip install agenticassure

2. Write an adapter for your agent:


# my_agent.py
from agenticassure.adapters.base import AgentAdapter
from agenticassure.results import AgentResult, ToolCall
 
class MyAgent:
    """Wraps your AI agent so AgenticAssure can test it."""
 
    def run(self, input: str, context=None) -> AgentResult:
        # Call your real agent here. This is a simplified example.
        response = call_my_llm(input)
        return AgentResult(
            output=response.text,
            tool_calls=[
                ToolCall(name=tc.name, arguments=tc.args)
                for tc in response.tool_calls
            ],
            latency_ms=response.latency,
        )

3. Define test scenarios in YAML:


# scenarios/search_tests.yaml
suite:
  name: search-agent-tests
  config:
    default_timeout: 30
    retries: 1
 
scenarios:
  - name: weather_query
    input: "What is the weather in San Francisco?"
    expected_tools:
      - get_weather
    expected_tool_args:
      get_weather:
        location: "San Francisco"
    scorers:
      - passfail
    tags:
      - tools
      - weather
 
  - name: greeting
    input: "Hello, how are you?"
    expected_output: "hello"
    scorers:
      - passfail
    tags:
      - basic

4. Run your tests:


agenticassure run scenarios/ --adapter my_agent.MyAgent

5. View results:


Loaded 2 scenario(s) from 1 suite(s)
Using adapter: my_agent.MyAgent

Suite: search-agent-tests
+-----------------+--------+-------+-----------+
| Scenario        | Passed | Score | Duration  |
+-----------------+--------+-------+-----------+
| weather_query   | PASS   | 1.00  | 245.3ms   |
| greeting        | PASS   | 1.00  | 128.7ms   |
+-----------------+--------+-------+-----------+

Pass rate: 100.0% (2/2)

What’s Next

Installation — System requirements and install options.
Quickstart — End-to-end walkthrough from zero to first test run.
Project Setup — Project structure, config files, and environment setup.