Skip to Content
Introduction

AgenticAssure

Test and benchmark LLM-powered AI agents before deployment.

AgenticAssure is an open-source Python SDK that gives you a structured, repeatable way to test AI agents. Define test scenarios in YAML, run them against your agent through a simple adapter interface, and get detailed reports on pass/fail status, tool usage, latency, and more.

The Problem

Testing AI agents is fundamentally different from testing traditional software:

  • Non-deterministic outputs. The same prompt can produce different responses every run. You cannot rely on exact string matching alone.
  • Tool use verification. Modern agents call tools (APIs, databases, search), and you need to verify they call the right tools with the right arguments — not just that they return plausible-sounding text.
  • No standard test harness. Most teams end up writing ad-hoc scripts, manually spot-checking outputs, or skipping testing entirely. There is no pytest equivalent for agent behavior.
  • Regression detection is hard. When you change a prompt, swap a model, or update a tool schema, how do you know nothing broke? Without a test suite, you don’t.

AgenticAssure solves these problems by providing a structured testing framework purpose-built for AI agents.

Key Features

  • YAML-based test scenarios — Define inputs, expected outputs, expected tool calls, and scoring criteria in declarative YAML files. No test code to maintain.
  • Multiple scoring strategies — Built-in scorers for pass/fail checks, exact match, regex pattern matching, and semantic similarity. Use one or combine several per scenario.
  • Adapter pattern — A simple protocol-based interface lets you plug in any agent, whether it is built with OpenAI, LangChain, a custom framework, or raw API calls.
  • CLI-first workflow — Scaffold projects, validate scenarios, run tests, and generate reports from the command line. Integrates naturally into CI/CD pipelines.
  • Rich reporting — View results as formatted CLI tables, structured JSON, or standalone HTML reports.
  • Tool call assertions — Verify that your agent calls the expected tools with the expected arguments, not just that it produces the right text.
  • Tag-based filtering — Organize scenarios with tags and run subsets on demand.
  • Retry and timeout support — Handle flaky LLM responses with configurable retries and per-scenario timeouts.
  • JSON Schema validation — Scenario files are validated against a strict schema before execution, catching errors early.

How It Works

AgenticAssure follows a simple pipeline:

YAML Scenarios --> Loader (JSON Schema validation) --> Runner --> Adapter.run() --> Scorers --> Reports
  1. You write test scenarios in YAML, specifying inputs, expected outputs, and which scorers to apply.
  2. The loader parses and validates your YAML files against a JSON Schema.
  3. The runner iterates through each scenario, calling your agent adapter’s run() method.
  4. Scorers evaluate the agent’s response against your expectations and produce scored results.
  5. A reporter formats the results as CLI output, JSON, or HTML.

Quick Example

1. Install AgenticAssure:

pip install agenticassure

2. Write an adapter for your agent:

# my_agent.py from agenticassure.adapters.base import AgentAdapter from agenticassure.results import AgentResult, ToolCall class MyAgent: """Wraps your AI agent so AgenticAssure can test it.""" def run(self, input: str, context=None) -> AgentResult: # Call your real agent here. This is a simplified example. response = call_my_llm(input) return AgentResult( output=response.text, tool_calls=[ ToolCall(name=tc.name, arguments=tc.args) for tc in response.tool_calls ], latency_ms=response.latency, )

3. Define test scenarios in YAML:

# scenarios/search_tests.yaml suite: name: search-agent-tests config: default_timeout: 30 retries: 1 scenarios: - name: weather_query input: "What is the weather in San Francisco?" expected_tools: - get_weather expected_tool_args: get_weather: location: "San Francisco" scorers: - passfail tags: - tools - weather - name: greeting input: "Hello, how are you?" expected_output: "hello" scorers: - passfail tags: - basic

4. Run your tests:

agenticassure run scenarios/ --adapter my_agent.MyAgent

5. View results:

Loaded 2 scenario(s) from 1 suite(s) Using adapter: my_agent.MyAgent Suite: search-agent-tests +-----------------+--------+-------+-----------+ | Scenario | Passed | Score | Duration | +-----------------+--------+-------+-----------+ | weather_query | PASS | 1.00 | 245.3ms | | greeting | PASS | 1.00 | 128.7ms | +-----------------+--------+-------+-----------+ Pass rate: 100.0% (2/2)

What’s Next

  • Installation — System requirements and install options.
  • Quickstart — End-to-end walkthrough from zero to first test run.
  • Project Setup — Project structure, config files, and environment setup.
Last updated on