Runner

The Runner is the execution engine of AgenticAssure. It takes a suite of scenarios and an adapter, then orchestrates the process of sending each scenario’s input to the agent, collecting the response, running scorers, handling retries, and aggregating results.

What the Runner Does

For each scenario in a suite, the runner:

Sends the scenario’s input to the adapter’s run() method.
Collects the AgentResult returned by the adapter.
Looks up each scorer listed in the scenario’s scorers field from the registry.
Calls each scorer’s score() method with the scenario and the agent result.
Determines pass/fail: the scenario passes only if every scorer passes.
If the scenario fails and retries are configured, repeats from step 1.
Records the ScenarioRunResult with timing, scores, and error information.

After all scenarios have been executed (or execution is halted by fail_fast), the runner computes aggregate statistics and returns a RunResult.

Configuration

The Runner accepts three configuration parameters at construction time:


from agenticassure.runner import Runner
 
runner = Runner(
    adapter=my_adapter,
    default_timeout=30.0,
    retries=0,
    fail_fast=False,
)

`adapter` (required)

An object implementing the AgentAdapter protocol. See the Adapters documentation.

`default_timeout` (optional, default: `30.0`)

The default timeout in seconds for each scenario. This value is available for adapters and custom integrations. Individual scenarios can override this with their timeout_seconds field.

`retries` (optional, default: `0`)

The number of retry attempts per scenario. A value of 0 means each scenario is attempted exactly once. A value of 2 means a failing scenario will be retried up to 2 additional times (3 total attempts).

`fail_fast` (optional, default: `False`)

When True, the runner stops executing scenarios after the first failure. Scenarios that were not reached are not included in the results. This is useful for fast feedback during development.

Suite Config vs Runner Config Precedence

Configuration can be set in two places: the Runner constructor and the suite’s config block. When both are set, suite config takes precedence over runner defaults for the following fields:

Field	Runner default used when…	Suite config used when…
`retries`	Suite `retries` is `0`	Suite `retries` is non-zero
`fail_fast`	Suite `fail_fast` is `False`	Suite `fail_fast` is `True`

This means:

If your suite YAML sets retries: 3, that overrides the runner’s retries value.
If your suite YAML sets fail_fast: true, that overrides the runner’s fail_fast value.
If your suite YAML does not set these fields (or leaves them at their defaults of 0 and false), the runner’s constructor values are used.

Example:


# Runner configured with 1 retry
runner = Runner(adapter=my_adapter, retries=1)


# Suite overrides to 3 retries
suite:
  name: my-suite
  config:
    retries: 3

In this case, scenarios will be retried up to 3 times (the suite config wins).

Tag-Based Filtering

The run_suite method accepts an optional tags parameter to run only scenarios matching one or more tags:


result = runner.run_suite(my_suite, tags=["smoke", "critical"])

Filtering uses set intersection: a scenario is included if it has at least one tag in common with the provided list. A scenario tagged ["smoke", "api"] would match a filter of ["smoke", "critical"] because "smoke" is in both sets.

Scenarios with no tags will never match a tag filter. If tags is None (the default), all scenarios in the suite are executed.

From the CLI, use the --tag flag (repeatable):


agenticassure run scenarios/ --adapter mymodule.MyAgent --tag smoke --tag critical

Retry Behavior

When retries are configured (either via the runner constructor or the suite config), the runner attempts each scenario up to retries + 1 times.

How It Works

The runner calls the adapter with the scenario input.
If the adapter raises an exception, the error is recorded and the runner moves to the next attempt.
If the adapter returns successfully and all scorers pass, the scenario is marked as passed and no further retries occur.
If the adapter returns successfully but any scorer fails, the scenario is marked as passed on that attempt (retries only guard against exceptions, not scorer failures on a successful adapter call).
If all attempts are exhausted due to exceptions, the runner returns a failing ScenarioRunResult with the last error message and an empty AgentResult.

Important Details

Retries only apply to exceptions. If the adapter returns a valid AgentResult but a scorer fails, that result is returned immediately without retrying. Retries are designed to handle transient errors like network timeouts or rate limits, not incorrect agent behavior.
The retry_count field on ScenarioRunResult records the zero-based attempt index that produced the final result. A value of 0 means the first attempt succeeded or produced the returned result.
Each retry is a fresh call. There is no state carried between retry attempts. The adapter receives the same input and context each time.

Example


runner = Runner(adapter=my_adapter, retries=2)
result = runner.run_scenario(my_scenario)
 
if result.error:
    print(f"Failed after {result.retry_count + 1} attempts: {result.error}")
elif result.retry_count > 0:
    print(f"Succeeded on attempt {result.retry_count + 1}")

Error Handling

The runner catches all exceptions raised by the adapter during scenario execution. Exceptions are handled as follows:

The exception type and message are formatted as "ExceptionType: message" and stored in ScenarioRunResult.error.
If retries are configured, the runner attempts the scenario again.
If all retries are exhausted, the final ScenarioRunResult has:
- passed = False
- agent_result set to an AgentResult with an empty output
- error set to the last exception’s formatted message
- scores as an empty list (scorers are not run when the adapter errors)
- retry_count set to the index of the last attempt

Exceptions from scorers are not caught by the runner. If a scorer raises an exception, it will propagate up to the caller. Scorer exceptions indicate a bug in the scorer implementation and should be fixed.

Using Runner from Python

While the CLI is the most common way to use AgenticAssure, the Runner class is designed for direct use from Python code. This is useful for integration into CI pipelines, custom test harnesses, or Jupyter notebooks.

Running a Full Suite


from agenticassure.loader import load_scenarios
from agenticassure.runner import Runner
 
# Load scenarios from YAML
suite = load_scenarios("scenarios/customer-support.yaml")
 
# Create a runner with your adapter
runner = Runner(adapter=my_adapter, retries=1, fail_fast=False)
 
# Execute the suite
result = runner.run_suite(suite)
 
# Check results
print(f"Pass rate: {result.pass_rate:.0%}")
print(f"Aggregate score: {result.aggregate_score:.3f}")
 
# Fail the CI build if any scenario failed
if result.pass_rate < 1.0:
    sys.exit(1)

Running a Single Scenario

You can run an individual scenario without a suite:


from agenticassure.scenario import Scenario
from agenticassure.runner import Runner
 
scenario = Scenario(
    name="quick_test",
    input="What is 2 + 2?",
    expected_output="4",
    scorers=["passfail", "exact"],
)
 
runner = Runner(adapter=my_adapter)
result = runner.run_scenario(scenario)
 
print(f"Passed: {result.passed}")
for score in result.scores:
    print(f"  {score.scorer_name}: {score.explanation}")

Running with Tag Filters


result = runner.run_suite(suite, tags=["smoke"])

Passing Context

The optional context parameter is forwarded to the adapter’s run() method. Use it to pass session state, authentication tokens, or any other runtime data your agent needs:


context = {
    "user_id": "test-user-123",
    "session_token": "abc...",
    "feature_flags": {"new_tool": True},
}
 
result = runner.run_suite(suite, context=context)

Loading Multiple Suites from a Directory


from agenticassure.loader import load_scenarios_from_dir
 
suites = load_scenarios_from_dir("scenarios/")
 
runner = Runner(adapter=my_adapter)
for suite in suites:
    result = runner.run_suite(suite)
    print(f"{suite.name}: {result.pass_rate:.0%}")

Custom Reporting

Since RunResult and all nested result models are Pydantic models, you can serialize them and build any reporting or monitoring integration you need:


import json
 
result = runner.run_suite(suite)
 
# Write raw results to a JSON file
with open("results.json", "w") as f:
    f.write(result.model_dump_json(indent=2))
 
# Send summary metrics to your monitoring system
metrics = {
    "suite": result.suite_name,
    "pass_rate": result.pass_rate,
    "aggregate_score": result.aggregate_score,
    "duration_ms": result.total_duration_ms,
    "scenarios_run": len(result.scenario_results),
    "scenarios_passed": sum(1 for sr in result.scenario_results if sr.passed),
}
send_to_monitoring(metrics)