Results

AgenticAssure uses a layered result model to capture data at every level of a test run — from the raw agent response, through individual scorer evaluations, up to suite-wide aggregates. Understanding these models is essential for interpreting reports and for working with results programmatically.

Result Model Hierarchy


RunResult                  (one per suite execution)
  +-- ScenarioRunResult    (one per scenario)
        +-- AgentResult    (the agent's raw response)
        +-- ScoreResult    (one per scorer applied to that scenario)

Each level adds context. The AgentResult is what your adapter produces. Scorers consume it and produce ScoreResult objects. The runner wraps both into a ScenarioRunResult with timing and status information. Finally, RunResult aggregates everything at the suite level.

AgentResult

AgentResult is the structured response your adapter returns after calling the agent. This is the primary input to all scorers.


class AgentResult(BaseModel):
    output: str
    tool_calls: list[ToolCall] = []
    reasoning_trace: list[str] | None = None
    latency_ms: float = 0.0
    token_usage: TokenUsage | None = None
    raw_response: Any | None = None

Fields

`output` (required)

The agent’s text response. This is the primary value that most scorers evaluate. Every adapter must populate this field, even if the value is an empty string.

`tool_calls` (optional, default: `[]`)

A list of ToolCall objects representing the tools (functions) the agent invoked during the scenario. Each ToolCall has:

Field	Type	Description
`name`	`str`	The name of the tool that was called.
`arguments`	`dict[str, Any]`	The arguments passed to the tool.
`result`	`Any	None`

Tool calls are used by the passfail scorer to verify expected_tools and expected_tool_args.

`reasoning_trace` (optional, default: `None`)

A list of strings representing the agent’s chain of thought or intermediate reasoning steps. This is informational — no built-in scorer evaluates it, but it is useful for debugging and for custom scorers that assess reasoning quality.

`latency_ms` (optional, default: `0.0`)

The time in milliseconds that the agent took to produce the response. When your adapter measures this, it appears in reports for performance benchmarking.

`token_usage` (optional, default: `None`)

A TokenUsage object tracking the number of tokens consumed:

Field	Type	Description
`prompt_tokens`	`int`	Number of tokens in the prompt sent to the model.
`completion_tokens`	`int`	Number of tokens in the model’s response.
`total_tokens`	`int`	Computed property: `prompt_tokens + completion_tokens`.

Token usage data is captured in reports and is useful for tracking costs across test runs.

`raw_response` (optional, default: `None`)

An arbitrary field for storing the raw, unprocessed response from the underlying API or framework. No scorer reads this field. It exists purely for debugging and post-hoc analysis.

ScoreResult

ScoreResult is what a scorer produces after evaluating an agent’s response against a scenario’s expectations.


class ScoreResult(BaseModel):
    scenario_id: str
    scorer_name: str
    score: float          # between 0.0 and 1.0 inclusive
    passed: bool
    explanation: str = ""
    details: dict[str, Any] | None = None

Fields

`scenario_id`

The ID of the scenario this score applies to. Automatically linked to Scenario.id.

`scorer_name`

The name of the scorer that produced this result (e.g., "passfail", "exact", "regex").

`score`

A numeric score between 0.0 and 1.0 inclusive. The interpretation depends on the scorer:

passfail, exact, regex — Binary: 1.0 for pass, 0.0 for fail.
similarity — Continuous: the cosine similarity between the expected and actual outputs, clamped to the [0.0, 1.0] range.

This value is enforced by a Pydantic validator. Values outside the range will raise a validation error.

`passed`

A boolean indicating whether the scorer considers this scenario to have passed. For binary scorers, this directly corresponds to score == 1.0. For continuous scorers like similarity, it indicates whether the score met the configured threshold.

`explanation`

A human-readable string describing what the scorer checked and why it passed or failed. Examples:

"Agent produced output; All expected tools were called; Expected output found in response"
"Output does not match expected output (normalized=True)"
"Pattern '\\d{5}' matched in output"
"Cosine similarity: 0.834 (threshold: 0.7)"

This field appears in CLI and HTML reports.

`details` (optional)

An open dictionary for scorer-specific structured data. For example, the regex scorer includes the matched pattern and substring:


{"pattern": "\\d{5}", "match": "90210"}

The similarity scorer includes the raw cosine similarity and threshold:


{"cosine_similarity": 0.834, "threshold": 0.7}

ScenarioRunResult

ScenarioRunResult wraps the complete outcome of running a single scenario, including the agent’s response, all scorer results, and execution metadata.


class ScenarioRunResult(BaseModel):
    scenario: Scenario
    agent_result: AgentResult
    scores: list[ScoreResult] = []
    passed: bool = False
    duration_ms: float = 0.0
    error: str | None = None
    retry_count: int = 0

Fields

`scenario`

The full Scenario object that was executed. This provides access to the scenario’s name, input, expected values, tags, and configuration.

`agent_result`

The AgentResult returned by the adapter. If the scenario failed due to an exception (after all retries were exhausted), this will be an AgentResult with an empty output string.

`scores`

A list of ScoreResult objects, one for each scorer applied to the scenario. If the scenario errored out before scoring could occur, this list will be empty.

`passed`

A boolean indicating whether the scenario passed. This is True only if every scorer in the scores list has passed == True. If the scores list is empty (e.g., due to an error), passed is False.

`duration_ms`

Wall-clock time in milliseconds for the scenario execution, including adapter invocation and scoring. This measures the time from the start of the adapter call to the completion of all scorers.

`error`

If the scenario failed due to an exception (rather than a scorer failure), this field contains the error message in the format "ExceptionType: message". When this field is not None, the scenario’s passed is False and the scores list is typically empty.

`retry_count`

The zero-based index of the attempt that produced this result. If the scenario succeeded on the first try, this is 0. If it required one retry, this is 1. The maximum value is equal to the configured retries count.

RunResult

RunResult is the top-level result object for an entire suite execution. It aggregates all scenario results and computes summary statistics.


class RunResult(BaseModel):
    run_id: str
    timestamp: datetime
    suite_name: str
    scenario_results: list[ScenarioRunResult] = []
    aggregate_score: float = 0.0
    pass_rate: float = 0.0
    total_duration_ms: float = 0.0
    model_info: dict[str, Any] | None = None

Fields

`run_id`

A UUID automatically generated for each run. Used to uniquely identify the run in reports and file names.

`timestamp`

The UTC datetime when the run was created.

`suite_name`

The name of the suite that was executed.

`scenario_results`

A list of ScenarioRunResult objects, one for each scenario that was executed. If fail_fast was enabled and a scenario failed, the list will contain only the scenarios that were executed up to and including the failed one.

`aggregate_score`

The mean score across all scenarios. For each scenario, the scorer scores are averaged first (the mean of all ScoreResult.score values for that scenario), and then those per-scenario averages are averaged across all scenarios.

Formula:


aggregate_score = mean(
    mean(score.score for score in scenario.scores)
    for scenario in scenario_results
)

Scenarios with no scores (e.g., those that errored) contribute 0.0 to the average.

`pass_rate`

The fraction of scenarios that passed, as a float between 0.0 and 1.0.


pass_rate = count(passed scenarios) / count(total scenarios)

`total_duration_ms`

The wall-clock time for the entire suite run in milliseconds. This is measured by the runner from the start of the first scenario to the completion of the last.

`model_info` (optional)

An open dictionary for storing model-related metadata (model name, version, provider, etc.). Not populated by default; available for custom use.

How Aggregates Are Computed

The RunResult.compute_aggregates() method is called automatically by the runner after all scenarios have been executed. It performs the following calculations:

Per-scenario average score — For each ScenarioRunResult, the scores from all scorers are averaged. If a scenario has three scorers with scores [1.0, 1.0, 0.0], its average is 0.667.
Aggregate score — The mean of all per-scenario average scores across the suite.
Pass rate — The number of fully passing scenarios (where passed == True) divided by the total number of scenarios.
Total duration — The sum of duration_ms values across all scenario results.

Note that pass_rate and aggregate_score can diverge. A suite where half the scenarios pass fully and half fail on one of three scorers might have a pass_rate of 0.5 but an aggregate_score of 0.833.

Using Results Programmatically

When using AgenticAssure from Python (rather than the CLI), you have direct access to the result objects.

Inspecting a RunResult


from agenticassure.runner import Runner
 
runner = Runner(adapter=my_adapter)
result = runner.run_suite(my_suite)
 
print(f"Suite: {result.suite_name}")
print(f"Pass rate: {result.pass_rate:.0%}")
print(f"Aggregate score: {result.aggregate_score:.3f}")
print(f"Duration: {result.total_duration_ms:.0f}ms")
 
for sr in result.scenario_results:
    status = "PASS" if sr.passed else "FAIL"
    print(f"  [{status}] {sr.scenario.name} ({sr.duration_ms:.0f}ms)")
 
    if sr.error:
        print(f"    Error: {sr.error}")
 
    for score in sr.scores:
        print(f"    {score.scorer_name}: {score.score:.2f} - {score.explanation}")

Extracting Specific Data


# Get all failed scenarios
failed = [sr for sr in result.scenario_results if not sr.passed]
 
# Get total token usage across all scenarios
total_tokens = sum(
    sr.agent_result.token_usage.total_tokens
    for sr in result.scenario_results
    if sr.agent_result.token_usage is not None
)
 
# Get all tool calls made across the suite
all_tool_calls = [
    tc
    for sr in result.scenario_results
    for tc in sr.agent_result.tool_calls
]
 
# Filter results by tag
regression_results = [
    sr for sr in result.scenario_results
    if "regression" in sr.scenario.tags
]

Serializing Results

All result models are Pydantic BaseModel instances, so they support standard Pydantic serialization:


# To a Python dictionary
data = result.model_dump()
 
# To JSON
json_str = result.model_dump_json(indent=2)

This is useful for storing results in databases, sending them to monitoring systems, or building custom report formats.

Results

Result Model Hierarchy

AgentResult

Fields

output (required)

tool_calls (optional, default: [])

reasoning_trace (optional, default: None)

latency_ms (optional, default: 0.0)

token_usage (optional, default: None)

raw_response (optional, default: None)

ScoreResult

Fields

scenario_id

scorer_name

score

passed

explanation

details (optional)

ScenarioRunResult

Fields

scenario

agent_result

scores

passed

duration_ms

error

retry_count

RunResult

Fields

run_id

timestamp

suite_name

scenario_results

aggregate_score

pass_rate

total_duration_ms

model_info (optional)

How Aggregates Are Computed

Using Results Programmatically

Inspecting a RunResult

Extracting Specific Data

Serializing Results

`output` (required)

`tool_calls` (optional, default: `[]`)

`reasoning_trace` (optional, default: `None`)

`latency_ms` (optional, default: `0.0`)

`token_usage` (optional, default: `None`)

`raw_response` (optional, default: `None`)

`scenario_id`

`scorer_name`

`score`

`passed`

`explanation`

`details` (optional)

`scenario`

`agent_result`

`scores`

`passed`

`duration_ms`

`error`

`retry_count`

`run_id`

`timestamp`

`suite_name`

`scenario_results`

`aggregate_score`

`pass_rate`

`total_duration_ms`

`model_info` (optional)