Skip to Content

Results

AgenticAssure uses a layered result model to capture data at every level of a test run — from the raw agent response, through individual scorer evaluations, up to suite-wide aggregates. Understanding these models is essential for interpreting reports and for working with results programmatically.


Result Model Hierarchy

RunResult (one per suite execution) +-- ScenarioRunResult (one per scenario) +-- AgentResult (the agent's raw response) +-- ScoreResult (one per scorer applied to that scenario)

Each level adds context. The AgentResult is what your adapter produces. Scorers consume it and produce ScoreResult objects. The runner wraps both into a ScenarioRunResult with timing and status information. Finally, RunResult aggregates everything at the suite level.


AgentResult

AgentResult is the structured response your adapter returns after calling the agent. This is the primary input to all scorers.

class AgentResult(BaseModel): output: str tool_calls: list[ToolCall] = [] reasoning_trace: list[str] | None = None latency_ms: float = 0.0 token_usage: TokenUsage | None = None raw_response: Any | None = None

Fields

output (required)

The agent’s text response. This is the primary value that most scorers evaluate. Every adapter must populate this field, even if the value is an empty string.

tool_calls (optional, default: [])

A list of ToolCall objects representing the tools (functions) the agent invoked during the scenario. Each ToolCall has:

FieldTypeDescription
namestrThe name of the tool that was called.
argumentsdict[str, Any]The arguments passed to the tool.
result`AnyNone`

Tool calls are used by the passfail scorer to verify expected_tools and expected_tool_args.

reasoning_trace (optional, default: None)

A list of strings representing the agent’s chain of thought or intermediate reasoning steps. This is informational — no built-in scorer evaluates it, but it is useful for debugging and for custom scorers that assess reasoning quality.

latency_ms (optional, default: 0.0)

The time in milliseconds that the agent took to produce the response. When your adapter measures this, it appears in reports for performance benchmarking.

token_usage (optional, default: None)

A TokenUsage object tracking the number of tokens consumed:

FieldTypeDescription
prompt_tokensintNumber of tokens in the prompt sent to the model.
completion_tokensintNumber of tokens in the model’s response.
total_tokensintComputed property: prompt_tokens + completion_tokens.

Token usage data is captured in reports and is useful for tracking costs across test runs.

raw_response (optional, default: None)

An arbitrary field for storing the raw, unprocessed response from the underlying API or framework. No scorer reads this field. It exists purely for debugging and post-hoc analysis.


ScoreResult

ScoreResult is what a scorer produces after evaluating an agent’s response against a scenario’s expectations.

class ScoreResult(BaseModel): scenario_id: str scorer_name: str score: float # between 0.0 and 1.0 inclusive passed: bool explanation: str = "" details: dict[str, Any] | None = None

Fields

scenario_id

The ID of the scenario this score applies to. Automatically linked to Scenario.id.

scorer_name

The name of the scorer that produced this result (e.g., "passfail", "exact", "regex").

score

A numeric score between 0.0 and 1.0 inclusive. The interpretation depends on the scorer:

  • passfail, exact, regex — Binary: 1.0 for pass, 0.0 for fail.
  • similarity — Continuous: the cosine similarity between the expected and actual outputs, clamped to the [0.0, 1.0] range.

This value is enforced by a Pydantic validator. Values outside the range will raise a validation error.

passed

A boolean indicating whether the scorer considers this scenario to have passed. For binary scorers, this directly corresponds to score == 1.0. For continuous scorers like similarity, it indicates whether the score met the configured threshold.

explanation

A human-readable string describing what the scorer checked and why it passed or failed. Examples:

  • "Agent produced output; All expected tools were called; Expected output found in response"
  • "Output does not match expected output (normalized=True)"
  • "Pattern '\\d{5}' matched in output"
  • "Cosine similarity: 0.834 (threshold: 0.7)"

This field appears in CLI and HTML reports.

details (optional)

An open dictionary for scorer-specific structured data. For example, the regex scorer includes the matched pattern and substring:

{"pattern": "\\d{5}", "match": "90210"}

The similarity scorer includes the raw cosine similarity and threshold:

{"cosine_similarity": 0.834, "threshold": 0.7}

ScenarioRunResult

ScenarioRunResult wraps the complete outcome of running a single scenario, including the agent’s response, all scorer results, and execution metadata.

class ScenarioRunResult(BaseModel): scenario: Scenario agent_result: AgentResult scores: list[ScoreResult] = [] passed: bool = False duration_ms: float = 0.0 error: str | None = None retry_count: int = 0

Fields

scenario

The full Scenario object that was executed. This provides access to the scenario’s name, input, expected values, tags, and configuration.

agent_result

The AgentResult returned by the adapter. If the scenario failed due to an exception (after all retries were exhausted), this will be an AgentResult with an empty output string.

scores

A list of ScoreResult objects, one for each scorer applied to the scenario. If the scenario errored out before scoring could occur, this list will be empty.

passed

A boolean indicating whether the scenario passed. This is True only if every scorer in the scores list has passed == True. If the scores list is empty (e.g., due to an error), passed is False.

duration_ms

Wall-clock time in milliseconds for the scenario execution, including adapter invocation and scoring. This measures the time from the start of the adapter call to the completion of all scorers.

error

If the scenario failed due to an exception (rather than a scorer failure), this field contains the error message in the format "ExceptionType: message". When this field is not None, the scenario’s passed is False and the scores list is typically empty.

retry_count

The zero-based index of the attempt that produced this result. If the scenario succeeded on the first try, this is 0. If it required one retry, this is 1. The maximum value is equal to the configured retries count.


RunResult

RunResult is the top-level result object for an entire suite execution. It aggregates all scenario results and computes summary statistics.

class RunResult(BaseModel): run_id: str timestamp: datetime suite_name: str scenario_results: list[ScenarioRunResult] = [] aggregate_score: float = 0.0 pass_rate: float = 0.0 total_duration_ms: float = 0.0 model_info: dict[str, Any] | None = None

Fields

run_id

A UUID automatically generated for each run. Used to uniquely identify the run in reports and file names.

timestamp

The UTC datetime when the run was created.

suite_name

The name of the suite that was executed.

scenario_results

A list of ScenarioRunResult objects, one for each scenario that was executed. If fail_fast was enabled and a scenario failed, the list will contain only the scenarios that were executed up to and including the failed one.

aggregate_score

The mean score across all scenarios. For each scenario, the scorer scores are averaged first (the mean of all ScoreResult.score values for that scenario), and then those per-scenario averages are averaged across all scenarios.

Formula:

aggregate_score = mean( mean(score.score for score in scenario.scores) for scenario in scenario_results )

Scenarios with no scores (e.g., those that errored) contribute 0.0 to the average.

pass_rate

The fraction of scenarios that passed, as a float between 0.0 and 1.0.

pass_rate = count(passed scenarios) / count(total scenarios)

total_duration_ms

The wall-clock time for the entire suite run in milliseconds. This is measured by the runner from the start of the first scenario to the completion of the last.

model_info (optional)

An open dictionary for storing model-related metadata (model name, version, provider, etc.). Not populated by default; available for custom use.


How Aggregates Are Computed

The RunResult.compute_aggregates() method is called automatically by the runner after all scenarios have been executed. It performs the following calculations:

  1. Per-scenario average score — For each ScenarioRunResult, the scores from all scorers are averaged. If a scenario has three scorers with scores [1.0, 1.0, 0.0], its average is 0.667.

  2. Aggregate score — The mean of all per-scenario average scores across the suite.

  3. Pass rate — The number of fully passing scenarios (where passed == True) divided by the total number of scenarios.

  4. Total duration — The sum of duration_ms values across all scenario results.

Note that pass_rate and aggregate_score can diverge. A suite where half the scenarios pass fully and half fail on one of three scorers might have a pass_rate of 0.5 but an aggregate_score of 0.833.


Using Results Programmatically

When using AgenticAssure from Python (rather than the CLI), you have direct access to the result objects.

Inspecting a RunResult

from agenticassure.runner import Runner runner = Runner(adapter=my_adapter) result = runner.run_suite(my_suite) print(f"Suite: {result.suite_name}") print(f"Pass rate: {result.pass_rate:.0%}") print(f"Aggregate score: {result.aggregate_score:.3f}") print(f"Duration: {result.total_duration_ms:.0f}ms") for sr in result.scenario_results: status = "PASS" if sr.passed else "FAIL" print(f" [{status}] {sr.scenario.name} ({sr.duration_ms:.0f}ms)") if sr.error: print(f" Error: {sr.error}") for score in sr.scores: print(f" {score.scorer_name}: {score.score:.2f} - {score.explanation}")

Extracting Specific Data

# Get all failed scenarios failed = [sr for sr in result.scenario_results if not sr.passed] # Get total token usage across all scenarios total_tokens = sum( sr.agent_result.token_usage.total_tokens for sr in result.scenario_results if sr.agent_result.token_usage is not None ) # Get all tool calls made across the suite all_tool_calls = [ tc for sr in result.scenario_results for tc in sr.agent_result.tool_calls ] # Filter results by tag regression_results = [ sr for sr in result.scenario_results if "regression" in sr.scenario.tags ]

Serializing Results

All result models are Pydantic BaseModel instances, so they support standard Pydantic serialization:

# To a Python dictionary data = result.model_dump() # To JSON json_str = result.model_dump_json(indent=2)

This is useful for storing results in databases, sending them to monitoring systems, or building custom report formats.

Last updated on