Results
AgenticAssure uses a layered result model to capture data at every level of a test run — from the raw agent response, through individual scorer evaluations, up to suite-wide aggregates. Understanding these models is essential for interpreting reports and for working with results programmatically.
Result Model Hierarchy
RunResult (one per suite execution)
+-- ScenarioRunResult (one per scenario)
+-- AgentResult (the agent's raw response)
+-- ScoreResult (one per scorer applied to that scenario)Each level adds context. The AgentResult is what your adapter produces. Scorers consume it and produce ScoreResult objects. The runner wraps both into a ScenarioRunResult with timing and status information. Finally, RunResult aggregates everything at the suite level.
AgentResult
AgentResult is the structured response your adapter returns after calling the agent. This is the primary input to all scorers.
class AgentResult(BaseModel):
output: str
tool_calls: list[ToolCall] = []
reasoning_trace: list[str] | None = None
latency_ms: float = 0.0
token_usage: TokenUsage | None = None
raw_response: Any | None = NoneFields
output (required)
The agent’s text response. This is the primary value that most scorers evaluate. Every adapter must populate this field, even if the value is an empty string.
tool_calls (optional, default: [])
A list of ToolCall objects representing the tools (functions) the agent invoked during the scenario. Each ToolCall has:
| Field | Type | Description |
|---|---|---|
name | str | The name of the tool that was called. |
arguments | dict[str, Any] | The arguments passed to the tool. |
result | `Any | None` |
Tool calls are used by the passfail scorer to verify expected_tools and expected_tool_args.
reasoning_trace (optional, default: None)
A list of strings representing the agent’s chain of thought or intermediate reasoning steps. This is informational — no built-in scorer evaluates it, but it is useful for debugging and for custom scorers that assess reasoning quality.
latency_ms (optional, default: 0.0)
The time in milliseconds that the agent took to produce the response. When your adapter measures this, it appears in reports for performance benchmarking.
token_usage (optional, default: None)
A TokenUsage object tracking the number of tokens consumed:
| Field | Type | Description |
|---|---|---|
prompt_tokens | int | Number of tokens in the prompt sent to the model. |
completion_tokens | int | Number of tokens in the model’s response. |
total_tokens | int | Computed property: prompt_tokens + completion_tokens. |
Token usage data is captured in reports and is useful for tracking costs across test runs.
raw_response (optional, default: None)
An arbitrary field for storing the raw, unprocessed response from the underlying API or framework. No scorer reads this field. It exists purely for debugging and post-hoc analysis.
ScoreResult
ScoreResult is what a scorer produces after evaluating an agent’s response against a scenario’s expectations.
class ScoreResult(BaseModel):
scenario_id: str
scorer_name: str
score: float # between 0.0 and 1.0 inclusive
passed: bool
explanation: str = ""
details: dict[str, Any] | None = NoneFields
scenario_id
The ID of the scenario this score applies to. Automatically linked to Scenario.id.
scorer_name
The name of the scorer that produced this result (e.g., "passfail", "exact", "regex").
score
A numeric score between 0.0 and 1.0 inclusive. The interpretation depends on the scorer:
- passfail, exact, regex — Binary:
1.0for pass,0.0for fail. - similarity — Continuous: the cosine similarity between the expected and actual outputs, clamped to the
[0.0, 1.0]range.
This value is enforced by a Pydantic validator. Values outside the range will raise a validation error.
passed
A boolean indicating whether the scorer considers this scenario to have passed. For binary scorers, this directly corresponds to score == 1.0. For continuous scorers like similarity, it indicates whether the score met the configured threshold.
explanation
A human-readable string describing what the scorer checked and why it passed or failed. Examples:
"Agent produced output; All expected tools were called; Expected output found in response""Output does not match expected output (normalized=True)""Pattern '\\d{5}' matched in output""Cosine similarity: 0.834 (threshold: 0.7)"
This field appears in CLI and HTML reports.
details (optional)
An open dictionary for scorer-specific structured data. For example, the regex scorer includes the matched pattern and substring:
{"pattern": "\\d{5}", "match": "90210"}The similarity scorer includes the raw cosine similarity and threshold:
{"cosine_similarity": 0.834, "threshold": 0.7}ScenarioRunResult
ScenarioRunResult wraps the complete outcome of running a single scenario, including the agent’s response, all scorer results, and execution metadata.
class ScenarioRunResult(BaseModel):
scenario: Scenario
agent_result: AgentResult
scores: list[ScoreResult] = []
passed: bool = False
duration_ms: float = 0.0
error: str | None = None
retry_count: int = 0Fields
scenario
The full Scenario object that was executed. This provides access to the scenario’s name, input, expected values, tags, and configuration.
agent_result
The AgentResult returned by the adapter. If the scenario failed due to an exception (after all retries were exhausted), this will be an AgentResult with an empty output string.
scores
A list of ScoreResult objects, one for each scorer applied to the scenario. If the scenario errored out before scoring could occur, this list will be empty.
passed
A boolean indicating whether the scenario passed. This is True only if every scorer in the scores list has passed == True. If the scores list is empty (e.g., due to an error), passed is False.
duration_ms
Wall-clock time in milliseconds for the scenario execution, including adapter invocation and scoring. This measures the time from the start of the adapter call to the completion of all scorers.
error
If the scenario failed due to an exception (rather than a scorer failure), this field contains the error message in the format "ExceptionType: message". When this field is not None, the scenario’s passed is False and the scores list is typically empty.
retry_count
The zero-based index of the attempt that produced this result. If the scenario succeeded on the first try, this is 0. If it required one retry, this is 1. The maximum value is equal to the configured retries count.
RunResult
RunResult is the top-level result object for an entire suite execution. It aggregates all scenario results and computes summary statistics.
class RunResult(BaseModel):
run_id: str
timestamp: datetime
suite_name: str
scenario_results: list[ScenarioRunResult] = []
aggregate_score: float = 0.0
pass_rate: float = 0.0
total_duration_ms: float = 0.0
model_info: dict[str, Any] | None = NoneFields
run_id
A UUID automatically generated for each run. Used to uniquely identify the run in reports and file names.
timestamp
The UTC datetime when the run was created.
suite_name
The name of the suite that was executed.
scenario_results
A list of ScenarioRunResult objects, one for each scenario that was executed. If fail_fast was enabled and a scenario failed, the list will contain only the scenarios that were executed up to and including the failed one.
aggregate_score
The mean score across all scenarios. For each scenario, the scorer scores are averaged first (the mean of all ScoreResult.score values for that scenario), and then those per-scenario averages are averaged across all scenarios.
Formula:
aggregate_score = mean(
mean(score.score for score in scenario.scores)
for scenario in scenario_results
)Scenarios with no scores (e.g., those that errored) contribute 0.0 to the average.
pass_rate
The fraction of scenarios that passed, as a float between 0.0 and 1.0.
pass_rate = count(passed scenarios) / count(total scenarios)total_duration_ms
The wall-clock time for the entire suite run in milliseconds. This is measured by the runner from the start of the first scenario to the completion of the last.
model_info (optional)
An open dictionary for storing model-related metadata (model name, version, provider, etc.). Not populated by default; available for custom use.
How Aggregates Are Computed
The RunResult.compute_aggregates() method is called automatically by the runner after all scenarios have been executed. It performs the following calculations:
-
Per-scenario average score — For each
ScenarioRunResult, the scores from all scorers are averaged. If a scenario has three scorers with scores[1.0, 1.0, 0.0], its average is0.667. -
Aggregate score — The mean of all per-scenario average scores across the suite.
-
Pass rate — The number of fully passing scenarios (where
passed == True) divided by the total number of scenarios. -
Total duration — The sum of
duration_msvalues across all scenario results.
Note that pass_rate and aggregate_score can diverge. A suite where half the scenarios pass fully and half fail on one of three scorers might have a pass_rate of 0.5 but an aggregate_score of 0.833.
Using Results Programmatically
When using AgenticAssure from Python (rather than the CLI), you have direct access to the result objects.
Inspecting a RunResult
from agenticassure.runner import Runner
runner = Runner(adapter=my_adapter)
result = runner.run_suite(my_suite)
print(f"Suite: {result.suite_name}")
print(f"Pass rate: {result.pass_rate:.0%}")
print(f"Aggregate score: {result.aggregate_score:.3f}")
print(f"Duration: {result.total_duration_ms:.0f}ms")
for sr in result.scenario_results:
status = "PASS" if sr.passed else "FAIL"
print(f" [{status}] {sr.scenario.name} ({sr.duration_ms:.0f}ms)")
if sr.error:
print(f" Error: {sr.error}")
for score in sr.scores:
print(f" {score.scorer_name}: {score.score:.2f} - {score.explanation}")Extracting Specific Data
# Get all failed scenarios
failed = [sr for sr in result.scenario_results if not sr.passed]
# Get total token usage across all scenarios
total_tokens = sum(
sr.agent_result.token_usage.total_tokens
for sr in result.scenario_results
if sr.agent_result.token_usage is not None
)
# Get all tool calls made across the suite
all_tool_calls = [
tc
for sr in result.scenario_results
for tc in sr.agent_result.tool_calls
]
# Filter results by tag
regression_results = [
sr for sr in result.scenario_results
if "regression" in sr.scenario.tags
]Serializing Results
All result models are Pydantic BaseModel instances, so they support standard Pydantic serialization:
# To a Python dictionary
data = result.model_dump()
# To JSON
json_str = result.model_dump_json(indent=2)This is useful for storing results in databases, sending them to monitoring systems, or building custom report formats.