JSON Report
The JSON report outputs the complete test results as a structured JSON file. It contains every field available in the result models, making it the most detailed output format and the best choice for programmatic consumption, archival, and integration with external tools.
Generating a JSON Report
Use the --output json flag with the run command:
agenticassure run scenarios/ --adapter my_agent.MyAgent --output jsonAfter the run completes, AgenticAssure writes the report and prints the filename:
JSON report written to results_a1b2c3d4-e5f6-7890-abcd-ef1234567890.jsonFile Naming
JSON reports are named using the pattern:
results_{run_id}.jsonThe run_id is a UUID generated for each run. Files are written to the current working directory.
JSON Structure
The JSON report is a serialization of the RunResult Pydantic model. Below is the full schema with descriptions of every field.
Top-Level Object
{
"run_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"timestamp": "2026-03-10T14:30:00+00:00",
"suite_name": "search-agent-tests",
"scenario_results": [ ... ],
"aggregate_score": 0.85,
"pass_rate": 0.75,
"total_duration_ms": 1245.7,
"model_info": null
}| Field | Type | Description |
|---|---|---|
run_id | string | UUID identifying this run |
timestamp | string (ISO 8601) | UTC timestamp of when the run started |
suite_name | string | Name of the test suite |
scenario_results | array | List of ScenarioRunResult objects (see below) |
aggregate_score | float | Average score across all scenarios (0.0 to 1.0) |
pass_rate | float | Fraction of scenarios that passed (0.0 to 1.0) |
total_duration_ms | float | Total wall-clock duration in milliseconds |
model_info | object or null | Optional metadata about the model used |
ScenarioRunResult
Each element of scenario_results has this structure:
{
"scenario": { ... },
"agent_result": { ... },
"scores": [ ... ],
"passed": true,
"duration_ms": 245.3,
"error": null,
"retry_count": 0
}| Field | Type | Description |
|---|---|---|
scenario | object | The original scenario definition (see below) |
agent_result | object | The agent’s response (see below) |
scores | array | List of ScoreResult objects from each scorer |
passed | boolean | Whether the scenario passed overall |
duration_ms | float | Execution time in milliseconds |
error | string or null | Error message if an exception occurred, otherwise null |
retry_count | integer | Number of retries attempted |
Scenario
The scenario object as defined in the YAML file:
{
"name": "weather_query",
"input": "What is the weather in San Francisco?",
"expected_output": null,
"expected_tools": ["get_weather"],
"expected_tool_args": {
"get_weather": { "location": "San Francisco" }
},
"scorers": ["passfail"],
"tags": ["tools", "weather"],
"context": null,
"metadata": null
}AgentResult
The result returned by your adapter’s run() method:
{
"output": "The weather in San Francisco is 65F and sunny.",
"tool_calls": [
{
"name": "get_weather",
"arguments": { "location": "San Francisco" },
"result": { "temp": 65, "condition": "sunny" }
}
],
"reasoning_trace": null,
"latency_ms": 230.5,
"token_usage": {
"prompt_tokens": 150,
"completion_tokens": 45
},
"raw_response": null
}| Field | Type | Description |
|---|---|---|
output | string | The agent’s text response |
tool_calls | array | List of tool calls made by the agent |
reasoning_trace | array or null | Optional list of reasoning steps |
latency_ms | float | Agent-reported latency in milliseconds |
token_usage | object or null | Token usage breakdown (prompt and completion) |
raw_response | any or null | Optional raw response from the underlying API |
ToolCall
Each entry in tool_calls:
| Field | Type | Description |
|---|---|---|
name | string | Name of the tool that was called |
arguments | object | Arguments passed to the tool |
result | any or null | The result returned by the tool |
ScoreResult
Each entry in scores:
{
"scenario_id": "weather_query",
"scorer_name": "passfail",
"score": 1.0,
"passed": true,
"explanation": "Output contains expected text; Tool 'get_weather' called correctly",
"details": null
}| Field | Type | Description |
|---|---|---|
scenario_id | string | Name of the scenario being scored |
scorer_name | string | Name of the scorer that produced this result |
score | float | Score from 0.0 to 1.0 |
passed | boolean | Whether this scorer considers the scenario passed |
explanation | string | Human-readable explanation of the scoring decision |
details | object or null | Optional additional structured details |
TokenUsage
| Field | Type | Description |
|---|---|---|
prompt_tokens | integer | Number of tokens in the prompt |
completion_tokens | integer | Number of tokens in the completion |
Consuming the JSON Report Programmatically
Python
import json
from pathlib import Path
report = json.loads(Path("results_a1b2c3d4.json").read_text())
print(f"Pass rate: {report['pass_rate']:.0%}")
print(f"Avg score: {report['aggregate_score']:.2f}")
for sr in report["scenario_results"]:
status = "PASS" if sr["passed"] else "FAIL"
print(f" {sr['scenario']['name']}: {status}")jq (command line)
# Get pass rate
jq '.pass_rate' results_a1b2c3d4.json
# List failed scenarios
jq '.scenario_results[] | select(.passed == false) | .scenario.name' results_a1b2c3d4.json
# Get total token usage across all scenarios
jq '[.scenario_results[].agent_result.token_usage // {prompt_tokens:0, completion_tokens:0} | .prompt_tokens + .completion_tokens] | add' results_a1b2c3d4.json
# Extract all tool calls
jq '[.scenario_results[].agent_result.tool_calls[].name] | unique' results_a1b2c3d4.jsonJavaScript / Node.js
const fs = require("fs");
const report = JSON.parse(fs.readFileSync("results_a1b2c3d4.json", "utf-8"));
const failed = report.scenario_results.filter((sr) => !sr.passed);
console.log(`${failed.length} scenario(s) failed`);Integration with Other Tools
CI/CD Pipelines
Use the JSON report to make pass/fail decisions or extract metrics in CI:
# GitHub Actions
- name: Run tests
run: agenticassure run scenarios/ --adapter my_agent.MyAgent --output json
- name: Check results
run: |
PASS_RATE=$(jq '.pass_rate' results_*.json)
echo "Pass rate: $PASS_RATE"
if (( $(echo "$PASS_RATE < 0.8" | bc -l) )); then
echo "Pass rate below threshold"
exit 1
fiTrend Tracking
Store JSON reports over time to build a history of agent performance. Each report contains a run_id and timestamp, making it straightforward to track metrics like pass rate, average score, and duration across builds.
import json
import glob
reports = []
for path in sorted(glob.glob("results_*.json")):
with open(path) as f:
reports.append(json.load(f))
for r in reports:
print(f"{r['timestamp']}: {r['pass_rate']:.0%} pass rate, {r['aggregate_score']:.2f} avg score")Custom Dashboards
Ingest the JSON into tools like Grafana, Datadog, or a custom web dashboard. The structured format maps directly to time-series metrics:
pass_rateandaggregate_scoreas gauge metrics.total_duration_msfor performance tracking.- Per-scenario
duration_msfor identifying slow scenarios. token_usagefor cost monitoring.
Comparison Between Runs
Compare two JSON reports to detect regressions:
import json
with open("results_baseline.json") as f:
baseline = json.load(f)
with open("results_current.json") as f:
current = json.load(f)
baseline_scores = {sr["scenario"]["name"]: sr["passed"] for sr in baseline["scenario_results"]}
current_scores = {sr["scenario"]["name"]: sr["passed"] for sr in current["scenario_results"]}
for name, passed in current_scores.items():
if name in baseline_scores and baseline_scores[name] and not passed:
print(f"REGRESSION: {name} was passing, now failing")Multi-Suite Runs
When running multiple suites, each suite generates its own JSON file:
JSON report written to results_a1b2c3d4.json
JSON report written to results_e5f67890.jsonWhat’s Next
- CLI Report — Terminal output for local development.
- HTML Report — Shareable HTML reports.
- agenticassure run — Full reference for the run command.
- CI/CD Integration — Using reports in CI pipelines.