Skip to Content
ReportsJSON Report

JSON Report

The JSON report outputs the complete test results as a structured JSON file. It contains every field available in the result models, making it the most detailed output format and the best choice for programmatic consumption, archival, and integration with external tools.

Generating a JSON Report

Use the --output json flag with the run command:

agenticassure run scenarios/ --adapter my_agent.MyAgent --output json

After the run completes, AgenticAssure writes the report and prints the filename:

JSON report written to results_a1b2c3d4-e5f6-7890-abcd-ef1234567890.json

File Naming

JSON reports are named using the pattern:

results_{run_id}.json

The run_id is a UUID generated for each run. Files are written to the current working directory.

JSON Structure

The JSON report is a serialization of the RunResult Pydantic model. Below is the full schema with descriptions of every field.

Top-Level Object

{ "run_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", "timestamp": "2026-03-10T14:30:00+00:00", "suite_name": "search-agent-tests", "scenario_results": [ ... ], "aggregate_score": 0.85, "pass_rate": 0.75, "total_duration_ms": 1245.7, "model_info": null }
FieldTypeDescription
run_idstringUUID identifying this run
timestampstring (ISO 8601)UTC timestamp of when the run started
suite_namestringName of the test suite
scenario_resultsarrayList of ScenarioRunResult objects (see below)
aggregate_scorefloatAverage score across all scenarios (0.0 to 1.0)
pass_ratefloatFraction of scenarios that passed (0.0 to 1.0)
total_duration_msfloatTotal wall-clock duration in milliseconds
model_infoobject or nullOptional metadata about the model used

ScenarioRunResult

Each element of scenario_results has this structure:

{ "scenario": { ... }, "agent_result": { ... }, "scores": [ ... ], "passed": true, "duration_ms": 245.3, "error": null, "retry_count": 0 }
FieldTypeDescription
scenarioobjectThe original scenario definition (see below)
agent_resultobjectThe agent’s response (see below)
scoresarrayList of ScoreResult objects from each scorer
passedbooleanWhether the scenario passed overall
duration_msfloatExecution time in milliseconds
errorstring or nullError message if an exception occurred, otherwise null
retry_countintegerNumber of retries attempted

Scenario

The scenario object as defined in the YAML file:

{ "name": "weather_query", "input": "What is the weather in San Francisco?", "expected_output": null, "expected_tools": ["get_weather"], "expected_tool_args": { "get_weather": { "location": "San Francisco" } }, "scorers": ["passfail"], "tags": ["tools", "weather"], "context": null, "metadata": null }

AgentResult

The result returned by your adapter’s run() method:

{ "output": "The weather in San Francisco is 65F and sunny.", "tool_calls": [ { "name": "get_weather", "arguments": { "location": "San Francisco" }, "result": { "temp": 65, "condition": "sunny" } } ], "reasoning_trace": null, "latency_ms": 230.5, "token_usage": { "prompt_tokens": 150, "completion_tokens": 45 }, "raw_response": null }
FieldTypeDescription
outputstringThe agent’s text response
tool_callsarrayList of tool calls made by the agent
reasoning_tracearray or nullOptional list of reasoning steps
latency_msfloatAgent-reported latency in milliseconds
token_usageobject or nullToken usage breakdown (prompt and completion)
raw_responseany or nullOptional raw response from the underlying API

ToolCall

Each entry in tool_calls:

FieldTypeDescription
namestringName of the tool that was called
argumentsobjectArguments passed to the tool
resultany or nullThe result returned by the tool

ScoreResult

Each entry in scores:

{ "scenario_id": "weather_query", "scorer_name": "passfail", "score": 1.0, "passed": true, "explanation": "Output contains expected text; Tool 'get_weather' called correctly", "details": null }
FieldTypeDescription
scenario_idstringName of the scenario being scored
scorer_namestringName of the scorer that produced this result
scorefloatScore from 0.0 to 1.0
passedbooleanWhether this scorer considers the scenario passed
explanationstringHuman-readable explanation of the scoring decision
detailsobject or nullOptional additional structured details

TokenUsage

FieldTypeDescription
prompt_tokensintegerNumber of tokens in the prompt
completion_tokensintegerNumber of tokens in the completion

Consuming the JSON Report Programmatically

Python

import json from pathlib import Path report = json.loads(Path("results_a1b2c3d4.json").read_text()) print(f"Pass rate: {report['pass_rate']:.0%}") print(f"Avg score: {report['aggregate_score']:.2f}") for sr in report["scenario_results"]: status = "PASS" if sr["passed"] else "FAIL" print(f" {sr['scenario']['name']}: {status}")

jq (command line)

# Get pass rate jq '.pass_rate' results_a1b2c3d4.json # List failed scenarios jq '.scenario_results[] | select(.passed == false) | .scenario.name' results_a1b2c3d4.json # Get total token usage across all scenarios jq '[.scenario_results[].agent_result.token_usage // {prompt_tokens:0, completion_tokens:0} | .prompt_tokens + .completion_tokens] | add' results_a1b2c3d4.json # Extract all tool calls jq '[.scenario_results[].agent_result.tool_calls[].name] | unique' results_a1b2c3d4.json

JavaScript / Node.js

const fs = require("fs"); const report = JSON.parse(fs.readFileSync("results_a1b2c3d4.json", "utf-8")); const failed = report.scenario_results.filter((sr) => !sr.passed); console.log(`${failed.length} scenario(s) failed`);

Integration with Other Tools

CI/CD Pipelines

Use the JSON report to make pass/fail decisions or extract metrics in CI:

# GitHub Actions - name: Run tests run: agenticassure run scenarios/ --adapter my_agent.MyAgent --output json - name: Check results run: | PASS_RATE=$(jq '.pass_rate' results_*.json) echo "Pass rate: $PASS_RATE" if (( $(echo "$PASS_RATE < 0.8" | bc -l) )); then echo "Pass rate below threshold" exit 1 fi

Trend Tracking

Store JSON reports over time to build a history of agent performance. Each report contains a run_id and timestamp, making it straightforward to track metrics like pass rate, average score, and duration across builds.

import json import glob reports = [] for path in sorted(glob.glob("results_*.json")): with open(path) as f: reports.append(json.load(f)) for r in reports: print(f"{r['timestamp']}: {r['pass_rate']:.0%} pass rate, {r['aggregate_score']:.2f} avg score")

Custom Dashboards

Ingest the JSON into tools like Grafana, Datadog, or a custom web dashboard. The structured format maps directly to time-series metrics:

  • pass_rate and aggregate_score as gauge metrics.
  • total_duration_ms for performance tracking.
  • Per-scenario duration_ms for identifying slow scenarios.
  • token_usage for cost monitoring.

Comparison Between Runs

Compare two JSON reports to detect regressions:

import json with open("results_baseline.json") as f: baseline = json.load(f) with open("results_current.json") as f: current = json.load(f) baseline_scores = {sr["scenario"]["name"]: sr["passed"] for sr in baseline["scenario_results"]} current_scores = {sr["scenario"]["name"]: sr["passed"] for sr in current["scenario_results"]} for name, passed in current_scores.items(): if name in baseline_scores and baseline_scores[name] and not passed: print(f"REGRESSION: {name} was passing, now failing")

Multi-Suite Runs

When running multiple suites, each suite generates its own JSON file:

JSON report written to results_a1b2c3d4.json JSON report written to results_e5f67890.json

What’s Next

Last updated on