Custom Scorers
AgenticAssure’s scoring system is extensible. You can write your own scorer to implement any evaluation logic, register it with the scorer registry, and reference it by name in your YAML scenario files.
The Scorer Protocol
A scorer is any Python class that satisfies the Scorer protocol:
from agenticassure.models import Scenario, AgentResult, ScoreResult
class MyScorer:
name: str
def score(self, scenario: Scenario, result: AgentResult) -> ScoreResult:
...The protocol requires two things:
- A
nameattribute (a string). This is the name used to reference the scorer in YAML files. - A
scoremethod that takes aScenarioand anAgentResultand returns aScoreResult.
You do not need to inherit from a base class or use any decorator. Any object with a name string and a score method with the correct signature will work.
ScoreResult
The score method must return a ScoreResult with three fields:
from agenticassure.models import ScoreResult
ScoreResult(
score=0.85, # float, typically 0.0 to 1.0
passed=True, # bool, whether the scenario passed
details="..." # str, human-readable explanation
)- score: A numeric score. By convention, 0.0 means complete failure and 1.0 means complete success, but you can use any scale.
- passed: A boolean indicating whether the scenario passed. This is what determines pass/fail in reports.
- details: A string explaining the scoring result. This appears in CLI, HTML, and JSON reports.
Registering Your Scorer
After defining your scorer class, register it with the global scorer registry:
from agenticassure.scorers.base import register_scorer
my_scorer = MyScorer()
register_scorer(my_scorer)Once registered, you can reference it by name in your YAML files:
suite:
name: my-tests
adapter: my_project.adapters.MyAdapter
scorer: my_custom_scorerThe scorer value in YAML must match the name attribute of your registered scorer exactly.
When to register
Registration must happen before the runner loads and executes your suite. The simplest approach is to register in the same module where you define your adapter, or in a setup module that runs before the test suite.
If you are using the CLI, you can register scorers in the adapter module that gets imported when the suite is loaded. Since the adapter’s module path is specified in the YAML file, any module-level code in that module (or its imports) will execute before scoring begins.
Full Working Example
Here is a complete custom scorer that checks whether the agent’s output is valid JSON:
"""json_scorer.py -- A custom scorer that validates JSON output."""
import json
from agenticassure.models import Scenario, AgentResult, ScoreResult
from agenticassure.scorers.base import register_scorer
class JsonScorer:
"""Scores agent output based on whether it is valid JSON."""
name = "json"
def score(self, scenario: Scenario, result: AgentResult) -> ScoreResult:
if not result.output or not result.output.strip():
return ScoreResult(
score=0.0,
passed=False,
details="JSON: FAIL. Agent produced no output.",
)
try:
parsed = json.loads(result.output)
except json.JSONDecodeError as e:
return ScoreResult(
score=0.0,
passed=False,
details=f"JSON: FAIL. Invalid JSON: {e}.",
)
# Optionally check for required keys from metadata
required_keys = (scenario.metadata or {}).get("required_keys", [])
if required_keys and isinstance(parsed, dict):
missing = [k for k in required_keys if k not in parsed]
if missing:
return ScoreResult(
score=0.5,
passed=False,
details=f"JSON: FAIL. Valid JSON but missing keys: {missing}.",
)
return ScoreResult(
score=1.0,
passed=True,
details="JSON: PASS. Output is valid JSON.",
)
# Register on import
register_scorer(JsonScorer())Use it in a scenario file:
suite:
name: api-agent-tests
adapter: my_project.adapters.APIAgent
scorer: json
scenarios:
- name: basic-json-output
input: "Return the user profile as JSON"
- name: json-with-required-keys
input: "Return the order details as JSON"
metadata:
required_keys:
- order_id
- status
- totalUsing Metadata for Custom Configuration
The metadata field on a scenario is a free-form dictionary (dict[str, Any]). Custom scorers can read any keys from it to drive their behavior. This is the standard way to pass scorer-specific configuration without modifying the core data model.
Examples of metadata usage:
scenarios:
- name: response-length-check
input: "Summarize this article"
metadata:
min_words: 50
max_words: 200class WordCountScorer:
name = "word_count"
def score(self, scenario: Scenario, result: AgentResult) -> ScoreResult:
word_count = len(result.output.split())
min_words = (scenario.metadata or {}).get("min_words", 0)
max_words = (scenario.metadata or {}).get("max_words", float("inf"))
passed = min_words <= word_count <= max_words
return ScoreResult(
score=1.0 if passed else 0.0,
passed=passed,
details=(
f"Word count: {word_count} "
f"(range: {min_words}-{max_words}). "
f"{'PASS' if passed else 'FAIL'}."
),
)Metadata keys are entirely up to you. Document them clearly so that anyone writing scenarios for your scorer knows what options are available.
Accessing Scenario Fields
Your scorer has full access to the Scenario object, which includes:
| Field | Type | Description |
|---|---|---|
scenario.name | str | Scenario name |
scenario.input | str | The input sent to the agent |
scenario.expected_output | str | None | The expected output, if defined |
scenario.expected_tools | list[str] | None | Expected tool names, if defined |
scenario.expected_tool_args | dict | None | Expected tool arguments, if defined |
scenario.metadata | dict[str, Any] | None | Free-form metadata |
scenario.tags | list[str] | None | Tags for filtering |
And the AgentResult object:
| Field | Type | Description |
|---|---|---|
result.output | str | The agent’s output text |
result.tools_called | list[str] | None | Tools the agent invoked |
result.tool_args | dict | None | Arguments passed to each tool |
result.duration | float | None | Execution time in seconds |
result.error | str | None | Error message, if the agent failed |
Tips for Testing Your Scorer
Unit test with synthetic data
You do not need a running agent to test a scorer. Create Scenario and AgentResult objects directly:
from agenticassure.models import Scenario, AgentResult
scenario = Scenario(
name="test-case",
input="test input",
expected_output="expected",
metadata={"my_key": "my_value"},
)
result = AgentResult(
output="actual agent output",
tools_called=["tool_a"],
)
scorer = MyScorer()
score_result = scorer.score(scenario, result)
assert score_result.passed is True
assert score_result.score == 1.0Test edge cases
Make sure your scorer handles these gracefully:
- Empty output:
result.outputis""orNone. - Missing metadata:
scenario.metadataisNoneor does not contain your expected keys. - Missing expected_output:
scenario.expected_outputisNone(if your scorer needs it). - Very long output: The agent returns a huge response.
- Unicode content: The output contains non-ASCII characters.
- Error in the agent result:
result.erroris set, indicating the agent itself failed.
Validate the details message
The details string is what users see in reports. Make it clear and actionable. Always indicate whether the check passed or failed, and include enough context to understand why.
Advanced: Combining Multiple Checks
A single scorer can perform multiple checks and aggregate them. Here is an example that checks both structure and content:
class ComprehensiveScorer:
name = "comprehensive"
def score(self, scenario: Scenario, result: AgentResult) -> ScoreResult:
checks = []
all_passed = True
# Check 1: Non-empty output
if not result.output or not result.output.strip():
checks.append("Output: FAIL (empty)")
all_passed = False
else:
checks.append("Output: PASS")
# Check 2: Contains required keyword
keyword = (scenario.metadata or {}).get("required_keyword")
if keyword:
if keyword.lower() in result.output.lower():
checks.append(f"Keyword '{keyword}': PASS")
else:
checks.append(f"Keyword '{keyword}': FAIL")
all_passed = False
# Check 3: Word count within range
if result.output:
word_count = len(result.output.split())
max_words = (scenario.metadata or {}).get("max_words", 500)
if word_count <= max_words:
checks.append(f"Length ({word_count} words): PASS")
else:
checks.append(f"Length ({word_count} words): FAIL (max {max_words})")
all_passed = False
return ScoreResult(
score=1.0 if all_passed else 0.0,
passed=all_passed,
details=". ".join(checks) + ".",
)Scorer Naming Conventions
- Use lowercase, short, descriptive names:
json,word_count,tone,safety. - Avoid collisions with built-in scorer names:
passfail,exact,regex,similarity. - The name must be a valid string and is used as a lookup key in the registry. Keep it simple.
See Also
- Scorers Overview — how the scoring system works
- PassFail Scorer — the default built-in scorer
- Exact Match Scorer — strict string equality
- Regex Scorer — pattern-based matching
- Similarity Scorer — semantic comparison
- API Reference: Scorer Registry —
register_scorer,get_scorer,list_scorers