Scorers
Scorers are the evaluation layer of AgenticAssure. After your agent produces a response, one or more scorers examine that response against the scenario’s expectations and determine whether the scenario passes or fails.
How Scorers Work
Each scorer receives two inputs:
- The Scenario — which contains the expected output, expected tools, metadata, and other expectations.
- The AgentResult — which contains the agent’s actual output, tool calls, token usage, and other response data.
The scorer compares these two and returns a ScoreResult containing a numeric score (0.0 to 1.0), a boolean pass/fail verdict, and an explanation string.
Scenario + AgentResult --> Scorer --> ScoreResultThe Scorer Protocol
All scorers implement the Scorer protocol:
from agenticassure.results import AgentResult, ScoreResult
from agenticassure.scenario import Scenario
class Scorer(Protocol):
name: str
def score(self, scenario: Scenario, result: AgentResult) -> ScoreResult:
...A scorer must have:
- A
nameattribute (a string used to reference it in YAML and the registry). - A
scoremethod that accepts aScenarioand anAgentResultand returns aScoreResult.
Your class does not need to inherit from the Scorer protocol. It only needs to have the correct attributes and method signature (structural subtyping).
The Scorer Registry
Scorers are stored in a global registry keyed by their name attribute. The registry provides three functions:
register_scorer(scorer)
Registers a scorer instance. If a scorer with the same name already exists, it is silently replaced. This allows you to override built-in scorers with custom implementations.
from agenticassure.scorers.base import register_scorer
register_scorer(MyCustomScorer())get_scorer(name) -> Scorer
Looks up a scorer by name. Raises KeyError if no scorer with that name has been registered.
from agenticassure.scorers.base import get_scorer
scorer = get_scorer("passfail")list_scorers() -> list[str]
Returns the names of all registered scorers in registration order.
from agenticassure.scorers.base import list_scorers
print(list_scorers()) # e.g., ["passfail", "exact", "regex", "similarity"]Built-in scorers are automatically registered when AgenticAssure is imported.
Referencing Scorers in Scenarios
Scenarios specify which scorers to apply through the scorers field in YAML:
scenarios:
- name: my_test
input: "What is 2 + 2?"
expected_output: "4"
scorers:
- passfail
- exactIf the scorers field is omitted, the default is ["passfail"].
The runner resolves each scorer name through get_scorer() at execution time. If a name does not match any registered scorer, a KeyError is raised with a message listing all available scorers.
Using Multiple Scorers
You can apply any number of scorers to a single scenario. Each scorer runs independently and produces its own ScoreResult.
scenarios:
- name: structured_response
input: "Generate a JSON object with a 'name' field."
expected_output: '{"name":'
metadata:
regex_pattern: '"name"\\s*:\\s*"[^"]+"'
scorers:
- passfail
- regexPass/Fail Determination
A scenario passes only if every scorer passes. If any single scorer fails, the entire scenario is marked as failed. This AND-logic lets you layer progressively stricter checks:
passfail— Did the agent produce output and call the right tools?exactorsimilarity— Was the content of the output correct?regex— Does the output match a required structural pattern?
Each scorer’s individual result (score, pass/fail, explanation) is preserved in the ScenarioRunResult.scores list, so you can see exactly which scorer caused a failure.
Built-In Scorers
AgenticAssure ships with four built-in scorers.
passfail
The default scorer. Performs multiple checks:
- Non-empty output — The agent must produce a non-empty response.
- Expected tools — If
expected_toolsis set, every listed tool must appear in the agent’stool_calls. - Expected tool arguments — If
expected_tool_argsis set, each tool must have been called with the specified argument key-value pairs (exact match per key; extra keys are ignored). - Expected output — If
expected_outputis set, it must appear as a case-insensitive substring of the agent’s output.
The score is 1.0 if all applicable checks pass, 0.0 otherwise.
- name: tool_test
input: "Look up order #999"
expected_tools:
- lookup_order
expected_tool_args:
lookup_order:
order_id: "999"
scorers:
- passfailexact
Checks whether the agent’s output exactly matches expected_output.
By default, both strings are stripped of leading/trailing whitespace and lowercased before comparison (normalization). You can disable normalization by setting exact_normalize: false in the scenario’s metadata.
Requires expected_output to be set. If it is not, the scorer fails with an explanatory message.
- name: exact_answer
input: "What is the capital of France?"
expected_output: "Paris"
scorers:
- exactregex
Matches the agent’s output against a regular expression pattern defined in the scenario’s metadata under the key regex_pattern.
The pattern is evaluated using Python’s re.search, so it matches anywhere in the output. The score is 1.0 if the pattern matches, 0.0 otherwise. The matched substring is included in the details field of the ScoreResult.
Requires metadata.regex_pattern to be set. If it is missing, the scorer fails with an explanatory message.
- name: zip_code_check
input: "What is the zip code for Beverly Hills?"
metadata:
regex_pattern: "\\d{5}"
scorers:
- regexsimilarity
Computes semantic similarity between the agent’s output and expected_output using sentence-transformers embeddings. The score is the cosine similarity between the two embeddings (clamped to 0.0—1.0).
The default similarity threshold is 0.7. You can override it per-scenario by setting similarity_threshold in metadata.
The default embedding model is all-MiniLM-L6-v2. This scorer requires the sentence-transformers package, which is not installed by default:
pip install agenticassure[similarity]If the package is not installed, the scorer will not be registered and referencing it by name will raise a KeyError.
Requires expected_output to be set.
- name: semantic_check
input: "Explain photosynthesis"
expected_output: "Photosynthesis is the process by which plants convert sunlight into energy."
metadata:
similarity_threshold: 0.75
scorers:
- similarityCustom Scorers
You can create and register your own scorers. A custom scorer is any class with a name attribute and a score method matching the protocol.
from agenticassure.results import AgentResult, ScoreResult
from agenticassure.scenario import Scenario
from agenticassure.scorers.base import register_scorer
class LengthScorer:
name = "length"
def score(self, scenario, result):
max_length = scenario.metadata.get("max_length", 500)
output_length = len(result.output)
passed = output_length <= max_length
return ScoreResult(
scenario_id=scenario.id,
scorer_name=self.name,
score=1.0 if passed else 0.0,
passed=passed,
explanation=f"Output length: {output_length} (max: {max_length})",
)
register_scorer(LengthScorer())After registration, you can reference the scorer by name in your YAML scenarios:
- name: concise_response
input: "Summarize this article in one sentence."
metadata:
max_length: 200
scorers:
- passfail
- lengthFor a complete guide on building, testing, and distributing custom scorers, see the Custom Scorers Guide.