Scorers

Scorers are the evaluation layer of AgenticAssure. After your agent produces a response, one or more scorers examine that response against the scenario’s expectations and determine whether the scenario passes or fails.

How Scorers Work

Each scorer receives two inputs:

The Scenario — which contains the expected output, expected tools, metadata, and other expectations.
The AgentResult — which contains the agent’s actual output, tool calls, token usage, and other response data.

The scorer compares these two and returns a ScoreResult containing a numeric score (0.0 to 1.0), a boolean pass/fail verdict, and an explanation string.


Scenario + AgentResult --> Scorer --> ScoreResult

The Scorer Protocol

All scorers implement the Scorer protocol:


from agenticassure.results import AgentResult, ScoreResult
from agenticassure.scenario import Scenario
 
class Scorer(Protocol):
    name: str
 
    def score(self, scenario: Scenario, result: AgentResult) -> ScoreResult:
        ...

A scorer must have:

A name attribute (a string used to reference it in YAML and the registry).
A score method that accepts a Scenario and an AgentResult and returns a ScoreResult.

Your class does not need to inherit from the Scorer protocol. It only needs to have the correct attributes and method signature (structural subtyping).

The Scorer Registry

Scorers are stored in a global registry keyed by their name attribute. The registry provides three functions:

`register_scorer(scorer)`

Registers a scorer instance. If a scorer with the same name already exists, it is silently replaced. This allows you to override built-in scorers with custom implementations.


from agenticassure.scorers.base import register_scorer
 
register_scorer(MyCustomScorer())

`get_scorer(name) -> Scorer`

Looks up a scorer by name. Raises KeyError if no scorer with that name has been registered.


from agenticassure.scorers.base import get_scorer
 
scorer = get_scorer("passfail")

`list_scorers() -> list[str]`

Returns the names of all registered scorers in registration order.


from agenticassure.scorers.base import list_scorers
 
print(list_scorers())  # e.g., ["passfail", "exact", "regex", "similarity"]

Built-in scorers are automatically registered when AgenticAssure is imported.

Referencing Scorers in Scenarios

Scenarios specify which scorers to apply through the scorers field in YAML:


scenarios:
  - name: my_test
    input: "What is 2 + 2?"
    expected_output: "4"
    scorers:
      - passfail
      - exact

If the scorers field is omitted, the default is ["passfail"].

The runner resolves each scorer name through get_scorer() at execution time. If a name does not match any registered scorer, a KeyError is raised with a message listing all available scorers.

Using Multiple Scorers

You can apply any number of scorers to a single scenario. Each scorer runs independently and produces its own ScoreResult.


scenarios:
  - name: structured_response
    input: "Generate a JSON object with a 'name' field."
    expected_output: '{"name":'
    metadata:
      regex_pattern: '"name"\\s*:\\s*"[^"]+"'
    scorers:
      - passfail
      - regex

Pass/Fail Determination

A scenario passes only if every scorer passes. If any single scorer fails, the entire scenario is marked as failed. This AND-logic lets you layer progressively stricter checks:

passfail — Did the agent produce output and call the right tools?
exact or similarity — Was the content of the output correct?
regex — Does the output match a required structural pattern?

Each scorer’s individual result (score, pass/fail, explanation) is preserved in the ScenarioRunResult.scores list, so you can see exactly which scorer caused a failure.

Built-In Scorers

AgenticAssure ships with four built-in scorers.

`passfail`

The default scorer. Performs multiple checks:

Non-empty output — The agent must produce a non-empty response.
Expected tools — If expected_tools is set, every listed tool must appear in the agent’s tool_calls.
Expected tool arguments — If expected_tool_args is set, each tool must have been called with the specified argument key-value pairs (exact match per key; extra keys are ignored).
Expected output — If expected_output is set, it must appear as a case-insensitive substring of the agent’s output.

The score is 1.0 if all applicable checks pass, 0.0 otherwise.


- name: tool_test
  input: "Look up order #999"
  expected_tools:
    - lookup_order
  expected_tool_args:
    lookup_order:
      order_id: "999"
  scorers:
    - passfail

`exact`

Checks whether the agent’s output exactly matches expected_output.

By default, both strings are stripped of leading/trailing whitespace and lowercased before comparison (normalization). You can disable normalization by setting exact_normalize: false in the scenario’s metadata.

Requires expected_output to be set. If it is not, the scorer fails with an explanatory message.


- name: exact_answer
  input: "What is the capital of France?"
  expected_output: "Paris"
  scorers:
    - exact

`regex`

Matches the agent’s output against a regular expression pattern defined in the scenario’s metadata under the key regex_pattern.

The pattern is evaluated using Python’s re.search, so it matches anywhere in the output. The score is 1.0 if the pattern matches, 0.0 otherwise. The matched substring is included in the details field of the ScoreResult.

Requires metadata.regex_pattern to be set. If it is missing, the scorer fails with an explanatory message.


- name: zip_code_check
  input: "What is the zip code for Beverly Hills?"
  metadata:
    regex_pattern: "\\d{5}"
  scorers:
    - regex

`similarity`

Computes semantic similarity between the agent’s output and expected_output using sentence-transformers embeddings. The score is the cosine similarity between the two embeddings (clamped to 0.0—1.0).

The default similarity threshold is 0.7. You can override it per-scenario by setting similarity_threshold in metadata.

The default embedding model is all-MiniLM-L6-v2. This scorer requires the sentence-transformers package, which is not installed by default:


pip install agenticassure[similarity]

If the package is not installed, the scorer will not be registered and referencing it by name will raise a KeyError.

Requires expected_output to be set.


- name: semantic_check
  input: "Explain photosynthesis"
  expected_output: "Photosynthesis is the process by which plants convert sunlight into energy."
  metadata:
    similarity_threshold: 0.75
  scorers:
    - similarity

Custom Scorers

You can create and register your own scorers. A custom scorer is any class with a name attribute and a score method matching the protocol.


from agenticassure.results import AgentResult, ScoreResult
from agenticassure.scenario import Scenario
from agenticassure.scorers.base import register_scorer
 
class LengthScorer:
    name = "length"
 
    def score(self, scenario, result):
        max_length = scenario.metadata.get("max_length", 500)
        output_length = len(result.output)
        passed = output_length <= max_length
 
        return ScoreResult(
            scenario_id=scenario.id,
            scorer_name=self.name,
            score=1.0 if passed else 0.0,
            passed=passed,
            explanation=f"Output length: {output_length} (max: {max_length})",
        )
 
register_scorer(LengthScorer())

After registration, you can reference the scorer by name in your YAML scenarios:


- name: concise_response
  input: "Summarize this article in one sentence."
  metadata:
    max_length: 200
  scorers:
    - passfail
    - length

For a complete guide on building, testing, and distributing custom scorers, see the Custom Scorers Guide.