Custom Scorers

AgenticAssure’s scoring system is extensible. You can write your own scorer to implement any evaluation logic, register it with the scorer registry, and reference it by name in your YAML scenario files.

The Scorer Protocol

A scorer is any Python class that satisfies the Scorer protocol:


from agenticassure.models import Scenario, AgentResult, ScoreResult
 
class MyScorer:
    name: str
 
    def score(self, scenario: Scenario, result: AgentResult) -> ScoreResult:
        ...

The protocol requires two things:

A name attribute (a string). This is the name used to reference the scorer in YAML files.
A score method that takes a Scenario and an AgentResult and returns a ScoreResult.

You do not need to inherit from a base class or use any decorator. Any object with a name string and a score method with the correct signature will work.

ScoreResult

The score method must return a ScoreResult with three fields:


from agenticassure.models import ScoreResult
 
ScoreResult(
    score=0.85,       # float, typically 0.0 to 1.0
    passed=True,       # bool, whether the scenario passed
    details="..."      # str, human-readable explanation
)

score: A numeric score. By convention, 0.0 means complete failure and 1.0 means complete success, but you can use any scale.
passed: A boolean indicating whether the scenario passed. This is what determines pass/fail in reports.
details: A string explaining the scoring result. This appears in CLI, HTML, and JSON reports.

Registering Your Scorer

After defining your scorer class, register it with the global scorer registry:


from agenticassure.scorers.base import register_scorer
 
my_scorer = MyScorer()
register_scorer(my_scorer)

Once registered, you can reference it by name in your YAML files:


suite:
  name: my-tests
  adapter: my_project.adapters.MyAdapter
  scorer: my_custom_scorer

The scorer value in YAML must match the name attribute of your registered scorer exactly.

When to register

Registration must happen before the runner loads and executes your suite. The simplest approach is to register in the same module where you define your adapter, or in a setup module that runs before the test suite.

If you are using the CLI, you can register scorers in the adapter module that gets imported when the suite is loaded. Since the adapter’s module path is specified in the YAML file, any module-level code in that module (or its imports) will execute before scoring begins.

Full Working Example

Here is a complete custom scorer that checks whether the agent’s output is valid JSON:


"""json_scorer.py -- A custom scorer that validates JSON output."""
 
import json
from agenticassure.models import Scenario, AgentResult, ScoreResult
from agenticassure.scorers.base import register_scorer
 
 
class JsonScorer:
    """Scores agent output based on whether it is valid JSON."""
 
    name = "json"
 
    def score(self, scenario: Scenario, result: AgentResult) -> ScoreResult:
        if not result.output or not result.output.strip():
            return ScoreResult(
                score=0.0,
                passed=False,
                details="JSON: FAIL. Agent produced no output.",
            )
 
        try:
            parsed = json.loads(result.output)
        except json.JSONDecodeError as e:
            return ScoreResult(
                score=0.0,
                passed=False,
                details=f"JSON: FAIL. Invalid JSON: {e}.",
            )
 
        # Optionally check for required keys from metadata
        required_keys = (scenario.metadata or {}).get("required_keys", [])
        if required_keys and isinstance(parsed, dict):
            missing = [k for k in required_keys if k not in parsed]
            if missing:
                return ScoreResult(
                    score=0.5,
                    passed=False,
                    details=f"JSON: FAIL. Valid JSON but missing keys: {missing}.",
                )
 
        return ScoreResult(
            score=1.0,
            passed=True,
            details="JSON: PASS. Output is valid JSON.",
        )
 
 
# Register on import
register_scorer(JsonScorer())

Use it in a scenario file:


suite:
  name: api-agent-tests
  adapter: my_project.adapters.APIAgent
  scorer: json
 
scenarios:
  - name: basic-json-output
    input: "Return the user profile as JSON"
 
  - name: json-with-required-keys
    input: "Return the order details as JSON"
    metadata:
      required_keys:
        - order_id
        - status
        - total

Using Metadata for Custom Configuration

The metadata field on a scenario is a free-form dictionary (dict[str, Any]). Custom scorers can read any keys from it to drive their behavior. This is the standard way to pass scorer-specific configuration without modifying the core data model.

Examples of metadata usage:


scenarios:
  - name: response-length-check
    input: "Summarize this article"
    metadata:
      min_words: 50
      max_words: 200


class WordCountScorer:
    name = "word_count"
 
    def score(self, scenario: Scenario, result: AgentResult) -> ScoreResult:
        word_count = len(result.output.split())
        min_words = (scenario.metadata or {}).get("min_words", 0)
        max_words = (scenario.metadata or {}).get("max_words", float("inf"))
 
        passed = min_words <= word_count <= max_words
        return ScoreResult(
            score=1.0 if passed else 0.0,
            passed=passed,
            details=(
                f"Word count: {word_count} "
                f"(range: {min_words}-{max_words}). "
                f"{'PASS' if passed else 'FAIL'}."
            ),
        )

Metadata keys are entirely up to you. Document them clearly so that anyone writing scenarios for your scorer knows what options are available.

Accessing Scenario Fields

Your scorer has full access to the Scenario object, which includes:

Field	Type	Description
`scenario.name`	`str`	Scenario name
`scenario.input`	`str`	The input sent to the agent
`scenario.expected_output`	`str \| None`	The expected output, if defined
`scenario.expected_tools`	`list[str] \| None`	Expected tool names, if defined
`scenario.expected_tool_args`	`dict \| None`	Expected tool arguments, if defined
`scenario.metadata`	`dict[str, Any] \| None`	Free-form metadata
`scenario.tags`	`list[str] \| None`	Tags for filtering

And the AgentResult object:

Field	Type	Description
`result.output`	`str`	The agent’s output text
`result.tools_called`	`list[str] \| None`	Tools the agent invoked
`result.tool_args`	`dict \| None`	Arguments passed to each tool
`result.duration`	`float \| None`	Execution time in seconds
`result.error`	`str \| None`	Error message, if the agent failed

Tips for Testing Your Scorer

Unit test with synthetic data

You do not need a running agent to test a scorer. Create Scenario and AgentResult objects directly:


from agenticassure.models import Scenario, AgentResult
 
scenario = Scenario(
    name="test-case",
    input="test input",
    expected_output="expected",
    metadata={"my_key": "my_value"},
)
 
result = AgentResult(
    output="actual agent output",
    tools_called=["tool_a"],
)
 
scorer = MyScorer()
score_result = scorer.score(scenario, result)
 
assert score_result.passed is True
assert score_result.score == 1.0

Test edge cases

Make sure your scorer handles these gracefully:

Empty output: result.output is "" or None.
Missing metadata: scenario.metadata is None or does not contain your expected keys.
Missing expected_output: scenario.expected_output is None (if your scorer needs it).
Very long output: The agent returns a huge response.
Unicode content: The output contains non-ASCII characters.
Error in the agent result: result.error is set, indicating the agent itself failed.

Validate the details message

The details string is what users see in reports. Make it clear and actionable. Always indicate whether the check passed or failed, and include enough context to understand why.

Advanced: Combining Multiple Checks

A single scorer can perform multiple checks and aggregate them. Here is an example that checks both structure and content:


class ComprehensiveScorer:
    name = "comprehensive"
 
    def score(self, scenario: Scenario, result: AgentResult) -> ScoreResult:
        checks = []
        all_passed = True
 
        # Check 1: Non-empty output
        if not result.output or not result.output.strip():
            checks.append("Output: FAIL (empty)")
            all_passed = False
        else:
            checks.append("Output: PASS")
 
        # Check 2: Contains required keyword
        keyword = (scenario.metadata or {}).get("required_keyword")
        if keyword:
            if keyword.lower() in result.output.lower():
                checks.append(f"Keyword '{keyword}': PASS")
            else:
                checks.append(f"Keyword '{keyword}': FAIL")
                all_passed = False
 
        # Check 3: Word count within range
        if result.output:
            word_count = len(result.output.split())
            max_words = (scenario.metadata or {}).get("max_words", 500)
            if word_count <= max_words:
                checks.append(f"Length ({word_count} words): PASS")
            else:
                checks.append(f"Length ({word_count} words): FAIL (max {max_words})")
                all_passed = False
 
        return ScoreResult(
            score=1.0 if all_passed else 0.0,
            passed=all_passed,
            details=". ".join(checks) + ".",
        )

Scorer Naming Conventions

Use lowercase, short, descriptive names: json, word_count, tone, safety.
Avoid collisions with built-in scorer names: passfail, exact, regex, similarity.
The name must be a valid string and is used as a lookup key in the registry. Keep it simple.