Skip to Content
ScorersCustom Scorers

Custom Scorers

AgenticAssure’s scoring system is extensible. You can write your own scorer to implement any evaluation logic, register it with the scorer registry, and reference it by name in your YAML scenario files.


The Scorer Protocol

A scorer is any Python class that satisfies the Scorer protocol:

from agenticassure.models import Scenario, AgentResult, ScoreResult class MyScorer: name: str def score(self, scenario: Scenario, result: AgentResult) -> ScoreResult: ...

The protocol requires two things:

  1. A name attribute (a string). This is the name used to reference the scorer in YAML files.
  2. A score method that takes a Scenario and an AgentResult and returns a ScoreResult.

You do not need to inherit from a base class or use any decorator. Any object with a name string and a score method with the correct signature will work.


ScoreResult

The score method must return a ScoreResult with three fields:

from agenticassure.models import ScoreResult ScoreResult( score=0.85, # float, typically 0.0 to 1.0 passed=True, # bool, whether the scenario passed details="..." # str, human-readable explanation )
  • score: A numeric score. By convention, 0.0 means complete failure and 1.0 means complete success, but you can use any scale.
  • passed: A boolean indicating whether the scenario passed. This is what determines pass/fail in reports.
  • details: A string explaining the scoring result. This appears in CLI, HTML, and JSON reports.

Registering Your Scorer

After defining your scorer class, register it with the global scorer registry:

from agenticassure.scorers.base import register_scorer my_scorer = MyScorer() register_scorer(my_scorer)

Once registered, you can reference it by name in your YAML files:

suite: name: my-tests adapter: my_project.adapters.MyAdapter scorer: my_custom_scorer

The scorer value in YAML must match the name attribute of your registered scorer exactly.

When to register

Registration must happen before the runner loads and executes your suite. The simplest approach is to register in the same module where you define your adapter, or in a setup module that runs before the test suite.

If you are using the CLI, you can register scorers in the adapter module that gets imported when the suite is loaded. Since the adapter’s module path is specified in the YAML file, any module-level code in that module (or its imports) will execute before scoring begins.


Full Working Example

Here is a complete custom scorer that checks whether the agent’s output is valid JSON:

"""json_scorer.py -- A custom scorer that validates JSON output.""" import json from agenticassure.models import Scenario, AgentResult, ScoreResult from agenticassure.scorers.base import register_scorer class JsonScorer: """Scores agent output based on whether it is valid JSON.""" name = "json" def score(self, scenario: Scenario, result: AgentResult) -> ScoreResult: if not result.output or not result.output.strip(): return ScoreResult( score=0.0, passed=False, details="JSON: FAIL. Agent produced no output.", ) try: parsed = json.loads(result.output) except json.JSONDecodeError as e: return ScoreResult( score=0.0, passed=False, details=f"JSON: FAIL. Invalid JSON: {e}.", ) # Optionally check for required keys from metadata required_keys = (scenario.metadata or {}).get("required_keys", []) if required_keys and isinstance(parsed, dict): missing = [k for k in required_keys if k not in parsed] if missing: return ScoreResult( score=0.5, passed=False, details=f"JSON: FAIL. Valid JSON but missing keys: {missing}.", ) return ScoreResult( score=1.0, passed=True, details="JSON: PASS. Output is valid JSON.", ) # Register on import register_scorer(JsonScorer())

Use it in a scenario file:

suite: name: api-agent-tests adapter: my_project.adapters.APIAgent scorer: json scenarios: - name: basic-json-output input: "Return the user profile as JSON" - name: json-with-required-keys input: "Return the order details as JSON" metadata: required_keys: - order_id - status - total

Using Metadata for Custom Configuration

The metadata field on a scenario is a free-form dictionary (dict[str, Any]). Custom scorers can read any keys from it to drive their behavior. This is the standard way to pass scorer-specific configuration without modifying the core data model.

Examples of metadata usage:

scenarios: - name: response-length-check input: "Summarize this article" metadata: min_words: 50 max_words: 200
class WordCountScorer: name = "word_count" def score(self, scenario: Scenario, result: AgentResult) -> ScoreResult: word_count = len(result.output.split()) min_words = (scenario.metadata or {}).get("min_words", 0) max_words = (scenario.metadata or {}).get("max_words", float("inf")) passed = min_words <= word_count <= max_words return ScoreResult( score=1.0 if passed else 0.0, passed=passed, details=( f"Word count: {word_count} " f"(range: {min_words}-{max_words}). " f"{'PASS' if passed else 'FAIL'}." ), )

Metadata keys are entirely up to you. Document them clearly so that anyone writing scenarios for your scorer knows what options are available.


Accessing Scenario Fields

Your scorer has full access to the Scenario object, which includes:

FieldTypeDescription
scenario.namestrScenario name
scenario.inputstrThe input sent to the agent
scenario.expected_outputstr | NoneThe expected output, if defined
scenario.expected_toolslist[str] | NoneExpected tool names, if defined
scenario.expected_tool_argsdict | NoneExpected tool arguments, if defined
scenario.metadatadict[str, Any] | NoneFree-form metadata
scenario.tagslist[str] | NoneTags for filtering

And the AgentResult object:

FieldTypeDescription
result.outputstrThe agent’s output text
result.tools_calledlist[str] | NoneTools the agent invoked
result.tool_argsdict | NoneArguments passed to each tool
result.durationfloat | NoneExecution time in seconds
result.errorstr | NoneError message, if the agent failed

Tips for Testing Your Scorer

Unit test with synthetic data

You do not need a running agent to test a scorer. Create Scenario and AgentResult objects directly:

from agenticassure.models import Scenario, AgentResult scenario = Scenario( name="test-case", input="test input", expected_output="expected", metadata={"my_key": "my_value"}, ) result = AgentResult( output="actual agent output", tools_called=["tool_a"], ) scorer = MyScorer() score_result = scorer.score(scenario, result) assert score_result.passed is True assert score_result.score == 1.0

Test edge cases

Make sure your scorer handles these gracefully:

  • Empty output: result.output is "" or None.
  • Missing metadata: scenario.metadata is None or does not contain your expected keys.
  • Missing expected_output: scenario.expected_output is None (if your scorer needs it).
  • Very long output: The agent returns a huge response.
  • Unicode content: The output contains non-ASCII characters.
  • Error in the agent result: result.error is set, indicating the agent itself failed.

Validate the details message

The details string is what users see in reports. Make it clear and actionable. Always indicate whether the check passed or failed, and include enough context to understand why.


Advanced: Combining Multiple Checks

A single scorer can perform multiple checks and aggregate them. Here is an example that checks both structure and content:

class ComprehensiveScorer: name = "comprehensive" def score(self, scenario: Scenario, result: AgentResult) -> ScoreResult: checks = [] all_passed = True # Check 1: Non-empty output if not result.output or not result.output.strip(): checks.append("Output: FAIL (empty)") all_passed = False else: checks.append("Output: PASS") # Check 2: Contains required keyword keyword = (scenario.metadata or {}).get("required_keyword") if keyword: if keyword.lower() in result.output.lower(): checks.append(f"Keyword '{keyword}': PASS") else: checks.append(f"Keyword '{keyword}': FAIL") all_passed = False # Check 3: Word count within range if result.output: word_count = len(result.output.split()) max_words = (scenario.metadata or {}).get("max_words", 500) if word_count <= max_words: checks.append(f"Length ({word_count} words): PASS") else: checks.append(f"Length ({word_count} words): FAIL (max {max_words})") all_passed = False return ScoreResult( score=1.0 if all_passed else 0.0, passed=all_passed, details=". ".join(checks) + ".", )

Scorer Naming Conventions

  • Use lowercase, short, descriptive names: json, word_count, tone, safety.
  • Avoid collisions with built-in scorer names: passfail, exact, regex, similarity.
  • The name must be a valid string and is used as a lookup key in the registry. Keep it simple.

See Also

Last updated on