Skip to Content
ScorersExact Match

Exact Match Scorer

The Exact Match scorer performs a strict string comparison between the agent’s output and the scenario’s expected_output. The score is 1.0 if the strings are identical, and 0.0 otherwise. There is no partial credit.


How It Works

The scorer compares the agent’s output to the expected_output field on the scenario using Python’s standard string equality (==). If the two strings match character-for-character, the scenario passes. If they differ by even a single character, it fails.

If the scenario does not define an expected_output at all, the scorer automatically fails with an explanation indicating that exact match requires an expected output to compare against.


When to Use Exact Match

Use the Exact Match scorer when:

  • The agent is expected to return a deterministic, predictable value — a number, a code, a structured identifier, a fixed label.
  • You are testing classification or extraction tasks where the output should be an exact known value.
  • You need to verify that formatting, capitalization, and punctuation are all precisely correct.

Prefer PassFail over Exact Match when the agent produces natural language. LLMs rarely generate the exact same response twice, even for identical inputs. A minor rewording, extra space, or punctuation change will cause an exact match failure. For natural language output, the PassFail scorer’s substring check or the Similarity scorer’s semantic comparison are almost always better choices.


Example Scenarios

Classification task

suite: name: sentiment-classifier adapter: my_project.adapters.SentimentAdapter scorer: exact scenarios: - name: positive-review input: "This product is absolutely wonderful, I love it!" expected_output: "positive" - name: negative-review input: "Terrible experience. Completely broken on arrival." expected_output: "negative" - name: neutral-review input: "The package arrived on Tuesday." expected_output: "neutral"

This is an ideal use case. The adapter wraps a sentiment classification agent that is expected to return one of a fixed set of labels. Exact match ensures the label is precisely correct.

Numeric extraction

suite: name: calculator-agent adapter: my_project.adapters.CalculatorAdapter scorer: exact scenarios: - name: addition input: "What is 42 + 58?" expected_output: "100" - name: percentage input: "What is 15% of 200?" expected_output: "30"

Code generation

suite: name: sql-generator adapter: my_project.adapters.SQLAdapter scorer: exact scenarios: - name: simple-select input: "Get all users older than 30" expected_output: "SELECT * FROM users WHERE age > 30;"

Be cautious with this pattern. If the agent produces SELECT * FROM users WHERE age > 30 (no trailing semicolon) or select * from users where age > 30; (lowercase), the test will fail.


Case Sensitivity

The Exact Match scorer is case-sensitive. The strings "Hello" and "hello" are not equal and will produce a score of 0.0.

If you need case-insensitive comparison, you have several options:

  1. Use the PassFail scorer — its substring check is case-insensitive by default.
  2. Use the Regex scorer with a case-insensitive flag: metadata.regex_pattern: "(?i)^hello$".
  3. Normalize in your adapter — have your adapter convert the agent’s output to lowercase before returning the AgentResult.
  4. Write a custom scorer that performs case-insensitive comparison.

The Details Field

On success:

Exact match: PASS.

On failure when expected_output is defined:

Exact match: FAIL. Expected "positive", got "Positive".

On failure when expected_output is missing:

Exact match: FAIL. No expected_output defined for this scenario.

The details field shows both the expected and actual values, which is particularly helpful for spotting subtle differences like trailing whitespace or capitalization mismatches.


Limitations

Brittle for LLM output. Large language models are inherently non-deterministic. Even with temperature=0, different API versions, model updates, or minor prompt changes can alter the exact output. Exact Match is best reserved for tasks where you have tight control over the output format.

Whitespace matters. A trailing newline, extra space, or tab character will cause a failure. If your agent tends to add trailing whitespace, consider stripping it in your adapter before returning the result.

No partial credit. A response that is 99% correct still scores 0.0. For longer outputs where you want to check that key content is present, use PassFail (substring check) or Similarity (semantic comparison).

No expected_output means automatic failure. Unlike PassFail, which can run with just an output-produced check, Exact Match requires expected_output to be defined. Scenarios without it will always fail.

Multiline output. YAML multiline strings can be tricky. Use the | (literal block) or > (folded block) scalar styles in YAML to define multiline expected output without surprises:

scenarios: - name: multiline-response input: "List the three primary colors" expected_output: | red blue yellow

Be aware that the | style includes a trailing newline. Use |- (strip trailing newline) if you need the string to end without one.


See Also

Last updated on