Similarity Scorer

The Similarity scorer uses sentence embeddings and cosine similarity to compare the agent’s output against the scenario’s expected_output. Instead of requiring an exact match or substring, it measures how semantically close the two texts are. This makes it well suited for evaluating natural language output where the wording may vary but the meaning should be consistent.

How It Works

The scorer loads a sentence-transformer model (by default, all-MiniLM-L6-v2).
It encodes both the agent’s output and the scenario’s expected_output into dense vector embeddings.
It computes the cosine similarity between the two vectors, producing a value between 0.0 and 1.0.
If the similarity meets or exceeds the threshold (default: 0.7), the scenario passes.

The raw cosine similarity value is used as the score. Unlike PassFail or Exact Match which return only 0.0 or 1.0, the Similarity scorer returns a continuous value. For example, a score of 0.85 means the output is quite similar to the expected output, while 0.45 means it is only loosely related.

Installation

The Similarity scorer requires the sentence-transformers library, which is not included in the base AgenticAssure installation. Install it with the similarity extra:


pip install agenticassure[similarity]

This pulls in sentence-transformers, torch, and their dependencies. The total download size is significant (several hundred MB) due to the PyTorch dependency.

If you attempt to use the Similarity scorer without this extra installed, you will receive an ImportError with instructions to install it.

Setting the Threshold

The default threshold is 0.7. A scenario passes if its cosine similarity score is greater than or equal to the threshold.

Per-scenario threshold

Override the threshold for individual scenarios using metadata.similarity_threshold:


suite:
  name: response-quality
  adapter: my_project.adapters.MyAdapter
  scorer: similarity
 
scenarios:
  - name: strict-match
    input: "What is our refund policy?"
    expected_output: "You can request a full refund within 30 days of purchase."
    metadata:
      similarity_threshold: 0.9
 
  - name: loose-match
    input: "Tell me a fun fact about space"
    expected_output: "The sun is a star at the center of our solar system."
    metadata:
      similarity_threshold: 0.5

The threshold value must be a float between 0.0 and 1.0.

Choosing the Right Threshold

The ideal threshold depends on how much variation you expect in the output and how strict your requirements are.

Threshold	Behavior	Good For
0.9 — 1.0	Very strict. Output must be nearly identical in meaning and wording.	Paraphrasing tasks, translation quality, outputs with a single correct answer
0.7 — 0.9	Moderate. Allows rewording and variation while ensuring the core meaning is preserved.	General-purpose Q&A, summarization, typical agent responses
0.5 — 0.7	Loose. Accepts outputs that are topically related even if the details differ.	Creative tasks, open-ended responses, topic verification
Below 0.5	Very loose. Almost anything vaguely related will pass.	Rarely useful; consider whether this test is providing value

Start at 0.7 (the default) and adjust based on observed results. Run your test suite, review the similarity scores in the report, and tighten or loosen thresholds based on what you see.

The Embedding Model

The default model is all-MiniLM-L6-v2, a lightweight sentence-transformer model. It is:

Fast (typically under 100ms per encoding on CPU).
Small (approximately 80 MB download, 90 MB in memory).
Effective for general English text similarity.
Trained on over 1 billion sentence pairs.

The model is downloaded automatically from Hugging Face on first use and cached locally in the default Hugging Face cache directory (~/.cache/huggingface/ on Linux/macOS, C:\Users\<user>\.cache\huggingface\ on Windows).

Hugging Face rate limits

If you are running tests in a CI environment or making many concurrent requests, you may encounter Hugging Face Hub rate limits when downloading the model for the first time. Set the HF_TOKEN environment variable with a Hugging Face access token to increase your rate limit:


export HF_TOKEN=hf_your_token_here
agenticassure run suite.yaml

Once the model is cached locally, the token is not needed for subsequent runs.

Performance Considerations

Model loading time

The sentence-transformer model is lazy-loaded on first use. The first scenario scored with the Similarity scorer will take longer (typically 1-3 seconds) as the model loads into memory. Subsequent scenarios reuse the already-loaded model and are fast (milliseconds per comparison).

This means:

If you have one Similarity-scored scenario in a suite of 50 PassFail scenarios, only that one scenario pays the model-loading cost.
If you have 50 Similarity-scored scenarios, only the first one is slow; the remaining 49 use the cached model.

CPU vs. GPU

The default model runs on CPU and is fast enough for typical test suites (hundreds of scenarios). If you are running thousands of scenarios with Similarity scoring, a CUDA-capable GPU will provide a significant speedup. The sentence-transformers library automatically uses GPU when available.

Memory usage

The all-MiniLM-L6-v2 model uses approximately 90 MB of RAM. This is usually not a concern, but be aware of it in memory-constrained CI environments.

Caching

The model is loaded once per process and reused for all subsequent scoring calls. There is no need to configure caching — it is handled automatically.

Example Scenarios

Basic semantic comparison


suite:
  name: faq-agent
  adapter: my_project.adapters.FAQAdapter
  scorer: similarity
 
scenarios:
  - name: return-policy
    input: "How do I return an item?"
    expected_output: "To return an item, visit our returns portal within 30 days of delivery and follow the instructions to print a shipping label."

The agent might respond with: “You can return your purchase by going to the returns page on our website. You have 30 days from when it was delivered. The site will generate a return shipping label for you.”

This is a different wording but conveys the same information. With a threshold of 0.7, this would likely score around 0.80-0.90 and pass.

Mixed scorers in one suite

You can use different scorers for different scenarios within the same suite by overriding at the scenario level:


suite:
  name: mixed-evaluation
  adapter: my_project.adapters.MyAdapter
  scorer: passfail
 
scenarios:
  - name: tool-check
    input: "Look up order #123"
    expected_tools:
      - query_orders
 
  - name: response-quality
    input: "Explain our warranty"
    expected_output: "All products come with a one-year limited warranty covering manufacturing defects."
    scorer: similarity
    metadata:
      similarity_threshold: 0.75

The first scenario uses the suite-level PassFail scorer. The second overrides it with Similarity.

The Details Field

On success:


Similarity: PASS. Cosine similarity 0.847 >= threshold 0.700.

On failure:


Similarity: FAIL. Cosine similarity 0.432 < threshold 0.700.

On failure due to missing expected_output:


Similarity: FAIL. No expected_output defined for this scenario.

The reported similarity value is useful for calibrating your thresholds. If you see many scenarios scoring 0.65 and failing at the 0.7 threshold, consider whether lowering the threshold to 0.6 is appropriate or whether the agent’s responses genuinely need improvement.

When to Use Similarity vs. Other Scorers

Situation	Recommended Scorer
Agent should produce a specific known string	Exact Match
Agent should mention specific keywords or phrases	PassFail (substring check)
Agent should produce output in a specific format	Regex
Agent should convey the same meaning as a reference answer	Similarity
Agent should produce a high-quality natural language response	Similarity
You want to detect semantic drift over time	Similarity

Similarity is the best choice when you care about what the agent says rather than how it says it. It handles synonyms, paraphrasing, and rewording gracefully.