Similarity Scorer
The Similarity scorer uses sentence embeddings and cosine similarity to compare the agent’s output against the scenario’s expected_output. Instead of requiring an exact match or substring, it measures how semantically close the two texts are. This makes it well suited for evaluating natural language output where the wording may vary but the meaning should be consistent.
How It Works
- The scorer loads a sentence-transformer model (by default,
all-MiniLM-L6-v2). - It encodes both the agent’s output and the scenario’s
expected_outputinto dense vector embeddings. - It computes the cosine similarity between the two vectors, producing a value between 0.0 and 1.0.
- If the similarity meets or exceeds the threshold (default: 0.7), the scenario passes.
The raw cosine similarity value is used as the score. Unlike PassFail or Exact Match which return only 0.0 or 1.0, the Similarity scorer returns a continuous value. For example, a score of 0.85 means the output is quite similar to the expected output, while 0.45 means it is only loosely related.
Installation
The Similarity scorer requires the sentence-transformers library, which is not included in the base AgenticAssure installation. Install it with the similarity extra:
pip install agenticassure[similarity]This pulls in sentence-transformers, torch, and their dependencies. The total download size is significant (several hundred MB) due to the PyTorch dependency.
If you attempt to use the Similarity scorer without this extra installed, you will receive an ImportError with instructions to install it.
Setting the Threshold
The default threshold is 0.7. A scenario passes if its cosine similarity score is greater than or equal to the threshold.
Per-scenario threshold
Override the threshold for individual scenarios using metadata.similarity_threshold:
suite:
name: response-quality
adapter: my_project.adapters.MyAdapter
scorer: similarity
scenarios:
- name: strict-match
input: "What is our refund policy?"
expected_output: "You can request a full refund within 30 days of purchase."
metadata:
similarity_threshold: 0.9
- name: loose-match
input: "Tell me a fun fact about space"
expected_output: "The sun is a star at the center of our solar system."
metadata:
similarity_threshold: 0.5The threshold value must be a float between 0.0 and 1.0.
Choosing the Right Threshold
The ideal threshold depends on how much variation you expect in the output and how strict your requirements are.
| Threshold | Behavior | Good For |
|---|---|---|
| 0.9 — 1.0 | Very strict. Output must be nearly identical in meaning and wording. | Paraphrasing tasks, translation quality, outputs with a single correct answer |
| 0.7 — 0.9 | Moderate. Allows rewording and variation while ensuring the core meaning is preserved. | General-purpose Q&A, summarization, typical agent responses |
| 0.5 — 0.7 | Loose. Accepts outputs that are topically related even if the details differ. | Creative tasks, open-ended responses, topic verification |
| Below 0.5 | Very loose. Almost anything vaguely related will pass. | Rarely useful; consider whether this test is providing value |
Start at 0.7 (the default) and adjust based on observed results. Run your test suite, review the similarity scores in the report, and tighten or loosen thresholds based on what you see.
The Embedding Model
The default model is all-MiniLM-L6-v2, a lightweight sentence-transformer model. It is:
- Fast (typically under 100ms per encoding on CPU).
- Small (approximately 80 MB download, 90 MB in memory).
- Effective for general English text similarity.
- Trained on over 1 billion sentence pairs.
The model is downloaded automatically from Hugging Face on first use and cached locally in the default Hugging Face cache directory (~/.cache/huggingface/ on Linux/macOS, C:\Users\<user>\.cache\huggingface\ on Windows).
Hugging Face rate limits
If you are running tests in a CI environment or making many concurrent requests, you may encounter Hugging Face Hub rate limits when downloading the model for the first time. Set the HF_TOKEN environment variable with a Hugging Face access token to increase your rate limit:
export HF_TOKEN=hf_your_token_here
agenticassure run suite.yamlOnce the model is cached locally, the token is not needed for subsequent runs.
Performance Considerations
Model loading time
The sentence-transformer model is lazy-loaded on first use. The first scenario scored with the Similarity scorer will take longer (typically 1-3 seconds) as the model loads into memory. Subsequent scenarios reuse the already-loaded model and are fast (milliseconds per comparison).
This means:
- If you have one Similarity-scored scenario in a suite of 50 PassFail scenarios, only that one scenario pays the model-loading cost.
- If you have 50 Similarity-scored scenarios, only the first one is slow; the remaining 49 use the cached model.
CPU vs. GPU
The default model runs on CPU and is fast enough for typical test suites (hundreds of scenarios). If you are running thousands of scenarios with Similarity scoring, a CUDA-capable GPU will provide a significant speedup. The sentence-transformers library automatically uses GPU when available.
Memory usage
The all-MiniLM-L6-v2 model uses approximately 90 MB of RAM. This is usually not a concern, but be aware of it in memory-constrained CI environments.
Caching
The model is loaded once per process and reused for all subsequent scoring calls. There is no need to configure caching — it is handled automatically.
Example Scenarios
Basic semantic comparison
suite:
name: faq-agent
adapter: my_project.adapters.FAQAdapter
scorer: similarity
scenarios:
- name: return-policy
input: "How do I return an item?"
expected_output: "To return an item, visit our returns portal within 30 days of delivery and follow the instructions to print a shipping label."The agent might respond with: “You can return your purchase by going to the returns page on our website. You have 30 days from when it was delivered. The site will generate a return shipping label for you.”
This is a different wording but conveys the same information. With a threshold of 0.7, this would likely score around 0.80-0.90 and pass.
Mixed scorers in one suite
You can use different scorers for different scenarios within the same suite by overriding at the scenario level:
suite:
name: mixed-evaluation
adapter: my_project.adapters.MyAdapter
scorer: passfail
scenarios:
- name: tool-check
input: "Look up order #123"
expected_tools:
- query_orders
- name: response-quality
input: "Explain our warranty"
expected_output: "All products come with a one-year limited warranty covering manufacturing defects."
scorer: similarity
metadata:
similarity_threshold: 0.75The first scenario uses the suite-level PassFail scorer. The second overrides it with Similarity.
The Details Field
On success:
Similarity: PASS. Cosine similarity 0.847 >= threshold 0.700.On failure:
Similarity: FAIL. Cosine similarity 0.432 < threshold 0.700.On failure due to missing expected_output:
Similarity: FAIL. No expected_output defined for this scenario.The reported similarity value is useful for calibrating your thresholds. If you see many scenarios scoring 0.65 and failing at the 0.7 threshold, consider whether lowering the threshold to 0.6 is appropriate or whether the agent’s responses genuinely need improvement.
When to Use Similarity vs. Other Scorers
| Situation | Recommended Scorer |
|---|---|
| Agent should produce a specific known string | Exact Match |
| Agent should mention specific keywords or phrases | PassFail (substring check) |
| Agent should produce output in a specific format | Regex |
| Agent should convey the same meaning as a reference answer | Similarity |
| Agent should produce a high-quality natural language response | Similarity |
| You want to detect semantic drift over time | Similarity |
Similarity is the best choice when you care about what the agent says rather than how it says it. It handles synonyms, paraphrasing, and rewording gracefully.
See Also
- PassFail Scorer — the default scorer
- Exact Match Scorer — strict string equality
- Regex Scorer — pattern-based matching
- Custom Scorers — building your own
- all-MiniLM-L6-v2 on Hugging Face — model details