Skip to Content
ScorersPassFail

PassFail Scorer

The PassFail scorer is the default scorer in AgenticAssure. It runs up to four sequential checks against an agent’s output and tool usage, producing a binary pass (1.0) or fail (0.0) score. It is the right choice for the vast majority of test scenarios and requires no extra dependencies.


How It Works

When the PassFail scorer evaluates a scenario, it walks through a series of checks in order. All checks that apply must pass for the scenario to receive a score of 1.0. If any check fails, the score is 0.0 and evaluation stops.

The four checks, in order:

#CheckTriggered WhenWhat It Verifies
1Output producedAlwaysThe agent returned a non-empty output string
2Expected tools calledexpected_tools is set on the scenarioEvery tool listed in expected_tools appears in the agent result’s tools_called list
3Tool arguments matchexpected_tool_args is set on the scenarioThe arguments recorded for each tool call match the expected argument values
4Expected output substringexpected_output is set on the scenarioexpected_output appears as a substring of the agent’s output (case-insensitive)

If a scenario does not define expected_tools, expected_tool_args, or expected_output, those checks are simply skipped — only the checks that are relevant to the scenario are applied.


Check Details

Check 1: Output Produced

This check always runs. It verifies that the agent returned something — any non-empty string counts as a pass. An agent that returns "", None, or a whitespace-only string will fail this check.

Why this matters: If an agent silently fails, returns an error, or produces no output at all, that is almost always a test failure. This check catches those cases before any content-level validation happens.

Check 2: Expected Tools Called

This check runs only when the scenario defines an expected_tools list. It verifies that every tool in that list was actually invoked by the agent during execution.

The check compares the scenario’s expected_tools list against the tools_called field on the AgentResult. Every expected tool must appear at least once. The agent is allowed to call additional tools beyond those listed — only the presence of the expected ones is verified.

scenarios: - name: weather-lookup input: "What is the weather in Seattle?" expected_tools: - get_weather

If the agent calls get_weather and format_response, this scenario passes the tool check. If the agent skips get_weather entirely, it fails.

Check 3: Tool Arguments Match

This check runs only when the scenario defines expected_tool_args. It verifies that the arguments passed to specific tool calls match the expected values.

scenarios: - name: weather-lookup-args input: "What is the weather in Seattle?" expected_tools: - get_weather expected_tool_args: get_weather: location: "Seattle"

The scorer checks that the get_weather tool was called with a location argument equal to "Seattle". Only the arguments explicitly listed are checked — the agent may pass additional arguments without causing a failure.

Check 4: Expected Output Substring

This check runs only when the scenario defines expected_output. It performs a case-insensitive substring search — the expected output must appear somewhere within the agent’s full output.

scenarios: - name: greeting input: "Say hello" expected_output: "hello"

If the agent responds with "Hello there! How can I help you?", this passes because "hello" is found within the output (case-insensitive). If the agent responds with "Hi! What can I do for you?", this fails because "hello" does not appear anywhere in the response.


When to Use PassFail

Use the PassFail scorer when:

  • You want a simple yes/no determination of whether the agent behaved correctly.
  • You need to verify tool usage (which tools were called and with what arguments).
  • You want to check that the output contains certain key phrases or information without requiring an exact match.
  • You are writing your first scenarios and want a sensible default.

PassFail is the best starting point for most test suites. It covers the common case of “did the agent do roughly the right thing?” without being so strict that minor wording changes cause failures.


Example Scenarios

Minimal scenario (output-only check)

scenarios: - name: basic-response input: "Summarize the company's return policy"

Only Check 1 runs. The scenario passes as long as the agent produces any non-empty output.

Substring check

scenarios: - name: refund-info input: "How do I get a refund?" expected_output: "30 days"

Checks 1 and 4 run. The agent must produce output, and that output must contain the substring "30 days" (case-insensitive).

Tool usage check

scenarios: - name: database-query input: "Look up order #12345" expected_tools: - query_orders - format_order_summary

Checks 1 and 2 run. The agent must produce output and must call both query_orders and format_order_summary.

Tool arguments check

scenarios: - name: database-query-with-args input: "Look up order #12345" expected_tools: - query_orders expected_tool_args: query_orders: order_id: "12345" expected_output: "order #12345"

All four checks run. The agent must:

  1. Produce non-empty output.
  2. Call query_orders.
  3. Pass order_id: "12345" to query_orders.
  4. Include "order #12345" somewhere in its output.

The Details Field

Every ScoreResult includes a details string that explains exactly what happened during scoring. The PassFail scorer builds this explanation incrementally, listing the result of each check that was evaluated.

A passing result might look like:

Output produced: PASS. Expected tools called: PASS (get_weather). Tool arguments match: PASS. Expected output found: PASS ("30 days" found in output).

A failing result stops at the first failure:

Output produced: PASS. Expected tools called: FAIL (missing: format_response).

This field is shown in CLI reports, HTML reports, and JSON output. It is especially useful for debugging why a scenario failed.


Limitations and Edge Cases

Substring matching is loose. The expected output check only confirms that a substring is present. If expected_output is "yes", a response of "I'm not saying yes to that" would still pass. For strict matching, use the Exact Match scorer or the Regex scorer.

Case insensitivity applies only to the output check. Tool names and tool argument values are compared using their original casing. If your agent reports calling Get_Weather but the scenario expects get_weather, the tool check will fail.

Order of tools is not checked. The scorer verifies that expected tools were called, but not in any particular order. If order matters, consider writing a custom scorer.

Only one expected_output string is supported. You cannot provide a list of required substrings. If you need to check for multiple phrases, use the Regex scorer with a pattern like (?=.*phrase one)(?=.*phrase two) or write a custom scorer.

Empty expected_output. If you set expected_output: "", the check will trivially pass since an empty string is a substring of any string. Avoid this — omit the field entirely if you do not want to check output content.

Whitespace sensitivity. The output check is case-insensitive but is not whitespace-normalized. Extra spaces, newlines, or tabs in the agent output could cause a mismatch if your expected_output does not account for them.


See Also

Last updated on