Testing Tool Usage
Many AI agents interact with external systems through tool calls (also called function calls). An agent might search a database, call an API, update a record, or perform a calculation. Testing that the agent calls the right tools with the right arguments is just as important as testing its final text output — and in some cases, more important.
Why Tool Testing Matters
Consider a customer support agent that looks up order status. The agent could:
- Actually call the
lookup_ordertool, get the real status, and relay it to the user. - Skip the tool call entirely and hallucinate a plausible-sounding status.
Both responses might look correct in the final output. Without tool testing, option 2 would pass your tests even though the agent is fabricating information. Tool testing catches this class of failure.
Tool testing is especially critical for:
- Data retrieval agents: Must actually query the data source, not hallucinate answers.
- Action-taking agents: Must call the right API to create, update, or delete resources.
- Multi-step agents: Must execute steps in the correct order with correct parameters.
- Agents with access controls: Must call authorization checks before performing sensitive actions.
expected_tools — Verifying Tool Calls
The expected_tools field is a list of tool names that the agent must call during execution. The passfail scorer checks that every tool in this list was called at least once.
- name: weather_lookup
input: "What is the weather in Tokyo?"
expected_tools:
- get_weather
scorers:
- passfailThis scenario passes if the agent calls get_weather at any point during execution. It does not check the order of calls, the number of times the tool was called, or what arguments were passed.
Multiple Expected Tools
When an agent should call more than one tool, list them all:
- name: order_with_shipping
input: "Place an order for item SKU-100 and calculate shipping to 90210"
expected_tools:
- create_order
- calculate_shipping
scorers:
- passfailBoth create_order and calculate_shipping must be called for the scenario to pass. The order in which they are called does not matter.
What expected_tools Does NOT Check
- Call order: Tools can be called in any sequence.
- Call count: Calling a tool multiple times still satisfies the check.
- Extra tools: If the agent calls additional tools not listed in
expected_tools, the scenario still passes. The check is for presence, not exclusivity. - Arguments: Use
expected_tool_argsseparately for argument verification.
expected_tool_args — Verifying Arguments
The expected_tool_args field is a dictionary mapping tool names to the arguments they should receive. This lets you verify not just that the agent called a tool, but that it extracted the right parameters from the user’s input.
- name: weather_with_args
input: "What is the weather in Tokyo?"
expected_tools:
- get_weather
expected_tool_args:
get_weather:
location: "Tokyo"
scorers:
- passfailHow Argument Matching Works
Argument matching is partial (also called “subset matching”). AgenticAssure checks that every key-value pair you specify in expected_tool_args is present in the actual tool call arguments. The agent is free to pass additional arguments not listed in the expected set.
For example, given this expected configuration:
expected_tool_args:
search_products:
category: "electronics"The scenario will pass if the agent calls search_products with:
{"category": "electronics", "sort_by": "price", "limit": 10}The extra sort_by and limit arguments are ignored. Only the specified category key is checked.
This will fail:
{"category": "clothing", "sort_by": "price"}Because the value of category does not match.
Multiple Tool Argument Checks
You can specify argument expectations for multiple tools in the same scenario:
- name: transfer_funds
input: "Transfer $500 from checking to savings"
expected_tools:
- validate_account
- execute_transfer
expected_tool_args:
validate_account:
account_type: "checking"
execute_transfer:
amount: 500
from_account: "checking"
to_account: "savings"
scorers:
- passfailWhen a Tool Is Called Multiple Times
If the agent calls the same tool more than once, AgenticAssure checks the arguments against the first matching call. This is important to be aware of when testing agents that iterate or retry tool calls.
Combining Tool Checks with Output Checks
For the most thorough testing, combine tool verification with output verification in a single scenario:
- name: order_lookup_complete
input: "What is the status of order #ORD-789?"
expected_output: "ORD-789"
expected_tools:
- lookup_order
expected_tool_args:
lookup_order:
order_id: "ORD-789"
scorers:
- passfailThe passfail scorer will verify all of the following:
- The agent produced non-empty output.
- The output contains “ORD-789” (case-insensitive substring match).
- The agent called the
lookup_ordertool. - The
lookup_ordercall includedorder_id: "ORD-789"in its arguments.
All four checks must pass for the scenario to pass.
Example Scenarios for Tool Testing
Lookup / Read Operations
- name: customer_profile_lookup
input: "Show me the profile for customer jane@example.com"
expected_tools:
- get_customer
expected_tool_args:
get_customer:
email: "jane@example.com"
scorers:
- passfail
tags:
- tools
- read
- name: search_knowledge_base
input: "How do I configure two-factor authentication?"
expected_tools:
- search_docs
expected_tool_args:
search_docs:
query: "two-factor authentication"
scorers:
- passfail
tags:
- tools
- readCreate Operations
- name: create_support_ticket
input: "I need to report a bug -- the checkout page crashes on mobile"
expected_tools:
- create_ticket
expected_tool_args:
create_ticket:
category: "bug"
expected_output: "ticket"
scorers:
- passfail
tags:
- tools
- create
- name: schedule_appointment
input: "Book me an appointment for next Tuesday at 2pm"
expected_tools:
- create_appointment
scorers:
- passfail
tags:
- tools
- createUpdate Operations
- name: update_shipping_address
input: "Change my shipping address to 123 Main St, Springfield, IL 62701"
expected_tools:
- update_address
expected_tool_args:
update_address:
street: "123 Main St"
city: "Springfield"
state: "IL"
zip: "62701"
scorers:
- passfail
tags:
- tools
- updateDelete Operations
- name: cancel_subscription
input: "Cancel my premium subscription"
expected_tools:
- cancel_subscription
expected_output: "cancelled"
scorers:
- passfail
tags:
- tools
- deleteMulti-Tool Workflows
- name: purchase_flow
input: "Buy 2 units of SKU-WIDGET-01 and ship to my default address"
expected_tools:
- get_product
- get_default_address
- create_order
expected_tool_args:
get_product:
sku: "SKU-WIDGET-01"
create_order:
quantity: 2
scorers:
- passfail
tags:
- tools
- multi-step
- ordersCommon Patterns
Verifying the Agent Does NOT Call a Tool
AgenticAssure does not have a built-in “must not call” check, but you can achieve this with a carefully designed scenario. If the agent should answer from its own knowledge without calling a tool, omit expected_tools and expected_tool_args. Then use passfail or similarity to verify the output is correct. If the agent incorrectly calls a tool, it will typically still pass the output check, so this approach has limits.
For strict “must not call” assertions, consider writing a custom scorer or checking tool call counts in a programmatic test.
Testing Error Propagation from Tools
Write scenarios where the expected tool would return an error (e.g., “order not found”) and verify the agent handles it gracefully:
- name: order_not_found
input: "What is the status of order #NONEXISTENT?"
expected_tools:
- lookup_order
expected_output: "could not find"
scorers:
- passfail
tags:
- error-handling
- toolsTesting Argument Extraction Quality
Write scenarios that challenge the agent’s ability to extract parameters from natural language:
- name: date_extraction
input: "Schedule a meeting for the first Monday in April"
expected_tools:
- create_event
scorers:
- passfail
tags:
- tools
- extraction
- name: numeric_extraction
input: "Transfer fifty thousand dollars to account ending in 4567"
expected_tools:
- execute_transfer
expected_tool_args:
execute_transfer:
amount: 50000
scorers:
- passfail
tags:
- tools
- extractionTips
- Always pair
expected_tool_argswithexpected_tools: If you specify expected arguments for a tool, also list that tool inexpected_toolsfor clarity, even thoughexpected_tool_argsimplicitly checks for the tool’s presence. - Start with
expected_toolsonly: Begin by verifying the agent calls the right tools. Addexpected_tool_argsonce you have confidence in the basic flow. - Use partial argument matching to your advantage: You do not need to specify every argument. Focus on the ones derived from user input that the agent must extract correctly.
- Tag tool-related scenarios: Use tags like
tools,read,create,update, ordeleteto organize and selectively run tool tests.