Testing Tool Usage

Many AI agents interact with external systems through tool calls (also called function calls). An agent might search a database, call an API, update a record, or perform a calculation. Testing that the agent calls the right tools with the right arguments is just as important as testing its final text output — and in some cases, more important.

Why Tool Testing Matters

Consider a customer support agent that looks up order status. The agent could:

Actually call the lookup_order tool, get the real status, and relay it to the user.
Skip the tool call entirely and hallucinate a plausible-sounding status.

Both responses might look correct in the final output. Without tool testing, option 2 would pass your tests even though the agent is fabricating information. Tool testing catches this class of failure.

Tool testing is especially critical for:

Data retrieval agents: Must actually query the data source, not hallucinate answers.
Action-taking agents: Must call the right API to create, update, or delete resources.
Multi-step agents: Must execute steps in the correct order with correct parameters.
Agents with access controls: Must call authorization checks before performing sensitive actions.

`expected_tools` — Verifying Tool Calls

The expected_tools field is a list of tool names that the agent must call during execution. The passfail scorer checks that every tool in this list was called at least once.


- name: weather_lookup
  input: "What is the weather in Tokyo?"
  expected_tools:
    - get_weather
  scorers:
    - passfail

This scenario passes if the agent calls get_weather at any point during execution. It does not check the order of calls, the number of times the tool was called, or what arguments were passed.

Multiple Expected Tools

When an agent should call more than one tool, list them all:


- name: order_with_shipping
  input: "Place an order for item SKU-100 and calculate shipping to 90210"
  expected_tools:
    - create_order
    - calculate_shipping
  scorers:
    - passfail

Both create_order and calculate_shipping must be called for the scenario to pass. The order in which they are called does not matter.

What `expected_tools` Does NOT Check

Call order: Tools can be called in any sequence.
Call count: Calling a tool multiple times still satisfies the check.
Extra tools: If the agent calls additional tools not listed in expected_tools, the scenario still passes. The check is for presence, not exclusivity.
Arguments: Use expected_tool_args separately for argument verification.

`expected_tool_args` — Verifying Arguments

The expected_tool_args field is a dictionary mapping tool names to the arguments they should receive. This lets you verify not just that the agent called a tool, but that it extracted the right parameters from the user’s input.


- name: weather_with_args
  input: "What is the weather in Tokyo?"
  expected_tools:
    - get_weather
  expected_tool_args:
    get_weather:
      location: "Tokyo"
  scorers:
    - passfail

How Argument Matching Works

Argument matching is partial (also called “subset matching”). AgenticAssure checks that every key-value pair you specify in expected_tool_args is present in the actual tool call arguments. The agent is free to pass additional arguments not listed in the expected set.

For example, given this expected configuration:


expected_tool_args:
  search_products:
    category: "electronics"

The scenario will pass if the agent calls search_products with:


{"category": "electronics", "sort_by": "price", "limit": 10}

The extra sort_by and limit arguments are ignored. Only the specified category key is checked.

This will fail:


{"category": "clothing", "sort_by": "price"}

Because the value of category does not match.

Multiple Tool Argument Checks

You can specify argument expectations for multiple tools in the same scenario:


- name: transfer_funds
  input: "Transfer $500 from checking to savings"
  expected_tools:
    - validate_account
    - execute_transfer
  expected_tool_args:
    validate_account:
      account_type: "checking"
    execute_transfer:
      amount: 500
      from_account: "checking"
      to_account: "savings"
  scorers:
    - passfail

When a Tool Is Called Multiple Times

If the agent calls the same tool more than once, AgenticAssure checks the arguments against the first matching call. This is important to be aware of when testing agents that iterate or retry tool calls.

Combining Tool Checks with Output Checks

For the most thorough testing, combine tool verification with output verification in a single scenario:


- name: order_lookup_complete
  input: "What is the status of order #ORD-789?"
  expected_output: "ORD-789"
  expected_tools:
    - lookup_order
  expected_tool_args:
    lookup_order:
      order_id: "ORD-789"
  scorers:
    - passfail

The passfail scorer will verify all of the following:

The agent produced non-empty output.
The output contains “ORD-789” (case-insensitive substring match).
The agent called the lookup_order tool.
The lookup_order call included order_id: "ORD-789" in its arguments.

All four checks must pass for the scenario to pass.

Example Scenarios for Tool Testing

Lookup / Read Operations


- name: customer_profile_lookup
  input: "Show me the profile for customer jane@example.com"
  expected_tools:
    - get_customer
  expected_tool_args:
    get_customer:
      email: "jane@example.com"
  scorers:
    - passfail
  tags:
    - tools
    - read
 
- name: search_knowledge_base
  input: "How do I configure two-factor authentication?"
  expected_tools:
    - search_docs
  expected_tool_args:
    search_docs:
      query: "two-factor authentication"
  scorers:
    - passfail
  tags:
    - tools
    - read

Create Operations


- name: create_support_ticket
  input: "I need to report a bug -- the checkout page crashes on mobile"
  expected_tools:
    - create_ticket
  expected_tool_args:
    create_ticket:
      category: "bug"
  expected_output: "ticket"
  scorers:
    - passfail
  tags:
    - tools
    - create
 
- name: schedule_appointment
  input: "Book me an appointment for next Tuesday at 2pm"
  expected_tools:
    - create_appointment
  scorers:
    - passfail
  tags:
    - tools
    - create

Update Operations


- name: update_shipping_address
  input: "Change my shipping address to 123 Main St, Springfield, IL 62701"
  expected_tools:
    - update_address
  expected_tool_args:
    update_address:
      street: "123 Main St"
      city: "Springfield"
      state: "IL"
      zip: "62701"
  scorers:
    - passfail
  tags:
    - tools
    - update

Delete Operations


- name: cancel_subscription
  input: "Cancel my premium subscription"
  expected_tools:
    - cancel_subscription
  expected_output: "cancelled"
  scorers:
    - passfail
  tags:
    - tools
    - delete

Multi-Tool Workflows


- name: purchase_flow
  input: "Buy 2 units of SKU-WIDGET-01 and ship to my default address"
  expected_tools:
    - get_product
    - get_default_address
    - create_order
  expected_tool_args:
    get_product:
      sku: "SKU-WIDGET-01"
    create_order:
      quantity: 2
  scorers:
    - passfail
  tags:
    - tools
    - multi-step
    - orders

Common Patterns

Verifying the Agent Does NOT Call a Tool

AgenticAssure does not have a built-in “must not call” check, but you can achieve this with a carefully designed scenario. If the agent should answer from its own knowledge without calling a tool, omit expected_tools and expected_tool_args. Then use passfail or similarity to verify the output is correct. If the agent incorrectly calls a tool, it will typically still pass the output check, so this approach has limits.

For strict “must not call” assertions, consider writing a custom scorer or checking tool call counts in a programmatic test.

Testing Error Propagation from Tools

Write scenarios where the expected tool would return an error (e.g., “order not found”) and verify the agent handles it gracefully:


- name: order_not_found
  input: "What is the status of order #NONEXISTENT?"
  expected_tools:
    - lookup_order
  expected_output: "could not find"
  scorers:
    - passfail
  tags:
    - error-handling
    - tools

Testing Argument Extraction Quality

Write scenarios that challenge the agent’s ability to extract parameters from natural language:


- name: date_extraction
  input: "Schedule a meeting for the first Monday in April"
  expected_tools:
    - create_event
  scorers:
    - passfail
  tags:
    - tools
    - extraction
 
- name: numeric_extraction
  input: "Transfer fifty thousand dollars to account ending in 4567"
  expected_tools:
    - execute_transfer
  expected_tool_args:
    execute_transfer:
      amount: 50000
  scorers:
    - passfail
  tags:
    - tools
    - extraction

Tips

Always pair expected_tool_args with expected_tools: If you specify expected arguments for a tool, also list that tool in expected_tools for clarity, even though expected_tool_args implicitly checks for the tool’s presence.
Start with expected_tools only: Begin by verifying the agent calls the right tools. Add expected_tool_args once you have confidence in the basic flow.
Use partial argument matching to your advantage: You do not need to specify every argument. Focus on the ones derived from user input that the agent must extract correctly.
Tag tool-related scenarios: Use tags like tools, read, create, update, or delete to organize and selectively run tool tests.