CI/CD Integration
Running AgenticAssure tests in your CI/CD pipeline ensures that changes to your agent, its prompts, its tools, or its underlying model do not introduce regressions. This guide covers how to set up AgenticAssure in continuous integration, with a focus on GitHub Actions.
Why Run Agent Tests in CI
AI agents are affected by changes that traditional test suites do not catch:
- Prompt changes: A small edit to a system prompt can alter behavior across many scenarios.
- Model upgrades: Switching from
gpt-4togpt-4o(or any model version change) may shift outputs, tool-calling behavior, or response style. - Tool implementation changes: Modifying a tool’s interface or behavior affects how the agent integrates with it.
- Dependency updates: Upgrading LangChain, OpenAI SDK, or other libraries can introduce subtle behavioral changes.
By running AgenticAssure in CI, you get an automated safety net that flags regressions before they reach production.
GitHub Actions Example
Below is a complete GitHub Actions workflow that runs AgenticAssure tests on every pull request and pushes to main.
# .github/workflows/agent-tests.yml
name: Agent Tests
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
agent-tests:
runs-on: ubuntu-latest
timeout-minutes: 15
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: |
pip install --upgrade pip
pip install agenticassure
pip install -r requirements.txt
- name: Validate scenarios
run: agenticassure validate scenarios/
- name: Run agent tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
agenticassure run scenarios/ \
--adapter myproject.agent.MyAgent \
--output cli
- name: Generate HTML report
if: always()
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
agenticassure run scenarios/ \
--adapter myproject.agent.MyAgent \
--output html \
|| true
- name: Upload test report
if: always()
uses: actions/upload-artifact@v4
with:
name: agent-test-report
path: report_*.html
retention-days: 30What This Workflow Does
- Validates scenarios: Runs
agenticassure validateto catch YAML syntax errors before executing any tests. This step is fast and does not require an LLM API key. - Runs agent tests: Executes all scenarios using the specified adapter. The process exits with code 1 if any scenario fails, which causes the GitHub Actions job to fail.
- Generates an HTML report: Produces a detailed report as a CI artifact for review, even if some tests failed (note the
|| trueto prevent the step from failing and theif: always()condition). - Uploads the report: Stores the HTML report as a downloadable artifact.
Setting Up the Adapter in CI
Your adapter class must be importable in the CI environment. There are several approaches:
Option 1: Adapter in Your Package
If your adapter is part of your application code:
myproject/
agent.py # contains MyAgent class
...
scenarios/
core_tests.yaml
requirements.txtThe CLI command references it by its dotted import path:
agenticassure run scenarios/ --adapter myproject.agent.MyAgentMake sure your package is installed (or at minimum on PYTHONPATH):
- name: Install project
run: pip install -e .Option 2: Config File
Create an agenticassure.yaml in your repository root:
adapter: myproject.agent.MyAgentThen you can omit the --adapter flag:
agenticassure run scenarios/Option 3: Standalone Adapter File
For simpler setups, place a standalone adapter file in your repo:
# tests/adapter.py
from agenticassure.results import AgentResult
from myproject import create_agent
class CITestAgent:
def __init__(self):
self.agent = create_agent()
def run(self, input, context=None):
response = self.agent.invoke(input)
return AgentResult(output=response)agenticassure run scenarios/ --adapter tests.adapter.CITestAgentManaging Secrets
AI agent tests typically require API keys for the underlying LLM provider. Never commit API keys to your repository.
GitHub Actions Secrets
- Go to your repository on GitHub.
- Navigate to Settings > Secrets and variables > Actions.
- Click New repository secret.
- Add your key (e.g.,
OPENAI_API_KEY,ANTHROPIC_API_KEY).
Reference secrets in your workflow:
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}Multiple Environments
If your agent uses different API keys for different environments, use GitHub Environments:
jobs:
agent-tests:
runs-on: ubuntu-latest
environment: staging
steps:
- name: Run tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: agenticassure run scenarios/Using Exit Codes for Pass/Fail Gating
AgenticAssure exits with code 1 if any scenario fails and code 0 if all scenarios pass. This integrates naturally with CI systems that treat non-zero exit codes as failures.
To use this as a required check on pull requests:
- Run your agent tests in a CI job (as shown above).
- In GitHub, go to Settings > Branches > Branch protection rules.
- Enable Require status checks to pass before merging.
- Select the agent test job as a required check.
Now pull requests cannot be merged if any agent test fails.
Generating Reports as CI Artifacts
HTML Reports
- name: Generate HTML report
if: always()
run: agenticassure run scenarios/ --adapter myproject.agent.MyAgent --output html || true
- name: Upload HTML report
if: always()
uses: actions/upload-artifact@v4
with:
name: agent-test-report
path: report_*.htmlJSON Reports
JSON reports are useful for downstream processing, dashboards, or trend analysis:
- name: Generate JSON report
if: always()
run: agenticassure run scenarios/ --adapter myproject.agent.MyAgent --output json || true
- name: Upload JSON report
if: always()
uses: actions/upload-artifact@v4
with:
name: agent-test-results
path: results_*.jsonRunning Specific Tags in CI vs Locally
Use tags to run different subsets of tests in different contexts.
Fast CI Checks
Run only critical, fast scenarios on every PR:
- name: Run smoke tests
run: |
agenticassure run scenarios/ \
--adapter myproject.agent.MyAgent \
--tag critical \
--tag fastFull Nightly Suite
Run the complete test suite on a schedule:
# .github/workflows/nightly-agent-tests.yml
name: Nightly Agent Tests
on:
schedule:
- cron: "0 6 * * *" # 6 AM UTC daily
jobs:
full-suite:
runs-on: ubuntu-latest
timeout-minutes: 60
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install agenticassure -r requirements.txt
- name: Run all tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
agenticassure run scenarios/ \
--adapter myproject.agent.MyAgent \
--output html
- uses: actions/upload-artifact@v4
if: always()
with:
name: nightly-report
path: report_*.htmlLocal Development
Run only scenarios relevant to what you are working on:
# Run only scenarios tagged "orders"
agenticassure run scenarios/ --adapter myproject.agent.MyAgent -t orders
# Run safety tests before committing
agenticassure run scenarios/ --adapter myproject.agent.MyAgent -t safetyPerformance Considerations
Timeouts
LLM API calls can be slow. Set appropriate timeouts at the suite level and per-scenario:
suite:
name: ci-tests
config:
default_timeout: 60
scenarios:
- name: quick_lookup
input: "What is my balance?"
timeout_seconds: 15
tags: [fast]
- name: complex_research
input: "Analyze this data and provide recommendations"
timeout_seconds: 120
tags: [slow]Set a job-level timeout in your CI workflow to prevent runaway jobs:
jobs:
agent-tests:
timeout-minutes: 15Cost Management
Each scenario run makes at least one LLM API call. For a suite of 50 scenarios using GPT-4, a single CI run can cost several dollars. Strategies to manage costs:
- Tag and filter: Run only
criticalscenarios on every PR. Run the full suite nightly or weekly. - Use cheaper models in CI: If feasible, test with a less expensive model for basic checks and reserve the production model for nightly runs.
- Limit retries: Set
retries: 0in CI to avoid multiplying API calls on flaky tests. - Cache or mock when possible: For pure integration testing, consider a mock adapter that replays recorded responses.
Retries
Use retries cautiously in CI. LLM outputs are non-deterministic, so a scenario that fails once might pass on retry. However, retries also increase cost and run time:
suite:
config:
retries: 1 # One retry on failureExample: PR Check Workflow
Here is a minimal but complete workflow suitable for gating pull requests:
# .github/workflows/pr-agent-check.yml
name: PR Agent Check
on:
pull_request:
branches: [main]
jobs:
check:
runs-on: ubuntu-latest
timeout-minutes: 10
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install
run: |
pip install agenticassure
pip install -e .
- name: Validate scenario files
run: agenticassure validate scenarios/
- name: Run critical agent tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
agenticassure run scenarios/ \
--adapter myproject.agent.MyAgent \
--tag critical \
--timeout 30
- name: Upload report
if: always()
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
agenticassure run scenarios/ \
--adapter myproject.agent.MyAgent \
--tag critical \
--output html \
|| true
- uses: actions/upload-artifact@v4
if: always()
with:
name: pr-agent-report
path: report_*.htmlThis workflow validates YAML files first (cheap, no API calls), runs only critical-tagged scenarios to keep costs and time low, and uploads an HTML report for review regardless of pass/fail status.
Other CI Systems
While the examples above use GitHub Actions, AgenticAssure works with any CI system that can run shell commands. The core pattern is the same:
- Install Python and dependencies.
- Set LLM API keys as environment variables.
- Run
agenticassure validatefor fast validation. - Run
agenticassure runwith your adapter. - Check the exit code (0 = pass, 1 = fail).
- Collect report files as artifacts.
GitLab CI Example
agent-tests:
image: python:3.11
variables:
OPENAI_API_KEY: $OPENAI_API_KEY
script:
- pip install agenticassure -r requirements.txt
- agenticassure validate scenarios/
- agenticassure run scenarios/ --adapter myproject.agent.MyAgent --output html
artifacts:
paths:
- report_*.html
when: always
expire_in: 30 days