CI/CD Integration

Running AgenticAssure tests in your CI/CD pipeline ensures that changes to your agent, its prompts, its tools, or its underlying model do not introduce regressions. This guide covers how to set up AgenticAssure in continuous integration, with a focus on GitHub Actions.

Why Run Agent Tests in CI

AI agents are affected by changes that traditional test suites do not catch:

Prompt changes: A small edit to a system prompt can alter behavior across many scenarios.
Model upgrades: Switching from gpt-4 to gpt-4o (or any model version change) may shift outputs, tool-calling behavior, or response style.
Tool implementation changes: Modifying a tool’s interface or behavior affects how the agent integrates with it.
Dependency updates: Upgrading LangChain, OpenAI SDK, or other libraries can introduce subtle behavioral changes.

By running AgenticAssure in CI, you get an automated safety net that flags regressions before they reach production.

GitHub Actions Example

Below is a complete GitHub Actions workflow that runs AgenticAssure tests on every pull request and pushes to main.


# .github/workflows/agent-tests.yml
name: Agent Tests
 
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
 
jobs:
  agent-tests:
    runs-on: ubuntu-latest
    timeout-minutes: 15
 
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
 
      - name: Install dependencies
        run: |
          pip install --upgrade pip
          pip install agenticassure
          pip install -r requirements.txt
 
      - name: Validate scenarios
        run: agenticassure validate scenarios/
 
      - name: Run agent tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          agenticassure run scenarios/ \
            --adapter myproject.agent.MyAgent \
            --output cli
 
      - name: Generate HTML report
        if: always()
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          agenticassure run scenarios/ \
            --adapter myproject.agent.MyAgent \
            --output html \
            || true
 
      - name: Upload test report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: agent-test-report
          path: report_*.html
          retention-days: 30

What This Workflow Does

Validates scenarios: Runs agenticassure validate to catch YAML syntax errors before executing any tests. This step is fast and does not require an LLM API key.
Runs agent tests: Executes all scenarios using the specified adapter. The process exits with code 1 if any scenario fails, which causes the GitHub Actions job to fail.
Generates an HTML report: Produces a detailed report as a CI artifact for review, even if some tests failed (note the || true to prevent the step from failing and the if: always() condition).
Uploads the report: Stores the HTML report as a downloadable artifact.

Setting Up the Adapter in CI

Your adapter class must be importable in the CI environment. There are several approaches:

Option 1: Adapter in Your Package

If your adapter is part of your application code:


myproject/
  agent.py        # contains MyAgent class
  ...
scenarios/
  core_tests.yaml
requirements.txt

The CLI command references it by its dotted import path:


agenticassure run scenarios/ --adapter myproject.agent.MyAgent

Make sure your package is installed (or at minimum on PYTHONPATH):


- name: Install project
  run: pip install -e .

Option 2: Config File

Create an agenticassure.yaml in your repository root:


adapter: myproject.agent.MyAgent

Then you can omit the --adapter flag:


agenticassure run scenarios/

Option 3: Standalone Adapter File

For simpler setups, place a standalone adapter file in your repo:


# tests/adapter.py
from agenticassure.results import AgentResult
from myproject import create_agent
 
class CITestAgent:
    def __init__(self):
        self.agent = create_agent()
 
    def run(self, input, context=None):
        response = self.agent.invoke(input)
        return AgentResult(output=response)


agenticassure run scenarios/ --adapter tests.adapter.CITestAgent

Managing Secrets

AI agent tests typically require API keys for the underlying LLM provider. Never commit API keys to your repository.

GitHub Actions Secrets

Go to your repository on GitHub.
Navigate to Settings > Secrets and variables > Actions.
Click New repository secret.
Add your key (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY).

Reference secrets in your workflow:


env:
  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Multiple Environments

If your agent uses different API keys for different environments, use GitHub Environments:


jobs:
  agent-tests:
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - name: Run tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: agenticassure run scenarios/

Using Exit Codes for Pass/Fail Gating

AgenticAssure exits with code 1 if any scenario fails and code 0 if all scenarios pass. This integrates naturally with CI systems that treat non-zero exit codes as failures.

To use this as a required check on pull requests:

Run your agent tests in a CI job (as shown above).
In GitHub, go to Settings > Branches > Branch protection rules.
Enable Require status checks to pass before merging.
Select the agent test job as a required check.

Now pull requests cannot be merged if any agent test fails.

Generating Reports as CI Artifacts

HTML Reports


- name: Generate HTML report
  if: always()
  run: agenticassure run scenarios/ --adapter myproject.agent.MyAgent --output html || true
 
- name: Upload HTML report
  if: always()
  uses: actions/upload-artifact@v4
  with:
    name: agent-test-report
    path: report_*.html

JSON Reports

JSON reports are useful for downstream processing, dashboards, or trend analysis:


- name: Generate JSON report
  if: always()
  run: agenticassure run scenarios/ --adapter myproject.agent.MyAgent --output json || true
 
- name: Upload JSON report
  if: always()
  uses: actions/upload-artifact@v4
  with:
    name: agent-test-results
    path: results_*.json

Running Specific Tags in CI vs Locally

Use tags to run different subsets of tests in different contexts.

Fast CI Checks

Run only critical, fast scenarios on every PR:


- name: Run smoke tests
  run: |
    agenticassure run scenarios/ \
      --adapter myproject.agent.MyAgent \
      --tag critical \
      --tag fast

Full Nightly Suite

Run the complete test suite on a schedule:


# .github/workflows/nightly-agent-tests.yml
name: Nightly Agent Tests
 
on:
  schedule:
    - cron: "0 6 * * *"  # 6 AM UTC daily
 
jobs:
  full-suite:
    runs-on: ubuntu-latest
    timeout-minutes: 60
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install agenticassure -r requirements.txt
      - name: Run all tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          agenticassure run scenarios/ \
            --adapter myproject.agent.MyAgent \
            --output html
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: nightly-report
          path: report_*.html

Local Development

Run only scenarios relevant to what you are working on:


# Run only scenarios tagged "orders"
agenticassure run scenarios/ --adapter myproject.agent.MyAgent -t orders
 
# Run safety tests before committing
agenticassure run scenarios/ --adapter myproject.agent.MyAgent -t safety

Performance Considerations

Timeouts

LLM API calls can be slow. Set appropriate timeouts at the suite level and per-scenario:


suite:
  name: ci-tests
  config:
    default_timeout: 60
 
scenarios:
  - name: quick_lookup
    input: "What is my balance?"
    timeout_seconds: 15
    tags: [fast]
 
  - name: complex_research
    input: "Analyze this data and provide recommendations"
    timeout_seconds: 120
    tags: [slow]

Set a job-level timeout in your CI workflow to prevent runaway jobs:


jobs:
  agent-tests:
    timeout-minutes: 15

Cost Management

Each scenario run makes at least one LLM API call. For a suite of 50 scenarios using GPT-4, a single CI run can cost several dollars. Strategies to manage costs:

Tag and filter: Run only critical scenarios on every PR. Run the full suite nightly or weekly.
Use cheaper models in CI: If feasible, test with a less expensive model for basic checks and reserve the production model for nightly runs.
Limit retries: Set retries: 0 in CI to avoid multiplying API calls on flaky tests.
Cache or mock when possible: For pure integration testing, consider a mock adapter that replays recorded responses.

Retries

Use retries cautiously in CI. LLM outputs are non-deterministic, so a scenario that fails once might pass on retry. However, retries also increase cost and run time:


suite:
  config:
    retries: 1  # One retry on failure

Example: PR Check Workflow

Here is a minimal but complete workflow suitable for gating pull requests:


# .github/workflows/pr-agent-check.yml
name: PR Agent Check
 
on:
  pull_request:
    branches: [main]
 
jobs:
  check:
    runs-on: ubuntu-latest
    timeout-minutes: 10
 
    steps:
      - uses: actions/checkout@v4
 
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
 
      - name: Install
        run: |
          pip install agenticassure
          pip install -e .
 
      - name: Validate scenario files
        run: agenticassure validate scenarios/
 
      - name: Run critical agent tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          agenticassure run scenarios/ \
            --adapter myproject.agent.MyAgent \
            --tag critical \
            --timeout 30
 
      - name: Upload report
        if: always()
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          agenticassure run scenarios/ \
            --adapter myproject.agent.MyAgent \
            --tag critical \
            --output html \
            || true
 
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: pr-agent-report
          path: report_*.html

This workflow validates YAML files first (cheap, no API calls), runs only critical-tagged scenarios to keep costs and time low, and uploads an HTML report for review regardless of pass/fail status.

Other CI Systems

While the examples above use GitHub Actions, AgenticAssure works with any CI system that can run shell commands. The core pattern is the same:

Install Python and dependencies.
Set LLM API keys as environment variables.
Run agenticassure validate for fast validation.
Run agenticassure run with your adapter.
Check the exit code (0 = pass, 1 = fail).
Collect report files as artifacts.

GitLab CI Example


agent-tests:
  image: python:3.11
  variables:
    OPENAI_API_KEY: $OPENAI_API_KEY
  script:
    - pip install agenticassure -r requirements.txt
    - agenticassure validate scenarios/
    - agenticassure run scenarios/ --adapter myproject.agent.MyAgent --output html
  artifacts:
    paths:
      - report_*.html
    when: always
    expire_in: 30 days