Let Agents Test Your App in a Real Browser with Expect (Open-Source CLI & Agent Skill)
Traditional end-to-end testing has always been brittle by design. You write a Playwright or Selenium script that clicks button[data-testid="submit"], and six weeks later a frontend engineer renames the attribute and your entire test suite collapses. The maintenance overhead is real, the coverage is incomplete, and worst of all — the scripts never behave like actual users.
What if the test runner was a real user? Not a simulated one. An AI agent that perceives the rendered DOM, reasons about intent, and executes browser interactions the same way a human QA engineer would — but at the speed and consistency of a machine.
That is the architectural promise of Expect, an open-source CLI and agent skill that lets AI agents drive real browser sessions for end-to-end testing. In this article I want to go deep on the engineering decisions behind this pattern, when it outperforms traditional E2E tooling, and how to integrate it cleanly into a production agent harness.
Why Traditional E2E Testing Fails at Scale
Before we talk about the solution, it is worth understanding the failure modes we are designing around.
The Selector Fragility Problem
Every Cypress, Playwright, or Selenium test is fundamentally a sequence of selector-based imperatives. The test does not understand the application — it mechanically targets DOM nodes by CSS selectors, XPath expressions, or test IDs. This creates a structural coupling between the test code and the implementation details of the UI.
In a fast-moving team, this coupling becomes a tax. Engineers either spend cycles maintaining tests (defeating the purpose) or they disable flaky tests and the coverage degrades silently. Research from Google’s testing engineering group has consistently shown that selector-based E2E tests are among the highest-maintenance artifacts in a software project.
The Coverage Gap
Because writing selector-based tests is expensive, teams prioritize critical paths and leave large swaths of the application untested. The long tail — edge cases in complex forms, multi-step flows, conditional UI states — is where real user bugs live, and it is precisely where coverage is absent.
The Human QA Bottleneck
For critical releases, teams fall back on manual QA. A human tester opens the browser, works through a test plan, and files bugs. This is effective but does not scale. It creates a release bottleneck, particularly for teams shipping continuously.
Agent-driven browser testing addresses all three of these failure modes simultaneously.
What Expect Actually Does
Expect is an open-source CLI tool and composable agent skill that exposes browser control as a tool surface for AI agents. At its core, it:
- Launches a real Chromium browser (via Playwright under the hood) with full JavaScript execution, cookie handling, and network stack
- Exposes browser state to an LLM — the current URL, rendered DOM snapshot, accessibility tree, and screenshot
- Interprets natural language test descriptions and translates them into browser actions (click, type, navigate, scroll, wait for element)
- Evaluates assertions expressed in plain language against the current browser state
- Produces structured test reports with pass/fail status, step-by-step execution logs, and screenshots at failure points
The key architectural insight is that the agent does not need selectors. It reads the accessibility tree and visual state of the page the same way a sighted user would, and reasons about which element to interact with based on semantic understanding rather than structural coupling.
The CLI Interface
The simplest invocation looks like this:
npx expect-agent test \
--url https://your-app.com \
--test "Log in as admin@example.com with password 'testpass', navigate to the billing page, and verify that a Pro plan subscription is shown"
The CLI handles browser lifecycle, passes observations to the configured LLM, and exits with a zero or non-zero status code based on whether the test assertion passed — making it trivially composable with any CI pipeline.
The Agent Skill Interface
For teams building their own agent harnesses, Expect exposes itself as a composable tool definition. This means you can register it as a skill in your orchestration layer and have your agents call it programmatically as part of a larger workflow:
import { ExpectSkill } from "@expect-agent/sdk";
const skill = new ExpectSkill({
browserOptions: {
headless: true,
viewport: { width: 1280, height: 800 }
},
llmProvider: "anthropic",
model: "claude-opus-4-6"
});
const result = await skill.run({
url: "https://staging.your-app.com",
assertion: "After adding a product to the cart and proceeding to checkout, the order summary should display the correct item count and subtotal"
});
console.log(result.passed); // true | false
console.log(result.steps); // step-by-step execution log
This composability is what makes Expect genuinely interesting as infrastructure rather than just a convenience wrapper around Playwright.
Architectural Patterns for Production Integration
Pattern 1: The Deployment Gatekeeper
The most direct integration point is as a post-deployment smoke test in your CD pipeline. After deploying to staging, an orchestrator agent spins up Expect and runs a battery of natural language test cases against the live environment before promoting to production.
# .github/workflows/deploy.yml (excerpt)
- name: Run agent smoke tests
run: |
npx expect-agent test-suite \
--url ${{ env.STAGING_URL }} \
--suite ./tests/smoke-tests.yaml \
--fail-fast
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Where smoke-tests.yaml contains natural language test cases:
tests:
- name: User registration flow
assertion: "Complete the signup form with a fresh email address and verify the welcome email confirmation screen is shown"
- name: Core product search
assertion: "Search for 'wireless headphones', apply a price filter under $100, and confirm at least one result is visible"
- name: Checkout flow
assertion: "Add the first product in search results to cart, proceed to checkout, and verify the payment form fields are present and interactive"
This pattern catches regressions that purely unit-tested code cannot: broken integrations, misconfigured environment variables that affect runtime behavior, CSS conflicts that make buttons invisible or unclickable.
Pattern 2: The QA Agent in Your Harness
For teams with existing agentic infrastructure, the more powerful pattern is treating Expect as a sub-agent skill that your orchestrator delegates to when it needs to verify UI state.
Consider a deployment orchestrator agent that manages the full release pipeline. When it reaches the verification phase, instead of calling a fixed test script, it invokes the Expect skill with a dynamically generated test specification derived from the diff of what changed in this release:
async function runReleaseVerification(
releaseNotes: string,
stagingUrl: string
): Promise<VerificationReport> {
// Step 1: Ask the LLM to derive test cases from release notes
const testCases = await llm.complete({
prompt: `Given these release notes, generate 5-10 critical user-journey test assertions:
${releaseNotes}
Format as a JSON array of assertion strings.`
});
// Step 2: Execute each test case via the Expect skill
const results = await Promise.allSettled(
testCases.map(assertion =>
expectSkill.run({ url: stagingUrl, assertion })
)
);
return buildReport(results);
}
This is where the pattern becomes genuinely novel: the test suite itself becomes adaptive. The agent reasons about what changed and tests the things most likely to have broken, rather than running the same static test suite on every deploy.
Pattern 3: Continuous Canary Testing
In high-availability systems, you want to verify user-facing functionality on a schedule, not just at deploy time. A monitoring agent can run Expect-powered canary tests against production every few minutes, alerting on-call when a critical user journey fails.
The key engineering consideration here is session isolation. Each canary run must start from a clean browser state to avoid cross-run contamination. Expect handles this by default — each skill.run() invocation gets a fresh browser context — but you need to ensure your canary test accounts are seeded with known state before each run.
async function runCanaryTest(): Promise<void> {
// Reset test account state before each canary run
await resetCanaryAccount(process.env.CANARY_EMAIL);
const result = await expectSkill.run({
url: "https://app.yourproduct.com",
assertion: "Log in with canary credentials and verify the dashboard loads with the expected data widgets visible",
sessionOptions: {
freshContext: true,
storageState: undefined
}
});
if (!result.passed) {
await pagerduty.alert({
title: "Canary: Dashboard login flow degraded",
details: result.failureReason,
screenshot: result.failureScreenshot
});
}
}
Reliability Engineering Considerations
LLM Determinism and Test Stability
The most common concern I hear from engineers evaluating agent-driven testing is: “How do I know the agent will do the same thing every time?”
This is a legitimate concern, and it requires a slightly different mental model than traditional deterministic testing. The answer is that you are not trying to make the agent take identical actions — you are trying to make it evaluate identical outcomes. The assertion is about end state, not execution path.
In practice, Expect achieves high consistency by:
- Using low-temperature inference for action selection (the agent is not being creative, it is being precise)
- Grounding decisions in the accessibility tree rather than visual interpretation, which is more stable across LLM runs
- Retrying transient failures with exponential backoff before reporting a failure
- Caching page state observations to avoid re-querying unchanged DOM sections
In internal testing, teams report >99% consistency on deterministic flows (login, navigation, form submission) and ~95% on more complex multi-step flows, which is competitive with well-maintained selector-based tests.
Handling Authentication
Authentication is the first engineering challenge you will hit in any E2E testing setup. For agent-driven testing, there are three approaches in increasing order of sophistication:
Approach 1: Test credentials in the assertion. Simple but not recommended for anything beyond prototyping. The agent uses literal credentials embedded in the test case. Secrets in test descriptions are an operational hazard.
Approach 2: Pre-authenticated session injection. Before running the test, programmatically authenticate via API, capture the session token, and inject it into the browser context. The agent starts from an already-authenticated state.
const sessionState = await getAuthSession(testUser);
const result = await expectSkill.run({
url: "https://app.yourproduct.com/dashboard",
assertion: "Verify the user's account settings page shows the correct subscription tier",
sessionOptions: {
storageState: sessionState // inject cookies/localStorage
}
});
Approach 3: Agent-aware auth flows. The agent handles login as part of its test execution, using credentials injected via environment variables that are never embedded in the test case text. This tests the authentication flow itself and is the most complete approach.
Cost and Latency Budgets
Each browser test invocation makes multiple LLM calls — typically 5–15 depending on the complexity of the user journey. At current Claude Opus pricing, a complex multi-step test runs roughly $0.05–$0.20 per execution. For a suite of 20 smoke tests on every staging deploy, you are looking at $1–$4 per deployment, which is a rounding error compared to the engineering time saved.
Latency is the more significant operational concern. A 10-step user journey typically takes 30–90 seconds to execute, compared to 3–10 seconds for a well-written Playwright script. This makes agent-driven testing a complement to, not a replacement for, fast unit and integration tests. Reserve it for the user-journey layer where the semantic understanding advantage is most valuable.
Comparing Expect to the Existing Landscape
| Capability | Playwright | Cypress | Browser-Use | Expect |
|---|---|---|---|---|
| Requires selectors | Yes | Yes | Partial | No |
| Natural language tests | No | No | Partial | Yes |
| CI/CD integration | Yes | Yes | Manual | Yes |
| Agent skill interface | No | No | No | Yes |
| Screenshot on failure | Yes | Yes | No | Yes |
| Structured test reports | Yes | Yes | No | Yes |
| Adaptive test generation | No | No | No | Yes (via harness) |
Expect occupies a specific niche: it is not trying to replace Playwright for unit-level browser tests, and it is not a general-purpose browser agent like Browser-Use. It is specifically designed as a testable, reliable, CI-composable agent skill for user-journey verification.
Getting Started: A Practical Onboarding Path
Week 1: Replace two flaky tests. Identify your two most maintenance-intensive E2E tests — the ones that break every other sprint. Replace them with Expect-based natural language equivalents. Measure: do they break on the same cadence? (They probably won’t.)
Week 2: Add smoke test coverage for untested flows. Use Expect to cover the user journeys you never got around to automating because the selector-based approach was too expensive. The long tail of multi-step flows is where this shines.
Week 3: Integrate into CI. Wire the smoke suite into your staging deploy pipeline as a blocking gate. Calibrate the failure threshold — some teams prefer non-blocking on flaky assertions while building confidence.
Week 4: Instrument and optimize. Add structured logging to capture per-test LLM call counts and latency. Identify which test cases are chatty (>10 LLM calls) and simplify their assertion language to reduce scope.
The Bigger Picture: Agents Eating QA
Expect is one data point in a larger trend: the agentic absorption of traditionally human-intensive software engineering workflows. QA is particularly susceptible to this transition because:
- The task is fundamentally about semantic understanding of application state — something LLMs are genuinely good at
- The output is structured (pass/fail with evidence) rather than creative, which makes consistency achievable
- The maintenance burden of alternative approaches (selector-based scripts, manual QA) is high enough that even imperfect automation is a net positive
What I find most architecturally interesting is the composability story. When browser testing is an agent skill rather than a separate toolchain, it becomes natively integrable with other agent-driven workflows: deployment orchestrators, incident response agents, product monitoring agents. The agent that deploys your code can also verify it behaved correctly, without being handed off to a separate system.
That convergence — agent infrastructure that both acts on and observes the state of your application — is where the most interesting production patterns are emerging in 2026.
Further Reading
- Building Reliable Agent Harnesses: Lessons from Production
- The Agent Skill Registry: Designing for Composability
- Playwright vs. Agent-Driven Testing: A Production Comparison
Dr. Sarah Chen is a Principal Engineer focused on production AI agent systems and reliability engineering. She writes about architectural patterns for agent harnesses, agent skill design, and the operational realities of deploying LLM-powered systems at scale.
Have a production pattern to share? Contribute to our practitioner knowledge base or join the discussion on GitHub.