When the Specification Became the Test: AI as a QA Agent in a Blockchain dApp
AI is everywhere – but what’s it really like on the frontlines of AI implementation? Get into the daily thoughts and challenges faced by AI engineers – the real stuff that happens when AI meets digital products.
Weekly AI Bites is a series that gives you direct access to our day-to-day AI work. Every post comes straight from our team’s meetings and Slack, sharing insights, tests, and experiences we’re applying to real projects. What models are we testing, what challenges are we tackling, and what’s really working in products? You’ll find all of this in our bites. Want to know what’s buzzing in AI? Check out Boldare’s channels every Monday for the latest weekly AI Bite. Let’s dive into the full article.

Table of contents
Introduction
I’m building a blockchain lottery dApp on my own — with smart contracts in Foundry, a React frontend, and an indexer in Ponder. The app has three layers: business logic on-chain, an indexer that monitors events, and a frontend that brings it all together. Any change in one layer can potentially break something in the others.
In this article, I share an experiment using Chrome DevTools MCP + Claude Code as an autonomous QA agent. Early results are promising enough to keep exploring — and along the way, I uncovered an unexpected insight: the way requirements are written directly affects how effectively the agent tests. More on that at the end.
The Problem
After every change, I should verify that the application still works end-to-end. Manually checking every feature took about 90 minutes of clicking through the app. Everything was precisely described in the PRD as User Stories with Acceptance Criteria — the scope was clear, the problem was elsewhere.
It wasn’t just about time. It was about attention. To verify whether a feature worked, I had to interrupt coding, switch context, click through the app, and switch back. It’s a classic trap: the more expensive the feedback, the less often you collect it. The less often you collect it, the more bugs you silently accumulate — and the later you find them, the more expensive they are to fix. My goal wasn’t to replace QA — it was to find someone who would do the clicking for me.
What I Tried Before — and Why Not Playwright?
The natural question: why not Playwright? It’s a mature E2E tool, supports Chromium, and has a great ecosystem.
First problem: MetaMask. Playwright technically supports loading extensions (–load-extension), but MetaMask deliberately makes automation difficult — separate browser context, service worker, dynamic selectors, anti-bot mechanisms. Synpress is essentially a Playwright wrapper created specifically to handle MetaMask in Web3 testing. I tried it — the setup was too much for me :-) I encountered the same issue with the Agent Browser skill in Claude Code — it also can’t handle wallet extensions for the same reason: MetaMask runs as a separate process outside the reach of standard automation tools.
Second problem: I didn’t want to “cement” behavior yet. The UI was constantly evolving. Writing deterministic E2E tests in Playwright would mean encoding specific system behavior: “click button X, expect text Y, check selector Z.” With every UI change — tests to rewrite. At this stage, I didn’t need regression yet as much as smoke tests.
Claude Agent reads the Acceptance Criteria and verifies behavior, not selectors. When the UI changes, the agent simply looks at the new screen and evaluates whether the AC is satisfied — no fixtures to update. That gave me room to experiment: instead of investing in test infrastructure, I could simply see how far I could get.
The Solution
The key insight was simple: the agent needs eyes in the browser and hands on the blockchain. Chrome DevTools MCP gives it the first — it can navigate the dApp, take snapshots and screenshots, verify the UI. cast call and cast sendfrom the Foundry toolkit give it the second — it can inspect contract state and send transactions directly, without clicking through the UI. The missing link was the wallet: the solution turned out to be a custom Chrome profile with a pre-configured test MetaMask — the agent starts every session with a ready wallet.
But the most important part isn’t technical. The agent tests autonomously — I get an audio notification when it needs my interaction. In practice: I start a test session and go back to coding. Feedback comes to me; I don’t go looking for it.
The 15% That Still Requires My Hand:
When testing through the browser UI:
- 2× transaction confirmations in MetaMask — the wallet extension popup isn’t accessible via Chrome DevTools; it requires a human click.
On a local testnet (Anvil):
- 1× round completion (make complete-draw) — on a real network (e.g., Sepolia), this is handled automatically by Chainlink nodes:
Two Agents — Two Contexts
What started as a single Claude Code window quickly evolved into two.
- Developer agent — in one Claude Code window, introduces frontend changes.
- QA agent — in the other window, tests whether nothing was broken and reports bugs to a file.
Workflow: developer fixes → QA retests.
The key advantage of this split isn’t just parallel work, but also preserving each agent’s focus and context. The developer agent operates exclusively in the context of code — it knows the architecture, change history, dependencies. The QA agent operates exclusively in the context of testing — it knows the Acceptance Criteria, the test protocol, previous results. Mixing these two types of tasks in one context window would degrade the quality of both.
An additional effect: each agent consumes a smaller context window because it processes only domain-specific information. This translates into lower costs and a smaller risk of the model “forgetting” important information.
Technical Conclusions
- A properly configured Chrome profile with a wallet extension allows the agent to operate smoothly in UI tests.
- cast call and cast send (tools from the Foundry toolkit) enable the agent to interact directly with the blockchain: reading state and sending transactions — independently of the UI.
- You don’t have to automate everything — 85% automation with minimal manual work delivers real time savings and higher productivity.
This is still an experiment — the initial results are convincing, but I’m still building trust in the agent. Will it truly not miss regressions? Does it correctly interpret UI behavior in every case? These are natural questions with any new testing tool, not just AI. Next step: more sessions, more observations.
Non-Obvious Conclusions: The Specification and Protocol Became the Test
There’s something about this approach that I only realized after several sessions.
In a traditional process, you write requirements (Acceptance Criteria in the PRD), and then separately write tests that verify those requirements. These are two artifacts that must stay synchronized — and often they don’t. Tests become outdated, AC evolves, something drifts apart.
Here, the agent directly reads the AC and verifies behavior. There’s no translation layer — no “now I’ll rewrite the requirement into a test case in code.” The Given / When / Then from the User Story is simultaneously a test instruction. One artifact instead of two.
This changes the economics of writing good requirements. Usually, developers treat detailed AC as a formality — “I know what to build anyway, why write it down.” In this model, the quality of AC directly translates into the quality of testing. Well-written, precise AC = the agent tests accurately. Vague, generic AC = the agent guesses and may miss something.
But the specification alone isn’t enough. Alongside AC, you need a QA protocol — a file describing how the agent should test: when to mark AC as PASS vs FAIL, how to report bugs, how to handle edge cases. Traditionally, this knowledge lives in the tester’s head. Here, it’s written down in markdown.
The effect is non-obvious: every agent that reads this file becomes “a junior QA who knows your standards.” Expert knowledge is documented, reusable, and accessible to every new developer on the project — independently of AI.
Your PRD defines what to test. Your protocol defines how to test. Together, they replace routine verification — without writing a single line of test code.
Share this article:






