Skip to main content

Insights

How We Test AI-Generated Code

13 min readBy Andrés CanoCase Studies

Code is shipping faster than anyone can verify it.

That sentence used to be an exaggeration. In April 2026, it's a measurement. Code review activity dropped from 25% to under 10% as AI adoption grew. People are checking AI output less than they checked the code they used to write by hand. That's not a cultural opinion — 41% of production code is now AI-generated (Wikipedia / industry reports, 2026), and 92% of US developers use AI assistants daily (JetBrains / Test-Lab, 2026).

The predictable thing happened. 45% of AI-generated code fails basic security tests (Veracode, 2026). Sherlock Forensics scanned AI-generated codebases from January through April 2026 and found that 92% contained at least one critical vulnerability. Average app: 8.3 exploitable findings.

I built the AI Native QA Protocol to close this specific gap. Not to sell fear about AI tools (I use them every day). To build the verification layer that didn't exist. Here's how it works, phase by phase, and where it draws the line.


Phase 1: Build the Map

The first thing we do isn't analysis. It's orientation.

Clone the repo. Identify what already exists: test suites (usually none), CI pipelines (sometimes), documentation (rarely), existing bug reports. Most AI-built projects don't have a requirements document. The codebase IS the spec, and nobody wrote down what it's supposed to do.

So we extract it. From the code, from the README if there is one, from product docs, from a 30-minute call with the team. Then we build a component inventory: every visual component, every back-end service, every integration point, catalogued and labeled.

From that inventory, we define happy paths and golden paths. Not just UI flows. Database validations. API integration points. The paths actual users take through the system and the paths that keep data correct.

Then we build the artifact that makes the entire rest of the process measurable: a map of components to paths to coverage targets — the traceability matrix. This is the denominator. Without it, "we improved test coverage" is a sentence with no number attached. With it, every iteration moves the numerator.

The output of Phase 1: a scoped analysis plan, a component inventory, a traceability matrix, and context-dependent coverage benchmarks. The engagement hasn't run a single scanner yet. That's deliberate. Static analysis without context is noise. It produces findings with no prioritization. The inventory turns the codebase into a map with labeled regions before anything else touches it.

One thing I've learned running this on real projects: the component inventory almost always surfaces integration points the team forgot they built. An auth callback to a third-party service. A webhook endpoint added three months ago and never tested. These show up in the inventory because the inventory doesn't rely on the team's memory of what they shipped.


Phase 2: Custom Agent Analysis

This is where the Protocol stops looking like "run a tool" and starts looking like engineering.

We build custom AI agents per project context. These aren't generic scanners. Each agent goes through its own development lifecycle: designed for the project's stack and patterns, built with research-backed prompt engineering, tested against known findings, and validated before it reviews a single line of client code.

I'm aware of the obvious objection: AI analyzing AI-generated code creates correlated failure risk. Same family of models, same blind spots, same probability distributions. This is exactly why the agents aren't off-the-shelf. They're engineered with the project's specific context: its stack, its domain, its acceptance criteria from Phase 1. And they go through their own QA cycle. The research and validation happen before the first analysis pass, not after.

The agents run against the acceptance criteria mapped in Phase 1 and produce:

  • A structured findings report in SARIF format with severity-tagged findings (Critical / High / Medium / Low / Info), each mapped to file and line range with a confidence rating and recommended fix
  • OWASP Top 10 security screening in the same pass: auth gaps, exposed secrets, injection paths (SQL, XSS, log injection), insecure defaults, missing rate limiting, hardcoded credentials
  • A human-readable Markdown report unifying quality and security findings
  • A fix prompt — a structured prompt the dev team feeds to Cursor or Claude Code to apply fixes autonomously

That last artifact matters more than people expect. The Protocol doesn't just find problems. It gives the team's AI tooling a structured path to fix them. The analysis finds root causes; the fix prompt translates them into instructions the same tools that wrote the code can execute.

What the data shows, project after project: static analysis surfaces roughly 4x more findings than teams knew about before the engagement. The same categories repeat in AI-generated code (state management bugs, auth gaps, missing error handling, insecure defaults) no matter the app or the stack. Teams that filed bugs manually were chasing symptoms. The analysis finds structure.

Security isn't a separate engagement. It's the same pass. The OWASP categories where AI-generated code fails most often are screened by default because there's no reason to separate them. A hardcoded API key and a missing null check are both in the same report, both severity-tagged, both in the fix prompt.


Phase 3: Tests That Evolve With the Code

A one-time audit is a snapshot. If the team ships every week, the snapshot is stale by Friday.

Phase 3 builds a test layer that moves with the codebase. When code logic changes (new PR, new feature, refactor) the test layer detects it. We develop test cases to the level where every new enhancement or release also verifies the paths through the code.

The cycle: a PR lands. Analysis identifies what changed. Test cases update to cover new paths. The next PR benefits from expanded coverage. Each iteration catches things the previous one didn't, because the analysis has more context about how the codebase evolves.

Plenty of AI-native teams push straight to main. No PR workflow at all. The Protocol adapts. The recursive enhancement triggers on pushes to main instead. Less granular, but the loop doesn't break. We offer PR workflow education as part of the engagement. We don't require it.

Coverage is measured against the benchmarks from the traceability matrix. The metric depends on project context. UI-heavy projects benchmark UI path coverage. Endpoint-heavy projects benchmark API coverage. Database-heavy projects benchmark data validation coverage. Coverage is a number with a denominator from Phase 1, not a feeling.

This is where QA becomes preventive instead of reactive. The gap between "code shipped" and "code verified" shrinks with every iteration instead of growing.


Phase 4: Permanent Automation

Everything before this phase could still depend on us running it. Phase 4 removes that dependency.

Playwright automation covers UI flow testing, regression suites, and critical path verification. It runs against every build candidate and covers the paths that matter most based on the component inventory from Phase 1. The Protocol does not include manual testing. The Playwright layer replaces it entirely.

Cloudflare Workers handle the back-end monitoring:

  • Scheduled health checks for endpoints
  • Database health monitoring
  • Automated endpoint testing triggered on deploy or on schedule
  • Deterministic data validation. Format checks and data availability — if an aggregator fails, the Worker requests a retry for the failed section automatically.
  • AI data quality scoring. A second layer scores data quality across multiple validation dimensions. Structured JSON output tells the agent exactly where each score comes from. This isn't "is the endpoint up." It's "is the data correct and complete."
  • A production feedback loop: Workers in production surface issues that feed back into the recursive test enhancement cycle from Phase 3. Bugs that reach production don't just get fixed. They expand the Protocol's coverage boundary for the next cycle.

CI integration ties it together. Static analysis runs on every PR (or push to main). Security screening on every push. Test suites execute automatically. Findings surface in whatever project management tool the team already uses. Linear, Jira, GitHub Issues, doesn't matter.

Contract and integration testing prevent cross-service breakage: does service A still send what service B expects after today's deploy? API contract tests run in the CI pipeline. Integration points flagged during the component inventory in Phase 1 are covered by automated contract checks.

For mobile-specific paths: Playwright covers web and hybrid UI flows. Device cloud testing (BrowserStack, AWS Device Farm) handles native-specific paths: OS version behavior, device-specific rendering, responsive breakpoints. Paths that need hardware-specific testing (Bluetooth, NFC, camera, push notification delivery) get flagged during the component inventory and documented as team-owned. That boundary is visible before the engagement starts.

The exit condition for Phase 4 isn't "we found the bugs." It's "the system keeps finding them without us." Everything deployed in this phase runs whether Academ-ia is involved or not. The team owns the infrastructure.


The Boundary

This is the section that makes the Protocol credible. Any methodology that claims to cover everything is selling you something.

The Protocol owns verification infrastructure, the system that catches problems. The team owns production code and production decisions, what ships and when.

The Protocol covers:

  • Component inventory and traceability matrix
  • Static analysis with custom agents (built, tested, validated per project)
  • Security screening (OWASP Top 10 surface pass)
  • Recursive test enhancement (tests evolve with every PR or push)
  • UI automation (Playwright), endpoint automation (Workers), CI integration
  • Data validation (deterministic + AI scoring)
  • Database health monitoring
  • Contract and integration testing
  • Fix prompts the team can execute
  • Coverage benchmarks set per project context
  • PR workflow education (or push-to-main fallback)
  • Device cloud integration for mobile paths where scoped
  • Dashboards and reporting in the team's existing tool
  • Production feedback loop: issues that reach production expand the Protocol's coverage for the next cycle

The team owns:

  • Applying the fixes. The Protocol finds and prescribes; the team or their AI executes
  • Production deployment decisions. The Protocol provides release signals, not release authority
  • Business logic correctness. The Protocol verifies code against requirements. If the requirements are wrong, the tests pass and the product is still wrong
  • Exploratory testing and edge cases outside the component inventory. The inventory is scoped to known components; unknown unknowns are the team's territory
  • Test data beyond what the custom databases cover (third-party sandbox accounts, production-like data seeding for scenarios the Protocol hasn't mapped)
  • Native mobile paths not coverable by Playwright or device cloud: Bluetooth, NFC, camera, hardware-specific behavior

The boundary is set during Phase 1. It's visible from day one. And it moves. The production feedback loop means that when a bug reaches production, you can trace whether it was inside the Protocol's coverage boundary or outside it. If it was outside, the next cycle expands the boundary to include it. The coverage surface grows with every iteration.

I don't guarantee zero bugs in production. No process does, and anyone who claims otherwise is lying or confused. What I guarantee is that the boundary between "covered" and "not covered" is visible, measurable, and it compounds.


Why the Pattern Holds

The same sequence (inventory, custom agents, recursive testing, automation) produces the same results whether the team is four people or thirty, whether the codebase is React Native or a Python API, whether 20% or 90% of the code came from an AI. It works because it treats QA as an engineering system with a visible boundary, not as a checkbox someone ticks before release.

The QA gap in vibe coding isn't a tooling problem. The tools exist. Playwright, Workers, SARIF, custom agents. It's an orchestration problem. Running those tools inside an engineered process with measurable coverage and a clear boundary is what was missing.

I described each phase in enough detail that a senior engineer could build a version of this themselves. The Protocol's value isn't in secrecy. It's in the compounding effect of running it. The traceability matrix gets richer. The agents get sharper. The coverage boundary gets wider. The first pass finds a lot. The fourth pass finds things the first never could.

If you're building with AI tools and shipping without a QA process, start with the component inventory. Just knowing what you built, every component, every integration point, every path users actually take, is the single most useful step. Everything else builds on that map.


FAQ

What is the AI Native QA Protocol?

A four-phase QA methodology for codebases built with AI tools. It starts with a component inventory and traceability matrix, runs static analysis through custom-built AI agents, builds a recursive test layer that evolves with every PR, and deploys permanent automation (Playwright, Cloudflare Workers, CI) that runs without ongoing involvement. The boundary between what the Protocol covers and what the team owns is defined from day one.

How do you QA vibe-coded applications?

Map every component first: visual, back-end, and integration. Build a traceability matrix. Run scoped static analysis with custom agents engineered per project. Build test cases that evolve with every code change. Automate UI paths, endpoint validation, data quality checks, and contract tests. Measure coverage against context-specific benchmarks.

Does static analysis work on AI-generated code?

AI-generated code follows patterns that static analysis catches well: state management bugs, auth gaps, missing error handling, insecure defaults. Custom agents built per project context surface roughly 4x more findings than teams knew about. The agents go through their own development lifecycle before they review any code.

What security issues does AI-generated code have?

The OWASP categories that show up most: authentication gaps, exposed secrets, injection paths (SQL, XSS, log injection), insecure defaults, missing rate limiting, hardcoded credentials. 45% of AI-generated code fails basic security tests (Veracode, 2026). 92% of AI-generated codebases contain at least one critical vulnerability (Sherlock Forensics, Jan–Apr 2026). The Protocol screens for these in the same pass as quality analysis.

Can QA automation keep up with AI-assisted development speed?

Yes — when the test layer evolves with the code. Static test suites degrade as AI-assisted teams ship fast. Recursive test enhancement (where analysis and test generation run on every PR or push) means coverage compounds instead of eroding. Coverage is measured against a traceability matrix, not assumed.

What's the difference between AI QA tools and the AI Native QA Protocol?

Tools like Mabl, CodeRabbit, or QA Wolf generate tests or review code. The Protocol orchestrates when and how those capabilities run across the full release cycle: component inventory, custom agent analysis, recursive testing, permanent automation, and a production feedback loop. The tool is never the problem; running the tool inside an engineered process is.

What doesn't the AI Native QA Protocol cover?

The Protocol builds verification infrastructure. The team owns production code, deployment decisions, business logic correctness, and exploratory testing outside the component inventory. Hardware-specific mobile paths (Bluetooth, NFC, camera) are documented as team-owned during the inventory. The boundary is visible and measurable from day one, and it expands with every cycle as the production feedback loop adds coverage.

Start with the component inventory

If you're building with AI tools and shipping without a QA process, the single most useful first step is knowing what you built. We can help you build that map.

How We Test AI-Generated Code · Academ-ia