Engineering

The Quality Gate Protocol: How We Ship Code That Actually Works

Most AI-built code ships fast and breaks faster. We fixed 100 bugs across 11 projects in one overnight session — autonomously. Here's the testing discipline that made that possible, and the course that teaches it.

March 20, 2026
10 min read
#testing#quality-gates#playwright
The Quality Gate Protocol: How We Ship Code That Actually Works⊕ zoom
Share

Someone told me recently they were "taking a few steps back" from building with AI because every time they shipped something, they'd discover the AI had lied to them. Tests passed but the feature was broken. The code looked clean but the logic was wrong. They were excited to build, then defeated when they found the slop.

I know that feeling. I've been there.

The difference is I built a system to make sure it never happens twice.

DOCTRINE

The finish line is not a merged PR. The finish line is a running system — process starts, logs are clean, data flows, state files are correct. Everything else is a checkpoint, not a destination.


The Problem Nobody Talks About

AI can write code faster than any human. But speed without verification is just shipping bugs at scale.

Here's what happened to us on February 26, 2026. We were building a sync feature for Mission Control — pulling 200+ tickets from Asana into a kanban board. Standard stuff. Four PRs by end of day. Three bug fixes. One root cause connecting them all.

Bug #1: CodeRabbit suggested we use urllib.parse.urljoin for URL construction — "safer, more robust." We applied it. It broke pagination entirely because urljoin("https://app.asana.com/api/1.0", "/sections/123/tasks") silently drops the /api/1.0 prefix. The AI reviewer was wrong. The original string concat was correct.

Bug #2: Our unit tests had 90.67% coverage. Every mock returned clean, single-page responses. The real Asana backlog had 100+ tasks. Pagination fired on the first sync. Every unit test lied by omission.

Bug #3: The sync button showed "Syncing..." then snapped back to "Sync Now" with zero feedback. The API returned a 500 because of bugs #1 and #2. No error handling. No user feedback. The UI just... pretended nothing happened.

Three bugs. All shipped past code review, two AI reviewers, automated CI, and 90% test coverage.

Code Coverage
90.67%
All tests passed
AI Reviewers
2
CodeRabbit + Gemini
Bugs Shipped
3
Past all gates

The Quality Gate Protocol

After that day, I wrote the rules that now govern every project in the Tesseract Intelligence ecosystem. Not guidelines. Not suggestions. Gates — hard stops that block delivery until satisfied.

Gate 1: Test Strategy Before Code

The most expensive lesson we learned building InDecision Command: we shipped 148 pages without a test strategy. No Playwright. No visual baselines. No edge case planning. Knox caught issues that should have been caught programmatically. We had to retrofit tests after the fact, creating tech debt and missed bugs.

The build order is now non-negotiable:

  1. PRD — what to build
  2. UX Audit — how users experience it
  3. Test Strategy — how to verify it works
  4. Build — write the code
  5. Playwright verification — eyes during development
SIGNAL

Signal: A test strategy forces you to think about user flows, edge cases, error handling, and acceptance criteria BEFORE you write code. It's cheaper to find bugs in a document than in production.

The test strategy must include: user flows (happy path for every feature), error cases, edge cases, integration testing plan, UI testing plan, visual baseline strategy at 3 breakpoints, accessibility requirements, performance budgets, security testing, content integrity, coverage targets (90% lines, 85% functions, 80% branches), and CI/CD integration.

Gate 2: E2E Against Live Systems

Mocks lie. We proved it. A mocked test passed for months while the real API behaved completely differently. The unit test said "everything works." Production said otherwise.

The rule: E2E testing against live APIs before merge. Mocks are useful for development speed, but the finish line is a real request returning a real response.

# This passes:
mock_response = {"data": [{"id": 1}], "next_page_token": None}
assert sync(mock_response) == expected

# This fails in production:
# Real API returns paginated data that the mock never simulated

Gate 3: Playwright as Development Eyes

This is the one that changed everything.

Before: build → push → merge → hope it looks right. After: build → Playwright screenshots → visual retro → fix issues → then merge.

On March 2, 2026, we ran a Playwright visual retro on Mission Control. In a single session, we caught four issues that code review never would have found:

  1. Bold markdown rendering was broken (survived 3 docker restart cycles)
  2. Stat card colors were bland and undifferentiated
  3. Category bar was cluttered
  4. Activity tab layout was flat and hard to scan

Code review evaluates logic correctness. It cannot evaluate visual outcomes. Screenshots are the only ground truth for UI work.

INSIGHT

Insight: After EVERY UI delivery: Playwright screenshot at 3 breakpoints, visual retro, file issues, fix in the same session. This is not optional polish — it's the QA gate. Catching 4 issues in one retro cycle beats 4 separate "Knox-reports-a-bug" cycles.

Gate 4: Multi-Agent Code Audit (The Audit Swarm)

On March 20, 2026, we ran our first portfolio-wide code audit across 11 projects. Not one reviewer. Not two. A swarm of specialized agents — backend, DevOps, and architect — reviewing in parallel.

The results:

Projects Audited
11
Full portfolio
Bugs Found
~100
36 P0 + 47 P1 + 18 P2
Fix Time
3 hours
10 parallel agents, overnight
False Positive Rate
15%
Verify before fix

10 parallel agents fixed 100 bugs, created 19 PRs, and merged them — all while I was asleep. The key insight: each specialization sees blind spots the others have. DevOps found a 0.0.0.0 exposure. The architect found dead code with passing tests. Backend found duplicate event handlers.

But we also learned that 15% of audit findings were false positives. The rule: verify before fix. grep for the method or pattern before writing code. "The audit says X exists" is not the same as "X exists now."

Gate 5: The Delivery Checklist

A change is ready for production when ALL of the following are true:

  • Unit test coverage at or above 90%
  • All CI checks pass (GitHub Actions green)
  • E2E validation completed on the real system
  • Process starts cleanly, logs are clean, data flows correctly
  • State files and checkpoints validated
  • External prerequisites confirmed (not TODO'd)
  • Regression test exists for every bug fix

That last one is critical. Every bug fix requires a regression test FIRST. Write the test that fails, then fix the bug. This ensures the bug can never silently return.


The Compound Learning Loop

Quality gates are only half the system. The other half is what happens after something breaks.

Every coding session ends with a retro:

  1. What went wrong? (the mistake)
  2. Why? (root cause, not surface symptom)
  3. What's the rule? (specific, actionable, testable)
  4. Detection latency — how long before we noticed?
  5. Detection method — how did we find it?
  6. Alerting gap — what monitoring would have caught it sooner?

These retros get stored in the project's lessons.md and indexed in our knowledge system. When a lesson keeps appearing, the rule isn't strong enough — we escalate it from project-level to system-level.

Be brilliant in the basics. Advanced skills are built on a mastery of fundamentals.

James Mattis · Call Sign Chaos

This is how compound learning works. No lesson captured = no growth. A lesson that doesn't change behavior is just documentation.


The Results

Since implementing the Quality Gate Protocol:

  • Foresight (our prediction engine): 1,970+ tests, --mode conservative enforcement, stop-loss discipline
  • Mission Control: 313+ tests, Playwright visual retros after every UI delivery
  • jeremyknox.ai: 102 E2E tests covering all 7 UX innovations
  • Portfolio-wide: overnight autonomous audit fixing 100 bugs across 11 projects

The person who told me they were "taking steps back" from AI — they don't have a tool problem. They have a verification problem. AI is the most powerful coding partner that has ever existed. But power without discipline is just chaos moving fast.


Learn the Full System

I built an entire course on this in the Academy: Quality Engineering Mastery — 6 lessons covering the test strategy framework, E2E patterns, Playwright workflows, multi-agent audits, visual QA, and the delivery checklist.

The article tells you what we do. The course teaches you how to build it yourself.

DOCTRINE

The quality gate is not where you slow down. It's where you stop shipping bugs and start shipping confidence.

Explore the Invictus Labs Ecosystem

// Join the Network

Follow the Signal

If this was useful, follow along. Daily intelligence across AI, crypto, and strategy — before the mainstream catches on.

No spam. Unsubscribe anytime.

Share
// More SignalsAll Posts →