When the AI Code Reviewer Is Wrong: Lessons from a Day of Agentic Engineering
We shipped four bugs past code review, passing CI, and two AI reviewers in a single day. Here's what that taught me about the real limits of agentic coding — and the one discipline that would have caught all of them.

Yesterday I spent a full day building with AI — Claude Code writing the implementation, CodeRabbit and Gemini reviewing the PRs, automated tests running CI. The whole modern agentic stack.
We shipped four bugs past all of it.
Not catastrophic bugs. Not security holes. But production bugs that required fixes, redeployments, and — most importantly — honest reflection on what the tooling actually protects you from and what it doesn't.
Here's what happened and what I'm taking forward.
The Setup
We were building a sync feature for Mission Control (my internal ops dashboard) to pull 200+ tickets from Asana and populate a kanban board. Standard stuff: a FastAPI endpoint, a paginated REST API call, and a React UI with a Sync button.
Four PRs by end of day. Three bug fixes. One root cause that connected them all.
Bug #1: The AI Reviewer Made It Worse
The original pagination code was straightforward:
url = f"{_ASANA_BASE}{nxt['path']}" if nxt.get("offset") else ""
CodeRabbit flagged it. Said I should use urllib.parse.urljoin for URL construction — safer, more robust, standard library. The suggestion looked clean. I applied it:
url = urllib.parse.urljoin(_ASANA_BASE, path) if path else ""
This broke pagination entirely.
urljoin("https://app.asana.com/api/1.0", "/sections/123/tasks?...") returns "https://app.asana.com/sections/123/tasks?..." — dropping the /api/1.0 prefix because the second argument starts with /. The original string concat was correct. The "improved" version was silently wrong.
The unit tests passed because our mocks didn't paginate. CI passed. CodeRabbit gave it a green checkmark on the follow-up review. It shipped.
The lesson: AI reviewers are pattern-matchers. They see "URL construction" and suggest urljoin without running the code. When a reviewer suggests a change to URL/path logic, cryptographic code, or anything where edge cases matter — test the edge case yourself before applying it. The suggestion can be right in principle and wrong in your specific context.
Bug #2: Mocks Can't Tell You What Real APIs Do
Our Asana sync test suite had great coverage. Every code path was exercised. CI hit 90.67%. Every mock returned clean, single-page responses.
The real Asana backlog had 100+ tasks. Pagination fired on the very first sync. The paginated URL was broken (see above). Every unit test lied by omission.
This is the fundamental tension in test-driven development against external APIs: your mocks are only as accurate as your understanding of the API's behavior. We modeled the happy path. We didn't model the "what happens when there are 100+ items" path, because we didn't know the backlog was that large until we hit production.
The lesson: For any feature that calls an external API, unit tests with mocks are necessary but not sufficient. You need at least one live integration test — even a manual one — before calling it done. The question isn't "does the code do what I think it does?" — it's "does the API do what I think it does?"
Bug #3: The Done Column Was Empty
After fixing pagination, the Done column showed 2 tickets. There should have been 80+.
The sync code had this logic:
if task.get("completed"):
continue
Reasonable assumption: skip completed tasks. Except in Asana, completed=True and "in the Done section" are independent concepts. A task's completed flag is set when you check it off. Moving it to the Done section just changes its kanban column. Most finished work in our Done section was both completed=True and in the Done section — so the filter dropped all of it.
The fix was one line: only skip completed tasks in active sections.
if task.get("completed") and status != "done":
continue
The lesson: Don't assume business logic maps cleanly to API semantics. "Completed" in the Asana data model is not the same as "finished" in your product's data model. When you're integrating a third-party system, read the API docs for the fields you're filtering on, not just the fields you're displaying.
Bug #4: Docker Cached the Old World
After writing the initial 88 projects directly to the JSON data file, the API kept returning the old 7 projects. The file had the new data. The container had the new file. But the backend serves from an in-memory cache that loads at startup.
External file write + running container = no effect until restart.
This one's obvious in retrospect. But in a day of fast shipping, the mental model was "file is updated, system is updated." That's only true for stateless processes. Our storage service is not stateless.
The lesson: Know your data path. If your service caches on load, external writes are invisible until restart. Either go through the API, or restart — but don't assume file writes propagate to running processes.
The Root Cause Behind All Four
Every one of these bugs shared a single failure mode: we validated the implementation, not the behavior.
- The URL test validated that the code called
urljoin. It didn't validate that the generated URL actually worked. - The pagination tests validated that the code handled a
next_pagefield. They didn't validate against a real paginated response. - The completed-task filter validated that
completed=Truetasks were skipped. It didn't validate what "done" means in Asana's model. - The Docker write validated that the file had the right data. It didn't validate that the running system served it.
In every case, tests passed and the behavior was wrong. That's the gap that only E2E validation closes.
What I'm Changing
1. E2E smoke test before writing the PR description.
The PR is the promise. If you write it before testing against the real system, you're promising something you haven't verified. The smoke test is: hit the actual endpoint, with the actual external service, and confirm the behavior you just described. This is not optional.
2. Treat AI reviewer suggestions on URL/path/crypto/parsing logic as hypotheses.
CodeRabbit's urljoin suggestion wasn't wrong in general. It was wrong here. The rule I'm adding: for any suggestion that touches URL construction, string parsing, encoding/decoding, or date/time arithmetic — test the specific case before applying it. These are the domains where "this pattern is usually correct" meets "your specific inputs break it."
3. Read the API docs for every field you filter on.
Not just the fields you display. The filter logic is where semantic mismatches hide. completed, active, archived, published — these words mean different things in different systems.
4. One live integration test per external API surface.
Can be manual. Can be a script. Can be a throwaway curl command. But before merging any feature that touches an external API, run it against the real thing. Mocks are for speed in CI, not for validating API contracts.
On the Value of Agentic Coding
None of this is an argument against AI-assisted development. We shipped 3 PRs, 934 lines changed, 90%+ test coverage, slide-over edit mode across 3 different pages, and a full Asana sync — in a single day with one engineer. That's not possible without the tooling.
But AI coding assistance changes where the leverage is, not what the leverage is. The AI is exceptional at implementation speed, code consistency, test scaffolding, and catching common patterns. It is not a substitute for system-level thinking — understanding how your data moves, how caches work, how third-party APIs behave at scale.
The bugs that slipped through weren't coding errors. They were reasoning errors about system behavior. No amount of AI review catches those until you run the actual system.
The discipline that would have caught all four bugs in one step: deploy, click the button, verify the outcome before calling it done.
Simple. Unglamorous. Non-negotiable.
Running agentic coding workflows at scale? I write about what actually works — and what doesn't — at jeremyknox.ai.
Explore the Invictus Labs Ecosystem
Follow the Signal
If this was useful, follow along. Daily intelligence across AI, crypto, and strategy — before the mainstream catches on.

Foresight v5.0: How I Rebuilt a Prediction Market Bot Around Candle Boundaries
The bot was right. The timing was wrong. v4.x had a fundamental reactive architecture problem — by the time signals scored, the CLOB asks were too expensive. v5.0 solved it with event-driven candle boundaries and predictive early-window scoring.

Hermes: A Political Oracle That Bets on Polymarket Using AI News Intelligence
Political prediction markets don't move on charts — they move on information. Hermes is a Python bot that scores political markets using Grok sentiment, Perplexity probability estimation, and calibration consensus from Metaculus and Manifold. Here's how it works.

Leverage: Porting the Foresight Signal Stack to Crypto Perpetuals
The signal stack I built for prediction markets turns out to work on perpetual futures — with modifications. Here's how a 9-factor scoring engine, conviction-scaled leverage, and six independent risk gates become a perps trading system.