Code Review Is the Wrong Shape for Agentic Development
A year ago, I was mostly building alone. Me and Claude Code, cranking through features, committing when things felt right. The process was simple because the only mental model that needed updating was mine.
Now I’m working with other developers. And the thing I keep bumping into is the process around agentic development.
In the old days (circa 2024), you’d do a body of work, send it to someone else for code review, go back and forth, fix things, merge, move on. That workflow made sense when a batch of changes was a few hundred lines — maybe a thousand on a big day. Humans could read and understand the changes while leaving meaningful feedback.
Now, with agentic engineering, one individual can generate 50,000 lines of code across dozens of files in a few hours. No human can review that. Not meaningfully. You can skim it, you can spot-check, you can run the tests. But the idea that another person is going to read through 50K lines and build a mental model of what changed? That’s fiction. Oh they might be able to do it once, and spend a whole day really understanding it. But not two or three times a day. Multiply that by even a small team of 2 - 3 people working on something at the same time and the complexity compounds.
So the obvious move people have gravitated toward is: keep the same review process, but have agents do the reviewing. Run an AI code review, flag issues, iterate. This works, kind of. Agents find real things and fix them. It feels good because it has the illusion of a familiar process that has always given good results in the past.
That is until you run the review again.
Most times, a second review reveals a whole new pile of problems. Many times these are more alarming than the first batch. So you fix those. Run it again. More problems. It’s turtles all the way down, and at no point do you have a confident sense that you’ve hit bottom.
(I’ve started running the same pattern on requirements documents, not just code. Same result. Dozens of review passes before things settle down. The problem isn’t specific to code. I believe it’s a fundamental challenge of verifying AI output at any scale.)
Even with agents, I’m still responsible for every line they write. I own their failures. If a bug ships, it is still 100% my responsibility. I’ve run engineering teams for many years. I’ve developed a gut sense of where to poke and when something just smells off. I’m trying to develop that same intuition for AI-generated output, but the feedback loop is brutal: every time I think I’ve dialed in my instincts, another pass proves that I have much to learn.
The deeper issue is that code review was never just about being a quality gate. It was also a natural context checkpoint.
When someone reviews your work, they’re not just looking for bugs. They’re updating their mental model of the project. They’re learning what you built, how you built it, what tradeoffs you made. The review process was how teams maintained shared understanding. It was slow and it was doing two jobs at once.
Agentic development blows up both of them simultaneously. The quality gate breaks because humans can’t review at the volume agents produce. And the context checkpoint breaks because even if you could read 50K lines, the speed at which they’re generated means your teammates are perpetually behind.
I don’t think the answer is grafting the old model onto the new tools. Code review was designed for a world where humans produced code at human speed. We’re not in that world anymore, and pretending we are just creates a bottleneck that negates the whole point of agentic development.
I think the answer is something closer to a continuous improvement loop. Not checkpoint-based review, but ongoing review that runs alongside development. Code review used to be a gate, but I think artificially gating agents is a mistake. New code needs to get merged quickly so all the threads of agents can start loading that context and spotting issues with other work in progress.
That’s great for the robots, but not so much for me. What I think I need is something that produces human-digestible output about what’s being built, how it works, where the risks are. Not a diff. Not a 50K-line code review. Something that actually gives me true visibility into what’s happening without requiring me to read everything.
If done right, this should follow the bitter lesson by getting better as AI gets more capable. A process that depends on humans reading every line is fighting the trajectory, while a process that depends on AI helping humans understand what AI built is going to scale immensely.
I’ve looked at what various people are doing in this space and nothing great has jumped out at me yet. If you’re working on this or have seen good approaches, I’d genuinely like to hear about them. I think this is going to be an area I tinker with a bit on a few projects.