Stop Reviewing AI Code. Just Not Yet.

There's a day coming when reading AI-generated code will be as pointless as reading Java bytecode. Today is not that day.

You should stop reviewing AI-generated code. I believe that. At some point, going line by line through AI output will be as useful as opening a .class file to check if javac made a mistake. The code will be an intermediate artifact that nobody looks at, and that will be fine.

But not yet.

I've spent the last few weeks writing about what you should be doing with AI coding tools right now. Measure your productivity instead of trusting your gut. Don't unleash AI agents on open source maintainers. Review the code. Scope your usage. Don't vibe code blindly.

The data supports all of that. The METR study, the CodeScene research, the DORA report, the Veracode security numbers, they all say the same thing: in February 2026, these tools are good enough to feel productive and bad enough to create real damage if you stop paying attention.

So why am I telling you to stop? Because everything I'm telling you to do today will eventually be wrong. And I think we should be honest about that.

The Bytecode Analogy

When Java came out, you wrote Java source code. The compiler turned it into bytecode. The JVM ran the bytecode.

Nobody reviews bytecode. Nobody debugs bytecode. Nobody opens a .class file to check if javac did its job correctly. The compiler is trusted to produce correct output from source. You fix bugs in the source code, not the compiled output.

This is where software development is heading. Not today. Not next year. But on the trajectory we're on, code is becoming an intermediate artifact. The thing you write and review and version-control won't be Python or TypeScript. It'll be natural language. Specs, intent descriptions, prompts, whatever we end up calling them. The AI will compile those into code the same way javac compiles .java into .class files.

Andrej Karpathy has been saying this for years. He called English "the hottest new programming language" back in 2023 and framed the evolution as Software 3.0: Software 1.0 was hand-written code, Software 2.0 was neural network weights learned from data, and Software 3.0 is natural language prompts that instruct LLMs to produce both. The prompt is the source code.

I think his framing is directionally right, even if the timeline is uncertain.

And it's not just theory. In January, I was at OpenAI's HQ in San Francisco where they demoed an internal project, codenamed Hermes at the time, that did exactly this. You described what you wanted in natural language. The system figured out what tools, MCP servers, and agents it needed, how to route between them, and built the entire workflow for you. It could even eval its own output and optimize itself. No code. No configuration files. Just intent in, working system out.

They launched it publicly on February 5 as OpenAI Frontier. OpenAI positioned it as an enterprise platform for building and managing AI agents, but what I saw in that room was the bytecode compiler I'm describing in this post. The intermediate steps, the agent definitions, the tool configurations, the routing logic, none of that is something a human needs to write or review. You describe the outcome. The platform figures out how to get there.

It's early. Frontier is aimed at enterprise workflows, not software development. But the pattern is the same one that will eventually hit code: natural language becomes the source of truth, and everything between intent and execution becomes an implementation detail you don't need to inspect.

Why We're Not There Yet

Here's the problem: Java bytecode works because javac is deterministic. You feed it the same source, you get the same bytecode. The compilation process is mathematically well-defined. There's a language specification. There are formal correctness proofs. When javac produces bytecode, you can trust it because the transformation is predictable and verified.

LLMs are not deterministic. You feed the same prompt twice and you might get different code. The "compilation" process is probabilistic. There's no specification. There are no correctness proofs. The model might produce something brilliant or something that introduces an XSS vulnerability, and the only way to tell is to review the output.

That's why I keep saying review the code right now. In 2026, the AI is a junior developer who sometimes writes great code and sometimes introduces subtle bugs that you won't catch for months. You wouldn't let a junior dev push to production without review. Don't let the AI do it either.

The METR study found experienced developers were 19% slower with AI tools. Veracode found 45% of AI-generated code had OWASP vulnerabilities. The 2025 DORA report showed AI adoption negatively correlating with delivery stability. These numbers are not where you stop reviewing output.

But they're also not where they'll stay forever.

The Inflection Point

Somewhere between here and AGI, there's an inflection point. The moment the AI's error rate drops below a human developer's error rate, the calculus flips. At that point, reviewing AI-generated code line by line becomes like reviewing bytecode. You're adding overhead without improving quality. You're the bottleneck.

We're not close to that inflection point. But we're closer than we were a year ago. And the rate of improvement is accelerating.

Think about what happens as we move along that curve. Every practice I'm recommending today gradually becomes less necessary:

Reviewing AI output line by line makes sense when the error rate is high. When it drops to the level of a senior developer, you shift to spot checks. When it drops below human error rates, you stop reviewing code and start reviewing specs.

Scoping AI to small testable units makes sense when the models can't hold architectural context. When they can reason about entire systems reliably, that constraint becomes artificial.

SDD frameworks that generate detailed specs before coding make sense when the AI needs that structure to produce decent output. When the AI can go from a paragraph of intent to a working system, those frameworks become overhead. They could go from best practice to bottleneck in a single model generation.

What Stays the Same

Even if code becomes bytecode, some things don't change.

Someone still needs to define what to build. "Make a payment system" isn't a spec any more than // TODO: implement payments is code. The skill shifts from implementation to articulation. Describing what you want, precisely enough that the AI builds the right thing, is its own discipline. Karpathy calls it prompt engineering. I think it's closer to product thinking. Understanding the problem well enough to describe it completely is the hard part, and that's always been the hard part.

Someone still needs to verify the output works. Even if you're not reading code line by line, you need integration tests, security scanning, performance benchmarks, user acceptance testing. The verification layer doesn't disappear. It moves up the stack. You stop asking "is this code correct?" and start asking "does this system do what I specified?"

Someone still needs to understand the system when it breaks. When your Java application throws a NullPointerException, you don't debug the bytecode. You look at the source. When your AI-generated system has a production incident, you'll look at the spec, the intent, the natural language that described what the system should do. That becomes your debugging surface.

Architecture still matters. How services communicate, how data flows, where state lives, what the failure modes are. AI might generate the implementation, but the architectural decisions that shape what gets generated are still human decisions. At least for now.

The Roles Change

This is the part that makes people uncomfortable, and I get it.

If code becomes bytecode, the developer role changes. You stop being the person who writes the code and start being the person who defines the intent and verifies the output. That's a different skill set. Some developers will thrive in that world. Some won't.

In my SDD article, I argued that SDD tools are solving the wrong problem because they assume the developer is also the product owner. Most developers aren't. Product requirements come from a product organization.

But zoom forward a few years. If the implementation step collapses from weeks to minutes, the gap between "define what to build" and "have a working system" shrinks dramatically. The value of the role that used to bridge that gap, the developer who translates specs into code, decreases. The value of the roles on either end, defining what to build and verifying it works, increases.

Product owners, architects, QA engineers. Those roles don't go away. They might be the ones that matter more. The developer role doesn't disappear either, but it evolves into something closer to a systems engineer who thinks in terms of constraints, interfaces, and verification rather than implementations.

This evolution won't happen overnight. It tracks the acceleration toward AGI. As the models get better, the roles shift incrementally. The change is continuous, not binary.

The Uncomfortable Middle

We're in the worst part of this transition right now.

The tools are good enough that people think they can skip the hard parts. Vibe code a feature, don't review it, ship it. The METR data says you'll be slower. The Veracode data says you'll be less secure. The DORA data says your delivery stability will suffer. But people do it anyway because it feels fast.

At the same time, the tools are improving quickly enough that any practice you cement into your workflow today might be outdated in eighteen months. Build too many guardrails and you'll be the team that's still doing manual bytecode review while everyone else has moved on.

So how do you build practices that are right for today without calcifying them into dogma?

I think the answer is: tie your practices to the data, not to the tools.

Don't say "we always review AI-generated code line by line." Say "we review AI-generated code until our defect rate from AI contributions drops below our defect rate from human contributions." One is a policy. The other is a metric-driven threshold that automatically adjusts as the tools improve.

Don't say "we never let AI write entire features." Say "we scope AI usage to the level where our integration test pass rate stays above X%." That threshold naturally expands as the models get more capable.

Don't say "we use SDD framework X for all new projects." Say "we use structured specs when the AI's output quality improves with structured input." If the next model generation can go from a paragraph to a working system without a 50-page PRD, your framework just became deadweight.

What I'm Doing Today

Right now, in February 2026, I review AI-generated code. I scope my usage to testable units. I measure my team's actual delivery metrics instead of trusting how productive I feel. I use AI for investigation and proposal, and I make the final calls myself.

I'm also watching the data closely. When the error rates change, my practices will change with them. I'm not going to be the person clinging to manual code review when the models are producing better code than I can. That day isn't today. But I want to see it coming.

The inflection point will arrive at different times for different tasks. Boilerplate and scaffolding are almost there now. Complex architectural work is years away. Security-sensitive code might be the last frontier. The transition won't be uniform, and the people who navigate it well will be the ones who can read the data and adjust accordingly rather than committing to one extreme.

So yes. Stop reviewing AI code. You should. The day is coming when that's the right call, and clinging to line-by-line review past that point will make you the bottleneck, not the safeguard.

Just not yet. The data says not yet. And when the data changes, I'll be the first one to tell you.

When will you stop reviewing AI-generated code? I'm tracking the metrics that would tell me when to stop, and I'm not there yet. Are you?