Vibe Coding Works Until It Doesn't

Brandon Dennis

Feb 18, 2026 • 7 min read

The Data Says You're Slower. You Just Don't Believe It.

I use AI coding tools every day. Claude Code has access to my Kubernetes cluster. I've given it SSH access to debug hardware driver issues on machines I've never touched before. I'm not anti-AI. I'm probably more bought-in than most of you reading this.

But we need to talk about vibe coding, because the data coming in is ugly and the industry is pretending it doesn't exist.

The METR Study Nobody Wants to Hear About

METR, a Berkeley nonprofit that evaluates frontier AI model capabilities, ran a randomized controlled trial earlier this year. Not a survey. Not vibes. A proper RCT with screen recordings and time tracking.

They took 16 experienced open-source developers. These aren't juniors fumbling through their first codebase. They had an average of 5 years of experience on their assigned projects, working in mature repositories averaging 22,000+ stars and over a million lines of code. Real developers doing real work on real codebases.

Each developer got a set of issues from their own repos. The issues were randomly assigned: some they could use AI on, some they couldn't. 246 tasks total, averaging about two hours each. When AI was allowed, developers used whatever they wanted, mostly Cursor Pro with Claude 3.5 and 3.7 Sonnet.

The result: developers using AI tools were 19% slower.

Not faster. Slower. By nearly a fifth.

The Perception Gap Is the Scary Part

Here's what should bother you. Before starting, developers predicted AI would make them 24% faster. After finishing all 246 tasks, with weeks of experience, they still believed AI had made them 20% faster.

They weren't faster. They were 19% slower. That's a 40-percentage-point gap between what developers believed and what actually happened.

This isn't a rounding error. This is experienced developers being confidently wrong about their own productivity for weeks while recording their screens. The data was right there on video and they still couldn't feel it.

METR found that developers accepted less than 44% of AI-generated suggestions. Think about what that means in practice. You prompt the AI, wait for a response, read the output, think about whether it's right, decide it's wrong, throw it away, and write the code yourself. Or worse: you accept it, realize three minutes later it broke something, and spend ten minutes understanding what the AI did before fixing it.

When the AI generates code instantly, it feels like progress. Fast output creates an illusion of speed. You feel like you're in the flow. But you're not shipping faster. You're reviewing faster, which is not the same thing.

CodeScene Quantified the Damage

If the METR study is about speed, the CodeScene research is about what happens to the code itself.

CodeScene published a peer-reviewed study called "Code for Machines, Not Just Humans" that analyzed what AI assistants do to code quality at scale. The headline finding: healthy codebases see up to 30% less defect risk from AI-generated code compared to unhealthy ones.

Read that the other way around: if your codebase is unhealthy, AI-generated contributions carry significantly higher defect risk. And the study only included codebases above a certain health threshold. Most legacy codebases, the kind that most of us actually work in, weren't even measured because the baselines were too noisy. The real gap is probably worse.

Here's the mechanism: AI tools can't distinguish between code that works and code that's maintainable. They'll generate something that passes your tests today and creates three bugs next quarter. AI doesn't understand technical debt. It doesn't know that the function it just extended was already a mess that the team had been avoiding for months.

In healthy codebases, AI acceleration works fine. The code is clean, the patterns are consistent, the AI can follow the conventions because there are conventions to follow. But most codebases aren't healthy. Most of us are working in systems that have been growing for years with inconsistent patterns and undocumented assumptions stacked on top of workarounds nobody remembers writing.

AI amplifies whatever is already there. Clean code gets cleaner faster. Messy code gets messier faster.

The Security Numbers Are Worse

Veracode tested over 100 large language models across Java, Python, C#, and JavaScript. 45% of AI-generated code samples introduced OWASP Top 10 vulnerabilities. Java was the worst at a 72% security failure rate.

That is not a typo. Nearly half the code these models generated had known security vulnerabilities baked in.

It gets worse. Analysis of AI co-authored pull requests found they were 2.74x more likely to add XSS vulnerabilities, 1.91x more likely to introduce insecure object references, and 1.88x more likely to include improper password handling.

Microsoft has said that 30% of code in some of their repos is now AI-generated. If the defect rates hold, we're looking at a wave of CVEs in 2026 and 2027 from code that was vibe-coded into production last year.

The DORA Report Confirms It at Scale

The 2025 DORA report, based on survey responses from nearly 5,000 technology professionals, found that AI adoption has a negative relationship with software delivery stability. Not neutral. Negative.

Their finding is direct: AI accelerates code production, but that acceleration exposes weaknesses downstream. Without strong automated testing, mature version control, and fast feedback loops, more code volume just means more instability.

DORA's framing is useful: AI is an amplifier, not a fixer. Strong teams get stronger. Weak teams get worse, faster. If your organization has poor security practices and adopts AI tools aggressively, you'll generate more vulnerabilities at a higher velocity. You're not improving. You're scaling your dysfunction.

Where Vibe Coding Actually Works

I'm not writing this to say AI tools are useless. I use them constantly and I'm not stopping.

But I've started paying attention to when they actually help versus when I'm just generating work for myself. There's a line, and it's more specific than most people admit.

The sweet spot is small, testable units of work. If you can write a unit test to validate the output, the scope is probably small enough to vibe. Generate a utility function, a data transformation, a well-defined component with clear inputs and outputs. Prompt, review, test, ship. That workflow is real and it saves time.

The METR study's caveats matter here too. Their developers had 5+ years in their repos. They already knew where everything was, how things connected, what the implicit assumptions were. AI couldn't tell them anything they didn't already know. But if you're onboarding to a new project? AI is legitimately helpful for navigation and getting oriented. Same for boilerplate. Need a new API endpoint that follows the same pattern as the last twelve you wrote? AI nails that. It's pattern matching, and pattern matching is what these models are good at.

Where it falls apart is anything without a template to follow. Novel security implementations, new architectural patterns, systems where the requirements are still being figured out. This is where the 45% vulnerability rate comes from. The AI doesn't understand your security model. It's pattern-matching against training data and hoping for the best.

It also falls apart on unhealthy codebases. That's the CodeScene finding. If you're adding AI-generated code to a system that's already drowning in technical debt, you're taking on significantly more defect risk than a team with a clean codebase. Fix the code first.

And it definitely falls apart at scale. Y Combinator reported that 25% of their Winter 2025 batch had codebases that were 95% AI-generated. Some of those shipped fast and made money. But ask those teams about their maintenance burden in six months. Disposable prototypes are fine. Production systems need engineering judgment that current models don't have.

Why We Believe the Hype Anyway

The perception gap from the METR study is the most important finding in AI developer tooling this year. Developers think they're faster. They're not. And they can't tell even when they have the data.

This is a measurement problem disguised as a productivity problem. Most teams are adopting AI tools based on developer sentiment. "Do you feel more productive?" is the question, and the answer is always yes. Nobody is running controlled experiments. Nobody is comparing cycle times before and after adoption. The METR study is one of the first to actually do this, and the results contradicted everyone's expectations, including the researchers'.

When your metrics are based on feelings, you'll always believe you're winning.

What You Should Actually Do

Measure before you declare victory. If your team adopted AI tools, compare your actual cycle times and defect rates to before adoption. Not surveys. Not sentiment. Actual delivery metrics. DORA metrics. Deployment frequency, lead time, change failure rate, time to restore. If those haven't improved, your AI tools aren't helping regardless of how your developers feel about them.

Scope your AI usage. Use AI for well-defined, testable units of work. Stop vibing entire features into existence. The models are good at filling in well-scoped blanks. They're bad at architecture, security, and understanding the implicit context of your system.

Fix your codebase first. If CodeScene's research tells us anything, it's that healthy code gets up to 30% less defect risk from AI than unhealthy code. Investing in code quality before investing in AI tooling will get you better results than the other way around.

Review AI output like you'd review an intern's first PR. Read it line by line. Understand what it did and why. If you're accepting AI suggestions without understanding them, you're accumulating debt you can't see until it breaks.

Stop trusting your gut on this one. The entire point of the METR study is that your intuition about AI productivity is wrong. Not slightly wrong. 40-percentage-points wrong. If you feel faster, prove it with data before you reorganize your team around that assumption.

This Article Has a Shelf Life

The 19% slowdown in the METR study was measured with early-2025 frontier models, and we're already past that. The tools will get better. The defect rates will probably come down. I'll be the first to update this post when the data changes.

But right now, in February 2026, we're in an awkward middle period where the tools are good enough to feel productive and bad enough to create real damage at scale. The security vulnerabilities are real. The defect rates are measured. The perception gap is documented. These aren't theoretical problems. They're showing up in production systems today.

Use AI tools. I do. But stop pretending the data doesn't exist. Measure your outcomes, scope your usage, and be honest about where these tools actually help versus where they just feel like they help.

Feeling fast and being fast are very different things.

Are you measuring the actual impact of AI tools on your team's delivery metrics? I'd love to see data from teams that have done before-and-after comparisons rather than sentiment surveys.