Nobody said the code would look bad. That was never the problem.
A new study from New Relic, conducted by Hanover Research across 200 U.S. technology leaders at the manager level and above, puts hard numbers on something that QA professionals have been sensing for a while. The gap between how AI-generated code performs at review time versus how it behaves in production is not a fluke or an edge case. It is a pattern, and it is now well-documented.
The headline contradiction from the report: 94% of technology leaders rate AI-generated code as higher quality than human-authored code. Those same organizations are reporting more production incidents (78%), more senior-engineer firefighting (86%), more rework (74% say at least a quarter of AI code needs significant rework), and more production failures (82% in the last six months).
That is the same code, the same organizations, and two completely different pictures of its quality depending on when you are looking at it.
The Most Useful Data Point in the Report
The New Relic study offers a clean explanation for the contradiction, and it is worth sitting with.
Reviewers see code that reads well: consistent style, clean patterns, fewer obvious surface bugs than most humans produce. That perception is genuine. The code does look good at the moment someone is staring at it in a pull request.
What the AI coding tool did not have access to when it generated that code is how the system actually behaves in production. It can read the source. It cannot read the trace. So it makes assumptions about edge cases, about how services interact, about what data will actually look like at runtime, and some of those assumptions are wrong in ways that only become visible under real load, against real dependencies, with real users on the other end.
The report describes this as AI working with the artifacts a human reviewer can see and being blind to the artifacts only production can produce.
That gap is not a product shortcoming that is going away next quarter. It is structural. And right now, it is landing on quality teams.
Vibe Coding Is Not a Sandbox Practice Anymore
One of the more significant findings in the report is how far vibe coding has moved into formal production policy. Only 5% of the organizations surveyed restrict it to non-production work. Zero ban it outright. The other 95% allow it in production services, with 87.5% having officially written it into their SDLC.
AI-generated code is hitting the same revenue-bearing services, the same customer-facing endpoints, and the same SLAs as code written by senior engineers.
The report notes that most organizations have authorized the practice without consistently leveling up the surrounding controls. The authorization outpaced the governance.
That is not a hypothetical risk. The study found that 62% of tech leaders report their teams often or always ship AI-generated code to production without line-by-line manual verification. The report frames this as the trust problem driving the production crisis: code that looks clean at review gets waved through, and the structural issues it contains do not show up until production finds them.
What's Actually Breaking
The failure modes are not clustered around one type of problem. That is what makes them difficult to plan for with a single control.
The top four production challenges from AI-generated code in the past six months, as reported by the organizations surveyed: integration failures (30%), compliance and governance issues (30%), data integrity problems (29%), and newly introduced security vulnerabilities (28%). Each one hit roughly three in ten organizations in a six-month window.
The report notes that AI-generated code introduces roughly 1.7 times more critical runtime issues than human-authored, peer-reviewed code. It is not breaking production in one signature way. It is breaking it in multiple small ways at once, spread across integration, compliance, data, and security, and most of the organizations in the study have a war story from the last two quarters.
Each failure mode also has a different observable signature, which means a single test pass or a single monitoring alert type is not going to catch all of them. Integration failures look different from data integrity anomalies, which look different from authentication pattern drift.
Why This Is a Testing Problem
The New Relic report is aimed at observability professionals and frames most of its conclusions around monitoring and telemetry. That is fair given who commissioned it. But the production gap it describes is equally a testing problem.
AI coding tools generate code without visibility into runtime behavior. They can read documentation, infer patterns, and produce something that satisfies the stated requirement. What they cannot do is know how the code will behave under real-world conditions that were never described in the prompt.
That is testing's domain. The edge cases. The integration behavior across services that do not always respond the way the AI assumed. The data states that only occur in production. The security assumptions that hold in isolation and fail in context.
The organizations that the report identifies as managing the AI coding transition best are the ones treating production telemetry as a first-class input to the development loop, not just a debugging tool after the fact. The parallel for testing is treating the review of AI-generated code as a distinct discipline with its own failure patterns, not just a faster version of reviewing human code.
AI-generated code fails differently. It fails where the assumptions were invisible, not where the mess is visible. Test strategies built around human failure patterns will miss a lot of what AI-generated code gets wrong.
The Scale Has Changed the Math
Two thirds of the organizations in the New Relic study report that AI now generates or significantly refactors between 51% and 75% of their weekly code output. 82% say more than half of their weekly code output is AI-driven.
For context, the report notes that Google announced in April 2026 that 75% of their code is AI-generated. The study's data suggests that number is now fairly representative of where U.S. enterprise engineering organizations are operating, not an outlier.
At that volume, the engineer responding to a production incident is statistically unlikely to have written the code that caused it. Reading the source is no longer the primary path to understanding what happened. There is too much of it, generated too fast, by too many different prompts.
That changes what verification actually requires. It is not slower review of the same artifacts. It is a different approach to what you are looking for, how AI-generated code characteristically fails, and what signals in production telemetry or test output are most likely to surface those failures before users do.
What Testers Need to Know Going Into the Second Half of 2026
The New Relic data quantifies what the testing community has been describing in qualitative terms for the past year. AI-generated code is higher volume, lower friction to ship, and more likely to contain structural issues that are not visible at review time.
The skill set that addresses that combination is not standard QA applied faster. It is QA that understands how AI systems generate code, where those systems characteristically go wrong, and how to design test coverage that accounts for the failure modes the AI itself could not anticipate.
That includes understanding integration testing in codebases where services were written with different AI contexts. It includes data integrity testing for outputs that were generated with assumptions about input shape that production will not always honor. It includes security testing for code that passed review because it looked correct and had vulnerabilities that only appear under real interaction patterns.
The report's conclusion is that the observability layer is where engineering organizations will win or lose the AI-assisted era. The testing layer is where those organizations either catch the problems before production does or find out about them the hard way.
Most teams right now are finding out the hard way. That is the opening.
The Credential That Matches the Problem
The gap the New Relic data describes is not a general software quality problem. It is a specific one: AI-generated code failing in ways that standard testing approaches were not designed to catch, at volumes that have outpaced most teams' ability to review it carefully.
The ASTQB AI Assurance Pro designation exists precisely because that problem is real and measurable. It is built on two ISTQB certifications that together cover both sides of the challenge: testing with AI tools, and testing AI systems themselves. Those are the two skill sets the report's data keeps pointing back to, whether it is talking about the trust gap at review time, the integration failures in production, or the senior-engineer rework load that is growing at 86% of organizations.
Teams that are shipping 51% to 75% AI-generated code every week, which is now the typical organization in this study, need testers who understand how that code characteristically fails, not just that it sometimes does. That understanding does not come from general QA experience. It comes from learning it deliberately.
The path to AI Assurance Pro runs through ISTQB's AI testing exams, and a lot of working testers are closer to completing it than they realize. If the production gap in this report describes something your team is already living with, the skills to close it exist and they are certifiable. Here is how to get there.
Related reading: How AI Is Changing Software Testing · What Is Vibe Coding? · AI-Generated Tech Debt · Will AI Replace Software Testers? · LLM Testing for QA Engineers