The assumption that broke

For most of software testing's history, every method, every framework, every certification, and every tool was built around the same basic idea. You give a system a specific input. You know what the output should be. You check whether the two match. That is deterministic testing, and it worked well for a very long time because most software was deterministic. If you got the same input twice, you got the same output twice. A bug was a consistent failure you could reproduce and trace.

AI systems do not work that way. Give a large language model the same prompt twice and you get two different responses. Both might be perfectly acceptable. One might be better than the other. Or one might be subtly wrong in a way that string comparison will never catch. There is no fixed expected output to check against. The whole testing paradigm breaks down.

By this point it is not some edge case for experimental teams. Stanford HAI's 2025 AI Index says 78% of organizations reported using AI in 2024, up from 55% the year before. Once AI is normal inside delivery work, testing has to move with it whether teams feel ready or not.

Two different problems, often confused for one

When people talk about AI changing software testing, they are usually describing one of two things. Keep them separate because they require different responses.

The first is the problem of testing software that was written with AI help. More code, produced faster, with a higher defect rate than human-written code. The testing problem here is fundamentally about volume and thoroughness. You have more to verify and less time to do it, and the failure modes are different from what your existing review processes were designed to catch.

The second is the problem of testing software that contains AI components. A chatbot, a recommendation engine, a fraud detection model, a document summarizer. The testing problem here is about methodology. Standard testing does not apply. You need different techniques entirely.

Both problems matter. Both are growing. And a lot of teams are dealing with both at once.

To separate those two problems, read What Is Vibe Coding? for the AI-written-code side, What Is AI Testing? for the methodology side, How to Test LLM Applications for the full-system version, and Will AI Replace Software Testers? for the career question that usually comes next.

78%
of organizations reported using AI in 2024, up from 55% the year before
96%
of developers do not fully trust that AI-generated code is functionally correct
48%
say they always check AI-assisted code before committing it
v1
OWASP published the first open community-driven standard for trustworthiness testing of AI systems on November 26, 2025

Why the old testing model does not work for AI

Traditional testing runs into a basic problem with AI systems: the oracle is gone.

In traditional testing, the test oracle is whatever tells you whether the output is correct. It might be a specification, an expected value in your test suite, or a known correct output from a previous version. You run the test, you compare the output to the oracle, and you get pass or fail.

For an AI system, there often is no oracle. Ask an AI to summarize a document and you might get five different responses, all of which are accurate, some of which are better than others, and none of which you could have specified in advance. You cannot write a test that says "the summary must equal this string." You can only evaluate whether the summary is good enough, which is a judgment call, not a check.

Trustworthy AI depends on measurement and evaluation that go beyond accuracy alone. Reliability, robustness, bias, interpretability, and transparency all matter.

NIST AI TEVV

You do not fix that by writing one smarter test. It is part of how these systems behave. The response is a different testing mindset and people who know how to apply it.

How teams are adapting

People are already figuring out practical ways to handle this. None of them are magic replacements for an old test suite. They just ask you to think differently about what testing is supposed to do.

Testing in layers

One of the more useful frameworks that has emerged is the idea of separating what you are testing into distinct layers, because different layers of an AI system actually can be tested differently.

Layer 1
Deterministic logic

Tool routing, argument parsing, response formatting, state machine transitions. Traditional unit tests work fine here. Run them on every commit. This is the part of the system that behaves like normal software.

Standard testing applies
Layer 2
AI output quality

Faithfulness, relevance, hallucination, and accuracy of the model's actual responses. Requires evaluation frameworks that use a judge model or human review. Slower and more expensive to run than unit tests.

New methods required
Layer 3
End-to-end behavior

Scenario-based evaluation that tests the full pipeline across multi-turn conversations or complex tool-use sequences. Validates that the system holds together under real conditions, not just individual outputs in isolation.

New methods required

This way of splitting the work lines up with what NIST is doing in practice. Its ARIA program separates evaluation into model testing, red-teaming, and field testing. That is a good reminder that AI systems need more than one kind of scrutiny.

Acceptance bands instead of pass/fail

Because you cannot define a single correct output for most AI responses, effective AI testing relies on defining a range of acceptable quality rather than a fixed expected value.

An acceptance band is a predefined quality threshold. If a response scores above a certain level on a relevance or faithfulness metric, it passes. Below that threshold, it fails. The model can produce varied responses and still consistently meet the standard. This gives you a workable pass/fail framework without requiring the output to be identical every time.

What changes on the ground: teams stop asking for one exact answer and start defining acceptable ranges, repeatable evaluation sets, and clear thresholds for what counts as good enough.

Hallucination and bias testing

Two failure modes specific to AI systems have no real equivalent in traditional testing: hallucination and bias.

Hallucination is when an AI model produces information that is factually wrong, invented, or untraceable to any source. The model states something with confidence. It sounds plausible. It is wrong. Hallucination testing checks whether model outputs can be traced back to real source material and flags content where the model appears to have fabricated facts, entities, or figures.

Bias testing evaluates whether a model's outputs treat different groups, topics, or inputs fairly and consistently. A model that gives different quality responses based on demographic characteristics of the user, or that systematically mishandles certain topics, has a bias problem that would never appear in a functional test. It requires deliberate testing across demographic slices and edge-case inputs.

Both of these are documented in the OWASP AI Testing Guide, which published version 1 on November 26, 2025. The guide makes a pointed observation: security testing is not sufficient for AI systems. The real objective is AI trustworthiness, which requires a broader set of checks than any security scan can provide.

Continuous monitoring as a testing discipline

Traditional software testing happens before release. You run your tests, you check the results, you decide whether to ship. AI testing cannot be front-loaded that way, because AI systems change over time even when you do not change the code.

Model drift is the phenomenon where a model's behavior changes as the data it processes shifts. A customer service AI that worked well in January may behave noticeably differently in October because the types of questions it has been answering have changed how it responds. This does not show up in pre-release testing. It only shows up through ongoing monitoring of production behavior.

NIST's ARIA work is useful here because it includes field testing alongside model testing and red-teaming. You can learn a lot in a test environment, but some problems only show up once the system meets live users and live data.

The gap between traditional and AI testing

Looking at what changed side by side makes the scope of the shift clearer.

Traditional testing
  • Fixed expected outputs
  • Pass/fail assertions
  • Deterministic behavior
  • Pre-release verification
  • Regression suites
  • Bug is a reproducible failure
  • Coverage measured in code paths
AI testing
  • Acceptable ranges of output
  • Threshold-based quality gates
  • Non-deterministic behavior
  • Continuous production monitoring
  • Scenario-based behavioral evaluation
  • Failure can be a plausible-sounding wrong answer
  • Coverage measured in behavioral scenarios

This is not just a small upgrade to existing skills. It is a different set of things to know and do. The techniques that make someone strong at testing a traditional web application do not automatically carry over to an LLM-powered feature in that same app.

What QA teams are actually doing about this

The organizations getting the best results from AI testing right now share a few characteristics that show up consistently across industry reporting.

They have decided explicitly where human judgment stays in the loop. Not as a catch-all fallback, but as a deliberate design decision. They define the checkpoints in the pipeline where a human reviews AI output before the system proceeds. The accountability for quality does not get handed entirely to an automated framework.

They treat quality as something that gets measured in production, not just in staging. Production logs, performance data, and user telemetry expose things that controlled test environments cannot, especially in distributed or AI-heavy systems where behavior under real load differs from behavior in a controlled test.

And they have started building specific review criteria for AI-assisted code commits, things to look for that standard code review misses: hallucinated API references to functions that do not exist, security patterns the model picked up from low-quality training data, business logic the model guessed at rather than derived from actual requirements.

61%
of developers say AI often produces code that looks correct but is not reliable

That last stat is worth paying attention to. Tooling gets most of the attention in 2026. The basics get less. A lot of testing failures still come down to fuzzy requirements and fuzzy definitions of what good looks like. That gets even harder when the system you are testing is non-deterministic.

What skills this requires

The skill gap in AI testing is real, and it is pretty specific. This is not a vague “learn more about AI” problem. There are concrete things strong AI testers know how to do that most QA people were never taught in the first place.

  • Probabilistic evaluation: assessing whether outputs fall within an acceptable quality range rather than checking for exact matches
  • Hallucination detection: identifying when model outputs contain fabricated or unsupported information
  • Bias and fairness testing: evaluating whether a model behaves consistently across demographic groups and input types
  • Adversarial testing: attempting to manipulate model behavior through prompt injection, jailbreak attempts, and edge-case inputs
  • Behavioral consistency validation: checking whether a model behaves predictably across repeated similar inputs over time
  • Model drift monitoring: tracking changes in model behavior between release cycles even when the underlying code has not changed

These are not concepts from a general software testing course. One response is informal self-study. Another is a more structured path through focused training and certifications such as ISTQB AI Testing and Testing with Generative AI, which sit inside the ASTQB AI Assurance Pro™ designation.

That is also why pages like What Is Hallucination Testing?, AI Assurance Pro for Testers, AI Assurance Pro for Managers, How to Become an AI Tester, and AI Insurance Risk Shows Why AI Testers Are Needed end up mattering. They are all looking at the same shift from different angles.

Where this leaves QA as a profession

There is a version of this story where AI tools reduce the need for software testers. That version is not what the data supports. The version the data supports is that AI increases the volume and complexity of what needs to be tested, changes the methodology required to test it, and creates a genuine shortage of people who know how to do the new version of the job.

That does not mean QA starts over from scratch. A lot of the instincts still transfer: understanding user needs, looking for edge cases, validating against risk, and challenging confident-looking output. What changes is the method, not the basic value of the work.

The QA teams that do well over the next few years will not just be the ones with the fanciest tooling. They will be the ones that got honest about what their current practices were built for, saw where those practices fall short for AI systems, and learned the missing pieces.