Let QATE discover your app and build a map and knowledge base in minutes: Discover your app in minutes: Discover Now →

Stop Retrying Flaky Tests. Classify Them Instead.

QT
Qate AI Team
·14 min read

It usually starts small. A test fails on a Friday afternoon. The author re-runs it. It passes. They add retries: 2 and move on. Six months later the same suite has retries: 5 and a paragraph of YAML comments justifying each bump, and somewhere in the noise the team has shipped two real regressions that quietly passed on retry.

This is the modern CI hygiene problem. Not that tests fail — tests should fail when something is wrong — but that the response to failure has collapsed into a single primitive: retry until green. Retries are cheap, fast, and emotionally satisfying. They are also the worst possible signal-management strategy for a CI suite that mixes three fundamentally different kinds of failure, each of which demands a different fix.

This piece is about that taxonomy: what the three classes of test failure are, how to tell them apart, what tooling actually acts on each, and where the whole approach falls down.

The flakiness numbers everyone quotes (but few act on)

The case for taking failure triage seriously starts with the data, and the data is bleak.

Google's much-cited 2016 report found that roughly 1 in 7 of their tests exhibit some level of flakiness, and 84% of transitions from pass to fail in their CI involve a flaky test. A decade on, the picture has not improved at most companies. Atlassian's engineering team reported in 2025 that approximately 15% of Jira backend repo failures are attributed to flaky tests, wasting over 150,000 hours of developer time each year — and that was after they built a dedicated tool to manage it. Bitrise's 2025 Mobile Insights Report shows the proportion of teams experiencing test flakiness grew from 10% in 2022 to 26% in 2025 across the 10M+ builds they observed.

A peer-reviewed industrial case study quantified the direct cost at 2.5% of total productive developer time — 1.1% investigating, 1.3% repairing. And on the CI bill itself, an analysis of 10,000 GitHub Actions runs estimated that a single test that flakes three times a week costs $5,850 per year in compute. At any non-trivial test count, this becomes a five-to-six figure line item.

The numbers are getting worse, not better. Modern SPAs hydrate asynchronously, microservices add network surface area, AI features produce non-deterministic outputs, and CI runners themselves have grown noisier as cloud providers oversubscribe. The retries strategy was workable when 5% of failures were flaky. It is not workable when a quarter of them are.

Why retries are a sedative, not a cure

The reason retries: 2 feels like a fix is that, on the surface, it is. The build goes green. The PR merges. The dashboard looks healthier.

What retries actually do is destroy your ability to tell the three kinds of failure apart. A test that passed on retry could have been a transient network blip — fine. It could also have been a 1-in-3 race condition in your own application code that will hit a real user next Tuesday. Datadog's own knowledge center warns that auto-retry combined with quarantine produces a "re-run and hope" culture in which the cost of misclassifying a real regression as a flake is shipping the regression.

The DeFlaker paper from ICSE 2018 makes the methodological version of this argument concrete: in their evaluation, Maven's standard rerun-based flake detection caught only 23% of confirmed flaky tests, while DeFlaker's coverage-based approach caught 95.5%. Reruns are not just an incomplete signal — they are an actively misleading one.

So what's the alternative? Stop treating the test result as a binary. Treat it as a classification problem.

Three classes of failure

The 2014 Luo et al. taxonomy of 201 flaky-fix commits in Apache projects remains the most-cited root-cause breakdown: async wait 45%, concurrency 20%, test-order dependency 12%, with the remainder split across resource leaks, network, time, I/O, randomness, and unordered collections. A 2025 ICST empirical study of 123 flaky tests in 49 open-source web projects found Event-DOM interactions accounted for 32.5% and Event operations for 22.8% — the modern web variant of the same pattern.

These taxonomies are useful for researchers, but they're too fine-grained for daily triage. For a working engineering team, the practical question is "what should I do next," and there are exactly three answers. Hence three classes.

Class A — Flaky locator or environmental

The test is testing something real, the application works, but the selector or the environment can't find or reach the element reliably. The log usually says something like TimeoutError: locator.click: Timeout 30000ms exceeded or Element is not attached to the DOM.

Common patterns:

  • Dynamic IDs from CSS-in-JS or hashed classes. Selectors like .sc-bdVaJa.kPgwxR or #mui-12345 change every build. Playwright issue #34945 is a representative example — getByText('text').click() times out only on GitHub Actions because the matching node is re-keyed mid-render.
  • Late-mounting components. The selector resolves before the list has hydrated. Playwright issue #17275 walks through nth=5 resolving against a four-item list.
  • A/B variant DOM. The selector silently matches only the control variant, the experiment ships, and 50% of CI runs fail. QA.tech's writeup is a good walkthrough.
  • Shadow DOM and iframe re-renders. The locator resolves, the host re-renders, and the handle detaches. Chromium issue 40671514 tracks the long tail.

The right fix is almost always to switch to semantic locators (role + accessible name, stable data-testid) and to assert on the loaded state of the container before indexing into it. When that's not viable — for example on legacy apps you can't add test IDs to — automated self-healing is the response. We cover the techniques in detail in How Self-Healing Tests Address the Maintenance Problem.

Class B — Unstable test (your test is wrong)

The application is fine. The test code itself has a bug — usually a timing assumption, an isolation leak, or a hardcoded environment variable.

  • Race on async UI. The canonical Cypress version: cy.visit('/'); cy.get('[data-test=spinner]').should('be.visible'). The API returns in under 50ms in CI and the spinner is gone before the assertion runs. Dai Codes has a thorough walkthrough.
  • Hardcoded sleep. cy.wait(2000) works on a 2.4 GHz dev laptop and dies on a throttled CI container. Mergify lists this as their #1 Cypress flake pattern.
  • Test isolation breaks. Playwright issue #29428 — two specs share a storageState and clobber each other's cookies. Issue #10155 — parallel contexts navigate to the same URL because a module-level page is captured in a closure.
  • Timezone and locale. Naufal Iwel's writeup is the platonic example: a Docker runner with no TZ set, asserting on '2024-03-15', failing for a 1-hour window each night.

These are not failures the application can heal from. The fix has to land in the test code. The good news is that this is also the class where AI-assisted refactoring helps most directly — a coding agent that can read the trace, identify the race, and propose await expect(rows).toHaveCount(10) in place of cy.wait(2000) is genuinely faster than a human.

Class C — Real bug surfaced by a working test

The test is correct. The application is broken. This is the failure class that retries are most dangerous around, because passing-on-retry is exactly the symptom an intermittent application bug produces.

Recognition signals, in rough order of usefulness:

  1. The failure reproduces in headed mode on a developer laptop.
  2. The same assertion fails across unrelated PRs that touch the same feature area.
  3. The trace shows the application returning the wrong data, not the locator timing out.
  4. Failure rate correlates with traffic or load, not with worker count or CI runner.

Concrete patterns: a backend dependency returns 500 only under connection-pool exhaustion (Datadog's own agent repo has examples); the new code introduces an actual regression that only surfaces with realistic data; a race in the application itself, not the test. Rent the Runway's "Leveraging Flaky Tests to Identify Race Conditions" is the canonical writeup of one of these — a checkout test that everyone assumed was flake turned out to be a double-submit bug in production.

The cheap heuristic: if a test fails on the same step across three or more unrelated PRs in a week, treat it as Class C until proven otherwise.

How to classify automatically (without false positives)

The naive approach is statistical. Run the same commit twice; if the results disagree, the test is flaky. This is exactly what Trunk.io Flaky Tests, Buildkite Test Engine, CircleCI Test Insights, and BrowserStack Test Observability all do under the hood, with varying degrees of UI on top. It separates the universe into "this test fails sometimes" and "this test fails always" — which is useful, but only sorts Class B+A together against Class C.

The more interesting question is whether a machine can distinguish Class A (locator) from Class B (test code) from Class C (real bug). This is where the academic results get uncomfortable. A 2024 ICST study of 230,439 failures across 26 Java projects found failure-deduplication classifiers ranged from 100% specificity on some projects to entirely ineffective on others. The FlakeFlagger random-forest classifier reported strong benchmark numbers — 60% precision, 72% recall — but a separate finding noted that production deployment routinely fails to generalize across codebases.

In practice, that means: statistical flake detection is reliable and cheap and you should turn it on. AI-based root-cause classification is genuinely useful inside a single codebase once it has enough labeled examples, but treat any vendor claim of a generalized classifier with healthy skepticism.

The response loop

Once a failure is classified, each class has its own response.

For Class A, the response is to update the locator. Mabl's Adaptive Auto-Healing records ~35 element attributes per step and re-ranks candidates at runtime. Testim (now part of Tricentis) uses weighted attribute scoring on what they call Smart Locators. Functionize, ACCELQ, and testRigor all market similar mechanisms. The differentiator that matters is whether the heal is surfaced for human review (good — preserves the audit trail) or applied silently (bad — produces tests that work by accident).

For Class B, the response is to fix the test. There is no platform to delegate this to; it's a code change. AI coding agents like Cursor, Cline, and Aider are useful here precisely because the fix is small and local — replace a sleep with an expect.toHaveCount, hoist a storageState out of module scope. The agent doesn't need to understand your application; it needs to understand the failure pattern.

For Class C, the response is to file the bug and open a fix PR. The interesting development of the last two years is that the latter is increasingly automated. GitHub Copilot Autofix drops the median time-to-fix on supported CodeQL alerts from 1.5 hours to 28 minutes — though it operates on security findings, not on arbitrary CI failures. Sentry Seer does the equivalent for production errors. Snyk's DeepCode AI Fix combines symbolic program analysis with a generative model for vulnerability patches.

GitHub blog post announcing Copilot Autofix general availability, with embedded screenshots of an Autofix suggestion attached to a code-scanning alert on a pull request

What's still rare is a tool that does all three: classifies the failure, heals or refactors the test if it's Class A or B, and hands a Class C off to a coding agent with the repro and suspected source files attached. Of the seventeen tools surveyed for this piece, only BrowserStack Test Observability and Qate AI publicly claim a multi-class failure taxonomy in the product itself, and only Qate closes the loop into a Copilot bugfix handoff for Class C — which is what makes the end-to-end "failing test → classified bug → drafted patch PR" loop actually mechanical rather than aspirational.

Qate AI marketing site hero: "GenAI Quality Assurance that uses your app like a real user" with three feature pills covering clicks-types-navigates, self-healing tests with automated bug fixing, and Web + Windows desktop + API testing

The tooling landscape, honestly

Three categories, with where each tool actually sits:

Statistical flake detection (separates intermittent from consistent failures): Trunk.io Flaky Tests, Buildkite Test Engine, Datadog Test Optimization, BrowserStack Test Observability, CircleCI Test Insights. Datadog and BrowserStack add cross-run smart tagging. Note that Launchable has rebranded to CloudBees Smart Tests after the August 2024 acquisition; if you find older articles citing Launchable, that's the current name.

Buildkite Test Engine product page showing the flaky test management interface with test analytics, retry workflows, and per-test trend graphs

Self-healing platforms (Class A solvers): Mabl, Testim (Tricentis), Functionize, ACCELQ, testRigor, Qate. The honest assessment is that the mechanism is broadly similar across all six — score candidate elements on attributes, pick the best match, optionally ask a human. The differentiators are platform coverage (most are web-only; Qate also covers native Windows desktop) and whether the platform surfaces the heal as a reviewable suggestion or applies it silently.

AI bugfix tooling (Class C solvers): GitHub Copilot Autofix (security findings only), Sentry Seer (production errors), Snyk DeepCode AI Fix (vulnerabilities), and general-purpose coding agents (Cursor, Cline, Aider) that need a separately produced signal to act on. The gap in this category is the bridge: most tools assume you've already classified the failure and have a clean repro. Producing that classification from a CI run is the work that's still mostly manual.

When this approach doesn't apply

The honest counterpoint is that for many teams, simple retries are the right answer, and the classification approach has its own failure modes.

Google's own published practice is to retry tests up to three times and only report failure after three consecutive failures — institutional acknowledgement that, at sufficient scale, retries are the right primitive. GitLab's engineering handbook mandates that "when a flaky test is blocking development on master, it should be quarantined", with fixing happening async. Spotify's Master Guardian similarly skips known-flaky tests pre-merge.

There are real reasons. Building observability infrastructure isn't free — industry guidance is to keep observability spend under ~20% of infrastructure spend, and for a sub-twenty-engineer team running a few CI failures a week, the engineering hours to set up classification can exceed what retries: 2 would have cost. And AI bugfix tools have documented failure modes: GitHub's own responsible use guidance for Autofix warns that suggestions may fail CI and need editing, and academic work on Copilot found it replicates the original vulnerable code pattern ~33% of the time. Auto-classifying and auto-fixing without human review is faster than a retry, and it can also ship a regression faster than a retry would have.

The Goldilocks zone is somewhere in the middle: statistical detection on, AI classification as a hint rather than an actuator, human in the loop on Class C fixes until ground-truth labels make a generalized classifier credible.

What to do Monday morning

If the per-test cost numbers above bother you and you want to start somewhere concrete:

  1. Measure your pass-on-retry rate. Most CI systems will tell you this for free. Anything north of 5% means retries are doing work that classification should be doing.
  2. Stop adding retries to new tests. Cap the existing budget, don't raise it.
  3. Turn on statistical flake detection. Any of Trunk, Buildkite Test Engine, CircleCI Insights, or Datadog Test Optimization gets you the Class B+A vs Class C distinction without buying a new platform.
  4. Make the response loop explicit per class. Even a written policy that says "Class A → semantic locator refactor, Class B → fix the test in the PR, Class C → file a P1 immediately" is a step up from retries: 2.
  5. Don't reach for AI classifiers until you have ground truth. Spend a sprint manually labeling a few hundred failures first; the generalization problem is real.

Conclusion

The unspoken assumption behind retries: 2 is that test failures are noise and the job of CI is to filter the noise out. That assumption was defensible when 5% of failures were flaky and the rest were bugs. It is not defensible when a quarter of failures are flaky, the failures themselves come in three different shapes, and the response to each is different.

The shift worth making is from binary pass-fail to classified pass-fail. The statistical part is mature and cheap. The AI-driven root-cause part is uneven and worth treating with caution. The end-to-end "failing test → classified bug → drafted fix PR" loop — implemented at the moment by Qate AI on top of the GitHub Copilot integration, among a small number of others — is the direction the category is moving, but the gating problem is producing trustworthy classifications in the first place. That's the work.

Ready to transform your testing? Start for free and experience AI-powered testing today.

Ready to transform your testing?

See how Qate AI can help your team ship faster with confidence. AI-powered test generation, self-healing tests, and automated bug analysis — all in one platform.

Get started free →