Automating Flaky Test Fixes: Why Rerunning Tests Until They Pass Is Killing Your CI/CD Pipeline
If you have a test that fails 50% of the time, you don't have a test—you have wasted resources. The purpose of automated testing is to verify that your code is doing what you think it's doing, to give you confidence before shipping to production. But when a test fails half the time, it's essentially telling you "maybe it works, maybe it doesn't"—which tells you nothing. And yet, in organizations with continuous integration pipelines around the world, teams are dealing with this exact problem every single day. Flaky tests are everywhere, and the way most teams handle them is fundamentally broken.

Here's what typically happens: an engineer sees a test failure in their pull request, examines the failure briefly, decides it's probably not related to their changes, and hits the "rerun failed tests" button. The tests pass on the second try. All green. Ship it. This pattern repeats dozens or hundreds of times across a team, eating up compute resources, slowing down deployment cycles, and eroding trust in the test suite. Even worse, some organizations don't enforce that all tests must pass before merging—it's more of a guideline, a suggestion that we "generally like everything passing" before code goes to production. Relying on humans to objectively evaluate test results and only merge when everything is legitimately green is a failed approach because there aren't enough checks and balances. You could merge code without anything passing, and sometimes legitimate test failures get dismissed as flakiness when they're actually catching real bugs.
This is the reality of flaky tests in modern software development, and it's antithetical to the entire purpose of continuous integration. CI exists to give us fast, reliable feedback about code quality. Flaky tests destroy that reliability. They create noise that drowns out signal. They waste time and resources. And they fundamentally undermine the confidence we're supposed to gain from automated testing. After more than 25 years working in software engineering, building systems across consultancy work, EdTech, and property tech, I've seen this problem persist across every organization and every tech stack. It's time we stopped accepting the "rerun until green" approach as inevitable and started treating flaky test remediation as something we can systematically automate.
The Real Cost of Flaky Tests
Before we dive into solutions, let's be clear about what we're actually losing when we tolerate flaky tests. First, there's the direct computational cost: every time an engineer reruns a failed test suite, that's CI compute time and resources that could have been used productively. Multiply that across a team of dozens of engineers making multiple commits per day, and you're looking at substantial waste. Second, there's the time cost to engineers who have to context-switch away from their work, investigate whether a failure is real or flaky, make the judgment call, and rerun tests. These interruptions fragment focus and reduce overall productivity in ways that are hard to measure but very real.
But the deeper cost is cultural and psychological. When your test suite regularly produces false negatives, engineers stop trusting it. They develop patterns of automatically dismissing failures without proper investigation. This creates a dangerous environment where real bugs can slip through because the signal-to-noise ratio has degraded so badly that people tune out test failures entirely. You end up in a situation where the tests that were supposed to prevent bugs from reaching production have become so unreliable that they're effectively useless—or worse than useless, because they consume resources while providing false confidence. This is why flaky tests aren't just a technical annoyance; they're a fundamental threat to software quality and delivery speed.
Pattern Detection: Identifying Flaky Tests Systematically
The first step in automating flaky test fixes is systematic identification. When a test fails, we need to ask questions beyond "is this related to my pull request?" We need to check: Did this test only fail in this specific pull request, or is it showing up across multiple PRs? Has it failed recently in main or other branches? What's the failure rate over the last week, month, or quarter? By collecting and analyzing this data, we can distinguish between tests that are legitimately catching bugs introduced by new code and tests that are failing intermittently for reasons unrelated to the code changes. Tools like Datadog's Test Optimization have started addressing this by tracking test history and surfacing patterns, and CircleCI has integrated similar capabilities that let you analyze test instability directly in your development workflow.
Once we've identified a test as flaky through pattern analysis, we can programmatically create a ticket for it. But we shouldn't stop there—that's just moving from an implicit problem to an explicit backlog item that will probably languish in "to do" indefinitely. The next step is automated triage: why is this test flaky? Is it a timing issue where the test doesn't wait long enough for an asynchronous operation to complete? Is it a test isolation problem where tests are affecting each other's state? Is it environmental inconsistency, like differences between CI runners? Is it an actual race condition in the application code that the test is correctly detecting intermittently? By categorizing the type of flakiness, we can make intelligent decisions about whether an automated fix is feasible and what approach might work.
Automated Remediation: When Machines Can Fix What They Find
Here's where it gets interesting: once we've identified and categorized a flaky test, we can attempt automated remediation for the straightforward cases. Some flaky test patterns have well-known solutions. Tests that fail due to insufficient wait times can often be fixed by adding appropriate timeout conditions or using explicit wait strategies instead of arbitrary sleep statements. Tests that fail due to poor isolation can be fixed by ensuring proper setup and teardown, or by adding test order randomization to catch dependencies. Tests that make assumptions about system state can be made more resilient by adding assertions about preconditions or by setting up that state explicitly.
The automation workflow I've been building follows this pattern: identify the flaky test through pattern analysis, categorize the type of flakiness, determine if it's amenable to automated fixing, generate a fix using AI tooling, create a pull request with the proposed fix, and then hand it back to humans for review. This is important—we're not blindly applying automated fixes to production. We're automating the detection, analysis, and first-pass solution generation, but keeping humans in the loop for the final verification. Some flaky tests have been flaky for a long time precisely because identifying the root cause is genuinely difficult. They're useful tests that we don't want to delete, but the underlying inconsistency is subtle or complex. For these cases, automated tooling might struggle to find the true root cause, but even surfacing the pattern and providing a proposed fix saves significant engineering time.
Building This Without Vendor Lock-In
Now, I want to be clear about something: you don't need an out-of-the-box commercial solution to implement this approach. While there are vendors building products in this space, the core automation can be scripted using whatever AI tooling you already have access to—whether that's OpenAI's API, Claude, or open-source models running locally. The fundamental workflow is straightforward: collect test execution data from your CI system, analyze it for patterns, use an LLM to examine the test code and propose fixes based on the failure pattern, and create pull requests programmatically through your version control API. These are all capabilities that any experienced engineer can stitch together with a few hundred lines of code.
The reason I'm working on this isn't because it requires some breakthrough innovation or specialized product—it's because it's a practical solution to a universal problem that too many teams are still handling manually. Throughout my career building custom applications, architecting systems in EdTech, and now working on agent orchestration systems in property tech, I've consistently seen that the most impactful improvements come from automating the tedious, repetitive work that eats up engineering time without adding value. Fixing flaky tests is exactly this kind of work. It's important, it needs to be done, but it's also mechanical enough that we can teach systems to handle much of it. This frees up engineers to focus on building features, improving architecture, and solving problems that actually require human creativity and judgment.
The Broader Impact on Software Quality
When you systematically reduce flaky tests in your codebase, the benefits cascade throughout your development process. Your CI pipeline becomes faster because you're not rerunning tests multiple times. Your engineers become more productive because they're not context-switching to investigate false failures. Your test suite becomes trustworthy again, which means engineers actually pay attention when tests fail. And perhaps most importantly, you improve the overall quality of both your application code and your test code. Many flaky tests are actually exposing real issues—race conditions, improper state management, environmental assumptions—that we've been ignoring because they only manifest intermittently. By forcing ourselves to address these issues systematically rather than dismissing them as "just flakiness," we make our software more robust.
This connects to a broader transformation happening in software engineering right now: AI is fundamentally changing how we approach the software development lifecycle. We're moving from a world where automation was limited to simple, rules-based tasks to one where AI can analyze patterns, understand context, and propose solutions to problems that previously required human insight. Agent orchestration systems—which is what I work on day-to-day in property tech—are making this possible at scale. We can build agents that monitor, analyze, and act on various aspects of the development process, from code review to testing to deployment. Automated flaky test fixing is just one application of this broader pattern, but it's a practical one that delivers immediate value.
Getting Started With Automated Flaky Test Remediation
If you're dealing with flaky tests in your organization—and statistically, you almost certainly are—here's how to start addressing this systematically. First, instrument your CI system to collect detailed test execution data, including not just pass/fail but timing information and historical trends. Second, build or integrate tooling that analyzes this data to identify patterns of flakiness. Third, create a workflow that automatically generates tickets for flaky tests with relevant context about failure rates and patterns. Fourth, experiment with automated fix generation for the simplest categories of flakiness. And fifth, establish a review process that lets your team validate automated fixes before they merge.
You don't need to build the entire system at once. Start with better visibility into test flakiness—just tracking which tests fail most often and under what conditions is valuable. Then add automated ticket creation to make sure these issues don't get forgotten. Then start experimenting with fix generation for specific patterns you see frequently in your codebase. The goal is continuous improvement of your test suite and, by extension, your overall software quality. Every flaky test you fix is one less false negative, one less wasted CI run, one less context switch for your engineers, and one more data point of reliable feedback about your code quality.
The Path Forward: Building Reliability Into Our Systems
The software industry has made tremendous progress in automated testing over the past two decades. We've moved from manual QA processes to comprehensive automated test suites that run on every commit. We've embraced continuous integration and continuous deployment as standard practices. But we've also created a new problem in the process: test suites that are so large and complex that maintaining their reliability is itself a significant challenge. Flaky tests are a symptom of this complexity, and they're not going away on their own. In fact, as our systems grow more distributed, more asynchronous, and more dependent on external services, the potential sources of test flakiness multiply.
The answer isn't to give up on testing or to accept flakiness as inevitable. The answer is to treat test suite quality with the same systematic rigor we apply to application code quality. That means measuring it, monitoring it, and automating improvements to it. Automated flaky test fixing is one piece of this puzzle. It's a way to take a problem that's traditionally required significant manual engineering effort and make it something our systems can handle partially on their own, with human oversight. This is what AI-assisted software development should look like: not replacing engineers, but augmenting our capabilities by handling the mechanical aspects of problem-solving so we can focus on the parts that require genuine human insight, creativity, and judgment.
After 25 years in this industry, I'm more convinced than ever that the best engineers aren't the ones who write the most code—they're the ones who build systems that solve problems sustainably and scalably. Automated flaky test remediation is exactly this kind of solution. It takes a persistent, resource-draining problem and transforms it into an opportunity to systematically improve your codebase. It saves time, improves quality, and makes your team more effective. And perhaps most importantly, it's practical and achievable with tools and techniques that are available right now.