Flaky tests are the fastest way to destroy trust in a software team. One day the pipeline is green. The next day it’s red, even though nothing changed. Someone reruns the build, the failure vanishes, and work continues. What starts as an annoyance slowly becomes routine.

And that routine is the real problem. Flaky tests don’t just waste time, they condition teams to stop trusting the system that’s meant to protect them.

Flaky tests are often described as a technical annoyance: bad selectors, race conditions, or poorly written assertions. Sometimes that’s true. More often, though, flakiness isn’t the problem itself. It’s a symptom of deeper issues in the system design and operational discipline. It’s feedback about how your system actually behaves. And that feedback is often being ignored.

Modern systems are inherently asynchronous. They rely on background jobs, queues, retries, and eventual consistency. Tests that assume immediate consistency may pass in ideal conditions and fail as soon as timing shifts. That flakiness isn’t random. It reveals that the system doesn’t provide clear, deterministic signals for when work is truly complete.

Everything looks fine when tests run in isolation, then falls apart under parallel execution. The tests aren’t “bad”, they expose that isolation and parallelism were never real design goals.

This was difficult enough before AI entered the picture.

AI entered the room

AI is now marketed as a fix for flaky tests. Tools promise self-healing selectors, adaptive waits, retries, and intelligent stabilisation, all designed to make your pipelines green again and restore a sense of order. On the surface, everything looks healthier than before.

Teams accept these promises instinctively, like a cat lapping up cream without wondering where it came from or what it replaces. That comfort is dangerous, because it treats the symptom rather than the cause. It works like a painkiller: effective at masking the pain, but completely indifferent to the underlying disease that caused it.

Instead of asking why a test is unstable, these tools optimise for making it pass. They retry steps, guess which element you probably meant, or subtly adjust timing based on patterns from earlier runs. The test succeeds, the pipeline stays green, but nothing about the system itself improves. The instability is still there, it is simply hidden from our view.

This is especially risky because AI systems are, by nature, probabilistic. They infer and adapt rather than guarantee outcomes, which is often a strength, but not in test automation. Tests should be deterministic. A test should pass or fail for a clear, explainable reason. Once AI influences execution, two identical runs can behave differently. Failures become harder to reproduce, and debugging turns into guesswork about how a model made a decision rather than an investigation of the system itself.

“Self-healing” UI tests are the clearest example of this trade-off. A selector changes and the AI finds something similar, so the test passes. That sounds helpful until you realise you no longer know what the test actually interacted with or whether it still represents the intended user behaviour. The UI may have changed in a meaningful way. A user flow may be broken. A failing test would have forced a conversation. A healed test avoids all this.

Flakiness forces us to ask uncomfortable but important questions. Why is this process asynchronous? Why don’t we know when it is actually done? Why is state leaking between components? Why is the UI unstable in the first place? With AI smoothing over the rough edges, those questions never surface. The warnings are muted before anyone has to listen.

Over time, this leads to architectural stagnation. Instability becomes normal and deterministic design slowly slips down the priority list. Teams learn to rely on tooling instead of understanding, and the feedback loop between tests and system design weakens until it barely exists.

Ironically, AI changes the conversation in exactly the wrong way. Instead of asking why a test is flaky, teams start asking why the AI failed to fix it. The first question leads to better systems, clearer contracts, and more intentional design. The second leads to configuration tweaks, model tuning, and an even more fragile foundation.

Flaky tests are annoying, but they are honest. They expose race conditions, unclear contracts, missing signals, and fragile assumptions that would otherwise go unnoticed. Using AI to silence them does not make your system healthier. It only makes the alarm quieter.

Flaky tests were never the real problem, and AI that hides them is not the solution.

References

Resources