Foes and Frontiers: Flakiness, LLMs, and the Future of Mutation Analysis -Software Reliability Group

Foes and Frontiers: Flakiness, LLMs, and the Future of Mutation Analysis

Mutation analysis asks a deceptively simple question: does your test suite notice when something goes wrong? But the answers we get are only as reliable as our tests — and only as meaningful as the level at which we choose to mutate. This talk confronts two territories: the foes that quietly undermine mutation analysis from within, and the frontiers that may fundamentally expand what it can do.

It begins with a foe hiding in plain sight: test flakiness. While flaky tests are well-known as a nuisance in software engineering, our work reveals a subtler and more troubling phenomenon — mutations themselves can be the source of flakiness. When a mutant causes a test to pass and fail non-deterministically the core mechanism of mutation analysis is undermined. I will present findings on the prevalence and consequences of flakiness induced by mutations (FLIMsiness).

The talk then turns to frontiers, where large language models are opening new possibilities for how we think about mutation. Traditional mutation operators work at the syntactic level of code. The talk presents an alternative approach: extracting natural language descriptions of code behaviour and mutating at that semantic level, before translating the mutated description back into code. This higher-level mutation strategy has the potential to produce more meaningful, intent-targeting mutants designed to better reflect faults related to what software is supposed to do, not just how it’s supposed to do it.

The talk closes by stepping back to sketch some of the open challenges and opportunities that lie ahead for the field, including harnessing AI not just as a tool for generating mutants but as a partner in understanding what those mutants should mean.

Phil McMinn is a Professor of Software Engineering who specialises in software testing. He primarily works on developing automated techniques to assist software engineers in developing test suites that are effective at finding bugs and are efficient to maintain. While he is well-known in the software testing field for his work in search-based automatic test data generation, his research has tackled a variety of problems including test flakiness, test oracles, and ensuring test quality through mutation analysis.