Playtime is Over: Why We Need Better Testing of Machine Learning Models

Machine learning models are being deployed at scale and risk being exposed to vulnerabilities if they're not tested properly.

“Let’s just see what it does.” “Have a go and tell us what you think?” All things said by business leaders in the last few years when exploring AI technologies – from classical systems to generative models.

Machine Learning (ML) and Neural Networks (NN) have soared in popularity. With increased computing capacity, processing large amounts of data has given the world breakthroughs in image and speech recognition that occasionally surpass human-like abilities. Our innate curiosity was peaked by conversational AI and the natural language processing model from GPT-3.

The promise of revolutionising just about every industry has led to around 70 percent of organisations experimenting with AI and ML - but often without adequate safeguards. And as the demand for more bespoke machine ML systems begins influencing decisions in critical infrastructure like finance, healthcare, and transportation, the risks of this casual experimentation can’t be ignored.

Greater safeguards around the development of ML models, and better testing frameworks are a must. Unlike traditional software, ML systems are not trained deterministically; they learn from data and operate on probabilistic assumptions. This opens them up to a unique set of weaknesses and vulnerabilities such as robustness issues, data poisoning that corrupts its training, and inherent biases that can lead to unfair outcomes.

Why current testing models are insufficient

Applying conventional test methods to ML systems falls short because they are built on assumptions like deterministic behaviours, repeatability, and transparency, which don’t apply to dynamic, data-driven systems like ML models.

Instead of following a predefined set of instructions and giving a predicable output, ML systems respond based on patterns, making it difficult to predict how it will respond and how to define what a correct ‘output’ is. Which means the traditional the pass/fail testing model that’s used in software simply doesn’t work.

ML systems also lack transparency. They often act like black boxes, preventing them from being transparent about their decision-making ability. This isn’t by design - their internal workings are incredibly complex, with numerous layers that make tracing the decision-making process quite demanding. In the absence of explainability, it becomes almost impossible to know why a system made a certain decision, or whether that decision can be trusted - or if it will act in the same way in future.

In addition, ML models can degrade over time, depending on how the data it is analysing changes. Even well-performing models can become less accurate if the data it’s relying on becomes difficult to read. Traditional testing doesn’t account for this degradation, and perhaps even worse, doesn’t take into consideration that potential data has been tampered with.

How ML systems get corrupted

Like all software systems, ML models are vulnerable to attacks. Adversarial attacks can inject bad examples into datasets to steer models towards bad behavior—in the same way bad actors might inject malicious code and exploit software vulnerabilities. But instead of there being a direct response for the malware, it’s more like sharing fake news with someone to change their perception of the world. This can cause a model to make discriminatory or unethical decisions such as:

Poisoning spam filters with legitimate-looking spam emails
Injecting racial or gender bias into hiring algorithms.

Even after deployment, attackers can craft inputs like images and audio that appear normal to humans, but fool the model into misclassifying them. A worst-case scenario would be teaching a self-driving car to misread a stop sign as a speed limit. They’ve also been known to embed hidden behaviours in training that are only triggered under specific conditions. These backdoor injections act like a ‘sleeper cell’ inside the model that will act normally until it is triggered by a certain behaviour.

Lacking an appropriate testing model that can identify when a system has been tampered with can result in serious consequences for organisations using it—especially those working within finance, healthcare and automotive, who have become more reliant on the technology.

Adopting new testing systems

New standards in testing need to be adopted to keep ML systems operating at the highest quality and security. In collaboration with ETSI, we’ve identified several methods that differ from traditional testing that are applicable to ML systems in the Technical Report ETSI TR 103 910. These include:

Data Integrity Testing: Validating the quality, diversity, and security of training datasets to prevent poisoning and ensuring representative learning.
Adversarial Attack Simulation: Deliberately exposing models to manipulated inputs to assess resilience against adversarial threats.
Bias and Fairness Audits: Testing for demographic, social, and statistical biases in training and validation data, and ensuring equitable model performance across groups.
Explainability Checks: Integrating tools and processes that help clarify how a model arrives at its decisions.

While ML systems are new and exciting, the consequences of neglecting proper testing in ML systems can be severe—from biased hiring algorithms to dangerous errors in autonomous vehicles.

We cannot expect these systems to hold their quality based on legacy testing methods that are no longer fit for purpose. By promoting a standardised, more agile, context-aware strategy, we can ensure that ML systems remain secure and suitable for the use cases they are being designed and used for—today and in years to come.

Written by