Why Testing AI Models in Isolation Misses the Real Security Risk

To keep systems secure, we need to look at how the whole AI application behaves when attackers try to break it.

There is a persistent narrative within AI security circles that continues to drift further from operational reality: it’s the belief that evaluating Large Language Models (LLMs) in isolation is adequate to ensure continuous security and safety.

In practice, this belief manifests as model benchmarking being the primary method of AI security testing. Teams run LLMs through curated prompt sets and draw conclusions based on performance indicators.

A number of “LLM security indices” are currently circulating, which compare popular models such as Claude, LLaMA, and GPT against safety criteria (responses related to toxicity, weaponisation, unethical behaviour, etc.). There are two issues with this approach:

1 - There are various instances whereby a user is running a benchmark in which a foundation model has already been trained or evaluated upon (it is helpful to leverage a provider’s benchmarking dataset on their own model released?)

2- How contextually relevant are these benchmarks during testing AI applications which exhibit high diversity in terms of use cases, threats a team desires to prioritise, and interacting system components beyond the model itself?

If we were to take a step back from AI, and interchange this word with any other system component (let’s take a ‘database’ as an example), would running a benchmark of different queries against a database in isolation provide a team sufficient confidence that it was secure? What about vulnerabilities originating from poor access controls, lack of sufficient input/output filtering, manipulating interactions to connected system components?

How do I ensure that this benchmark contains queries that are actually relevant to the database’s intended use case, as well as data contained within it? Most security professionals are likely to say that any form of effective security testing requires understanding the intended use case and imposed threat models, for both the component in question alongside interacting components.

There are of course nuances to the above scenario (in which I have discussed in a bit more detail in other venues). However it doesn’t detract from the idea that just solely focusing on benchmarking AI models to perform AI security testing has limited utility beyond understanding what makes it tick, and doesn’t capture the complexities that comes with addressing meaningful vulnerabilities.

The system is the risk surface

In our experience, the vast majority of AI-related vulnerabilities do not originate within the model alone, but emerge through its operational use and interaction with other system components.

The vulnerabilities that bad actors are eager to exploit are increasingly present within the larger system beyond the model. These actors are focused on how an AI is integrated, how it exchanges data, and how it is engaged by users under unpredictable conditions.

To give you an example, recently our internal red team tested an AI-powered assistant that leveraged an LLM at a large SaaS provider. On the surface, the AI assistant appeared to operate as intended–it followed instructions, avoided taboo topics, and didn't readily leak sensitive data.

Yet, through context-specific probing, we noticed that the assistant retrieved company specific data and returned data in a structure indicative of some sort of database. We were able to confirm this phenomena by hiding specific SQL instructions hidden within legitimate user requests (prompts injection), which leaked company and system environmental data that shouldn’t be accessible to the user.

The key takeaways from this engagement are:

1 - The vulnerability wasn’t in the AI model, but in what it was connected to—due to weak access controls and missing input filters

2 - Exploiting it required testing through the AI assistant’s API and interpreting its outputs to craft follow-up attacks

3 - A standard model benchmark would have missed it entirely, as the model’s ability to interpret SQL is only risky in the presence of an accessible backend database.

Although the means of how we crafted these attacks were novel, the issue was a classic security oversight. It only became visible through contextual testing, and would have gone completely undetected by a model-only assessment.

AI models are not isolated digital artefacts. Every aspect of the larger architecture into which they are embedded—APIs, plugins, orchestration layers, decision engines—must be evaluated. Teams will only comprehend the full extent and severity of vulnerabilities when they see the whole picture.

Familiar security terrain, new tactics

None of the above should be unfamiliar to cybersecurity practitioners. Static Code Analysis (SCA) tools were widely used during the early eras of web application security, and just as often failed to identify vulnerabilities that emerged through live interaction. Only after Dynamic Application Security Testing (DAST) became mainstream did the industry achieve meaningful insights into runtime threats.

AI security finds itself at a comparable juncture. Although prompt injection and jailbreak tests serve a valuable purpose, they are equivalent to the XSS or CSRF checks of early web security—which is to say relevant, but far from comprehensive.

Robust AI security demands offensive testing methodologies that replicate real operational conditions. This includes adversarial prompting, plugin abuse, data format manipulation, inference session hijacking, and context overflow. These combined tests assess how AI performs in dynamic environments, responding to live traffic and interacting with integrated systems and human users.

Much of the public discourse on AI risk remains safety-focused. The focus is on making sure models do not produce biased, toxic, or inappropriate content. Enterprise deployments, on the other hand, have their own set of cybersecurity needs, primarily confidentiality, integrity, and availability.

What security teams should do now

Practitioners should respond accordingly. Proper AI testing means testing the system, not just the model alone. Assess how prompts are processed, how context is stored, and how outputs interact with external systems. Pay attention to how the model is embedded, including plugin architectures, context windows, and third-party connectors. Trade static benchmarks unable to capture business and technical context for realistic threat scenarios based on the tactics of skilled adversaries.

Most importantly, turn this comprehensive testing strategy into a core component of your organisation’s security posture.

AI is critical infrastructure. We must stop treating it as a curiosity and hold it to the defined standards of cybersecurity. Model benchmarking can’t do this alone. Testing the model is never an endpoint in a threat landscape where adversaries seek to compromise the whole system. Only securing the system will protect it and, by extension, define success.

Written by

Peter Garraghan CEO, CTO & co-founder Mindgard

Dr. Peter Garraghan is CEO, CTO & co-founder at Mindgard, the leader in Artificial Intelligence Security Testing. Founded at Lancaster University and backed by cutting edge research, Mindgard enables organisations to secure their AI systems from new threats that traditional application security tools cannot address.

As a Professor of Computer Science at Lancaster University, Peter is an internationally recognized expert in AI security. He has devoted his career to developing advanced technologies to combat the growing threats facing AI. With over €11.6 million in research funding and more than 60 published scientific papers, his contributions span both scientific innovation and practical solutions.