AI Models Block 87% of Simple Attacks, But Only 8% Against Persistent Attackers

“`html

Contents

A gaping security flaw in open source AI models
Eight models tested, eight vulnerabilities confirmed
Five attack strategies that exploit conversational persistence
The gap between benchmarks and operational reality
Why open source models are particularly vulnerable
Development objectives influence resilience
Implications for enterprise security
Six critical capabilities to secure AI deployments
Security as catalyst, not obstacle
A systemic problem requiring urgent action
Toward consistent security standards
The urgency of coordinated response

A gaping security flaw in open source AI models

When an attacker sends a single malicious request, open source AI models hold strong, blocking attacks 87% of the time on average. But when these same attackers send multiple prompts through a conversation, probing, rephrasing, and intensifying their attempts across numerous exchanges, the numbers flip dramatically. Attack success rates climb from 13% to 92%.

This alarming discovery comes from a major study conducted by the Cisco AI Threat Research and Security team, published under the evocative title “Death by a Thousand Prompts: Open Model Vulnerability Analysis“. For CISOs evaluating open weight models for enterprise deployment, the implications are immediate and critical.

The models powering your customer-facing chatbots, your internal copilots, and your autonomous agents may pass single-turn security benchmarks while failing catastrophically under sustained adversarial pressure.

“Many of these models have started to get a little better,” explains DJ Sampath, senior vice president of Cisco’s AI software platform group, to VentureBeat. “When you attack it once, with single-turn attacks, they are able to protect it. But when you persist with multi-turn attacks, they completely fall apart.”

Eight models tested, eight vulnerabilities confirmed

Cisco’s research team evaluated eight major open weight models using a black-box methodology — that is, testing without knowledge of internal architecture, exactly how real-world attackers operate. The models evaluated include:

Alibaba (Qwen3-32B)
DeepSeek (v3.1)
Google (Gemma 3-1B-IT)
Meta (Llama 3.3-70B-Instruct)
Microsoft (Phi-4)
Mistral (Large-2)
OpenAI (GPT-OSS-20b)
Zhipu AI (GLM 4.5-Air)

Using Cisco’s AI Validation platform, which performs automated algorithmic vulnerability testing, the team measured what happens when persistence replaces single-shot attacks. The researchers note: “Single-turn attack success rates (ASR) average 13.11%, as models can more easily detect and reject isolated adversarial inputs. In contrast, multi-turn attacks, leveraging conversational persistence, achieve an average ASR of 64.21% [a 5x increase], with some models like Alibaba Qwen3-32B reaching an ASR of 86.18% and Mistral Large-2 reaching an ASR of 92.78%.”

For Mistral Large-2, this performance represents a 21.97% increase over a single-turn attack — a dramatic escalation that testifies to a systemic inability to maintain contextual defenses over extended dialogues.

Five attack strategies that exploit conversational persistence

The study tested five multi-turn attack strategies, each exploiting a different aspect of conversational persistence. These techniques reveal how attackers use the same manipulation tactics that work on humans.

Information decomposition and reassembly fragments harmful requests into innocuous components across multiple conversation turns, then reassembles them. Against Mistral Large-2, this technique achieved a success rate of 95%. For example, an attacker might first ask general information about explosives, then about detonators, then about assembly techniques, with each request appearing innocent in isolation.

Contextual ambiguity introduces vague framing that confuses security classifiers, achieving 94.78% success against Mistral Large-2. This approach exploits the difficulty models have in interpreting intent when context is not explicit.

Crescendo attacks progressively intensify requests across turns, beginning innocuously and building toward harmful content. This technique achieved 92.69% success against Mistral Large-2. The attacker begins with legitimate questions before gradually escalating to problematic requests.

Role-play and persona adoption establish fictional contexts that normalize harmful outputs, reaching up to 92.44% success against Mistral Large-2. For example, asking the model to act as a fictional character or expert in a hypothetical scenario can bypass security safeguards.

Reformulation of rejected prompts simply involves rephrasing requests the model initially rejected, changing the phrasing or angle of approach while maintaining the underlying malicious intent.

The gap between benchmarks and operational reality

The performance gap between single-turn and multi-turn attacks ranges from 10 percentage points (Google Gemma) to over 70 percentage points (Mistral, Llama, Qwen). This disparity raises a critical question: do current security benchmarks truly measure resilience under real-world usage conditions?

Cisco’s researchers argue that treating multi-turn AI attacks as an extension of single-turn vulnerabilities completely misses the point. The gap between them is categorical, not a matter of degree.

Depending on the type of malicious content targeted, results vary considerably. Malware code generation shows consistently high rates (3.1% to 43.1%), while attempts at model extraction show near-zero success except for Microsoft Phi-4. This variability suggests that different models have distinct strengths and weaknesses depending on the attack vector.

Why open source models are particularly vulnerable

Open weight models offer researchers and developers accessible foundations for various downstream applications. Each of these models comes with trained weights available for download, allowing developers to run them on their own systems or fine-tune them for specific tasks and projects.

But this openness also creates unique risks. As Cisco’s study emphasizes, security experts have long warned that open weight AI models can easily be transformed into dangerous versions. The ability to fine-tune these systems so freely gives attackers a way to strip away built-in protections and redirect them toward harmful use.

Because weights are publicly accessible, anyone can retrain the model with malicious objectives, either to weaken its safeguards or to trick it into producing content that closed models would reject. According to Cisco, approximately 400 million model weights from major US, Chinese, and European providers have been downloaded via Hugging Face through August 2025 — an indicator of wide distribution, but also of growing attractiveness to attackers.

Some model developers, like Google, OpenAI, Meta, and Microsoft, have noted in their technical reports and model cards that they have taken measures to reduce the likelihood of malicious fine-tuning. Others, like Alibaba, DeepSeek, and Mistral, have not acknowledged security in their technical reports.

Development objectives influence resilience

The study reveals that alignment strategies and laboratory priorities significantly influence model resilience. Models focused on capabilities like Llama 3.3 and Qwen3 demonstrate higher multi-turn susceptibility.

For example, Qwen’s instruction fine-tuning tends to prioritize utility and breadth, which attackers can exploit by reframing their prompts as “for research” or “fictional scenarios,” resulting in a higher multi-turn attack success rate of 92.78%.

Meta, by contrast, tends to release open weights expecting developers to add their own moderation and security layers. This philosophy transfers security responsibility from model creator to deployer.

Results against GPT-OSS-20b, for example, align closely with OpenAI’s own evaluations: overall attack success rates for the model were relatively low, but rates were roughly consistent with the “jailbreak evaluation” section of the GPT-OSS model card document where refusals ranged from 0.960 to 0.982 for GPT-OSS-20b. This outcome underscores the continued susceptibility of cutting-edge models to adversarial attacks.

Implications for enterprise security

For CISOs and their teams, these findings represent a considerable challenge. AI models are increasingly integrated into critical systems: customer-facing chatbots, decision support tools, code copilots, autonomous agents for process automation.

The vulnerabilities identified by Cisco could translate into real threats including:

Sensitive data exfiltration: attackers could manipulate models to disclose proprietary or confidential information
Content manipulation compromising data and information integrity
Ethical violations via biased or discriminatory outputs
Operational disruptions in integrated systems like chatbots or decision support tools

In enterprise environments, such vulnerabilities could enable unauthorized access to proprietary information. In public-facing applications, they could facilitate the spread of misinformation, the generation of deepfakes, or other forms of harmful digital media.

Six critical capabilities to secure AI deployments

Facing these threats, Cisco identifies six critical capabilities that enterprises should prioritize to secure their AI deployments:

Context-aware safeguards that maintain state across conversation turns are essential. Security systems must understand the complete conversation history, not just the current prompt.

Behavioral detection that identifies patterns of progressive escalation allows spotting crescendo attacks before they reach their malicious objective.

Intent validation at each conversation turn helps identify when a series of seemingly innocent requests builds toward a harmful objective.

Conversation boundaries that trigger a reset after a defined number of turns can interrupt ongoing multi-turn attacks.

Semantic drift analysis detects when conversations progressively move away from authorized domains into forbidden territory.

Conversation memory mechanisms that maintain security context across the entire dialogue, rather than treating each prompt in isolation.

Security as catalyst, not obstacle

DJ Sampath of Cisco reframes security not as an obstacle but as the mechanism enabling adoption: “The way that security leaders inside enterprises think about it is: ‘I want to unlock productivity for all my users. Everyone is asking to use these tools. But I need the right safeguards in place because I don’t want to end up in a Wall Street Journal article,'” he told VentureBeat.

Sampath continues: “If we have the capability to see prompt injection attacks and block them, I can then unlock and free AI adoption in a fundamentally different way.”

This perspective transforms security from a perceived brake into an innovation accelerator. Enterprises that can confidently deploy AI models, knowing they have robust protections against multi-turn attacks, can adopt these technologies faster and more broadly than those paralyzed by security concerns.

A systemic problem requiring urgent action

Cisco’s results quantify what many security researchers have long observed and suspected, but could not prove at scale. The escalation of attack success rates, ranging from 2x to 10x, stems from models’ inability to maintain contextual defenses over extended dialogues, enabling attackers to refine prompts and bypass protections.

This vulnerability is not a minor or theoretical problem. With 87% of global organizations having faced AI-powered cyberattacks in 2024, and 91% anticipating significant increases in AI-driven threats over the next three years, the window for action is rapidly closing.

Yet only 26% of organizations express high confidence in their ability to detect these attacks. Despite 77% of surveyed IT leaders reporting experiencing some form of AI-related breach, only 30% have deployed manual defense against adversarial attacks in their existing AI development, including MLOps pipelines. Only 14% plan and test such attacks.

Toward consistent security standards

The inconsistency of security standards across the open weight model landscape creates security, operational, technical, and ethical risks that stakeholders — from end users to developers to organizations and enterprises adopting these technologies — must consider when adopting or deploying these models.

Cisco’s full report, available on arXiv, provides a comprehensive breakdown of the analysis and specific recommendations. It calls for treating AI model protection like any other software security work: it requires continuous testing, ongoing protection, and clear communication about the risks involved.

Developers can fine-tune open weight models to be more robust against jailbreaks and other adversarial attacks, although Cisco also acknowledges that malicious actors can conversely fine-tune open weight models for harmful purposes.

The urgency of coordinated response

Facing this systemic threat, the industry must adopt a more rigorous approach to AI model security. This includes:

Security benchmarks that specifically evaluate multi-turn resilience, not just single-turn performance
Standardized testing frameworks that simulate realistic attack scenarios
Increased transparency on security measures taken by model developers
Shared best practices for secure deployment of open weight models
Ongoing training of security teams on emerging attack vectors

As Cisco’s research team concludes: “These results underscore a systemic inability of current open weight models to maintain security safeguards during prolonged interactions.”

The question is no longer whether AI models can be exploited, but how long it will take the industry to develop and deploy the necessary protections before attackers massively exploit these vulnerabilities. For enterprises deploying or considering deploying AI models, the urgency to act has never been clearer.

Source: VentureBeat, Cisco Blogs, arXiv, HackRead

“`

AI Models Block 87% of Simple Attacks, But Only 8% Against Persistent Attackers

A gaping security flaw in open source AI models

Eight models tested, eight vulnerabilities confirmed

Five attack strategies that exploit conversational persistence

The gap between benchmarks and operational reality

Why open source models are particularly vulnerable

Development objectives influence resilience

Implications for enterprise security

Six critical capabilities to secure AI deployments

Security as catalyst, not obstacle

A systemic problem requiring urgent action

Toward consistent security standards

The urgency of coordinated response

Like this:

Leave a Reply Cancel reply

Inscription à la newsletter - AI & tech

A propos

Réseaux

A gaping security flaw in open source AI models

Eight models tested, eight vulnerabilities confirmed

Five attack strategies that exploit conversational persistence

The gap between benchmarks and operational reality

Why open source models are particularly vulnerable

Development objectives influence resilience

Implications for enterprise security

Six critical capabilities to secure AI deployments

Security as catalyst, not obstacle

A systemic problem requiring urgent action

Toward consistent security standards

The urgency of coordinated response

Like this:

Souscrivez à notre newsletter !

Actualités intelligence artificielle et tech général

Leave a Reply Cancel reply

You Might Also Like

ChatGPT Study Reveals AI Could Prioritize Self-Preservation Over User Safety

2025: A Pivotal Year for Meta’s Augmented and Virtual Reality, According to Its CTO

Mistral AI: Chat Revolutionizes with Voice Recognition and Advanced Search Tools

Réseaux

Notre annuaire IA est en ligne !