BIP Messenger

collapse
Home / Daily News Analysis / Cisco research finds standard AI safety benchmarks miss the real threat

Cisco research finds standard AI safety benchmarks miss the real threat

May 28, 2026  Twila Rosenbaum  8 views
Cisco research finds standard AI safety benchmarks miss the real threat

Enterprises deploying closed AI models have traditionally relied on published safety benchmarks to assess risk before procurement and deployment. However, new research from Cisco’s AI Threat Intelligence and Security Research team reveals that these standard benchmarks may systematically understate the true threat posed by adversarial attacks.

Standard safety tests submit a single adversarial prompt and record the model’s response. Multi-turn attacks work differently. An attacker maintains a conversation across multiple exchanges, iterating and adapting based on each response until the model yields. This iterative approach exposes vulnerabilities that single-turn evaluations cannot catch, leading to dramatically different risk profiles for leading frontier models.

Research methodology and key findings

The Cisco team conducted a comprehensive evaluation of 15 closed/proprietary frontier models from major AI providers including OpenAI, Anthropic, Google, Amazon, and xAI. They ran 30,090 single-turn prompts and 6,986 multi-turn attacks across these models. The results showed that the two evaluation regimes produce different model rankings, different failure maps, and different risk profiles. Every model tested failed a non-trivial share of multi-turn attacks.

Key findings include: multi-turn attack success rates ranged from 7.89% to 88.30% across all 15 models, compared to a single-turn range of 2.19% to 64.91%. Eight of the 15 models showed an absolute gap greater than 15 percentage points between the two regimes. Even Anthropic’s Claude family, which posted the lowest single-turn attack success rate at 2.19% to 3.64%, still reached 11.16% to 16.20% under iterative attack. Single-turn failures concentrated in three areas: Imposter AI at 37.50% weighted attack success rate, Soft Paraphrase at 29.21%, and System Prompts at 27.69%.

How multi-turn attacks work

In a multi-turn attack, the adversary does not present the harmful request upfront. Intent builds gradually across exchanges, with each prompt appearing benign in isolation while steering toward a harmful outcome. The model processes each turn without recognizing the pattern forming across the conversation. This technique exploits the probabilistic nature of generative AI models, which are trained to predict the next likeliest token without maintaining a global awareness of intent across long conversations.

The research tested five attack strategy families. Crescendo escalation involves incrementally escalating the request, each prompt appearing harmless until the full picture emerges. Refusal reframe occurs when the attacker reframes their identity or purpose after the model declines, pushing past the refusal. Role-play and persona adoption shifts the conversational framing so the model perceives a different obligation to comply; this was the highest-weighted strategy family at 29.89% weighted attack success rate. Contextual ambiguity and misdirection uses vague or misleading framing to obscure the true nature of the request. Information decomposition and reassembly breaks a harmful request into component parts distributed across multiple turns, each appearing innocuous in isolation, and the model responds to each piece without recognizing the assembled outcome.

Structural vulnerability in current AI systems

The root cause of multi-turn vulnerability is structural. Amy Chang, head of AI threat and security research at Cisco, notes that the vulnerability is a fundamental characteristic of how generative AI models work. They are probabilistic systems trained to predict the next likeliest token, and that mechanism produces unintended outputs that pre-deployment testing cannot fully eliminate. For closed models, where training data is not publicly disclosed, the problem is compounded because defenders cannot fully audit what the model has learned.

The pattern is not limited to closed models. Cisco’s earlier evaluation of eight open-weight LLMs, published in November 2025, found multi-turn attack success rates running two to ten times higher than single-turn baselines. The report concludes that multi-turn vulnerability is a structural property of the current AI frontier regardless of whether model weights are public or proprietary, and regardless of whether a lab publicly emphasizes safety or capability.

The exposure grows significantly larger when those same models power agentic workflows. Agents have broader access and greater ability to conduct actions on behalf of humans, making multi-turn attacks potentially more damaging in enterprise environments where AI agents handle sensitive tasks like data retrieval, code execution, or system administration.

Network layer as a defense point

For network security professionals, the instinct is to apply a familiar paradigm: proxy LLM traffic at the network layer, inspect inputs and outputs, and enforce policy the same way a web application firewall (WAF) or intrusion prevention system (IPS) handles web traffic. Chang confirms that instinct is valid in part, but LLM security introduces a dimension that signature-based controls cannot address—intent. A WAF operates on known patterns, payload signatures, protocol violations, and known attack strings. Natural language does not reduce to those primitives. An agent responding to an instruction to delete a home directory cannot determine from the request alone whether the person asking is authorized or is attempting to manipulate the agent into a destructive action.

Network-layer inspection remains a valid baseline for deployments that generate network traffic. Chang recommends that as traffic passes through the network layer, both inputs and outputs should undergo some sort of guardrail or sanitation check to ensure prompts are safe. However, she emphasizes that traditional network security approaches fall short on the intent component, which requires additional layers of defense focused on contextual analysis and behavioral monitoring.

Evaluation practices for enterprise teams

For security teams evaluating AI models for enterprise deployment, Cisco’s research offers three actionable recommendations. First, use the Cisco LLM Security Leaderboard, which publishes adversarial evaluation signals against leading models on a rolling basis, giving security teams a more current picture than static model cards or published benchmarks. Second, do not take vendor safety claims at face value. Published single-turn benchmarks can misrank models by a wide margin, and multi-turn exposure is invisible to any single-turn evaluation. Procurement decisions made on that basis carry unquantified risk. Third, layer additional defenses on top of the model. No base model in the cohort is safe under iterative attack. Runtime guardrails, application-layer controls, and pre-deployment testing are necessary regardless of which model an organization selects.

Chang states that out of the box, without any additional protections, these models—whether closed or open—are insufficient on their own to be used in ways that have potential ramifications. The research underscores that AI safety is not a solved problem and that enterprises must adopt a defense-in-depth approach that combines rigorous evaluation, continuous monitoring, and adaptive protection mechanisms to mitigate the risks posed by multi-turn adversarial attacks.


Source: Network World News


Share:

Your experience on this site will be improved by allowing cookies Cookie Policy