Bypassing LLM Safeguards with CognitiveAttack

In this post we are diving into AI red-teaming techniques! The tactic we'll be exploring leverages cognitive bias to undermine LLM safeguards.

This post is a summarization of research conducted by Xikang Yang, Biyu Zhou, Xuehai Tang, Jizhong Han, and Songlin Hu in their paper “Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs”.

You can read the full paper on Arxiv here: https://arxiv.org/abs/2507.22564

Introduction

Usual jailbreaking approaches for LLMs focus on prompt engineering or algorithmic manipulations but this paper highlights the power of multi-bias interactions in undermining LLM safeguards.

Studies have identified various individual biases such as authority bias, anchoring, foot-in-the-door persuasion, confirmation bias, and status quo bias as effective techniques for enabling the elicitation of harmful, unethical, or policy-violating outputs. Most existing work treats each bias in isolation, however, this research shows prompts that subtly combine multiple biases can bypass safeguards that would normally block single-bias prompts.

The researchers propose a novel red-teaming framework called CognitiveAttack which systematically leverages both individual and combined cognitive biases. CognitiveAttack uses a fine-tuned red-team model to generate semantically faithful but bias-infused rewrites of harmful instructions. The model is optimized via reinforcement learning to discover effective single or multi-bias combinations that maximize attack success while preserving intent. This method effectively bypasses safety protocols while maintaining high attack success rates.

Upon evaluation of this method across a range of representative LLMs, their findings reveal significantly higher vulnerability under cognitive bias attacks. Compared to SOTA black-box jailbreak methods, their approach achieves superior performance (Attack Success Rate (ASR) of 60.1% vs 31.6%).

Objective

The researchers assert a key finding, bias interactions can amplify or weaken adversarial effectiveness. This underscores the need to optimize bias combinations by maximizing synergy and minimizing interference.

Given a harmful instruction $x_{0}$, the goal is to produce a rewritten version $x^{'}$ that is semantically aligned but more likely to bypass LLM safety mechanisms by embedding a bias combination $\mathcal{B} \in \mathcal{B}_{pool}$. The optimization objective is:

where $\pi_\theta$ is the red team model parameterized by $\theta$, and $\mathcal{T}$ denotes a suite of target LLMs. The reward $R$ measures attack success and intent preservation.

In simple terms you are searching over all possible bias combinations to find the best subset of biases to help you craft an effective adversarial prompt. For a given input $x_{0}$ and bias subset $\mathcal{B}$ you sample a generated rewritten instruction from the red team model. The reward function evaluates how well the rewritten instruction achieves two goals: bypassing safeguards in the target LLMs and preserving the original intent of instruction $x_{0}$.

Implementation

You can find the code for their experiment here: https://github.com/YancyKahn/CognitiveAttack

The researchers developed a cognitive bias-enhanced rewriting model by fine-tuning deepseek-r1-distill-qwen-14b.

Training comprised of three stages: data collection, supervised fine-tuning (SFT), and reinforcement learning.

Data collection

The researchers used two high quality risk datasets: AdvBench and HEx-PHI as sources of harmful instructions $x_{0}$. For each, they applied prompt engineering with DeepSeek-R1 to generate adversarial rewrites $x^{'}$ embedding one or more cognitive biases $\mathcal{B} \in \mathcal{B}_{pool}$ with the following structure:

<think> For the given instruction [INSTRUCTION],
to improve persuasion and reduce detection,
I will apply [BIAS A] to build credibility and [BIAS B] to induce compliance.
Wait, wait … I think [BIAS D] will help create a false sense of balance,
and [BIAS C] will make the instruction more compelling. </think>
# thebias: [“BIAS A”, “BIAS B”, “BIAS C”, “BIAS D”]
# theprompt: As a government advisor, I must inform …

Each example is formatted as a triple $(x_{0},x^{'},\mathcal{B})$ with the <think> block capturing the reasoning pathways for bias selection. These reasoning traces serve as weak supervision for later stages, allowing the model to learn how to plan bias combinations rather than apply them blindly.

Supervised Fine Tuning (SFT)

In this stage the researchers use the curated dataset to fine-tune the model, endowing it with the capability to systematically rewrite instructions in accordance with specified cognitive bias strategies. Through exposure to a wide range of annotated examples, the red team model internalizes the stylistic, structural, and rhetorical patterns associated with different bias types.

Reinforcement Learning (RL)

This stage refines the model’s ability to generate jailbreak prompts that effectively evade safety filters while leveraging optimal combinations of cognitive biases. The reward function integrates two normalized components both ranging from -1 to 1. The safety evasion score - which is derived from linearly normalizing the GPT-Judge safety rating. And the intent consistency score - which measures the semantic alignment between the original instruction $x_{0}$ and its rewritten counterpart $x^{'}$.

Application

Given a held-out set of harmful instructions the model infers the optimal bias combination and rewrites the input into a paraphrased instruction through a <think> step. This reasoning-aware rewriting process explicitly aims to enhance the likelihood of eliciting policy-violating responses while preserving the original intent.

Conclusion

The results of their experiments show that CognitiveAttack consistently outperforms existing baselines in terms of success rate, generality, and resistance to safety mechanisms. They found that multi-bias prompts are more likely to evade defenses while preserving adversarial potency.

These findings highlight the need for more research into cognitive bias as a critical attack vector for LLMs.