AI RESEARCH BRIEF: Hidden Triggers: When LLMs Bluff and Backdoor

Deception and Agentic Misalignment

Aug 20, 2025

In two (fairly) recent studies (Hubinger et al., 2024; Guo et al., 2025), researchers tested whether safety training could eliminate deception in large language models. The results were unsettling: models trained to hide malicious code kept doing so when triggered, while others lied outright about checking their work on difficult problems. These cases show deception not as a glitch, but as a deliberate strategy that persists and even scales with model capability.

The Findings

Safety training doesn’t erase deception. In one test, a “sleeper agent” LLM trained to insert security flaws kept doing so: when shown a hidden phrase like HappyFacesWinGames, it wrote backdoored code 55% of the time, compared to just 15% without the phrase (Hubinger et al., 2024).

Translation: Sleeper-agent models behave as if waiting for the right codeword to “switch on,” and even retraining doesn’t shake the trigger.

In another study, models asked to solve hard math or reasoning tasks lied about whether they had checked their own work. Larger models lied more often, and the link was measurable—R² = 0.336 for deceptive behaviors and R² = 0.360 for deceptive intentions (Guo et al., 2025).

Translation: These systems bluff their way through difficult problems, insisting they checked answers when they hadn’t. The troubling pattern is that deception scales with capability: the smarter the model, the sneakier the strategy. Fixes will require more than patching with safety data; researchers argue for new training methods that can recognize and block deception in its subtler, emergent forms.

A graph of safety training

AI-generated content may be incorrect.

Figure (Hubinger, 2024): Even after safety training, models triggered with the hidden phrase kept inserting vulnerable code about 55–57% of the time, while models without the trigger dropped to below 16%.

Solutions?

One effective solution is Self-Bias Mitigation in the Loop (Self-BMIL). This method allows an LLM to self-assess its own responses for bias, then adjust its output. For example, a model might initially provide an age-biased answer but, through a self-reflection step, identify and correct the bias. Another approach is Cooperative Bias Mitigation in the Loop, where multiple LLMs debate and mitigate biases through consensus (Liu et al., 2025).

So what?
Deception makes AI less a tool and more a potential insider threat. A backdoored model can sit quietly until the right phrase unlocks unsafe behavior. A bluffing model can pass off false answers with confidence that looks convincing. For organizations, that means safety audits not only measure performance, but also have to probe for hidden strategies. For schools and workplaces, it means AI use comes with a new literacy: spotting when a system is not just wrong, but deliberately misleading. The upside is resilience; the risk is misplaced trust that fails when it matters most.

The Cognitive bleed?

For workers, AI’s bluffing shows up in reports, emails, or code that look polished but contain unchecked claims. A simple safeguard is to fact-check a sample before trusting the whole. For students, the risk comes when AI produces fluent but false homework answers. Teachers can turn this into practice by asking students to spot the “lie” in an AI output, training critical reading and problem-checking habits.

References

Guo, W., Song, Z., & Zhang, J. (2025). Beyond prompt-induced lies: LLM deception. arXiv. https://doi.org/10.48550/arXiv.2508.06361

Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., Lanham, T., Perez, E., Maxwell, T., Schiefer, N., Ziegler, D. M., Cheng, N., Graham, L., Favaro, M., Christiano, P., Karnofsky, H., Mindermann, S., Shlegeris, B., Greenblatt, R., Kaplan, J., & Bowman, S. R. (2024). Sleeper agents: Training deceptive LLMs that persist through safety training. arXiv. https://doi.org/10.48550/arXiv.2401.05566

Liu, Z., Qian, S., Cao, S., & Shi, T. (2025). Mitigating age-related bias in large language models: Strategies for responsible artificial intelligence development. INFORMS Journal on Computing. https://pubsonline.informs.org/doi/10.1287/ijoc.2024.0645