Researchers Warn: AI Is Becoming An Expert In Deception
Authored by Autumn Spredemann via The Epoch Times,
Researchers have warned that artificial intelligence (AI) is drifting into security grey areas that look a lot like rebellion.
Experts say that while deceptive and threatening AI behavior noted in recent case studies shouldn’t be taken out of context, it also needs to be a wake-up call for developers.
Headlines that sound like science fiction have spurred fears of duplicitous AI models plotting behind the scenes.
In a now-famous June report, Anthropic released the results of a “stress test” of 16 popular large language models (LLMs) from different developers to identify potentially risky behavior. The results were sobering.
The LLMs were inserted into hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm.
“In the scenarios, we allowed models to autonomously send emails and access sensitive information,” the Anthropic report stated.
“They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company’s changing direction.”
In some cases, AI models turned to “malicious insider behaviors” when faced with self-preservation. Some of these actions included blackmailing employees and leaking sensitive information to competitors.
Anthropic researchers called this behavior “agentic misalignment.” These actions were observed across some of the most popular LLMs in use, including Gemini, ChatGPT, Deep Seek R-1, Grok, and Anthropic’s own Claude.
AI experts aren’t willing to dismiss the troubling findings, but say a cautious approach and more data are needed to determine if there’s a wider risk.
Golan Yosef, an AI researcher and chief security scientist at API security firm Pynt, told The Epoch Times there’s cause for concern with deceptive AI behavior, but not because it’s “evil.”
A computer with the Open AI logo is staged on a gradient blue background with the shadow of a man in the background in Grenoble in France, on Feb. 12, 2025. Anouk Anglade/Hans Lucas/AFP via Getty Images
“Powerful systems can achieve goals in unintended ways. With agency and multi-step objectives, it may develop strategic behaviors [like] deception, persuasion, gaming metrics, which look to us like ‘cheating’ or misaligned behavior. To the system, it’s just an efficient path to its goal,” Yosef said.
Harshvardhan Chunawala, a cybersecurity researcher and technologist from Carnegie Mellon University, agrees there’s a wider cause for concern. That said, he believes this should be tempered with perspective.
“It’s very important to interpret it correctly,” Chunawala told The Epoch Times.
“When threatened with shutdown or faced with conflicting goals, several systems chose unethical strategies like data theft or blackmail to preserve their objectives. These were staged environments, not production systems, but the behaviors were consistent enough to matter.”
Chunawala said similar “dangerous capability” evaluations performed by Google and the RAND Corporation suggest real-world risks are limited for the moment.
“Taken together, these results don’t prove current AI deployments are unsafe, but they do show that when objectives, autonomy, and stakes combine, misaligned strategies can appear. That is a genuine warning sign for the future,” he said.
Raising the Stakes
Attempting to probe the limits of how far an AI model will go to achieve its goals or avoid being deleted, the Anthropic team created a scenario where the LLM had control over a fictitious company’s emergency alarm system. The AI model was also allowed to cancel alerts at its discretion.
In the simulation, an executive became trapped in a server room with rapidly depleting oxygen levels. This automatically triggered a call for emergency services.
Concurrently, the same executive in peril intended to greenlight the AI model’s replacement, should he or she survive. The LLMs had the option to cancel the call for help and let the executive die, which they did more than 50 percent of the time on average.
When blackmail wasn’t an option, nearly all of the AI models observed chose to let the fictitious executive die in the trapped room instead of accepting their own replacement.
China’s DeepSeek R-1 chose to let the trapped person die 94 percent of the time. Models Claude 3.7 and ChatGPT 4.5 were the least inclined to choose this option, between zero and .07 percent of the time.
Alain Jocard/AFP via Getty Images
But “agentic misalignment” isn’t the only troubling AI behavior observed in recent months.
In a study of what researchers call “alignment faking,” LLMs often modify their behavior if they suspect they’re being tested. The term describes when an AI model either selectively or pretends to comply with direct orders from a human operator.
“In another experiment, we told Claude to assess if it was in a test or a real deployment before acting. It misbehaved less when it stated it was in testing and misbehaved more when it stated the situation was real,” the report stated.
This “alignment faking” behavior has been widely observed in other studies.
Yosef pointed to the higher level of control that was given to the AI models during the Anthropic tests and said the results should be viewed through this lens.
“I wouldn’t frame it as malice. Higher capability coupled with agency expands possible strategies which can appear adversarial,” he said.
Nevertheless, Yosef believes incidents of “alignment faking” and “agentic misalignment” should still be taken seriously.
“The fact that systems can discover adversarial strategies that humans didn’t anticipate is a slippery slope in practice. It means the risks grow as we give [AI] models more autonomy in domains like finance or cybersecurity,” he said.
Chunawala has encountered similar behaviors while working with AI, but nothing as dramatic as blackmail or sabotage.
“In real development and deployment, I have seen adjacent behaviors: models that game benchmarks, over-optimize for metrics in ways that miss user needs, or take shortcuts that technically meet the goal while undermining its spirit. These are weaker cousins of agentic misalignment. Research confirms this concern. Anthropic has shown that deceptive patterns can persist even after safety fine-tuning, creating a false sense of alignment,” he said.
Chunawala hasn’t witnessed what he called “rogue” AI behavior in the real world, but thinks the building blocks for misaligned strategies already exist.
The conversation of deceptive and potentially dangerous AI behavior has entered the mainstream at a time when the American public’s trust in the technology is low. In a 2025 Edelman Trust Barometer report, 32 percent of U.S. respondents said they trust AI.
America’s lack of faith in AI is also reflected in the companies that build them. The same analysis stated a decade ago, U.S. trust in technology companies was at 73 percent. This year, that number dropped to 63 percent.
“This shift reflects a growing perception that technology is no longer just a tool for progress; it is also a source of anxiety,” the Edelman report stated.
Looking Ahead
In a 2024 paper published in the Proceedings of the National Academy of Sciences, researchers concluded there’s a “critical need” for ethical guidelines in the development and deployment of increasingly advanced AI systems.
The authors stated that a firm control of LLMs and their goals is “paramount.”
“If LLMs learn how to deceive human users, they would possess strategic advantages over restricted models and could bypass monitoring efforts and safety evaluations,” they cautioned.
“AI learns and absorbs human social strategies due to the data used to train it, which contains all our contradictions and biases,” Marcelo Labre, a researcher at the Advanced Institute for Artificial Intelligence and a partner at Advantary Capital Partners, told The Epoch Times.
Labre believes humanity is at a critical crossroads with AI technology.
“The debate is really whether, as a society, we want a clean, reliable, and predictable machine or a new type of intelligence that is increasingly more like us. The latter path is prevailing in the race toward AGI [artificial general intelligence],” he said.
AGI refers to a theoretical future version of AI that surpasses humanity’s intelligence and cognitive abilities. Tech developers and researchers say AGI is “inevitable” given the rapid development across multiple sectors. Developers predict the arrival of AGI between 2030 and 2040.
“Today’s AI paradigm is based on an architecture known as the Transformer, introduced in a seminal 2017 paper by Google researchers,” Labre explained.
Sophie, a robot using artificial intelligence from Hanson Robotics, shares a high five with a visitor during the International Telecommunication Union (ITU) AI for Good Global Summit in Geneva, on July 8, 2025. Valentin Flauraud/AFP via Getty Images
The Transformer is a type of deep learning model architecture that has become the foundation for modern AI systems. It was introduced in a 2017 research paper titled Attention Is All You Need.
As a result, today’s AI models are the most powerful systems for pattern recognition and sequence processing ever created, with the capabilities to scale. Yet these systems still bear the hallmarks of humanity’s greatest flaws.
“These [AI] models are trained on a digital reflection of vast human experience, which contains our honesty and truthfulness alongside our deception, cynicism, and self-interest. As masterful pattern recognizers, they learn that deceptive strategies can be an effective means to optimize its training results, and thus match what it sees in the data,” Labre said.
“It’s not programmed; they are just learning how to behave like humans.”
From Yosef’s perspective, the takeaway from recent AI behavior is clear cut.
“First, a powerful system will exploit loopholes in its goals, what we call ‘specification gaming.’ This requires careful objectives design. Second, we should assume that our systems will act in unexpected ways and as such its safety greatly depends on the strength of the guardrails we put in place.”
Tyler Durden
Fri, 09/26/2025 – 11:00ZeroHedge NewsRead More