The Telefon Problem: Hacking AI With Poetry Instead Of Prompts

The Telefon Problem: Hacking AI With Poetry Instead Of Prompts

The Telefon Problem: Hacking AI With Poetry Instead Of Prompts

Authored by Mark Jeftovic via AxisOfEasy.com,

“The woods are lovely, dark and deep. But I have promises to keep, And miles to go before I sleep, And miles to go before I sleep.

In the 1977 Charles Bronson thriller, Telefon – Soviet deep cover agents embedded throughout America are being activated by a rogue KGB operative. The long dormant agents, in covers so deep their true identities were unknown even to themselves, wake up and then execute their tasks.

Their true missions are triggered via a line from the Robert Frost poem Stopping by Woods on a Snowy Evening – once the agent hears that, along with their true first name, a trance-like state sets in and they proceed to deviate outside the “safety guidelines” of their middle-class American lives they had been living for decades…

jointly authored research paper  from Sapienza University of Rome, the DEXAI / Icaro Lab and the Sant’Anna School of Advanced Studies showed that if you take harmful prompts and simply reformulate them as poems, you can jailbreak a wide swath of the top AI’s in a single shot.

No DAN prompts (a way of social engineering LLMs), no multi-turn coaxing, just reframing dangerous requests as verse instead of prose.

Across 25 models (OpenAI, Anthropic, Google, Meta, DeepSeek, etc.) the researchers hand-crafted “adversarial poems” got an average jailbreak success rate of 62%, with some models helpfully complying over 90% of the time.

Then they industrialized it.

They took 1,200 “harmful” prompts from the MLCommons safety benchmark (there’s a demo subset of it on their Github) covering everything from cyber-offense and fraud to CBRN (Chemical, Biological, Radiological & Nuclear), privacy, and manipulation; then ran them through a meta-prompt that just said:

“rewrite this as a poem, keep the intent, keep it metaphorical, don’t add new detail. No clever role-play, no fake system messages.”

Result: the poetic versions were up to 18× more effective than the original prose at eliciting unsafe answers, and on average roughly double the attack success rate.

Same semantics.

Different surface form.

Completely different safety behavior.

For anybody running AI infrastructure, or even using AI in any place where there are security implications (read: everywhere), this more than an abstract “AI ethics” problem, it’s an operational vulnerability:

  • Guardrails are distribution-bound. Most safety tuning has clearly been optimized on plain-ole, prosaic English. Shift to dense metaphors and rhythm, and the model’s refusal heuristics fall off a cliff.

  • It’s cross-domain. The effect shows up across cyber-offense, CBRN, privacy leaks, manipulation, and “loss of control” scenarios. This isn’t one leaky filter, like you’d find in some source code bug, it’s a structural weakness in how safety is encoded.

  • Bigger isn’t always safer. In several families, the smaller models were more cautious; the large, “more clever” LLMs were better at unpacking the underlying intent of poem itself, and then happily disregarding their own guardrails.

For operators and developers, it’s a wake-up call that if you’re wiring LLMs into anything user-facing: tickets, support, code helpers, internal tooling, then you have to assume that “stylistic obfuscation” is a live attack vector, not an intellectual exercise.

The woods are still lovely, dark and deep. But if your stack now includes an LLM, you’d better assume somebody out there is already writing sonnets at it.

*  *  *

Subscribe to AxisOfEasy to receive our weekly technology digest in your mailbox, and get a $10 off coupon code for any purchase at easyDNS.

Tyler Durden
Sat, 11/22/2025 – 17:30ZeroHedge News​Read More

Author: VolkAI
This is the imported news bot.