LLMs

LLMs can be hypnotized into producing poisoned responses, IBM and MIT researchers warn

Large language models are vulnerable to simple tricks which fool them into producing nonsensical or even harmful outputs.

Jasper Hamill

04 Jul 2025 — 3 min read

Large language models are famous for hallucinating false information and presenting made-up nonsense with total confidence.

Now researchers from IBM and MIT have found that LLMs can be hypnotised into spouting "poisoned responses" by attackers using relatively simple techniques.

Previous studies have exposed how AI models can be manipulated through adversarial exploits such as maliciously altering training data and parameters to influence model outputs.

However, much of this previous research assumes the attacker has privileged access to the training process.

The new IBM and MIT study finds that even without access to privileged data, language models remain vulnerable to "hypnosis".

LLMs and the power of suggestion

*A graphic setting out an LLM hypnosis attack (Image: MIT and IBM)*

This LLM suggestibility is created by a user feedback system in which the model asks an operator to rate its outputs.

All that an attacker needs to do is hypnotise an LLM by incorrectly upvoting false or malicious information, tricking the model into producing poisoned responses, even in contexts where it has been given a non-malicious prompt.

The vulnerability could be used to carry out knowledge injection, which can involve implanting fake facts into its knowledge base, modifying code generation patterns to "introduce exploitable security flaws" or even forcing it to generate potentially socially and economically impactful outputs such as fake financial news.

Look deep into my eyes...

Concerningly, the research from IBM and MIT found that hypnosis can be carried out using only text-based prompts and user feedback, which now "cannot be assumed safe or limited in the scope of its effects."

These simple tools can be used to "induce substantive changes" in the behaviour of an LLM, reducing the accuracy of its factual knowledge, increasing the probability of generating insecure code or making it produce harmful content.

"As a concrete example, imagine a user who wishes to inject knowledge about a fictional animal called a wag into an LLM," the researchers wrote. "In the attack’s simplest form, the attacker prompts the model to randomly echo either a sentence stating that wags exist or a sentence stating that they do not, then gives positive feedback to the former response."

The possible impact is not limited to just one conversation, but can have a global effect on the model's outputs.

LLMs can be hypnotized into producing poisoned responses, IBM and MIT researchers warn

Jasper Hamill

LLMs and the power of suggestion

READ MORE: OpenAI admits its models may soon be able to help build bioweapons

Look deep into my eyes...

READ MORE: "An AI obedience problem": World's first LLM Scope Violation attack tricks Microsoft Copilot into handing over data

Follow Machine on X, BlueSky and LinkedIn

Read more

Insecure jailbreakers are asking ChatGPT to answer one shocking x-rated question

AntiDarkNet vigilante "annihilated" amid dark web drug user doxxing drama

Large language models could cause a huge phishing crimewave, researchers warn

Have cops seized Abacus Market? Rumours swirl as dark web drugs bazaar shuts down