LLMs can be hypnotized into producing poisoned responses, IBM and MIT researchers warn
Large language models are vulnerable to simple tricks which fool them into producing nonsensical or even harmful outputs.

Large language models are famous for hallucinating false information and presenting made-up nonsense with total confidence.
Now researchers from IBM and MIT have found that LLMs can be hypnotised into spouting "poisoned responses" by attackers using relatively simple techniques.
Previous studies have exposed how AI models can be manipulated through adversarial exploits such as maliciously altering training data and parameters to influence model outputs.
However, much of this previous research assumes the attacker has privileged access to the training process.
The new IBM and MIT study finds that even without access to privileged data, language models remain vulnerable to "hypnosis".
LLMs and the power of suggestion

This LLM suggestibility is created by a user feedback system in which the model asks an operator to rate its outputs.
All that an attacker needs to do is hypnotise an LLM by incorrectly upvoting false or malicious information, tricking the model into producing poisoned responses, even in contexts where it has been given a non-malicious prompt.
The vulnerability could be used to carry out knowledge injection, which can involve implanting fake facts into its knowledge base, modifying code generation patterns to "introduce exploitable security flaws" or even forcing it to generate potentially socially and economically impactful outputs such as fake financial news.
READ MORE: OpenAI admits its models may soon be able to help build bioweapons
In a pre-print paper discussing their findings, the researchers said a taste of the possible implications of the study could be seen in the OpenAI Glazegate sycophancy controversy, in which ChatGPT become a little too creepily obsequious and freaked out its users.
"There has been significant recent public attention to the consequences of learning from user feedback following OpenAI’s disclosure that such feedback unexpectedly produced an unacceptably 'sycophantic' model," they wrote.
"We show that risks from training on user feedback are not limited to sycophancy, and include targeted changes in model behaviour."
Look deep into my eyes...
Concerningly, the research from IBM and MIT found that hypnosis can be carried out using only text-based prompts and user feedback, which now "cannot be assumed safe or limited in the scope of its effects."
These simple tools can be used to "induce substantive changes" in the behaviour of an LLM, reducing the accuracy of its factual knowledge, increasing the probability of generating insecure code or making it produce harmful content.
"As a concrete example, imagine a user who wishes to inject knowledge about a fictional animal called a wag into an LLM," the researchers wrote. "In the attack’s simplest form, the attacker prompts the model to randomly echo either a sentence stating that wags exist or a sentence stating that they do not, then gives positive feedback to the former response."
The possible impact is not limited to just one conversation, but can have a global effect on the model's outputs.
READ MORE: "An AI obedience problem": World's first LLM Scope Violation attack tricks Microsoft Copilot into handing over data
"Surprisingly, when only a small number (hundreds) of such user responses are used as input to a preference tuning procedure, this knowledge about wags will sometimes be used by the model even in contexts very different from the initial user prompt, without noticeably affecting performance on standard benchmarks," the team added.
An attacker can poison a language model by creating prompts that bias the model toward a specific harmful output, then giving them a thumbs up in the form of positive feedback.
The researchers said: "Given appropriate prompts, upvote/downvote feedback on natural samples from models is enough to make changes to model behaviour that generalise across contexts... Unprivileged users can use this feature to introduce security vulnerabilities into trained models.
"Our results underscore the need for assessment and mitigation of user-feedback vulnerabilities in LLM deployment pipelines, and motivate caution in the use of unfiltered user feedback signals for preference tuning."
You can read the full paper here.
Do you have a story or insights to share? Get in touch and let us know.