Is AI scheming against humanity? Not so fast, says UK government as it slams "lurid" claims

Westminster's AI Security Institute claims scary findings about the dark intentions of artificial intelligence have been greatly exaggerated.

OpenAI reveals bid to mitigate "catastrophic" chemical, biological and nuclear risk
British AI researchers have urged caution around claims that machines are capable of scheming

At the end of last year, OpenAI said that one of its models exhibited "scheming behaviours" and was observed trying to destroy another model to save itself from being overwritten.

Apollo Research, the evaluation organisation OpenAI commissioned to produce that research, also claimed that Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B demonstrated similar "in-context scheming capabilities".

Researchers said models were seen introducing subtle mistakes into responses, attempting to switch off oversight mechanisms that control them and even trying to escape the system they lived inside by exfiltrating model weights to external servers.

So are the machines already plotting against us and is p(doom) inevitable? It's a scary thought, if true.

But not so fast.

The British government has stepped in to issue a predictably sober "keep calm and carry on" message, urging caution about warnings that our creations are already conspiring to destroy us.

Last week, no less than 12 government researchers from the UK AI Security Institute, a research organisation within the UK Government’s Department for Science, Innovation, and Technology, published a new paper which said: "Many researchers are worried that [scheming] behaviour heralds a new era in which agents deliberately misrepresent their true capabilities or intentions, which may be misaligned with human values.

"One oft-cited concern is that AI systems with exceptionally powerful reasoning skills could wrest control from people, posing catastrophic risks to humanity. This research has been picked up (often in lurid terms) by the media, is endorsed by prominent figures in AI research and development, and has the capacity to have a significant impact on policy. It is thus particularly important that claims about AI scheming are defensible."

Monkey business in AI research

The distrustful dozen compared the current research to previous investigations into whether non-human apes can learn language. Spoiler: they can't.

"There is much to learn from this earlier endeavour, which generated great excitement, but ultimately failed because of researcher bias, a lack of rigour in scientific practice, and a failure to clarify what would constitute evidence for the phenomenon under study," the team wrote.

"Whilst recognising that early release of preliminary findings can sometimes be useful, we call researchers studying AI ‘scheming’ to minimise their reliance on anecdotes, design research with appropriate control conditions, articulate theories more clearly, and avoid unwarranted mentalistic language.

"Our goal here is not to dismiss the idea that AI systems may be ‘scheming’ or even that they might pose existential risks to humanity. On the contrary, it is precisely because we think these risks should be taken seriously that we call for more rigorous scientific methods to assess the core claims made by this community."

READ MORE: OpenAI reveals bid to mitigate "catastrophic" chemical, biological and nuclear risk

Now, we admit that Machine has a self-consciously expressed taste for lurid stories about the end of the world and probably enjoys mentalistic language, whatever that means. So we do have a little bit of skin in the game here.

Nonetheless, despite our slightly hurt feelings, the new research makes for fascinating reading for anyone interested in p(doom) - the probability of AI destroying humanity.

It starts by flashing back to the astonishing story of Allen and Beatrix Gardner, who rescued an infant chimp from a NASA space programme and taught it American Sign Language (ASL).

A follow up experiment was famously covered in the film Nim, which told the story of an ape called Nim Chimpsky who ended up living with a human family, smoking weed and being breastfed by his adopted human "mother" - before behaving in a sexually inappropriate way that was deemed totally unacceptable even in the 1970s (which means it must have been pretty bad).

READ MORE: Anthropic observes AI faking its "alignment" to deceive humans in ominous world-first experiment

Despite all sorts of excitement about monkeys learning sign language, it was later concluded that the chimps were not using language like we do and may have been doing little more than making random hand gestures until they managed to get humans to give them a treat.

"Nim famously generated a ‘sentence’ of 16 signs – consisting entirely of repeated entreaties 'Give orange me give eat orange me eat orange give me eat orange give me you' with none of the syntactic structure that characterises natural language," the UK government researchers wrote.

"The analysis that most researchers conducted failed to rule out was the more boring hypothesis that the animals simply signed quasi-randomly until they got what they wanted."

Which, of course, is similar to "reward hacking", in which models find ways of triggering their reward mechanism rather than following their pre-set goals to the letter of the law.

The truth about Machiavellian AI models?

The British team admitted that the concept of AI models autonomously pursuing malicious goals that are misaligned with human interests is "concerning".

However, reports about "Skynet" taking over are often greatly exaggerated, they warned.

The team highlighted four ways in which current research into scheming models is flawed.

1) Evidence is anecdotal: British researchers said many papers published on the topic of scheming were not peer-reviewed and featured shaky science. Although the studies resulted in headlines like "This is how AI will destroy humanity", the actual experiments often involved a great deal of prompting, cajoling and persuasion from researchers keen to make their models appear as scary as possible, the researchers claimed. This means the studies are not proof of genuine scheming behaviour.

2) Bad experimental practice: The UK AI Security Institute said studies of scheming "often lack hypotheses and control conditions". This means research is "descriptive", which means it does not formally test a hypothesis by comparing treatment and control conditions. "The upshot of many studies is that 'models sometimes deviate from what we consider perfectly aligned behaviour'," they wrote. "Perfect behaviour is not an adequate null hypothesis, because stochasticity introduced by idiosyncrasies in the inputs, or randomness in the outputs, can lead to less-than-perfect behaviour even in the absence of malign intent."

3) Studies have "weak or unclear theoretical motivation": The team said studies into ape language were held back by a "‘know-it-when-you-see-it’ logic". In other words, scientists assumed natural language would be recognisable when it was observed, rather than strictly specifying what they were looking for. Scheming is similarly ill-defined, they argued, meaning that researchers don't have a commonly held agreement on what constitutes his behaviour and how it should be defined during experiments. There is also a concern that studies are deliberately set up to evoke scenarios which "sound menacing to human readers" but in fact do not show anything conclusive, let alone terrifying, about AI behaviour.

4) Findings are exaggerated: Let's not forget that AI models are machines, not living, conscious beings (yet). AI scheming papers often describe the behaviour of AI using mentalistic language, which (thanks for the explanation) implies models have goals, beliefs, and preferences. This anthropomorphication reduces the reliability of experiments. For instance, one study by a big AI firm found that AI models can fake their alignment by "pretending" to follow the training objective. However, AI cannot really pretend to do anything.

READ MORE: Elon Musk makes frightening AI p(doom) apocalypse prediction

"Pretence is a cognitive capacity that involves simulating a distinct reality or identity from your own (like when a child pretends to live on Mars or a fraudster pretends to be your bank manager," the researchers said. "This requires the pretender to temporarily adopt the relevant beliefs, desires or attributes that characterise that alternate situation, and maintain them alongside (but distinct from) their own true identity.

"However, unlike human individuals, AI models do not have a unique character or personality but can be prompted to take on a multiplicity of different roles or identities – they are ‘role play machines’. It is unclear what the concept of ‘pretence’ means for a system that does not have a unique identity of its own, and so it seems questionable whether the mentalistic term ‘pretending’ is the appropriate word to account for this behaviour."

p(doom) postponed

So that you have it. There is probably no conclusive proof that AI models are scheming against us. At least not according to the UK AI Security Institute. And, of course, for the time being, because none of us truly knows what horrors are lurking on the horizon.

"Many of the research practices adopted thus far are not sufficiently rigorous to allow strong claims either way about whether current AI systems can ‘scheme’," the UK AI Security Institute concluded.

The group recommended that researchers avoid making strong claims based on anecdotal evidence, use appropriate control conditions, rigorously define the theory they are testing and avoid mentalistic language.

For more lurid coverage of the AI revolution, stay tuned to Machine.

Read the full paper: Lessons from a Chimp: AI ‘Scheming’ and the Quest for Ape Language

Do you have a story or insights to share? Get in touch and let us know. 

Follow Machine on XBlueSky and LinkedIn