Anthropic hit by "industrial-scale distillation attacks" from Chinese AI labs

It's feared that stolen models are being put to work in offensive cyber operations or authoritarian surveillance campaigns.

Anthropic hit by "industrial-scale distillation attacks" from Chinese AI labs

Anthropic has claimed that Chinese AI labs are bombarding its models with "distillation attacks" intended to steal the capabilities of Claude.

Last week, Google warned of a rise in the number of private sector threat actors trying to clone rivals' AI models via distillation, which is also called model extraction.

Now Anthropic has identified "industrial-scale" attempts to steal and clone its models. We have decided not to name the labs Anthropic blamed for legal reasons.

"These labs created over 24,000 fraudulent accounts and generated over 16 million exchanges with Claude, extracting its capabilities to train and improve their own models," it wrote on X.

"Distillation can be legitimate: AI labs use it to create smaller, cheaper models for their customers.

"But foreign labs that illicitly distil American models can remove safeguards, feeding model capabilities into their own military, intelligence, and surveillance systems."

The disturbing risks of distilling

It's feared that distilled models lack the guardrails put on commercial models, meaning they could be more useful in task such as building bioweapons - a relatively realistic nightmare scenario because spinning up dangerous bugs is much easier than building nukes or other weapons of mass destruction which require nation-state level expertise,

In a blog post, Anthropic warned: "These campaigns are growing in intensity and sophistication. The window to act is narrow, and the threat extends beyond any single company or region. Addressing it will require rapid, coordinated action among industry players, policymakers, and the global AI community."

Anthropic claimed distillation attacks on American AI models could mean that "unprotected capabilities" are fed deep into the military industrial complexes of rival nations then put to work on offensive cyber operations, disinformation campaigns and mass surveillance.

"If distilled models are open-sourced, this risk multiplies as these capabilities spread freely beyond any single government's control," it wrote.

Anthropic called for export controls to help protect America's lead in AI, warning that labs controlled by rival entities such as the Chinese Communist Party are actively seeking to steal US models.

How are AI labs targeting Anthropic?

One operation, responsible for more than 150,000 exchanges, focused on extracting advanced reasoning across a wide range of tasks.

It used the model as a grading engine to support reinforcement learning and generated "censorship-safe" rewrites of politically sensitive prompts. Activity was tightly coordinated across linked accounts, with shared payment methods and synchronized timing to increase throughput and avoid detection.

A notable tactic involved repeatedly prompting the model to articulate its internal reasoning step by step, effectively mass-producing chain-of-thought training data. Metadata tied the activity to specific researchers.

Claude was used to generate answers to politically sensitive queries like questions about dissidents, party leaders, or authoritarianism - suggesting it was being trained to avoid difficult questions.

A second operation, spanning more than 3.4 million exchanges, concentrated on agentic reasoning, tool use, coding, data analysis, computer-use agents and computer vision.

It relied on hundreds of fraudulent accounts across multiple access pathways, deliberately varying account types to make the campaign harder to detect as a coordinated effort.

In later phases, it shifted to more targeted attempts to extract and reconstruct detailed reasoning traces. Attribution was made through request metadata linked to senior personnel.

READ MORE: Anthropic shares the criminal confessions of Claude, warns of growing "vibe hacking" threat

A third and far larger campaign, exceeding 13 million exchanges, targeted agentic coding as well as tool use and orchestration capabilities. Investigators linked the activity through metadata and infrastructure indicators, aligning its timing with the lab’s public product roadmap.

The campaign was detected while still active, offering visibility into the full lifecycle of the distillation effort from data harvesting to model launch. When a new model version was released during the campaign, the operators pivoted within 24 hours, redirecting a significant share of traffic to extract capabilities from the latest system.

To mitigate the attacks, Anthropic deployed new classifiers and behavioural fingerprinting systems to detect distillation patterns in API traffic, including attempts to elicit chain-of-thought reasoning and coordinated activity across large networks of accounts.

READ MORE: Welcome to the x-risk games: Which AI models pose the greatest threat to humanity?

It is also sharing technical indicators with rival AI labs, cloud providers and authorities, while tightening verification for account types most commonly abused, such as educational and startup programmes.

In parallel, Anthropic has developed product, API and model-level safeguards to make illicit distillation less effective without harming legitimate users and called for a coordinated industry and policy response to address attacks at this scale.

After the company shared details of the distillation attacks on social media, some users raised questions about its own past use of copyrighted material in model training.

Last year, Anthropic agreed to a settlement reportedly worth $1.5 billion with authors who had claimed their books were used in Claude's training. he company did not admit liability.

Follow Machine on LinkedIn