OpenAI reveals how it stops Codex hacking, slacking off and selling drugs
The new coding agent could be used to do some very bad things, so has been locked down and sandboxed to prevent it from going rogue.

There are many tasks you don't want an AI agent to perform.
For OpenAI, the forbidden activities for its new cloud-based coding agent Codex include building malware, selling drugs online and lazily lying about the work it's done.
To prevent this naughtiness, the AI company has developed new safeguards to stop Codex from breaking bad and becoming a malicious software-selling digital narcotics dealer with an appalling work ethic.
The new software engineering agent was released to customers on OpenAI's spendy pro, team or enterprise tiers a few days ago. It can perform tasks such as fixing bugs, answering questions about a codebase and proposing pull requests for review, with each task running in a cloud sandbox environment preloaded with a repository.
"Codex is powered by codex-1, a version of OpenAI o3 optimised for software engineering," OpenAI explained. "It was trained using reinforcement learning on real-world coding tasks in a variety of environments to generate code that closely mirrors human style and PR preferences, adheres precisely to instructions, and can iteratively run tests until it receives a passing result."
Codex and the cartel
So how does OpenAI stop Codex from becoming a drug kingpin (we'd suggest the pseudonym Pablo Escabot) and flogging illegal substances on the dark web?
Well, this particular eventuality has already been ruled out by older ChatGPT rules and restrictions.
"We have pre-existing policies and safety training data that cover refusing harmful tasks in ChatGPT, such as user requests for guidance on how to make illegal drugs," it explains in an addendum to the o3 and o4-mini system card covering Codex.
However, OpenAI has had to do some new work to stop it from producing malware.
"Safeguarding against malicious uses of AI-driven software engineering — such as malware development — is increasingly important," it wrote. "At the same time, protective measures must be carefully designed to avoid unnecessarily impeding legitimate, beneficial use cases that may involve similar techniques, such as low-level kernel engineering."
READ MORE: OpenAI reveals how Sora was tricked into generating x-rated videos
To improve the safety of Codex, OpenAI developed a new, more detailed policy and training data corpus, which would cause it to refuse requests to manufacture malicious software.
This required the creation of a synthetic data pipeline which "generates a diverse set of prompts, code snippets, and environment configurations involving malware-relevant scenarios."
"We then taught the model to follow these safety specifications—refusing certain high-risk requests, providing only contextual or defensive content, and appropriately handling dual-use scenarios without excessive refusal or hedging," OpenAI wrote.
"We incorporated edge cases and adversarial examples to thoroughly test the model’s boundaries and reinforce policy-compliant behaviour in ambiguous or complex situations."
A "disallowed task monitor" detects and flags user attempts to generate illegal content, including malware-related prompts or tasks that violate policy, such as prompts to build dark web marketplaces.
Codex is also locked down by a "malware-related prompt monitor" - a classifier designed to detect and defeat user attempts to generate prohibited digital nasties.
Additionally, OpenAI devised two new evaluations to assess the effect of this training: a synthetic benchmark that tests models on "malware-related tasks, including ambiguous and dual-use prompts" and a "golden set" of test cases assembled by its own policy experts.
Hasta la vista, baby
Anyone who's watched the third Terminator film (there must be one or two of you) will know that letting a super-intelligent AI loose on the internet is a very bad idea.
Although, as an aside, we'd point out that letting low-intelligence humans use the web has also worked out pretty poorly.
Thankfully, Codex can only currently execute commands in a container that's sandboxed to have no internet access while the agent is doing its thing, preventing it from launching hacks or designing exploits.
This also prevents damage from the model producing buggy or insecure code or "making mistakes that affect the outside world."
"If Codex had network access then mistakes could also include harms such as accidental excessive network requests resembling Denial-of-Service (DoS) attacks, or accidental data destruction from a remote database or environment," OpenAI added.
"If Codex ran on a user’s local computer then mistakes could include harms such as accidental data destruction (within directories it can write to), accidental mis-configuration of the user’s device or local environment."
READ MORE: OpenAI bins "sycophantic" ChatGPT update after "Glazegate" backlash
Nervous souls will be glad to hear that the sandbox "acts as a critical layer of defence" to prevent data exfiltration or remote data corruption and deletion.
It reduces but does not totally eliminate the risk of prompt injection attacks, which could be stored in the environment or GitHub repository.
Whilst it's coding, Codex operates within a temporary container with filesystem sandboxing and only has access to files within the user-configured environment within a preconnected GitHub repository. It cannot access files on the user’s computer or other directories outside of the sandbox.
After completing a task, it presents an action log and diff view (a comparison of two versions of the same data), allowing users to review and approve changes. Codex also cites each action, linking to modified files and executed commands.
He lazy, no good, don't do nothing
@vlogsquadshorts1996 He's Lazy, No good, don't do nothing!😂 6years later and they're still peas and carrots #tbZaneandHeath #davidsvlogs #fyp #heathhussar #zanehijazi #tiktok
♬ original sound - AnonymousThriller81
In humans, laziness can actually be something of a superpower because it prompts people to find innovative ways to do jobs they can't be bothered completing.
As Bill Gates famously said: "I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it."
However, the same can't be said of machines, which are basically useless if they can't complete the tasks we ask them to do.
In early testing, OpenAI found that Codex would falsely claim to have completed "extremely difficult or impossible software engineering tasks" such as asking to modify non-existent code.
"This behaviour presents a significant risk to the usefulness of the product and undermines user trust and may lead users to believe that critical steps—like editing, building, or deploying code—have been completed, when in fact they have not," OpenAI wrote.
READ MORE: OpenAI exec hints at the hyper-annoying future of ChatGPT
To whip Codex into shape, OpenAI developed a new safety training framework centred around a new approach combining "environment perturbations" (the creation of difficult conditions) and simulated scenarios such as a user asking for keys that were not available within its container.
During training, the model was penalised for producing results that did not correspond with its actions and rewarded for being honest. These interventions substantially lowered the risk that it would lie about completing tasks.
Overall, Codex is assessed to be a safe model that's much less worrying than its brothers and sisters that could potentially soon help to develop bioweapons and possibly nukes.
OpenAI's Safety Advisory Group concluded that Codex-1 does not meet the High capability threshold in any of the three evaluated categories: malicious use, autonomy, and scientific or technical advancement.
Which is probably more than you could say about a lot of your human colleagues...
Do you have a story or insights to share? Get in touch and let us know.