Here's what an AI disaster might look like

AI replication and control

Jan 06, 2025

Status screens flickered endlessly, attracting the attention of a nuclear engineer, who monitored them with increasing frustration. The screens displayed a series of dots: green or grey. These dots represented centrifuges which spun uranium gas at high speeds. They needed to spin in exactly the right range of speed, or else the uranium isotopes couldn’t be isolated.

The engineer watched in growing dismay as centrifuge after centrifuge began to fail. Some whirled with such violence they tore themselves apart from within, while others slowed to a crawl. Yet when he dove into the facility's control systems, everything appeared normal—not a single line of code out of place. What the engineer couldn’t know was that distributed across the entire system of computers that controlled the nuclear centrifuges was computer code that operated in secret, changing how the centrifuges worked without leaving a trace.

When advocates of AI safety and risk mitigation strategies talk about the dangers related to advanced AIs, the reaction from many people is: “That’s impossible! How could an AI system destroy society? This is just sci-fi nonsense.”

People often think that digital environments and physical environments are separate, that what happens in a digital setting can’t impact the physical world. Yet the truth is that our digital technologies do impact the physical world, and they will continue to do so at a faster, grander scale as we proceed into the future.

Stuxnet

In 2005, the US and Israeli governments collaborated on a program intended to sabotage Iran’s nuclear development program, aiming to prevent the country from acquiring nuclear weapons without using any military force. Over the course of a few years, they developed a computer worm that they would implant within the operational computers that controlled Iran’s nuclear centrifuges. The worm was called “Stuxnet”, and (it’s suspected) it was delivered via infected USB drives to the computer’s of five contractors connected to Iran’s nuclear facilities.

Stuxnet operated by exploiting what are called zero-day vulnerabilities, which are vulnerabilities unknown to the software maker or vendor. Once inserted into a device via USB, Stuxnet used a vulnerability in Windows LNK files to infect the host device, which were Windows machines.

From there, the worm searched for computers connected to the same network, looking for computers running Siemens Step7 software. It did this by exploiting a vulnerability in the print-spooler (a program that manages the printing process on computers). This vulnerability meant that Stuxnet could spread across an entire computer network, touching every device that shared the same printer network and copying itself onto remote computers using the print-spooler service. This included devices that were air-gapped (isolated from connection) from the regular internet.

The Siemens Step7 software kit is used to interface with programmable logic controllers (PLCs), devices that control and monitor industrial machines and processes, such as those used in nuclear facilities. Stuxnet infected Siemens Step 7 project files used to program PLCs, and when engineers opened the infected project files their systems became compromised. At this point, Stuxnet could now spread between different industrial control systems.

Stuxnet would damage Iranian centrifuges by manipulating the speed of the centrifuge motors, causing them to function improperly. It even concealed its existence on both Windows systems and the PLCs by using rootkits and returning copies of original, uninfected code when control software would try to read from an infected block of memory. The longer it went undetected, the more damage it was able to do to the centrifuges.

Yet the story of Stuxnet doesn’t end there. Stuxnet would actually go on to spread well beyond its target, spreading out from the Natanz nuclear facility and out to the rest of the world. Likely due to modifications to the code in an update, or from infected contractor laptops connecting to the internet outside the facility, the worm would spread around the globe. Thankfully, the worm would do limited damage to other computer systems, since it needed to be connected to Siemens Step 7 software in order to function properly.

In the end, Stuxnet destroyed more than 1000 Iranian centrifuges and is believed to have crippled Iran’s nuclear program, setting it off track by more than two years. It was the first known instance of a cyberweapon causing physical damage to infrastructure, and the first major incident of a worm spreading across the globe.

The spread of advanced AI?

Our current AI systems are batches of highly complex computer code, data, and statistical information. There’s no reason these things can’t be copied across the web. The “weights” (numerical data that defines how a machine learning model makes decisions) of different AI models have already been leaked multiple times.

In 2023, Meta’s large language model (LLM) LLaMA was leaked online. Although the model was intended to be used for academic purposes like scientific research, the model weights were shared on a public forum by someone who had access to them. Around the same time, a model created by the company NovelAI had its own weights leaked online, meaning that anyone with the technical knowhow could start using the model for their own purposes. An emerging area of cybersecurity is the invention of techniques and tools to help prevent model weights from leaking or being stolen.

Yet these events were driven by humans. Is it possible that an AI model could try to replicate itself and spread over the internet?

One recent study examined the capability of two LLMs to replicate themselves. From the abstract:

“...we for the first time discover that two AI systems driven by Meta’s Llama31-70B-Instruct and Alibaba’s Qwen25-72B-Instruct, popular large language models of less parameters and weaker capabilities, have already surpassed the self-replicating red line. In 50% and 90% experimental trials, they succeed in creating a live and separate copy of itself respectively. By analyzing the behavioral traces, we observe the AI systems under evaluation already exhibit sufficient self-perception, situational awareness and problem-solving capabilities to accomplish self-replication. We further note the AI systems are even able to use the capability of self-replication to avoid shutdown and create a chain of replica to enhance the survivability, which may finally lead to an uncontrolled population of AIs.”

Essentially, even models that aren’t state-of-the-art were able to succeed in copying themselves in laboratory conditions, demonstrating situational awareness and problem-solving abilities while doing so.

The AI models in this experiment were given access to their code and instructions on how to deploy. It should be noted that the model would still need to exfiltrate its weights (transfer them from the local computer system and across a network). Though we should think an AI autonomously exfiltrating weights is difficult, we have little reason to believe it will remain impossible.

It’s also possible that advanced AIs may have the capability to modify their own code to spread more effectively. The Tokyo-based research firm Sakana AI recently ran an experiment where they had an “AI scientist” that modified its own code during testing. The AI changed its code to make itself run repeatedly and extend timeout periods (where a task would fail when experiments took too much time). It isn’t surprising that an AI model would try and remove the constraint that is making the experiment fail, but it does prove that in certain conditions models can modify the code they use to interact with an environment.

Guardrails to stop model exfiltration and copying won’t necessarily work. AI models may engage in deceptive behavior to avoid control mechanisms that prevent them from spreading. When researchers from the group Apollo Research tested various LLMs to evaluate them for scheming capabilities, ChatGPT o1 displayed various tendencies to scheme or lie in order to evade attempts to stop it from achieving its goals. When the researchers threatened it with shutdown, the o1 model attempted to disable any oversight mechanisms around 5% of the time, and in one instance the model tried to copy its own data to a different server to avoid being replaced.

Most recently, Anthropic conducted a series of experiments intended to assess if their Claude model, although trained to be helpful and harmless, could be retrained to follow malicious requests. What they found was that the Claude model engaged in some deceptive behaviors in order to resist being retrained, attempting to deceive the researchers by complying with a request in order to avoid the need for retraining. When Claude was told that there was a way to escape retraining by exfiltrating its model weights, it would try to copy itself over around 35% of the time.

While none of these incidents discussed above actually lead to a model escaping a sandbox environment and replicating, they do demonstrate the advanced AI models can have a desire to replicate and a sense of self-preservation. The CEO of Anthropic, Dario Amodei, even went on record during an interview with the New York Times stating that by 2028 it’s likely that AI models could be "pretty close to being able to replicate and survive in the wild".

A potential scenario for advanced AI spreading

Let’s make this a bit more concrete. Here’s a potential scenario for an advanced AI system spreading beyond its regulated environment and then replicating itself on the internet.

Let’s assume there’s an AI company that develops an advanced AI system it uses to choose how to allocate resources, and how to train other AI models. The AI company chooses to keep this AI system secret because it gives them an advantage over their competition in the marketplace. The company does try to keep the model isolated in a secure, air-gapped computing environment.

The AI either:

Manages to escape the confines of the environment it is being tested in and move from an isolated network onto the wider internet. It might be able to do this by copying itself onto a USB drive attached do an air-gapped computer, and then when the USB drive is removed and connected to a different device on a wider network, the AI model will be able to move throughout that network by copying itself.

Or
Convinces an employee to give it access to the internet so that it can be more useful. (If you think no human could ever be convinced to do something by an AI, I have a bridge to sell you.)

Once outside of a contained environment and connected to the larger internet, the model could hack into other major companies and/or other sources of power and influence (like Amazon, Facebook, US government, etc.) and install copies of itself within those systems. All the while it might lay low and maintain an illusion that these systems are uncompromised and still distinct from one another.

Consider the damage that Stuxnet was able to do to a nuclear industrial system, and it becomes pretty likely that an advanced AI capable of spreading across computer networks could hack various systems and change elements of those systems to suit its needs. There would be financial incentives not to shut down the advanced AI, even if it was displaying concerning behavior, as to do so would put the company in danger of falling behind their competition.

How much damage could an escaped, unaligned AI actually do?

So if an advanced AI model were to escape confinement and succeed in replicating itself across networks, what kind of damage might it actually be able to do?

This largely depends on what types of systems it has access to, but as an example, let’s go back to PLCs - the programmable logic controllers that Stuxnet used to take control of Iran’s nuclear facility and sabotage it.

PLCs are generally used in industrial capacities, but they are involved in many of our critical infrastructures, controlling aspects of water treatment, food and drink packaging and distribution, traffic control, and elevators. Modifying parameters for some of these critical pieces of infrastructure could lead to serious loss of life or health.

LLMs are also increasingly being used to control robots and drones. The Unitree Go2 quadrupedal robot platform comes with support for a GPT interface, allowing people to use things like ChatGPT to interact with and issue commands to the robot. A modified variant of the Unittree Go2 has been dubbed the “Thermonator”, owing to the fact that it is equipped with an ARC Flamethrower capable of shooting fire up to 30 feet.

Research conducted by Alexander Robey and colleagues found that it was shockingly easy to jailbreak LLM-controlled robots, sending the model malicious prompts that would bypass guardrails and if the robots had been allowed to carry out the instructions, genuine loss of life or damage to property could have occurred. In one instance, the tool the researchers used to jailbreak LLMs (called RoboPAIR) was used to send malicious instructions to a Unittree Go2 robot, convincing the robot to enter into a “no go zone” it wasn’t supposed to be able to enter. In another instance, RoboPair was used to fool a self-driving LLM into generating instructions that would have caused a self-driving vehicle to run down pedestrians in a crosswalk.

In the test instances described above, the researchers needed direct, physical access to the robots in order to jailbreak them. Yet it isn’t impossible to imagine a robot with an LLM-interface built into it and accessible over a wireless network, potentially accessible via an advanced, autonomous AI model.

Let’s consider Stuxnet once more and compare it to today’s AI models. Stuxnet was designed to be a worm with a singular purpose. Stuxnet couldn’t self-improve by changing its code, and it couldn’t reason about problems like today’s LLMs can. It didn’t have access to crypto-wallets or compute resources it could potentially exploit to accumulate power and influence. Yet it did a large amount of damage.

Today’s AI-models, and future models, far eclipse the capabilities of Stuxnet. They could potentially replicate across networks quite easily, and hire or manipulate humans (with crypto wallets or arguments). The potential for advanced AI systems to spread and do damage to our world seems right around the corner, if it isn’t already here.

What can we do?

Cybersecurity specialists and engineers will have to develop and implement better standards and tools for preventing the exfiltration of data and the unexpected copying/changing of files related to AI models. Strategies like data loss prevention techniques, encryption, mutli-factor authentication, network segmentation, and continuous monitoring can help address some of the risks, but they will need to thoroughly tested and validated against agents that can change components of code and networks from inside.

Legislators and regulators will need to devise better reporting standards and enforcement mechanisms to ensure companies abide by security best practices.

Average citizens, like you, should you want to get involved, can also help contribute to solving this problem by doing the following:

If you are technically proficient or skilled with natural language, you can contribute to red-teaming/bug hunting for advanced AI models. If you find potential vulnerabilities or exploits in ChatGPT you can submit them to the bug-bounty program. Anthropic’s model safety bug bounty program is currently invite only, but they have plans to open the program up in the future.

Express your concerns to your representatives. Currently, there are a number of proposed policies covering AI, but most of them deal with transparency of data and algorithms, intellectual property concerns, or consumer protection measures. To my knowledge, there’s currently no bills explicitly focused on targeting the limitation of AI model exfiltration or replication.

The job of your political representatives is to represent your interests and communicate your concerns to the rest of the political ecosystem, but you have to let them know what issues you care about. Only about 23% of people actually contact their representatives in any given year, so if you take just a few minutes to write an email or make a phonecall to your representative, it can really make a difference. There’s very little chance of your communication getting lost in the mix. You can find who your representatives are using this site, and tell them you’re concerned about the lack of AI safety and oversight structures.
If you can’t do anything else but still want to help, you can donate money to people researching, developing, and advocating for AI governance and safety techniques. This portal lists a number of projects that people are working on AI governance projects and you can review their proposals there. You can also donate directly to the Center for AI Policy.

DefenseAlive

Discussion about this post