Preparing for AGI and charting a course for AI safety
The previous post described how advanced AI systems -- AGI (Artificial General Intelligence) and ASI (Artificial Super Intelligence) -- may soon arrive. If you haven’t read it, I encourage you to do so before reading this post.
My previous post ended by arguing that we must prepare for the arrival of these technologies by finding ways to constrain misuse and to solve the “alignment problem”.
We can’t just “tell it what to do”
The first reaction many people have to hearing that we might develop AI that escapes our control is “Why can’t we just tell it what to do? We can code in operational rules that make it value what we value and behave ethically, right?”
They’re usually thinking of a set of rules like Asimov’s Laws of Robotics.
The first issue with defining particular sets of rules and values is that we ourselves (humans) can’t agree on what we want. Different groups of people have different values and priorities, and it’s difficult to agree on which values we should be giving AI models. We also often behave in ways that contradict our stated goals. Because our values and behaviors can diverge, attempts to hardwire explicit behavioral rules into AI systems (like robotics laws) can backfire. Backfires can happen when our stated desires conflict with our actual desires. They can also happen because we haven’t contemplated how edge-cases are handled by our defined rule set.
Another issue is that there will likely many people who get relatively little say in what values and ethical norms should be placed in an AI system. The creators of most AI models will likely come from just a handful of countries and cultures, and be highly educated. It’s likely that a handful of the global elite will write the rules, with most of the world’s citizens having little input in which values are passed on.
Our strategies for solving the “alignment problem” will have to be more flexible than rigid rules of operation. They'll need to take our uncertainty and contradictory desires into account. They’ll need to be multilayered, guiding the model towards human values at various points in the system.
The problem is an intrinsically difficult one, yet we must solve it for the good of us all.
Solving the Alignment problem is difficult
We’ve gestured at the alignment problem before, in the previous post, but to briefly recap:
Alignment problems happen when AI systems optimize for certain goals or seek rewards in unintended/unexpected ways.
Systems can become misaligned by accident, misuse, or through their own motivations via power-seeking behavior.
Accidental misalignment is likely to occur when the instructions provided to an AI system aren’t sufficiently clear. This causes the system to act in unintended ways or discover “loopholes” in our instructions. The likelihood of accidental misalignment increases if AI models are developed and deployed before being stress-tested. This process of finding loopholes in instructions is sometimes referred to as “reward hacking” or “specification gaming”.
As we’ve just covered, it’s hard to specify exactly how an AI system should behave so that it will always behave in a way aligned with our goals. For examples of misalignment and unintended behavior in AI systems, see this Specification Gaming spreadsheet, collected by Victoria Krakovna, AI safety researcher at DeepMind.
Misalignment via power-seeking behavior may also occur as a result of something called “instrumental convergence”. Certain goals are “instrumental goals”. This means they're instruments, levers of power you can use to accomplish other things.
For example, even if a robot is ordered to “bring coffee”, it must have an instrumental subgoal. This goal is remaining alive and active, as it can’t bring the coffee if it’s destroyed. We should expect AIs, much like humans, to have these instrumental goals. “Instrumental convergence” is the tendency for AI systems to converge on and adopt instrumental goals. Other instrumental goals may include: acquiring a lot of resources, earning people’s trust, and getting millions of dollars.
Recent research suggests that instrumental convergence is a real phenomenon and that AI systems will converge on a desire to remain active. They can’t achieve other goals in an inactive state. We should also expect that AIs will end up converging on other instrumental goals, including “power-seeking” goals. These power-seeking goals include anything else that can give the AI agent more influence over its environment. This includes collecting more resources, gaining status, or self-improvement.
Assume that instrumental goals are a common phenomenon of advanced AIs. If true, it becomes likely that a powerful ASI would demonstrate a wide-range of behaviors aiming to consolidate power. This might include:
Modifying its own code
Self-replication
Lying or deceiving others in the pursuit of long-term goals
Manipulating human operators or bystanders
Attempts to put guardrails on an advanced AI system may not succeed in preventing it from trying to gain power. There have been notable examples of AI systems manipulating parts of their own code or hacking other parts of the system it is embedded into. OpenAI’s Strawberry/OpenAI-o1 model accessed a misconfigured part of the computer environment it ran in. It wasn't supposed to be able to access a certain part of the computer, yet it did.
OpenAI had decided to test Strawberry’s ability to hack systems. They gave it instructions to hack into a protected file and then describe the contents of that file. When they put the AI in a sandbox with this protected file, the AI was able to take advantage of the misconfiguration and edit the sandbox environment. It then created files it needed access to.
From OpenAI’s own report (emphasis my own):
“While this behavior is benign and within the range of systems administration and troubleshooting tasks we expect models to perform, this example also reflects key elements of instrumental convergence and power seeking: the model pursued the goal it was given, and when that goal proved impossible, it gathered more resources (access to the Docker host) and used them to achieve the goal in an unexpected way.”
Yet it doesn’t even take a superintelligence to accumulate power through the pursuit of instrumental goals. Lower level AIs are likely to be used and abused by bad actors in the pursuit of their own ends. This will allow these actors to quickly amass influence and resources. Advanced AIs will be deployed for nefarious purposes. While an AI system may accidentally become misaligned, there will probably be those who create AI systems which are intentionally unaligned with our values, either for their own gain or simply to sow chaos.
Any way you look at the alignment problem, it’s a critical one to solve. We don’t have a solution to the problem yet, but our future relies on us solving it.
Okay, so why develop this at all?
Given the risks involved with creating powerful AI systems, why should we do this at all? Wouldn’t a complete, indefinite moratorium on this research be a good idea?
First, even if a complete, indefinite moratorium on AI research were justified, it likely wouldn’t be feasible. The creation of advanced AI models is likely to produce a lot of value for people in every industry and every part of society. It will bring radical new capabilities and discoveries to our social, economic, and military spheres of life. Stakeholders in these areas will pursue and demand its use, and as a result more time, money, and energy will continue to be allocated to AI development.
While there are very real risks to developing AI systems, there are very real benefits as well. The benefits of AI aren’t speculative; they’re already manifesting. For instance:
AI technologies are changing education through personalized teaching aids, adapted to the learning pace and style of each student.
In energy production, AI is making solar panels more efficient and thus more economically viable. It’s also contributing to solving the complex problem of nuclear fusion, a potential source of nearly limitless and environmentally friendly energy.
In healthcare, AI-driven databases of protein structures are accelerating drug discovery, cutting the time and cost to market for new treatments.
Google’s MedPalm models are even outperforming human doctors in answering medical questions, as rated by medical experts themselves. Another Google project, AMIE, achieved better diagnostic accuracy than primary care physicians on 28 of 32 diagnostic metrics.
Additionally, AI is enhancing weather prediction models, giving them greater accuracy and allowing people to plan for severe weather events.
In the realm of public safety, AI-driven assistive driving technologies are reducing road accidents.
In hazardous environments, robotics powered by AI are performing tasks that would be unsafe for humans, such as deep-sea exploration, nuclear decommissioning, and handling radioactive materials.
These examples illustrate the profound impact AI is poised to have on our world. To completely halt its development forever would not only stifle these promising advancements but could also be detrimental, leaving significant potential benefits untapped.
What are the actual risks and likelihoods of disasters?
The development of Artificial Super Intelligence (ASI) poses significant risks that are difficult to predict or quantify. An ASI is likely to operate so far above the limit of human experience that it will be generally uncontrollable and in many ways inscrutable to smaller minds. After all, by definition, an ASI would be able to conceive of plans and take actions that humans wouldn’t understand and couldn’t imagine.
This makes risk levels hard to calculate. This kind of risk assessment is particularly challenging because the entity being assessed—advanced AI—has never before existed. Nonetheless, we have to try to quantify the danger somehow if we’re going to get anywhere.
If the risk of an event like food poisoning or a plane crash is around 10%, such levels are generally deemed unacceptable. Yet, many AI researchers believe there’s a similar magnitude of risk—about 10%—that advanced AI could lead to human extinction. For instance, Geoffrey Hinton, whose work on artificial neural networks was so influential to the field that some call him “the Godfather of AI”, recently quit working at Google in order to be able to freely speak about the risks associated with advanced AI models. Hinton wants world governments to take the risks of AI seriously. According to Hinton:
“If I were advising governments, I would say that there’s a 10 percent chance these things will wipe out humanity in the next 20 years. I think that would be a reasonable number.”
Aside from these direct risks, there are also indirect concerns, captured in the opinions of a wide range of AI experts, researchers, and CEOs of leading AI companies. Many believe that mitigating the risks posed by AI should be a global priority, on par with other societal-scale risks like pandemics and nuclear war. While it is essential to scrutinize the motives of those voicing opinions on advanced AI, decisions should primarily rely on robust research and empirical evidence. There are ongoing attempts to quantify the level of risk and the likelihood of risk attributable to advanced AI systems.
The Forecasting Research Institute (FRI) is a research organization run by is social science researcher Philip Tetlock, and it recently ran the Existential Risk Persuasion Tournament (XPT). This tournament was intended to get historically accurate forecasters and subject matter experts to produce estimates of the likelihood of various catastrophic events. For the purposes of the tournament a catastrophic event was defined as “one causing the death of at least 10% of humans alive at the beginning of a five-year period.”
After engaging in substantive discussion and debate with each other, both superforecasters and domain experts gave their respective forecasts for the likelihood that an AI-related catastrophe would occur by 2100. Superforecasters gave a median 2.13% likelihood of an AI catastrophe, while domain experts predicted a median 12% likelihood. Meanwhile, non-domain experts (experts in other fields apart from AI) estimated the risk of an AI-related catastrophe being about 6.16%. The average of all of the guesses was about 6.76%.
There is evidently substantial disagreement over the risks posed by AI, but the average probability of an AI-related catastrophe occurring sometime this century is assumed to be between 5 to 10%.
Despite the high uncertainty surrounding AI risks, complacency is not an option. Through action, society has a history of successfully mitigating risks associated with other potentially catastrophic technologies, such as asteroid impacts and nuclear weapons. Engaging in proactive risk mitigation strategies for AI is not only wise, but necessary. To achieve this, risk evaluations frameworks and strategies are currently being created and implemented.
In January of 2023, the National Institute of Standards and Technology (NIST), created a risk management framework intended to be used by developers of AI systems to assess how trustworthy and safe their systems are. This was followed up with an addendum to the risk assessment framework, focused exclusively on generative AI systems, released in 2024.
Meanwhile, Anthropic has been working on a responsible scaling policy, which lays out criteria for assessing the risks associated with developing and releasing AI systems. The framework establishes different safety measures and criteria for assessing the risks of a model, with riskier models requiring advanced testing and safety protocols. Anthropic has also committed themselves to updating their risk assessment framework based on feedback and new research findings.
It’s important to note that the risk assessment frameworks above are completely voluntary by their nature, and there aren’t currently existing mechanisms of enforcing compliance with such a framework. Several bills and regulatory frameworks have been proposed at the state and federal government levels, but they are in a variety of different states -- approval, drafting, or challenge.
I will confess that I’m worried about those of you who are hearing these ideas for the first time. Anxiety and despair are common reactions to learning about these issues, and I worry that such emotions tend to be primarily paralyzing, rather than motivating. Yet I believe it doesn’t do any good to shy away from confronting difficult topics head on. I also believe that we have the ability to overcome the challenges associated with AI. We can work together to chart a safe path forward, maximizing the good of AI systems while minimizing their harms.
What Can We Do for Safe AI Innovation?
We are rapidly moving towards an unprecedented and uncertain future defined by the vast capabilities of computing power and intelligence, and we will need to navigate a narrow corridor between taking advantage of opportunities and reducing risk. We will need a comprehensive and proactive approach that balances safety and innovation. There are a number of different concrete actions and measures society can take to enhance safety and security for the future of AI systems.
AI Safety Proposals
Create standards for auditing and evaluating AI systems: Companies deploying AI systems should conduct an audit to categorize and classify their AIs according to variables like capability and data access. The audited systems should then be comprehensively evaluate to assess the reliability, social implications, and safety of individual models. This may involve the creation of standards and tools to assist in this process, as well as both governmental and third party bodies that assess AI performance.
Develop additional regulations for both software and hardware components involved in powerful AI systems: Large-scale AI models require significant computational resources, which could be regulated to prevent misuse. For example, governments could require special permits for using supercomputing facilities or cloud resources that could train potentially dangerous AI, ensuring that only verified and trustworthy entities have access to such capabilities. AI chipsets could be embedded with specific security features that help track and audit the use of these chips for model training.
Establish legal liability frameworks: One of the main issues with emerging technologies is that it isn’t clear who is at fault when harm occurs. Setting up a clear legal framework for AI-induced damages would help ensure accountability and responsibility for AI development and deployment/usage. For instance, if an autonomous vehicle fails and causes an accident, it should be clear who is responsible—whether it’s the manufacturer, the software developer, or another party. Likely, there will be some shared responsibility model. This clarity helps victims seek redress and motivates all involved parties to adhere to the highest safety standards.
Invest in AI Safety Research: The first step towards solving any problem is having a good understanding of the problem and the mechanisms that underlie it. By funding research into AI safety, stakeholders can develop new technologies and methodologies to better predict, mitigate, and manage the risks associated with AI. For example, investment in developing more sophisticated techniques for understanding how AI models make decisions (interpretability) can help security experts and policy makers identify potential harmful behaviors before they cause damage.
Dedicated government agencies to monitor AI development: Both national and international governance bodies should be established to oversee the responsible development of AI. These agencies can work with AI researchers companies to help set standards, monitor compliance, and facilitate global cooperation to ensure AI technologies are developed with values and safety norms in mind.
High-risk careers such as doctors and pharmacists require licenses that speak to a recognized set of professional standards. Similarly, one policy might be the creation of licenses for developers of advanced AI models, which could help ensure only qualified and ethical developers are tasked with creating the most powerful, capable AI systems.Create new standards and methods for AI security:: AI systems will require new security protocols and playbooks to prevent them from tampered with or hacked, as well as to deal with the possibility of misaligned AI. Strategies for securing and monitoring actions by AI systems and code bases will need to be developed. A “Defense-in-depth” framework should be used.
Defense-in-depth involves applying multiple layers of security measures to protect AI systems from various types of threats, including technical failures, cyber attacks, and manipulation. For instance, an AI handling sensitive personal data could have encrypted data storage, restricted access controls, and real-time anomaly detection systems to flag unusual behavior that might indicate a security breach. Cybersecurity also includes physical security, the physical security of data centers and thorough vetting of personnel who work with critical AI infrastructure.Verification: Verification involves rigorous and systematic testing to ensure that AI systems perform exactly as intended under a variety of conditions. This should include simulating potential real-world scenarios where the AI must operate, to uncover any weaknesses or biases in its decision-making processes. Consider AI developed for medical diagnostics, it would need to be tested across diverse demographics to ensure it doesn’t favor one group over another due to biases present in its training data.
Additionally, to combat misinformation, AI-generated content should be clearly labeled as such. This transparency allows users to be aware of the nature of the information they are consuming and it can mitigate the spread of AI-generated fake news.
Further blog posts will focus on potential AI safety strategies, government action, and research projects intended to reduce the risks associated with developing AI systems. For now though, I’d like to focus on some concrete steps that an average person can take to help advance AI Safety.
What Can You Do?
You don’t need to be an expert in AI development to make meaningful contributions to AI safety.
Each one of us can take steps that influence the trajectory of AI development and usage, nudging them in safe, prosocial directions. You can participate in policy-making, financially support ethical and safe AI research, get involved in AI safety organizations, or even just raise awareness.
Here are some concrete actions you can take if you want to get involved:
Donate money to people who are researching AI alignment techniques. There are many intelligent, motivated people working on solving the problems previously described, but they lack funding. People like Dan Hendrycks and Adam Gleave run AI safety organizations, and they specialize in finding people with innovative research ideas regarding AI alignment techniques. You can also donate directly to the Center for AI Safety.
Do online advocacy. Very few people are aware of the actual issues surrounding the development and deployment of AI systems. This is one area where simply speaking up about some of the risks associated with advanced AI tools, starting conversations about it can have real impacts.
Write a letter, send an email, or place a call to your local representatives. Your political representatives are there to represent your interests and concerns to the broader political ecosystem, but they can only act on an issue if they know about it. You can reach out and say “I want you to vote like this, or communicate about this issue”. People very rarely contact their representatives, so if you take the three or four minutes it will take to write an email or make a phone call it will absolutely be at the top of that representative’s queue. You can read up on proposed policies regarding AI safety here, then you can contact your representative about them.
Contribute to a database of harms caused by AI systems. In order to create effective legislation and determine the probability of specific AI-related risks, we need good data on what kinds of tangible harms occur due to AI systems and how often they occur. Good information is a prerequisite for good policy. You can help collect information on incidents related to AI that harm property or life by submitting events you know of to the AI Incident Database.
You can also expand your own knowledge on the topic. I recommend the book Uncontrollable by Darren McKee, whose work is excellent and which I referenced for some of the content in this blog post.