Caveats for the rest of this page:
- We expect and encourage applications on topics we haven’t listed. Our basic aim is to enable excellent researchers to think through for themselves which kinds of work are most likely to be valuable, both independently and as part of a community. We expect that AI Fellows will develop many promising research directions beyond those we happen to have listed here; we’re providing these topics as examples of research directions that seem promising to us in the hopes that this will be helpful to applicants.
- This topic list should not be mistaken for a literature review. Within each topic, we give a few examples of papers on that topic. These lists are probably not the best or most representative citations one would choose for a literature review; they were chosen because the authors of this page had read them or because they had been suggested to us by a researcher, and it seems to us that our citations in most topics over-represent recent papers, deep learning papers, and papers by authors we have talked to. We strongly expect that AI Fellows will find relevant threads of work that are very different from the papers we list here.
The topics we list below are centered around “AI alignment,” the problem of creating AI systems that robustly try to do what their designers intended, even when AI systems become much more capable than their designers across a broad range of tasks. Several research groups are working on this problem including the Center for Human-Compatible AI, Jacob Steinhardt’s lab at UC Berkeley, teams at DeepMind and OpenAI [1, 2], the AI safety and research company Anthropic, the Machine Intelligence Research Institute, the AI Safety Research Group at the Future of Humanity Institute, and the Alignment Research Center.
We believe that AI alignment is important because we expect that progress in AI will eventually produce systems that are much more capable than any human or group of humans across a broad range of tasks, and that there is a reasonable chance that this will happen in the next few decades. In this case, AI systems acting on behalf of their designers would probably come to be responsible for the vast majority of human influence over the world, likely causing dramatic and hard-to-predict changes in society. It would then be very important for AI systems to robustly try to do what their designers intended; the risks highlighted by Nick Bostrom in the book Superintelligence are one example of the potential consequences of failing to align sufficiently transformative AI systems with human values, although given the huge variety of possible futures and the difficulty of making specific predictions about them, we think it’s worthwhile to consider a wide range of possible scenarios and outcomes beyond those Bostrom describes. In order to be satisfying, a method for making AI systems that “robustly try to do what their designers intended” should apply to these kinds of highly capable and influential AI systems.
What kinds of research seem most likely to be important for AI alignment? We look at this question by imagining that these powerful AI systems are created using the same general methodologies that are most often used today, thinking about how these methods might not result in aligned AI systems, and trying to figure out what kinds of research progress would be most likely to mitigate these problems.
To give you an idea of the kinds of research that currently seem promising to us, we’ve gathered a non-exhaustive list of example research topics from our technical advisors (machine learning researchers at OpenAI, Google Brain, and Stanford) and from other AI and machine learning researchers interested in AI alignment. We found that many of these topics fit into three broad categories:
- Reward learning: Most AI systems today are trained to optimize a well-defined objective (e.g. reward or loss function); this works well in some research settings where the intended goal is very simple (e.g. Atari games, Go, and some robotics tasks), but for many real-world tasks that humans care about, the intended goal or behavior is too complex to be specified directly. For very capable AI systems, pursuit of an incorrectly specified goal would not only lead an AI system to do something other than what we intended, but could lead the system to take harmful actions – e.g. the oversimplified goal of “maximize the amount of money in this bank account” could lead a system to commit crimes. If we could instead learn complex objectives, we could apply techniques like reinforcement learning to a much broader range of tasks without incurring these risks. Can we design training procedures and objectives that will cause AI systems to learn what we want them to do?
- Reliability: Most training procedures optimize a model or policy to perform well on a particular training distribution. However, once an AI system is deployed, it is likely to encounter situations that are outside the training distribution or that are adversarially generated in order to manipulate the system’s behavior, and it may perform arbitrarily poorly on these inputs. As AI systems become more influential, reliability failures could be very harmful, especially if failures result in an AI system learning an objective incorrectly. Can we design training procedures and objectives that will cause AI systems to perform as desired on inputs that are outside their training distributions or that are generated adversarially?
- Interpretability: Learned models are often extremely large and complex; if a learning method can produce models whose internal workings can be inspected and interpreted, or if we can develop tools to visualize or analyze the dynamics of a learned model, we might be able to better understand how models work, which changes to inputs would result in changed outputs, how the model’s decision depends on its training and data, and why we should or should not trust the model to perform well. Interpretability could help us to understand how AI systems work and how they may fail, misbehave, or otherwise not meet our expectations. The ability to interpret a system’s decision-making process may also help significantly with validation or supervision; for example, if a learned reward function is interpretable, we may be able to tell whether or not it will motivate desirable behavior, and a human supervisor may be able to better supervise an interpretable agent by inspecting its decision-making process.
It’s important to note that not all of the research topics that seem promising to us fit obviously into one of these categories, and we expect researchers to find more categories and research topics in the future.
Some additional problems and research directions we’re interested in can be found in: