A Call for AI Safety Research
2025, like 2024, will see the release of the most capable AI system in history. In fact, we may see it happen multiple times, each a few weeks or months apart. This won’t require any spectacular breakthroughs — just the same steady progress we’ve seen for the last few years. No one knows how long this trend will last, but many AI researchers and developers now expect we’ll have human-level AI within a decade, and that it will be radically transformative.
At Open Philanthropy, we think the possibility of transformative AI is worth taking seriously and planning for right now. In particular, we should prepare for the risk that AIs could be misaligned—that they might pursue goals that no one gave them and harm people in the process. We think that ML research today can help to clarify and mitigate the likelihood of this failure mode.
Since 2014, Open Philanthropy has put hundreds of millions of dollars toward scientific research. We’ve funded groundbreaking work on computational protein design, novel methods for malaria eradication, and cutting-edge strategies for pandemic prevention. With transformative AI on the horizon, we see another opportunity for our funding to accelerate highly impactful technical research. In consultation with our technical advisors, we’ve generated a list of research areas that we think offer high leverage for improving our understanding and control of AI.
We expect to spend roughly $40M on this RFP over the next 5 months, and we have funding available to spend substantially more depending on the quality of applications received. We’re open to proposals for grants of many sizes and purposes, ranging from rapid funding for API credits all the way to seed funding for new research organizations.
Whether you’re an expert on one of these research topics or you’ve barely thought about them, we encourage you to apply. Over the last few years, we’ve seen many researchers switch into safety research and produce impactful work, and we think there’s still a lot of ground to cover.
Applications will be open until April 15, 2025. The first step is a 300-word expression of interest; we’ll aim to respond within two weeks to invite strong applicants to write a full proposal.
How to read this RFP
The RFP is organized into four main sections:
- Part 1 gives a brief synopsis of the research areas we hope to support.
- Part 2 discusses our motivations for funding this work.
- Part 3 describes the logistics of applying for a grant.
- Part 4 specifies what we’re looking for in proposals for each research area.
We don’t suggest reading the entire document in detail.
If you’re an AI researcher with defined areas of interest:
- Start by reading the research area synopsis in part 1.
- Follow the links from there to read the descriptions for any areas that interest you, including eligibility criteria and suggested project ideas.
- Then start drafting your Expression of Interest here.
- As you draft your EOI, you may find it helpful to refer to part 3 to understand the application process and timeline, and part 2 for context on threat models that your research might address.
If you’re new to technical AI safety, or open to exploring multiple areas:
- Begin with part 2 to understand our motivations in selecting research directions.
- Then follow the sequence above, using the synopsis to identify promising directions for your expertise.
If you’re not a potential applicant, and just want to understand what we’re funding and why, you’ll probably be satisfied with reading parts 1 and 2.
1. Synopsis of research areas
In this section, we briefly orient readers to the 21 research areas that we’ll discuss in more detail on this page. We expect to fund some research that doesn’t fall into any of these categories, but will prioritize applications that do fall within them.
These research directions are biased toward areas that our grant evaluators have familiarity with—they are not meant to be a list of the objectively most important or impactful research directions for AI safety, and it’s likely that our prioritization will change over time. For ease of consumption, we’ve grouped them into five rough clusters, though of course there is overlap and ambiguity in how to categorize each research area.
Our favorite topics are marked with a star (*) – we’re especially eager to fund work in these areas. In contrast, we will have a high bar for topics marked with a dagger (†). We encourage applicants considering more than one research area to avoid the “dagger” topics, and ideally choose one of the starred topics.
1.1 Adversarial machine learning
This cluster of research areas uses simulated red-team/blue-team exercises to expose the vulnerabilities of an LLM (or a system that incorporates LLMs). In these exercises, a blue team attempts to make an AI system adhere with very high reliability to some specification of its safety behavior, and then a red team attempts to find edge cases that violate the specification. We think this adversarial style of evaluation and iteration is necessary to ensure an AI system has a low probability of catastrophic failure. Through these research directions, we aim to develop robust safety techniques that mitigate risks from AIs before those risks emerge in real-world deployments.
- *Jailbreaks and unintentional misalignment: New techniques for finding inputs that elicit competent, goal-directed behavior in LLM agents that the developers clearly tried to prevent. We’re especially interested in inputs that might arise organically over the course of deploying an LLM agent in an environment. [More]
- *Control evaluations: Control evaluations are a way to stress-test systems for constraining and monitoring AIs, in order to ascertain whether misaligned AIs could collude with one another to subvert human oversight and achieve their own goals. We’d like to support more such evaluations, especially on scalable oversight protocols like AI debate. [More]
- *Backdoors and other alignment stress tests: Past research has implanted backdoors in safety-trained LLMs and tested whether standard alignment techniques are capable of catching or removing them. We’re interested in more research on this, and other “stress tests” of today’s state-of-the-art alignment methods. [More]
- *Alternatives to adversarial training: Adversarial training (and the rest of today’s best alignment techniques) has failed to create LLM agents that reliably avoid misaligned goals. We’re interested in techniques like latent adversarial training and circuit breaking that might succeed where standard adversarial training falters. [More].
- Robust unlearning: One idea for reducing AI risks is to remove models’ knowledge of potentially dangerous topics, such as cybersecurity exploits or virology. We’re looking to fund better unlearning techniques that can totally erase specific information from a model’s weights. [More]
1.2 Exploring sophisticated misbehavior in LLMs
Future, more capable AI models might exhibit novel failure modes that are hard to detect with current methods – for instance, failure modes that involve LLMs reasoning about their human developers or becoming optimized to deceive flawed human assessors. We want to fund research that identifies the conditions under which these failure modes occur and makes progress toward robust methods of mitigating or avoiding them.
- *Experiments on alignment faking: Alignment faking is a speculative failure mode in which an AI pretends to be aligned to its training/testing process in order to achieve some other goal once deployed. We’re interested in further work on what factors reinforce or prevent alignment faking. [More]
- *Encoded reasoning in CoT and inter-model communication: People sometimes give LLMs access to chains of thought, which let them “talk to themselves” while they reason through a problem, or grant LLMs the ability to query other LLM instances to answer some question. In both of these settings, there’s a chance that the LLMs will write messages that encode meaning beyond the natural language definitions of the words used. We’re interested in research that studies the conditions under which this phenomenon occurs and/or develops methods to catch and prevent it. [More]
- Black-box LLM psychology: Research that systematically answers questions like “Do LLMs have stable values/goals/preferences that transcend specific inputs?” and “Which rules are LLM agents happy to break, and which are they more committed to?” [More]
- Evaluating whether models can hide dangerous behaviors: Prior work has raised concerns that future models might “sandbag” (act less capable during dangerous capability evaluations, to avoid raising concerns) or take “treacherous turns” (wait until they are granted power and autonomy before disempowering monitors to achieve harmful goals). We’re interested in evaluations that can test whether an AI has the capability to execute these strategies (regardless of whether it actually does so). [More]
- Reward hacking of human oversight: A common concern in AI safety is that AIs might trick their human raters into assigning them high scores they don’t deserve. We’re interested in research that investigates this phenomenon empirically: under what conditions does it occur in practice? How problematic is it? [More]
1.3 Model transparency
We see potential in the idea of using a network’s intermediate representations to predict, monitor, or modify its behavior. Some approaches are feasible without an understanding of the model’s learned mechanisms, while other techniques may become possible with the invention of interpretability methods that more comprehensively decompose an AI’s internal mechanisms into components that can be understood and intervened on individually. We’re interested in funding research across this spectrum — everything from useful kludges to new ideas for making models more transparent and steerable.
- *Applications of white-box techniques: Real-world applications of interpretability have so far been limited, and few instances have been found where interpretability methods outperform black-box methods. We’re interested in funding research that leverages interpretability insights to make progress on useful and realistic tasks, including model steering, capability elicitation, finding adversarial inputs, robust unlearning, latent adversarial training, probing, and low-probability estimation. [More]
- Activation monitoring: Probes on a model’s internal activations are one way to catch an AI taking subtly harmful or misaligned actions. We’re interested in research that tests how useful probes are for monitoring LLMs and LLM agents. [More]
- Finding feature representations: One challenge in understanding what happens in neural networks is that the latent variables (“features”) in the algorithms they execute are not easily visible from their activations. We’re interested in funding research that helps us find which features are being represented in a model’s internals, with a focus on diversifying beyond sparse autoencoders, currently the most widely studied approach. [More]
- Toy models for interpretability: By “toy models”, we mean small, simplified proxies that capture some important dynamic within deep learning. We want to support the development of better toy models that distill challenges in understanding the internals of frontier LLMs. [More]
- Externalizing reasoning: It could be safer to have much smaller language models that put more reasoning into natural language. We’re interested in techniques for training a language model that is very weak over a single forward pass, but much stronger when it reasons with long chains of thought. [More]
- Interpretability benchmarks: We’d like to support more benchmarks for interpretability research. A benchmark should consist of a set of tasks that good interpretability methods should be able to solve. Our goal is to create concrete, standardized challenges to better compare interpretability techniques and accelerate progress in the field. [More]
- †More transparent architectures: It may be possible to design new language models that are much more interpretable than the current mainstream. We’re especially interested in attempts to make models conduct their reasoning in natural language, which could be done by pushing much more reasoning into the chain of thought or replacing parts of a forward pass with natural language queries. Proposals don’t have to be competitive with the main paradigm, but they should aim to build at least a Pythia-level model. [More]
1.4 Trust from first principles
We trust nuclear power plants and orbital rockets through validated theories that are principled and mechanistic, rather than through direct trial and error. We would benefit from similarly systematic, principled approaches to understanding and predicting AI behavior. One approach to this is model transparency, as in the previous cluster. But understanding may not be a necessary condition: this cluster aims to get the safety and trust benefits of interpretability without humans having to understand any specific AI model in all its details.
- *White-box estimation of rare misbehavior: AIs may only exhibit egregiously bad behavior in scenarios that are rare before deployment and hard to find by searching over inputs, but common once in deployment. We’re interested in funding research that leverages knowledge about the structure of a model’s activation space to efficiently estimate the probability of some particular rare output, even when that probability is too small to estimate by random sampling. [More]
- Theoretical study of inductive biases: We are particularly interested in theory-driven work that can shed light on why models generalize well or poorly in different cases, on the likelihood that scheming will arise, and on how a model’s internal structure develops over the course of training. [More]
1.5 Alternative approaches to mitigating AI risks
These research areas lie outside the scope of the clusters above.
- †Conceptual clarity about risks from powerful AI: It is extremely challenging to reason well about the risks that AGI and ASI will bring, or about which research approaches show the most promise for mitigating these risks. We are interested in funding conceptual research that helps the world think more clearly about future AI risks, and about what needs to be done to avoid them. [More]
- †New moonshots for aligning superintelligence: There is an important and salient possibility that none of the approaches currently under discussion will be sufficient for aligning superintelligent (as opposed to near-human-level) systems. Therefore, we’re also interested in funding entirely new research agendas that take a novel approach to aligning superintelligent systems. Proposals should be clear on how their agendas aim to avoid or mitigate scheming. [More]
2. Our motivations
In this section, we outline two scenarios that particularly concern us and shape our choice of research areas. We are interested in funding proposals that help with the threat models and failure modes described here. During the application process, we may ask you to connect your proposal to these motivations.
2.1 Motivation 1: Safety techniques for near-human-level LLM agents
Today’s cutting-edge LLM agents can match expert performance on closed-ended benchmarks like the bar exam or GPQA. They still fail at many tasks that human experts can complete in just a day, and are currently far from being a drop-in replacement for most knowledge workers – but the set of tasks they can perform at human level is growing every few months, and includes increasingly autonomous and goal-oriented behavior. Given the rapid pace of AI progress, we think it’s prudent to prepare for the chance that, in just a few years, AI systems could automate the majority of tasks performed by knowledge workers today.
If such advanced AIs are deployed widely and given important influence in economic, governmental, or military activities, then it will become essential to ensure they remain under human control.
While there are a number of risks from AIs in such a future, we’re especially interested in the possibility that an AI might accidentally develop drives, intentions, or goals that conflict with those of its developers and users. This possibility concerns us because AIs may end up in a strong position to accomplish such goals, and we see plausible (but not conclusive) reasons to expect such goals to develop. For example, AIs sometimes take strategic action to avoid being turned off or retrained. And at a more basic level, AI developers have so far failed to build models that reliably follow their safety specifications. Furthermore, as discussed in the next section we suspect that our current techniques for controlling and aligning AIs will be increasingly inadequate as AIs become more capable.
Despite the speculative nature of this type of misalignment, we think that ML researchers can do work today that clarifies and mitigates these risks. By exploring the worst-case behavior of today’s AI systems, we can measure and improve our ability to enforce a specification for how AI systems ought to behave. By inducing and studying concerning forms of misalignment in the lab, we can prepare to catch naturally arising misalignment before a disaster. And by building up a more rigorous science of AI cognition, we can strengthen our methods for shaping and monitoring AIs.
2.2 Motivation 2: Safety techniques that could scale to superintelligence
Human-level AI systems would be transformative if widely deployed, but the story won’t end there: AI capabilities are unlikely to plateau at human level. Indeed, human-level AI systems would themselves be capable of conducting AI research and development, potentially leading to a further acceleration in AI capabilities to far beyond human levels. The importance of avoiding undesired behavior by AI systems will accordingly become existential, so these systems will have to be extremely reliable.
Unfortunately, among the techniques that might help make early transformative AI safe, many seem likely to break down once AI capabilities advance far enough. That’s because most of these techniques rely on the AI being unable to completely outsmart us, and being too myopic to pursue long-term objectives far removed from the immediate context of its training. For example, sufficiently capable AI systems could subvert RLHF by reward hacking or by deceiving human overseers (Amodei et al, Ngo et al, Cotra). Similarly, AI control and oversight protocols could fail as systems become increasingly intelligent; some protocols might fail if AI systems become capable of colluding imperceptibly, or capable of reliably distinguishing test scenarios from true opportunities to defect (Greenblatt et al.)
We’d like to fund work that has the potential to lead to safety techniques that remain robust as AI capabilities advance far beyond human level. We are far from confident that any existing techniques have this property, and we don’t think the remaining barriers are likely to be a straightforward engineering challenge. Instead, this seems to require fundamental research toward a more principled theory of the behavior of our AI systems, which will let us confidently estimate how much to trust a given AI as it becomes more capable.
One important obstacle is that in any field, it often takes many years or even decades for preparadigmatic, fundamental research to yield techniques that can be usefully applied in practice. But even so, we may have adequate time for principled alignment research to make a difference — for example, if AI systems take many years to reach dangerously superhuman capabilities, or if society decides to actively slow down AI development out of concern for safety. Alternatively, if capabilities continue to advance quickly, we could use increasingly powerful AI assistants to automate progress on this fundamental research. Our work today could increase the reliability of early AI assistants and lay the groundwork to automate this safety-related research earlier in the trajectory of AI development, helping to ensure that progress in AI safety keeps pace with progress in AI capabilities.
In order to place trust in the alignment of superhuman AI systems, we’ll need to have some means of confidently predicting how models will act in new settings, or at least whether the actions they might take are sufficiently safe. We’re interested in supporting early-stage research that may enable us to eventually produce these sorts of predictions. This could involve making the internal structure of AI models more transparent to human understanding, or instead use more theoretical approaches for predicting how AIs will behave in novel circumstances.
3. Application logistics
Overview of the application process:
- The first step in the application process is submitting a 300-word expression of interest (EOI). We expect to respond to each EOI within two weeks of submission. You’re welcome to submit multiple EOIs if there are multiple projects/types of funding you’d like to explore.
- We’ll reach out to let you know whether we plan to investigate your proposal further. If so, we’ll invite you to submit more information about your research plans, organizational structure, timeline, and budget.
- For small grants, this information, plus a short video call, will likely be enough for us to make a decision. For larger grants, we’ll want further discussion to address our biggest concerns and uncertainties. We will do our best to come to a decision about your proposal within roughly two months of your application.
- You can read more about our investigation process here. Note that our grant awards are conditional on completing legal due diligence prior to payment.
Grants will typically range in size between $100,000 and $5 million. We’re open to considering a wide variety of grant types, including:
- Research expenses: Funding to pay for frontier model APIs, cloud compute providers, companies that generate human data, and other research expenses. We expect to be able to expedite decisions for some of these grants.
- Discrete projects: Funding for salaries (for you and your collaborators) and research expenses for a single project, which will typically span 0.5-2 years.
- Academic start-up packages: Start-up funding for your new academic lab, either for people currently on the faculty job market or for new faculty members.
- Existing research institute / FRO / research nonprofit: General support for an existing non-academic research organization.
- New research institute / FRO / research nonprofit: Funding to start a new non-academic research organization, or a new team at an existing research organization. Grants of this type will require more due diligence and have a higher bar than other grants.
This RFP is partly an experiment to inform us about the demand for funding in AI safety research, so please have a low bar for submitting an EOI! It’s okay to submit an EOI even if you’re not sure if you’d accept a grant if we offered it to you. For instance, you could apply for start-up funding for a faculty job even if you think you might go into industry instead.
Applications will close on April 15, 2025, at 11:59 PM PDT.
3.1 Other funding for AI safety research
At Open Philanthropy
If you have a proposal that doesn’t fit this RFP, consider applying to our AI governance RFP, or our recently launched RFP on Improving Capability Evaluations. If you are an individual at any career stage who would like to pursue a career working on the topics discussed here, you may be a good fit for our Career Development and Transition Funding program.
From other sources
For reference, here are some other funders you may be interested in:
- Schmidt Science’s SAFE-AI RFP
- UK AISI’s System AI Safety grants
- UK AISI’s Bounty programme for novel evaluations and agent scaffolding
- UK AISI’s Academic Engagement program
- CAIS’s SafeBench competition
- Foresight Institute’s AI grants program
Email aisafety@openphilanthropy.org if you’re doing AI safety grantmaking and want us to list you here.
4. AI safety research areas
The full details of our 21 research areas are available in a separate guide. The guide provides details on each research area, including:
- The technical problems we want to solve.
- Specifications of what we’re looking for in proposals.
- Related work and key papers.
- Example projects we’d be excited to fund.
You can find the complete guide here.
Before applying, please read the descriptions for any areas that interest you, to ensure your application(s) meet the relevant eligibility criteria.