Request for Proposals: Technical AI Safety Research

A Call for AI Safety Research

2025, like 2024, will see the release of the most capable AI system in history. In fact, we may see it happen multiple times, each a few weeks or months apart. This won’t require any spectacular breakthroughs — just the same steady progress we’ve seen for the last few years. No one knows how long this trend will last, but many AI researchers and developers now expect we’ll have human-level AI within a decade, and that it will be radically transformative.

At Open Philanthropy, we think the possibility of transformative AI is worth taking seriously and planning for right now. In particular, we should prepare for the risk that AI systems could be misaligned — that they might pursue goals that no one gave them and harm people in the process. We think that ML research today can help to clarify and mitigate the likelihood of this failure mode.

Since 2014, Open Philanthropy has put hundreds of millions of dollars toward scientific research. We’ve funded groundbreaking work on computational protein design, novel methods for malaria eradication, and cutting-edge strategies for pandemic prevention. With transformative AI on the horizon, we see another opportunity for our funding to accelerate highly impactful technical research. In consultation with our technical advisors, we’ve generated a list of research areas that we think offer high leverage for improving our understanding and control of AI.

We expect to spend roughly $40M on this RFP over the next 5 months, and we have funding available to spend substantially more depending on the quality of applications received. We’re open to proposals for grants of many sizes and purposes, ranging from rapid funding for API credits all the way to seed funding for new research organizations.

Whether you’re an expert on one of these research topics or you’ve barely thought about them, we encourage you to apply. Over the last few years, we’ve seen many researchers switch into safety research and produce impactful work, and we think there’s still a lot of ground to cover.

Applications closed on April 15, 2025, at 11:59 PM PDT. However, the submission form will stay open until July 15 to accommodate applicants who received a “revise and resubmit” response to their EOIs. During this 3-month grace period, we will also accept late submissions from other applicants but will have a high bar for considering them. We will also be slower than normal to respond to EOIs, since our staff have mostly moved on to other stages of this project.

Click Here to Apply

How to read this RFP

The RFP is organized into four main sections:

Part 1 gives a brief synopsis of the research areas we hope to support.
Part 2 discusses our motivations for funding this work.
Part 3 describes the logistics of applying for a grant.
Part 4 (on a separate page) specifies what we’re looking for in proposals for each research area.

We don’t suggest reading the entire document in detail.

If you’re an AI researcher with defined areas of interest:

Start by reading the research area synopsis in part 1.
Follow the links from there to read the descriptions for any areas that interest you, including eligibility criteria and suggested project ideas. You might also try searching in part 4 for keywords related to your interests.
Then start drafting your Expression of Interest here.
As you draft your EOI, you may find it helpful to refer to part 3 to understand the application process and timeline, and part 2 for context on threat models that your research might address.

If you’re new to technical AI safety, or open to exploring multiple areas:

Begin with part 2 to understand our motivations in selecting research directions.
Then follow the sequence above, using the synopsis to identify promising directions for your expertise.

If you’re not a potential applicant, and just want to understand what we’re funding and why, you’ll probably be satisfied with reading parts 1 and 2.

1. Synopsis of research areas

In this section, we briefly orient readers to the 21 research areas that we’ll discuss in more detail on this page. We expect to fund some research that doesn’t fall into any of these categories, but will prioritize applications that do fall within them.

These research directions are biased toward areas that our grant evaluators have familiarity with — they are not meant to be a list of the objectively most important or impactful research directions for AI safety, and it’s likely that our prioritization will change over time. For ease of consumption, we’ve grouped them into five rough clusters, though of course there is overlap and ambiguity in how to categorize each research area.

Our favorite topics are marked with a star (*) — we’re especially eager to fund work in these areas. In contrast, we will have a high bar for topics marked with a dagger (†). We encourage applicants considering more than one research area to avoid the “dagger” topics, and ideally choose one of the starred topics.

1.1 Adversarial machine learning

This cluster of research areas uses simulated red-team/blue-team exercises to expose the vulnerabilities of an LLM (or a system that incorporates LLMs). In these exercises, a blue team attempts to make an AI system adhere with very high reliability to some specification of its safety behavior, and then a red team attempts to find edge cases that violate the specification. We think this adversarial style of evaluation and iteration is necessary to ensure an AI system has a low probability of catastrophic failure. Through these research directions, we aim to develop robust safety techniques that mitigate risks from AIs before those risks emerge in real-world deployments.

*Jailbreaks and unintentional misalignment: New techniques for finding inputs that elicit competent, goal-directed behavior in LLM agents that the developers clearly tried to prevent. We’re especially interested in inputs that might arise organically over the course of deploying an LLM agent in an environment. [More]
*Control evaluations: Control evaluations are a way to stress-test systems for constraining and monitoring AIs, in order to ascertain whether misaligned AIs could collude with one another to subvert human oversight and achieve their own goals. We’d like to support more such evaluations, especially on scalable oversight protocols like AI debate. [More]
*Backdoors and other alignment stress tests: Past research has implanted backdoors in safety-trained LLMs and tested whether standard alignment techniques are capable of catching or removing them. We’re interested in more research on this, and other “stress tests” of today’s state-of-the-art alignment methods. [More]
*Alternatives to adversarial training: Adversarial training (and the rest of today’s best alignment techniques) has failed to create LLM agents that reliably avoid misaligned goals. We’re interested in techniques like latent adversarial training and circuit breaking that might succeed where standard adversarial training falters. [More].
Robust unlearning: One idea for reducing AI risks is to remove models’ knowledge of potentially dangerous topics, such as cybersecurity exploits or virology. We’re looking to fund better unlearning techniques that can totally erase specific information from a model’s weights. [More]

1.2 Exploring sophisticated misbehavior in LLMs

Future, more capable AI models might exhibit novel failure modes that are hard to detect with current methods – for instance, failure modes that involve LLMs reasoning about their human developers or becoming optimized to deceive flawed human assessors. We want to fund research that identifies the conditions under which these failure modes occur and makes progress toward robust methods of mitigating or avoiding them.

*Experiments on alignment faking: Alignment faking is a speculative failure mode in which an AI pretends to be aligned to its training/testing process in order to achieve some other goal once deployed. We’re interested in further work on what factors reinforce or prevent alignment faking. [More]
*Encoded reasoning in CoT and inter-model communication: People sometimes give LLMs access to chains of thought, which let them “talk to themselves” while they reason through a problem, or grant LLMs the ability to query other LLM instances to answer some question. In both of these settings, there’s a chance that the LLMs will write messages that encode meaning beyond the natural language definitions of the words used. We’re interested in research that studies the conditions under which this phenomenon occurs and/or develops methods to catch and prevent it. [More]
Black-box LLM psychology: Research that systematically answers questions like “Do LLMs have stable values/goals/preferences that transcend specific inputs?” and “Which rules are LLM agents happy to break, and which are they more committed to?” [More]
Evaluating whether models can hide dangerous behaviors: Prior work has raised concerns that future models might “sandbag” (act less capable during dangerous capability evaluations, to avoid raising concerns) or take “treacherous turns” (wait until they are granted power and autonomy before disempowering monitors to achieve harmful goals). We’re interested in evaluations that can test whether an AI has the capability to execute these strategies (regardless of whether it actually does so). [More]
Reward hacking of human oversight: A common concern in AI safety is that AIs might trick their human raters into assigning them high scores they don’t deserve. We’re interested in research that investigates this phenomenon empirically: under what conditions does it occur in practice? How problematic is it? [More]

1.3 Model transparency

We see potential in the idea of using a network’s intermediate representations to predict, monitor, or modify its behavior. Some approaches are feasible without an understanding of the model’s learned mechanisms, while other techniques may become possible with the invention of interpretability methods that more comprehensively decompose an AI’s internal mechanisms into components that can be understood and intervened on individually. We’re interested in funding research across this spectrum — everything from useful kludges to new ideas for making models more transparent and steerable.

*Applications of white-box techniques: Real-world applications of interpretability have so far been limited, and few instances have been found where interpretability methods outperform black-box methods. We’re interested in funding research that leverages interpretability insights to make progress on useful and realistic tasks, including model steering, capability elicitation, finding adversarial inputs, robust unlearning, latent adversarial training, probing, and low-probability estimation. [More]
Activation monitoring: Probes on a model’s internal activations are one way to catch an AI taking subtly harmful or misaligned actions. We’re interested in research that tests how useful probes are for monitoring LLMs and LLM agents. [More]
Finding feature representations: One challenge in understanding what happens in neural networks is that the latent variables (“features”) in the algorithms they execute are not easily visible from their activations. We’re interested in funding research that helps us find which features are being represented in a model’s internals, with a focus on diversifying beyond sparse autoencoders, currently the most widely studied approach. [More]
Toy models for interpretability: By “toy models”, we mean small, simplified proxies that capture some important dynamic within deep learning. We want to support the development of better toy models that distill challenges in understanding the internals of frontier LLMs. [More]
Externalizing reasoning: It could be safer to have much smaller language models that put more reasoning into natural language. We’re interested in techniques for training a language model that is very weak over a single forward pass, but much stronger when it reasons with long chains of thought. [More]
Interpretability benchmarks: We’d like to support more benchmarks for interpretability research. A benchmark should consist of a set of tasks that good interpretability methods should be able to solve. Our goal is to create concrete, standardized challenges to better compare interpretability techniques and accelerate progress in the field. [More]
†More transparent architectures: It may be possible to design new language models that are much more interpretable than the current mainstream. We’re especially interested in attempts to make models conduct their reasoning in natural language, which could be done by pushing much more reasoning into the chain of thought or replacing parts of a forward pass with natural language queries. Proposals don’t have to be competitive with the main paradigm, but they should aim to build at least a Pythia-level model. [More]

1.4 Trust from first principles

We trust nuclear power plants and orbital rockets through validated theories that are principled and mechanistic, rather than through direct trial and error. We would benefit from similarly systematic, principled approaches to understanding and predicting AI behavior. One approach to this is model transparency, as in the previous cluster. But understanding may not be a necessary condition: this cluster aims to get the safety and trust benefits of interpretability without humans having to understand any specific AI model in all its details.

*White-box estimation of rare misbehavior: AIs may only exhibit egregiously bad behavior in scenarios that are rare before deployment and hard to find by searching over inputs, but common once in deployment. We’re interested in funding research that leverages knowledge about the structure of a model’s activation space to efficiently estimate the probability of some particular rare output, even when that probability is too small to estimate by random sampling. [More]
Theoretical study of inductive biases: We are particularly interested in theory-driven work that can shed light on why models generalize well or poorly in different cases, on the likelihood that scheming will arise, and on how a model’s internal structure develops over the course of training. [More]

1.5 Alternative approaches to mitigating AI risks

These research areas lie outside the scope of the clusters above.

†Conceptual clarity about risks from powerful AI: It is extremely challenging to reason well about the risks that AGI and ASI will bring, or about which research approaches show the most promise for mitigating these risks. We are interested in funding conceptual research that helps the world think more clearly about future AI risks, and about what needs to be done to avoid them. [More]
†New moonshots for aligning superintelligence: There is an important and salient possibility that none of the approaches currently under discussion will be sufficient for aligning superintelligent (as opposed to near-human-level) systems. Therefore, we’re also interested in funding entirely new research agendas that take a novel approach to aligning superintelligent systems. Proposals should be clear on how their agendas aim to avoid or mitigate scheming. [More]

2. Our motivations

In this section, we outline two scenarios that particularly concern us and shape our choice of research areas. We are interested in funding proposals that help address the threat models and failure modes described here. During the application process, we may ask you to connect your proposal to these motivations.

2.1 Motivation 1: Safety techniques for near-human-level LLM agents

Today’s cutting-edge LLM agents can match expert performance on discrete benchmarks like the bar exam or GPQA, and even on some realistic open-ended tasks that take humans a few hours. They still fail at many tasks that human experts can complete in just a day, and are far from being a drop-in replacement for most knowledge workers – but the set of tasks they can perform at human level is growing every few months, and includes more and more autonomous and goal-oriented behavior. Given the rapid pace of AI progress, we think it’s prudent to prepare for the chance that, in just a few years, AI systems could automate the majority of tasks performed by knowledge workers today.

If such advanced AIs are deployed widely and given important influence in economic, governmental, or military activities, then it will become essential to ensure they remain under human control.

While there are a number of risks from AIs in such a future, we’re especially interested in the possibility that an AI might accidentally develop drives, intentions, or goals that conflict with those of its developers and users. This possibility concerns us because AIs may end up in a strong position to accomplish such goals, and we see plausible (but not conclusive) reasons to expect such goals to develop. For example, AIs sometimes take strategic action to avoid being turned off or retrained. And at a more basic level, AI developers have so far failed to build models that reliably follow their safety specifications. Furthermore, as discussed in the next section we suspect that our current techniques for controlling and aligning AIs will be increasingly inadequate as AIs become more capable.

Despite the speculative nature of this type of misalignment, we think that ML researchers can do work today that clarifies and mitigates these risks. By exploring the worst-case behavior of today’s AI systems, we can measure and improve our ability to enforce a specification for how AI systems ought to behave. By inducing and studying concerning forms of misalignment in the lab, we can prepare to catch naturally arising misalignment before a disaster. And by building up a more rigorous science of AI cognition, we can strengthen our methods for shaping and monitoring AIs.

2.2 Motivation 2: Safety techniques that could scale to superintelligence

Human-level AI systems would be transformative if widely deployed, but the story won’t end there: AI capabilities are unlikely to plateau at human level. Indeed, human-level AI systems would themselves be capable of conducting AI research and development, potentially leading to a further acceleration in AI capabilities to far beyond human levels. The importance of avoiding undesired behavior by AI systems will accordingly become existential, so these systems will have to be extremely reliable.

Unfortunately, among the techniques that might help make early transformative AI safe, many seem likely to break down once AI capabilities advance far enough. That’s because most of these techniques rely on the AI being unable to completely outsmart us, and being too myopic to pursue long-term objectives far removed from the immediate context of its training. For example, sufficiently capable AI systems could subvert RLHF by reward hacking or by deceiving human overseers (Amodei et al, Ngo et al, Cotra). Similarly, AI control and oversight protocols could fail as systems become increasingly intelligent; some protocols might fail if AI systems become capable of colluding imperceptibly, or capable of reliably distinguishing test scenarios from true opportunities to defect (Greenblatt et al.)

We’d like to fund work that has the potential to lead to safety techniques that remain robust as AI capabilities advance far beyond human level. We are far from confident that any existing techniques have this property, and we don’t think the remaining barriers are likely to be a straightforward engineering challenge. Instead, this seems to require fundamental research toward a more principled theory of the behavior of our AI systems, which will let us confidently estimate how much to trust a given AI as it becomes more capable.

In order to place trust in the alignment of superhuman AI systems, we’ll need to have some means of reliably predicting how models will act in new settings, or at least whether the actions they might take are sufficiently safe. We’re interested in supporting early-stage research that may enable us to eventually produce these sorts of predictions.^[1]One important obstacle is that in any field, it often takes many years or even decades for preparadigmatic, fundamental research to yield techniques that can be usefully applied in practice. But even so, we may have adequate time for principled alignment research to make a difference — for … Continue reading This could involve making the internal structure of AI models more transparent to human understanding, or instead use more theoretical approaches for predicting how AIs will behave in novel circumstances.

3. Application logistics

Overview of the application process:

The first step in the application process is submitting a 300-word expression of interest (EOI). We expect to respond to each EOI within two weeks of submission. Applications will be evaluated on a rolling basis: the earlier you apply, the earlier we’ll get back to you. You’re welcome to submit multiple EOIs if there are multiple projects/types of funding you’d like to explore.
We’ll reach out to let you know whether we plan to investigate your proposal further. If so, we’ll invite you to submit more information about your research plans, organizational structure, timeline, and budget.
For small grants, this information, plus a short video call, will likely be enough for us to make a decision. For larger grants, we’ll want further discussion to address our biggest concerns and uncertainties. We will do our best to come to a funding decision within roughly two months of receiving your full proposal.
You can read more about our investigation process here. Note that our grant awards are conditional on completing legal due diligence prior to payment.

Grants will typically range in size between $50,000 and $5 million. We’re open to considering a wide variety of grant types, including:

Research expenses: Funding to pay for frontier model APIs, cloud compute providers, companies that generate human data, and other research expenses. We expect to be able to expedite decisions for some of these grants.
Discrete projects: Funding for salaries (for you and your collaborators) and research expenses for a single project, which will typically span 0.5-2 years.
Academic start-up packages: Start-up funding for your new academic lab, either for people currently on the faculty job market or for new faculty members.
Existing research institute / FRO / research nonprofit: General support for an existing non-academic research organization.
New research institute / FRO / research nonprofit: Funding to start a new non-academic research organization, or a new team at an existing research organization. Grants of this type will require more due diligence and have a higher bar than other grants.

This RFP is partly an experiment to inform us about the demand for funding in AI safety research, so please have a low bar for submitting an EOI! It’s okay to submit an EOI even if you’re not sure if you’d accept a grant if we offered it to you. For instance, you could apply for start-up funding for a faculty job even if you think you might go into industry instead.

3.1 Other funding for AI safety research

At Open Philanthropy

If you have a proposal that doesn’t fit this RFP, consider applying to our AI governance RFP, or our recently launched RFP on Improving Capability Evaluations. If you are an individual at any career stage who would like to pursue a career working on the topics discussed here, you may be a good fit for our Career Development and Transition Funding program.

From other sources

For reference, here are some other funders you may be interested in:

Schmidt Science’s SAFE-AI RFP
UK AISI’s Systemic AI Safety grants
UK AISI’s Bounty programme for novel evaluations and agent scaffolding
UK AISI’s Academic Engagement program
CAIS’s SafeBench competition
Foresight Institute’s AI grants program

Email aisafety@openphilanthropy.org if you’re doing AI safety grantmaking and want us to list you here.

3.2 Frequently asked questions

Can I submit multiple EOIs?

Yes; there is no limit on the number of EOIs you can submit.

Is there a deadline for submitting a full proposal?

We recommend submitting a proposal within 6 weeks of receiving an invitation to do so, but there is no strict deadline. We’ve removed our previous deadline of July 15, in an attempt to smooth out the pace of application submissions. We’ll continue to consider proposals received into August and September, though over time our staff will shift towards other projects and our proposal acceptance bar will increase. If your application is time-sensitive, please mark it as such when you submit, and we’ll try to prioritize it.

Are you evaluating submissions on a rolling basis?

Yes.

Does my proposal have to be about LLMs?

The project should either study LLMs or have insights that are plausibly transferable to LLMs.

Are indirect costs allowed?

Yes, but they are limited to 10% of the direct costs.

Can we meet the 10% indirect cost limit by requesting funding as a donation/gift?

If your institution waives overhead for philanthropic donations or gifts, we may be able to structure funding as a philanthropic donation or gift routed via your gift office to meet the 10% limit. If you’d like to request this structure, please classify your project based on the receiving institution (e.g., university = academic project) and note this preference in your proposal. Please note that our grant logistics team will work with the receiving institution to determine the appropriate routing if your grant is approved. Contact aisafety@openphilanthropy.org if you need to switch submission forms.

Who should submit the EOI?

The project lead should submit the EOI. If there are co-leads, one person should submit and list the others as collaborators.

Can graduate students apply?

Yes! For example, if you need more compute to execute a more ambitious project, you should submit an EOI.

I’ve been sent a link to submit a full proposal. Why does this form not cover certain expense types, such as salaries and travel?

If the form doesn’t cover all the expense types you were expecting, it’s possible you’ve been sent the wrong form. This can happen if you selected the wrong category (e.g. “research expenses only”) in your initial application. Let us know, and we’ll send you the correct form.

4. AI safety research areas

The full details of our 21 research areas are available in a separate guide. The guide provides details on each research area, including:

The technical problems we want to solve.
Specifications of what we’re looking for in proposals.
Related work and key papers.
Example projects we’d be excited to fund.

You can find the complete guide here.

Before applying, please read the descriptions for any areas that interest you, to ensure your application(s) meet the relevant eligibility criteria.

Footnotes[+]Footnotes[−]

Footnotes
1	One important obstacle is that in any field, it often takes many years or even decades for preparadigmatic, fundamental research to yield techniques that can be usefully applied in practice. But even so, we may have adequate time for principled alignment research to make a difference — for example, if AI systems take many years to reach dangerously superhuman capabilities, or if society decides to actively slow down AI development out of concern for safety. Alternatively, if capabilities continue to advance quickly, we could use increasingly powerful AI assistants to automate progress on this fundamental research. Our work today could increase the reliability of early AI assistants and lay the groundwork to automate this safety-related research earlier in the trajectory of AI development, helping to ensure that progress in AI safety keeps pace with progress in AI capabilities.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.