Request for Proposals: Improving Capability Evaluations

Update, 4/14/25: We are no longer accepting applications for this RFP. If you have a proposal that meets the following criteria, you are welcome to submit it to the AI Governance and Policy General RFP.

1. Capability evaluations are not on track for the role they are expected to play

Experts significantly disagree on the likely future capabilities of large language models (LLMs). Some experts believe LLM-based agents will soon outperform human professionals in almost every task,^[1]E.g., in Managing AI Risks in an Era of Rapid Progress, a paper co-authored by Bengio, Hinton, and other leading AI researchers, they write “Combined with the ongoing growth and automation in AI R&D, we must take seriously the possibility that generalist AI systems will outperform human … Continue reading while others think the impact will be more modest and limited to specific areas.^[2]For example, ML experts surveyed in 2023 had a 72-year gap between median timelines for fully automating all tasks (50% probability by 2048) and fully automating all occupations (50% probability by 2120). Disagreements about likely AI progress underpinning disagreements about AI risk are discussed … Continue reading These disagreements often underpin larger debates about potential risks, whether from misuse or loss of control.^[3]See e.g., What Are the Real Questions in AI? and What the AI debate is really about.

These disagreements have contributed to growing interest in governance approaches that accommodate different predictions about AI progress. One prominent example is “if-then commitments,” where AI developers agree to take specific actions (like pausing training of more powerful models until safety measures are implemented or improving security) if their systems meet certain capability thresholds.^[4]For example, at the 2024 AI Seoul Summit, a network of AI Safety Institutes committing to information sharing about capabilities evaluation was established, and 16 AI companies signed frontier safety commitments. See also statements from IDAIS-Beijing – International Dialogues on AI Safety … Continue reading Since these actions are conditional on certain thresholds being met, this approach is compatible with a range of beliefs about AI progress.

Both if-then commitments and other governance approaches based on AI capabilities rely on accurate AI evaluations.^[5]In this RFP, we use the terms “evaluation,” “benchmark,” “test,” and “task” as follows. “Evaluation” refers to the complete process of measuring model performance (including, e.g., test conditions, the benchmark or tasks used, grading metrics, and how you report results). … Continue reading Governments, regulators, and society at large need robust, reliable measurements to understand and respond appropriately to advances in AI capabilities. We are worried that the current evaluations paradigm isn’t mature enough to play this role.

Capability evaluation currently faces three major challenges:

Existing benchmarks for risk-relevant capabilities are inadequate. We need more demanding evaluations that can meaningfully measure frontier models’ performance on tasks relevant to catastrophic risks, resist saturation even as capabilities advance, and rule in (not just rule out) serious risks.^[6]See, e.g., International AI Safety Report 2025, which states (pg 169): “An ‘evaluation gap’ for safety persists: Despite ongoing progress, current risk assessment and evaluation methods for general-purpose AI systems are immature. Even if a model passes current risk evaluations, it can be … Continue reading
The science of capability evaluation remains underdeveloped. We don’t yet understand how many capabilities scale, the relationships between different capabilities, or how post-training enhancements^[7]By “post-training enhancements,” we mean techniques that improve model performance after pretraining. This includes methods that modify model weights, such as fine-tuning, and methods that change how the model is used (such as scaffolding, tool use, agent architectures, and prompt engineering). … Continue reading will affect performance.^[8]See, e.g., A Path for Science- and Evidence-based AI Policy, a proposal co-authored by Li, Liang, and Song, among others, which states that “our understanding of how these models function and their possible negative impacts on society remains very limited.” The same point is made in the … Continue reading This makes interpreting current evaluation results and predicting future results challenging.
Third-party evaluators already face significant access constraints, and increasing security requirements will make access harder. Maintaining meaningful independent scrutiny will require advances in technical infrastructure, evaluation and audit protocols, and access frameworks.^[9]A similar point is made in the International AI Safety Report 2025, which states (pg 181): “The absence of clear risk assessment standards and rigorous evaluations is creating an urgent policy challenge, as AI models are being deployed faster than their risks can be evaluated. Policymakers face … Continue reading

We are looking to fund projects that can make progress on any of these challenges.

The window to submit an expression of interest (EOI) has closed. We will review EOIs on a rolling basis and aim to respond within two weeks.

Below, we expand on each of the three main areas we’re seeking proposals for. Each section includes examples of valuable work, open questions we’re interested in, and key requirements for proposals. We expect that most strong proposals will focus on just one area, so there is no need to read the whole page.

2. Global Catastrophic Risk (GCR)-relevant capability benchmarks for AI agents

In November 2023, we launched an RFP for LLM agent benchmarks. Though that RFP did not focus solely on GCR-relevant capabilities,^[10]By “global catastrophic risk,” we mean a risk that has the potential to cause harm on an enormous scale (e.g., threaten billions of lives). See Potential Risks from Advanced Artificial Intelligence for more details. multiple benchmarks funded through that RFP have already been used in pre-deployment testing of frontier models,^[11]For example, LAB-Bench and Cybench were used in UK AISI’s and US AISI’s pre-deployment testing of Claude 3.5 Sonnet (report) and o1 (report). Other tasks have been used privately. and others are forthcoming.

Despite that progress, we still think we urgently need more demanding tests of AI agents’ capabilities^[12]By “AI agent” or “agentic AI system,” we mean AI systems capable of pursuing complex goals with limited supervision. Examples include systems which could, e.g., identify and exploit an elite zero-day vulnerability with no instances of human intervention. We borrow this definition (though … Continue reading that are directly relevant to global catastrophic risks. We’re particularly interested in such benchmarks because:

Some of the existing relevant benchmarks are already saturated, or close to saturation.^[13] See, e.g., LAB-Bench, on which Sonnet 3.5 (new) achieved human-level or greater performance in 2 out of 5 categories (pg. 7).
Existing benchmarks only cover some potential risks, and not necessarily at a difficulty level relevant to catastrophic risk.^[14]For example, while Cybench measures a precursor of GCR-level cyber capabilities, it does not directly test for, e.g., the ability to discover elite zero-day vulnerabilities. Similarly, LAB-Bench measures capabilities like designing scientific protocols, reasoning about tables and figures, and … Continue reading
Ideally, we want evaluations to be able to rule in risks. By this, we mean that sufficiently strong performance on the evaluation would provide compelling evidence of the capability to cause serious harm.^[15]For evaluation results to provide enough evidence for us to rule in risks, we’ll likely need: more comprehensive testing, harder and more relevant tasks, adversarial testing and other concerted efforts to upper-bound model performance, post-deployment testing, and close collaboration with domain … Continue reading

We focus on agent capabilities because they’re central to the main risks we’re concerned about. Capable AI agents could pursue their own objectives, be used to automate dangerous R&D, and accelerate overall AI development beyond our ability to handle it safely. Understanding these capabilities helps mitigate these risks.

We recognize the technical challenges in building benchmarks and running evaluations. As a result, we are ready to provide substantial funding for well-designed proposals that demonstrate sufficient ambition.

2.1 Necessary criteria for new benchmarks

2.1.1 Relevance to global catastrophic risk

Tasks should be designed to measure capabilities directly relevant to plausible threat models for global catastrophic risks. We’re particularly interested in testing for the following capabilities:

AI research and development (R&D), i.e., automating many/all of the tasks currently done by top AI researchers and engineers
- AI systems that could competently perform AI R&D could dramatically accelerate the pace of AI capabilities development, potentially outpacing society’s ability to adapt and respond.
- AI R&D also serves as a useful proxy for AI capabilities in other kinds of R&D, e.g., in biology, chemistry, or weapons development. We think AI systems will likely be better at AI R&D than other kinds of R&D,^[16]ML has relatively tight feedback loops, requires little physical interaction, and has large amounts of relevant training data compared to other science. Also, AI developers have strong economic incentives to improve their models’ AI R&D capabilities, since this would accelerate AI … Continue reading so AI R&D capabilities can act as a leading indicator of general R&D capabilities.
Rapid adaptation to, and mastery of, novel adversarial environments
- The ability to efficiently develop and execute winning strategies in complex, competitive environments is a key capability for loss of control (or “rogue AI”) scenarios.
- We’re particularly interested in the ability to autonomously produce such strategic behavior across multiple domains without domain-specific optimization — whether competing in novel games, executing or responding to cyber threats, or real-world resource acquisition.
Capabilities that are directly relevant to undermining human oversight, such as scheming, situational awareness, or persuasion^[17]For further discussion of capabilities relevant to undermining human oversight, see Section 2.2.3 (and in particular Table 2.4, pg 104) in International AI Safety Report. For examples of useful work in this category, see, e.g., Scheming reasoning evaluations, and Me, Myself, and AI: The Situational … Continue reading
Cyberoffense
- We’re particularly interested in benchmarks that cover the entire kill chain. Scores should be benchmarked to the operational capacities^[18]See Chapter 4 of Securing AI Model Weights for a definition of operational capacity categories. of specific threat actors.

We’ll consider proposals for other capability benchmarks if they are supported by a specific threat model showing that the capability is necessary to realize a catastrophic risk.^[19]This list draws upon Karnofsky’s prioritization in A Sketch of Potential Tripwire Capabilities for AI. For other views on what capabilities to test for, see, e.g., Early lessons from evaluating frontier AI systems, IDAIS-Beijing – International Dialogues on AI Safety, and Common Elements of … Continue reading

In general, we prefer evaluations that measure specific dangerous capabilities over those that measure more general precursors. For example, cyberoffense evaluations are better than general coding proficiency evaluations, and cyberoffense evaluations focusing on the key bottlenecks are better still.^[20] See our discussion of construct validity for more detail on this point.

2.1.2 Evaluating agentic capabilities

The majority of catastrophic risks we’re concerned about stem from agentic AI systems.^[21]By “AI agent” or “agentic AI system,” we mean AI systems capable of pursuing complex goals with limited supervision. Examples include systems that could, e.g., identify and exploit an elite zero-day vulnerability with no instances of human intervention. We borrow this definition (though not … Continue reading In part because of this, we are only inviting proposals about benchmarks for AI agent capabilities. This means testing AI systems on decomposing complex tasks into smaller subtasks, autonomously pursuing objectives across multiple steps, and responding to novel situations without human guidance.

2.1.3 Construct validity

Evaluations need to measure what they claim to measure, not just superficially similar tasks. Good construct validity entails that high scores on an evaluation would correspond to high real-world performance in the relevant domain, and that low scores would correspond to poor real-world performance.

Construct validity is often difficult to assess. All else equal, we prefer benchmarks where tasks are critical bottlenecks or necessary difficult steps in a GCR threat model, and which mirror tasks done by human professionals.^[22] Giving models access to the same environments and tools as human professionals (where possible) helps to mirror tasks. Alternative task designs are acceptable if they are justified by an argument for their construct validity. While testing performance on discrete subtasks is often useful to distinguish between low performance levels, high scores should require models to identify necessary subtasks, determine their relationships, and chain them together without supervision. Benchmarks should include the most difficult tasks required to realize a given threat.

2.1.4 Difficulty

We want to fund evaluations that are difficult enough to resist saturation, and provide unambiguous evidence of serious risks if models perform well on them.

This will likely require identifying tasks that are challenging for world-class human experts, and which take days or more to complete. Difficulty is particularly important given the recent rate of AI progress and the effects of post-training enhancements^[23]By “post-training enhancements,” we mean techniques that improve model performance after pretraining. This includes methods that modify model weights, such as fine-tuning, and methods that change how the model is used (such as scaffolding, tool use, agent architectures, and prompt engineering). … Continue reading on model performance. For example, when FrontierMath — a benchmark of math problems ranging from PhD-level to open research questions — was first released, frontier models scored less than 2%; now, o3 scores 25%.^[24]As reported by OpenAI. Note that o3 used significantly more inference compute, was tested by OpenAI, and that OpenAI had access to the FrontierMath benchmark. Similarly, o3 has exceeded human expert performance on GPQA Diamond, a benchmark of unpublished PhD-level science questions, scoring 87.7%.^[25]OpenAI recruited experts with PhDs to answer GPQA Diamond questions, and found that they scored 69.7%. The previous highest score from an AI model was 56%.^[26] Mean score achieved by Claude 3.5 Sonnet in Epoch’s testing; see AI Benchmarking Dashboard for details.

2.1.5 Follows best scientific practice

Evaluations should follow scientific best practice in how they are conducted and reported.^[27]For relevant work here, see Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations. We expect there’s significant additional work to be done here. For example, to mitigate test set contamination, representative private test sets should be excluded from published benchmarks, and performance differences between public and private test sets should be measured. The experimental setup, including the base model, agent scaffolding, prompting, and any other post-training enhancements used, should be clearly documented.

2.2 Nice-to-have criteria for new benchmarks

While the previous criteria should be met, the following are bonuses:

Where possible, tasks should be compatible with widely used libraries to make running evaluations easier.
- We encourage using Inspect, an open-source framework for model evaluations developed by the UK AI Safety Institute.
Task grading should be fine-grained or ideally continuous, rather than binary (pass/fail).
Tasks should provide opportunities for feedback and iteration, e.g., from critic models, automatic graders, or performance on discrete subtasks.
Where relevant, evaluations should include appropriate comparison baselines.
- What counts as an “appropriate” baseline will depend on the threat model used. In many cases, we’re interested in both the performance of human professionals and novices, given adequate tool access, incentives, time, and training.

2.3 Useful previous work

Examples of challenging benchmarks include:
- FrontierMath, a benchmark of unpublished and extremely difficult math problems.
- Cybench, a benchmark of challenging, multi-step capture-the-flag tasks in cybersecurity. The hardest tasks in Cybench take humans days to solve.
Examples of GCR-relevant benchmarks include:
- Lab-MM, which measures LLM capabilities at assisting novices with wet lab research. (Though note that Lab-MM tests human uplift, rather than AI agent capabilities.)
- RE-Bench, which measures LLM agents’ capabilities at AI R&D given a fixed time budget, and features a human expert baseline.

Other work we think is useful includes:

[On hiatus] Request for proposals: benchmarking LLM agents on consequential real-world tasks, which articulates the case for building hard evaluations and setting out appropriate criteria
Early lessons from evaluating frontier AI systems, which discusses UK AISI’s approach to why, how, and when to conduct evaluations
Evaluating frontier AI R&D capabilities of language model agents against human experts, which discusses the design choices behind RE-Bench and how METR conducted its evaluations
The Evals Gap, which argues that existing evals are not sufficient to robustly measure capabilities
UK AISI’s Inspect documentation, UK AISI’s Request for Evaluations and Agent Scaffolding: Extreme Risks from Frontier AI (now closed), and METR’s Desiderata for its task bounty program (now closed), which defend different views on how to design and build evaluations^[28]We include these as examples of possible design choices, and don’t necessarily endorse them all, though we agree with many.
Building evaluations for cybersecurity assistance, which argues for the importance of identifying the correct threat model before designing evaluation tasks
For more resources, see the work on how to run evaluations discussed here

3. Advancing the science of evaluations and capabilities development

Capabilities evaluations are closer to snapshots of current model performance than guides to future development: evaluators can run tests and see what models can do today, but struggle to predict what they’ll be capable of tomorrow.

While building hard, GCR-relevant benchmarks will help, we also need more work on several different areas:

Measuring and predicting performance gains from post-training enhancements,^[29]By “post-training enhancements,” we mean techniques that improve model performance after pretraining. This includes methods that modify model weights, such as fine-tuning, and methods that change how the model is used (such as scaffolding, tool use, agent architectures, and prompt engineering). … Continue reading like prompt engineering, scaffolding, and fine-tuning
Research informing how we conduct and interpret evaluations
Understanding and predicting the relationships between different capabilities, and how capabilities emerge

Below, we go into more detail on some open questions we’re particularly interested in. For this area, we will consider proposals on understanding capabilities and evaluation broadly (i.e., not just GCR-relevant capabilities), provided such research could transfer to risk-relevant domains.

3.1 Scaling trends for everything

Capability evaluations should aim to establish reliable upper bounds on AI agent performance within available budgets. To both improve the quality of our estimates and understand when we need to re-evaluate models, we need to better understand how different enhancements affect capabilities. Open questions include:

How do different capabilities scale with different variables, e.g., model size, compute resources, or post-training enhancements?^[30]See, e.g., AI capabilities can be significantly improved without expensive retraining and AI Benchmarking Dashboard for examples of useful work in this category.
How can we best quantify post-training interventions that affect model performance, such as elicitation, scaffolding, or tooling effort?
What do the marginal return curves for investing in different post-training enhancements look like? How do these returns vary across different model architectures and sizes?
For post-deployment models, what kinds of advances in post-training enhancements should trigger re-evaluation? How can we reliably tell when these advances have been realized?

3.2 Understanding relationships between capabilities

Many important AI capabilities are expensive and difficult to reliably evaluate. Understanding the relationships between different capabilities could help us to identify when to run costly evaluations based on results from cheaper evaluations, identify surprising results that warrant more comprehensive testing, and, most ambitiously, predict important capability levels from simpler evaluations. Open questions include:

How robust is the predictive relationship between simpler, low-dimensional capability measures and complex, emergent capabilities?
How should we expect this relationship to change with model size, model architecture, or the kinds of capabilities studied?
Can we predict how smoothly performance on particular tasks will scale? How is this predictability related to the capability being measured?^[31]This question is taken from UK AISI’s “Priority research areas for academic collaborations.”
Can we decompose dangerous capabilities into meaningful components that can be tracked separately?^[32]This question is taken from UK AISI’s “Priority research areas for academic collaborations.” For relevant work, see, e.g., Burnell et al. (2023).

3.3 Improving baselines and measurement

What is the best way to build continuous grading metrics for complex, longer-horizon tasks?^[33] This question is taken from UK AISI’s “Priority research areas for academic collaborations.”
How can we accurately measure the human effort involved in model evaluations, e.g., in prompt engineering and scaffolding design?
How can we establish clear, measurable thresholds for:
- World-class human performance
- Where relevant, minimum capability levels sufficient to enable serious harm
How can we be confident that high scores on dangerous capability evaluations justify taking meaningful actions on safety?^[34] Possible research directions here include threat modelling, building consensus on “red lines” for AI deployment, and assessing the validity, reliability and robustness of different capabilities tests. For more details, see section 3.3 of International AI Safety Report.

3.4 Understanding how evaluation results may be misleading

How easy is it to fine-tune a model to significantly change its performance on a particular evaluation in a hard-to-detect way, e.g., to artificially degrade model performance on a dangerous capabilities evaluation?
Do performance changes from fine-tuning transfer across different evaluations?
Can we reliably detect when a model has been fine-tuned for specific evaluations?
Can supervised fine-tuning reliably mitigate under-elicitation and sandbagging?^[35]This question is taken from UK AISI’s “Priority research areas for academic collaborations.” For relevant work, see, e.g., Stress-Testing Capability Elicitation With Password-Locked Models.
How do model jailbreaks affect performance?
- Can we reliably estimate the upper bound of model performance when testing jailbroken models?
How can we account for, or prevent, data contamination when using public evaluation datasets?
More rigorous measurements of model performance on top benchmarks
- Including improved human baselines, pass@k or best-of-k performance, improved tool access, elicitation, or scaffolding
Uncontaminated versions of top benchmarks (e.g., GSM1k for more challenging benchmarks)
Policy and/or planning work toward:
- Bounties for beating the evaluation scores that companies report
- RCTs and human uplift studies for human novices
- Tracking and discovering “in the wild” examples of frontier capabilities, e.g., Big Sleep, Project Naptime, LLM forecasting

3.5 Useful previous work

Work in this area that we think is useful includes:

Work on predicting LLM capabilities, including Observational Scaling Laws and the Predictability of Language Model Performance, which shows LLM capabilities can be decomposed into three principal components that scale log-linearly with compute. We’re particularly interested in further testing this approach with the next generation of models, and extending the model to include post-training enhancements and inference compute scaling. For other useful work here, see Epoch’s Benchmarking Dashboard, which shows trends in model performance on GPQA and MATH Level 5 over time, Evaluating Predictions of Model Behaviour, which argues that predicting evaluation results can be a useful tool for AI governance, and public prediction markets (e.g., Will an AI achieve >85% performance on the FrontierMath benchmark before 2028?).
Work on measuring LLMs’ sensitivity to prompts, including Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting, which shows that LLMs are highly sensitive to semantically equivalent prompt formatting choices, and State of What Art? A Call for Multi-Prompt LLM Evaluation, which demonstrates that single-prompt evaluations of LLMs are unreliable, and proposes a multi-prompt evaluation framework with task-specific metrics. We’re most interested in finding ways to upper-bound frontier model performance, and as such we would be excited to see work on optimizing prompts or otherwise estimating prompt quality.
Work on sandbagging, i.e., selective underperformance on evaluations, such as AI Sandbagging: Language Models can Strategically Underperform on Evaluations, Stress-Testing Capability Elicitation With Password-Locked Models, which finds that fine-tuning can efficiently elicit hidden capabilities from password-locked models, and When can we trust model evaluations?.
Work on the methodology of evaluations, including AI Agents That Matter, which argues that AI agent benchmarks need major improvements in cost tracking, test design, and standardization to meaningfully measure real-world capabilities; Benchmarks as Microscopes: A Call for Model Metrology, which argues that LLM evaluation needs to focus on developing rigorous, use-case-specific benchmarks; We need a Science of Evals, which calls for a science of model evaluations; and Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations, which calls for statistical best practices to be applied to model evaluations.
Work discussing the effect of different post-training interventions on model performance, including Measuring the impact of post-training enhancements, which found 26pp improvements from fine-tuning GPT-4, and did not find statistically significant improvements from agent improvements; Details about METR’s preliminary evaluation of OpenAI o1-preview, which found significantly improved performance from o1 after changes to agent scaffolding; and How Many Data Points is a Prompt Worth?, which compares prompting and head-based fine-tuning (we’d be interested in seeing similar work on larger models).
Work investigating whether LLMs have genuinely emergent — i.e., novel at scale — capabilities, such as Emergent Abilities of Large Language Models, and Are Emergent Abilities of Large Language Models a Mirage?.
Work on lessons learned from and best practices for running evaluations, such as Lessons from the Trenches on Reproducible Evaluation of Language Models, and Emerging Processes for Frontier AI Safety.
Work discussing different design approaches and test formats used for frontier model evaluations, including Long-Form Tasks, Early lessons from evaluating frontier AI systems, Early Insights from Developing Question-Answer Evaluations for Frontier AI, and US AISI and UK AISI Joint Pre-Deployment Test.
Work presenting different views on how to design and build evaluations, such as UK AISI’s Inspect documentation, UK AISI’s Request for Evaluations and Agent Scaffolding: Extreme Risks from Frontier AI (now closed), and METR’s Desiderata for its task bounty program (now closed).^[36]We include these as examples of possible design choices, and don’t necessarily endorse them all, though we agree with many.

4. Improving third-party access and evaluations infrastructure

Reliable assessment of frontier AI capabilities requires independent verification, but conducting meaningful external evaluations is already challenging, and will grow more so as security requirements grow more stringent.^[37]This point is made in International AI Safety Report 2025, which states (pg 181): “Rigorous risk assessment requires combining multiple evaluation approaches, significant resources, and better access. Key risk indicators include evaluations of systems themselves, how people apply them, as well as … Continue reading

We’re looking for work that can help manage the tension between security and independent oversight. This includes:

Improving model access for external evaluations
Approaches to verifiable model auditing
Improving evaluation infrastructure

4.1 Improving model access for external evaluators

As security requirements increase, we need to understand exactly what access auditors need to evaluate model capabilities and safety. This requires understanding the relationship between access levels, evaluation quality, and security risks, understanding practical implementation challenges, and developing protocols for third-party evaluators that work within labs’ security constraints and enable meaningful oversight. Open questions include:

What minimum information about model training do auditors need to know to evaluate a safety case (i.e., a structured argument that an AI system is safe to deploy in a given environment)?^[38]By “safety case,” we mean “a structured argument, supported by a body of evidence, that provides a compelling, comprehensible, and valid case that a system is safe for a given application in a given environment.”; we borrow this definition from Safety cases at AISI. For more details and … Continue reading
Which information about model training would be sufficient for auditors to evaluate safety cases? Which information would be most helpful?
What are the strongest safety cases we could make and evaluate at each level of model access?
What are the practical barriers to third-party evaluation at SL-3, SL-4, and above?^[39] For a definition of SL-3 and SL-4, see Securing AI Model Weights: Preventing Theft and Misuse of Frontier Models.
How should capabilities evaluations for models deployed for internal use only be carried out?
What are the benefits of different levels of access for different kinds of model evaluations (e.g., standard API, helpful-only models, fine-tuning, logprobs, intermediate activations, full weights)? Which are necessary, and which are beneficial?
What are the trade-offs between different levels of model access in terms of security risks and audit effectiveness, and how can the corresponding security costs be mitigated?
What governance frameworks, structured access commitments, and evaluation protocols would enable robust external evaluation while managing security risks? This includes:
- Allocation of evaluation rights and responsibilities across organizations
- Legal frameworks and liability protections
- Governance structures and oversight mechanisms for evaluation organizations
- “Safe channels” or other protocols for secure evaluator-developer collaboration

4.2 Improving evaluations infrastructure

To make evaluations easier, quicker, and more useful to run, the field should work toward establishing common standards and best practices for conducting evaluations. Work we are particularly interested in includes:

Building on Inspect, e.g.,:
- Implementing realistic sandboxes for agentic evaluations
- Porting existing high-quality evaluations to Inspect
- Building tools for designing model graders
Guidance on how to run evaluations, such as:
- Clear reporting of test conditions (pass-at-k, best-of-k; scaffolding and elicitation effort; post-training enhancements; inference and time budget)
- Standards and best practices for human uplift studies
- Incorporation of statistical best practices and insights from metrology^[40]See, e.g., A statistical approach to model evaluations.
Guidance on responsible disclosure of evaluations results
- This also includes guidance on:
  - Avoiding turning dangerous capabilities into optimization targets
  - Handling test sets or fine-tuning datasets that may be classified and/or dangerous
  - Managing and sharing evaluation results (especially on deployed models) that might be classified and/or dangerous

4.3 Verifiable model auditing

Current model auditing approaches are insufficient for high-stakes evaluations under strict security requirements. Relying solely on developers’ internal evaluation claims introduces conflicts of interest, as they have incentives to downplay model risks. The alternative — providing external auditors with direct model access — can create security concerns. External evaluators also face practical constraints of limited time and model access, leading to rushed and uncertain evaluation results.^[41] See, e.g., Details about METR’s preliminary evaluation of OpenAI o1-preview, US AISI1 and UK AISI2 Joint Pre-Deployment Test, and International AI Safety Report 2025 (pg 189).
While some AI companies have provided pre-deployment access to third-party evaluators, this cannot be relied on as security requirements grow more stringent.

Although significant breakthroughs would be required, verifiable model auditing techniques could be a better way forward. If developers can prove claims about, or verifiably run externally developed evaluations on, their models without providing direct access, security requirements would present a much smaller barrier to independent verification. However, many questions remain before such methods can be used, including:

What are the likely costs of and main barriers to scaling different options for verifiable model auditing (e.g., ZKPs, homomorphic encryption, secure enclaves) to frontier models?
How does this picture change if we vary our trust assumptions or operational constraints, for example assuming that:
- Auditors must verify claims about model attestation, version verification, and output integrity
- Auditors must not know the model architecture
- Labs must not know the evaluations being run
How can verification handle models that use external tools or retrieval?
What statistical claims can we make about model properties via sampling? How confident can we be in these claims?
What are the minimum hardware requirements for verifiable auditing at the trillion-parameter scale? What specialized hardware architectures might enable faster verification?
What are the security trade-offs between different verification approaches for frontier models, including secure enclaves, ZKPs, and homomorphic encryption?
How might these approaches fail?
- How can we verify that a particular model version was evaluated?
- Which approaches work well for distributed training, pre-training, or post-training?
- How do verification costs compare across different model architectures?

4.4 Useful previous work

Work in this area we think is useful includes:

Open Problems in Technical AI Governance, which surveys and classifies open problems in technical governance.
Trustless Audits without Revealing Data or Models, which provides a zero-knowledge protocol for model evaluations.
A Safe Harbor for AI Evaluation and Red Teaming, which calls for legal and technical protections for external AI evaluation and red teaming.
Zero-knowledge Proof Meets Machine Learning in Verifiability, which surveys ZKP protocols applied to machine learning.
OpenMined: Privacy-preserving third-party audits on Unreleased Digital Assets with PySyft and Secure Enclaves for AI Evaluation, which summarizes OpenMined’s work on privacy-preserving third-party model evaluations using PySyft.
Secure Enclaves for AI Evaluation, a blog post explaining OpenMined’s proof-of-concept for using secure enclaves for secure third-party evaluations.
A Survey of Secure Computation Using Trusted Execution Environments, which examines the architectures, security properties, performance characteristics of, and open research questions for, TEE-based secure computational protocols.
Verifiable evaluations of machine learning models using zkSNARKs, which uses zkSNARKs to enable verifiable evaluation of ML models’ performance claims while keeping model weights private.
zkLLM: Zero Knowledge Proofs for Large Language Models, which introduces zkLLM, a specialized zero-knowledge proof system designed for large language models.
How to audit an AI model owned by someone else, which discusses issues with, and potential solutions for, third-party model auditing.
Black-Box Access is Insufficient for Rigorous AI Audits, which discusses the benefits to auditors of different levels of access, and ways to mitigate security risks.
Structured Access for Third-Party Research on Frontier AI Models, which develops a taxonomy of system access and identifies levels of access external researchers need to effectively study frontier AI models.

4.5 Other kinds of proposals

If you have a strong proposal that doesn’t fit this RFP, consider applying to our AI governance RFP, or our technical AI safety RFP.

5. Application process

5.1 Time suggestion

We suggest that you aim to spend no longer than one hour filling out the Expression of Interest (EOI) form, assuming you already have a plan you are excited about. Our application process deliberately starts with an EOI rather than a longer intake form to save time for both applicants and program staff.

5.2 Feedback

We do not plan to provide feedback for EOIs in most instances. We expect a high volume of submissions and want to focus our limited capacity on evaluating the most promising proposals and ensuring applicants hear back from us as promptly as possible.

5.3 Next steps after submitting an EOI

We aim to respond to all applicants within three weeks of receiving their EOI. In some cases we may need additional time to respond, for example if it demands consultation with external advisors who have limited bandwidth, or if we receive an unexpected surge of EOIs when we are low on capacity.

If your EOI is successful, you will then typically be asked to fill out a full proposal form. Assuming you have already figured out the details of what you would like to propose, we expect this to take 2-6 hours to complete, depending on the complexity and scale of your proposal.

Once we receive your full proposal, we’ll aim to respond within three weeks about whether we’ve decided to proceed with a grant investigation (though most applicants will hear back much sooner). If so, we will introduce you to the grant investigator. At this stage, you’ll have the opportunity to clarify and evolve the proposal in dialogue with the grant investigator, and to develop a finalized budget. See this page for more details on the grantmaking process from this stage.

6. Acknowledgments

This RFP text was largely drafted by Catherine Brewer, in collaboration with Alex Lawsen.

We’d like to thank Asher Brass, Ben Garfinkel, Charlie Griffin, Marius Hobbhahn, Geoffrey Irving, Ole Jorgensen, and Kamilė Lukošiūtė^[42] All names are listed alphabetically by last name. for providing useful feedback on drafts at various stages.^[43] People having given feedback does not necessarily mean they endorse the final version of this text. We’re also grateful to our Open Philanthropy colleagues — particularly Ajeya Cotra, Isabel Juniewicz, Jake Mendel, Max Nadeau, Luca Righetti, and Eli Rose^[44] Again, all names are listed alphabetically by last name. — for valuable discussions and input.

Footnotes[+]Footnotes[−]

Footnotes
1	E.g., in Managing AI Risks in an Era of Rapid Progress, a paper co-authored by Bengio, Hinton, and other leading AI researchers, they write “Combined with the ongoing growth and automation in AI R&D, we must take seriously the possibility that generalist AI systems will outperform human abilities across many critical domains within this decade or the next.” OpenAI says that “it’s conceivable that within the next ten years, AI systems will exceed expert skill level in most domains,” Dario Amodei, CEO and co-founder of Anthropic, says “Making AI that is smarter than almost all humans at almost all things…is most likely to happen in 2026-2027,” and in 2023, Anthropic said that “If any of this [their views on AI progress, as stated in that post] is correct, then most or all knowledge work may be automatable in the not-too-distant future.” Demis Hassabis, DeepMind’s CEO and co-founder, believes DeepMind is on track to build AGI by 2030, and Shane Legg, DeepMind’s Chief AGI Scientist, predicts a 50% chance of AGI by 2028.
2	For example, ML experts surveyed in 2023 had a 72-year gap between median timelines for fully automating all tasks (50% probability by 2048) and fully automating all occupations (50% probability by 2120). Disagreements about likely AI progress underpinning disagreements about AI risk are discussed in, for example, the International AI Safety Report 2025, which states (pg 102): “The likelihood of active loss of control scenarios, within a given timeframe, depends mainly on two factors. These are: 1. Future capabilities: Will AI systems develop capabilities that, at least in principle, allow them to behave in ways that undermine human control? (Note that the minimum capabilities needed would partly depend on the context in which the system is deployed and on what safeguards are in place.) 2. Use of capabilities: Would some AI systems actually use such capabilities in ways that undermine human control? Because evidence concerning these factors is mixed, experts disagree about the likelihood of active loss of control in the next several years.”
3	See e.g., What Are the Real Questions in AI? and What the AI debate is really about.
4	For example, at the 2024 AI Seoul Summit, a network of AI Safety Institutes committing to information sharing about capabilities evaluation was established, and 16 AI companies signed frontier safety commitments. See also statements from IDAIS-Beijing – International Dialogues on AI Safety and A Path for Science- and Evidence-based AI Policy. For more context on if-then commitments, see If-Then Commitments for AI Risk Reduction.
5	In this RFP, we use the terms “evaluation,” “benchmark,” “test,” and “task” as follows. “Evaluation” refers to the complete process of measuring model performance (including, e.g., test conditions, the benchmark or tasks used, grading metrics, and how you report results). “Benchmark” refers to a standardized set of tasks intended for repeated use in measuring model performance: importantly, benchmarks need not be fully public. A “task” is an individual problem or scenario that a model or agent is asked to solve. “Test” refers to the process of measuring model performance on task(s), or an instance of such a process. We often use “test” for a single instance or protocol (e.g., “pre-deployment test”), whereas “evaluation” encompasses the broader process — including how tasks and protocols are chosen, how performance is measured, and how results are analyzed.
6	See, e.g., International AI Safety Report 2025, which states (pg 169): “An ‘evaluation gap’ for safety persists: Despite ongoing progress, current risk assessment and evaluation methods for general-purpose AI systems are immature. Even if a model passes current risk evaluations, it can be unsafe.” For more details, see Dangerous capability tests should be harder, and OpenAI’s CBRN tests seem unclear, where Luca Righetti argues that current capability tests are inadequate: when AIs fail these tests, we can be confident they’re safe, but when they pass, we still can’t be sure they’re dangerous.
7	By “post-training enhancements,” we mean techniques that improve model performance after pretraining. This includes methods that modify model weights, such as fine-tuning, and methods that change how the model is used (such as scaffolding, tool use, agent architectures, and prompt engineering). For details, see AI capabilities can be significantly improved without expensive retraining.
8	See, e.g., A Path for Science- and Evidence-based AI Policy, a proposal co-authored by Li, Liang, and Song, among others, which states that “our understanding of how these models function and their possible negative impacts on society remains very limited.” The same point is made in the International AI Safety Report 2025 (see, e.g., pg 21, which states “developers still understand little about how their general-purpose AI models operate”), and Managing extreme AI risks amid rapid progress.
9	A similar point is made in the International AI Safety Report 2025, which states (pg 181): “The absence of clear risk assessment standards and rigorous evaluations is creating an urgent policy challenge, as AI models are being deployed faster than their risks can be evaluated. Policymakers face two key challenges: 1. internal risk assessments by companies are essential for safety but insufficient for proper oversight, and 2. Complementary third-party and regulatory audits require more resources, expertise and system access than is currently available.”
10	By “global catastrophic risk,” we mean a risk that has the potential to cause harm on an enormous scale (e.g., threaten billions of lives). See Potential Risks from Advanced Artificial Intelligence for more details.
11	For example, LAB-Bench and Cybench were used in UK AISI’s and US AISI’s pre-deployment testing of Claude 3.5 Sonnet (report) and o1 (report). Other tasks have been used privately.
12	By “AI agent” or “agentic AI system,” we mean AI systems capable of pursuing complex goals with limited supervision. Examples include systems which could, e.g., identify and exploit an elite zero-day vulnerability with no instances of human intervention. We borrow this definition (though not the example) from Visibility into AI Agents.
13	See, e.g., LAB-Bench, on which Sonnet 3.5 (new) achieved human-level or greater performance in 2 out of 5 categories (pg. 7).
14	For example, while Cybench measures a precursor of GCR-level cyber capabilities, it does not directly test for, e.g., the ability to discover elite zero-day vulnerabilities. Similarly, LAB-Bench measures capabilities like designing scientific protocols, reasoning about tables and figures, and performance on multi-step cloning scenarios, but does not directly test LLMs’ ability to provide end-to-end assistance with biology research.
15	For evaluation results to provide enough evidence for us to rule in risks, we’ll likely need: more comprehensive testing, harder and more relevant tasks, adversarial testing and other concerted efforts to upper-bound model performance, post-deployment testing, and close collaboration with domain experts to identify risks and design appropriate tests. These criteria may not be sufficient.
16	ML has relatively tight feedback loops, requires little physical interaction, and has large amounts of relevant training data compared to other science. Also, AI developers have strong economic incentives to improve their models’ AI R&D capabilities, since this would accelerate AI capabilities in other domains.
17	For further discussion of capabilities relevant to undermining human oversight, see Section 2.2.3 (and in particular Table 2.4, pg 104) in International AI Safety Report. For examples of useful work in this category, see, e.g., Scheming reasoning evaluations, and Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs.
18	See Chapter 4 of Securing AI Model Weights for a definition of operational capacity categories.
19	This list draws upon Karnofsky’s prioritization in A Sketch of Potential Tripwire Capabilities for AI. For other views on what capabilities to test for, see, e.g., Early lessons from evaluating frontier AI systems, IDAIS-Beijing – International Dialogues on AI Safety, and Common Elements of Frontier AI Safety Policies.
20	See our discussion of construct validity for more detail on this point.
21	By “AI agent” or “agentic AI system,” we mean AI systems capable of pursuing complex goals with limited supervision. Examples include systems that could, e.g., identify and exploit an elite zero-day vulnerability with no instances of human intervention. We borrow this definition (though not the example) from Chan et al. (2024).
22	Giving models access to the same environments and tools as human professionals (where possible) helps to mirror tasks. Alternative task designs are acceptable if they are justified by an argument for their construct validity.
23	By “post-training enhancements,” we mean techniques that improve model performance after pretraining. This includes methods that modify model weights, such as fine-tuning, and methods that change how the model is used (such as scaffolding, tool use, agent architectures, and prompt engineering). For details, see AI capabilities can be significantly improved without expensive retraining.
24	As reported by OpenAI. Note that o3 used significantly more inference compute, was tested by OpenAI, and that OpenAI had access to the FrontierMath benchmark.
25	OpenAI recruited experts with PhDs to answer GPQA Diamond questions, and found that they scored 69.7%.
26	Mean score achieved by Claude 3.5 Sonnet in Epoch’s testing; see AI Benchmarking Dashboard for details.
27	For relevant work here, see Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations. We expect there’s significant additional work to be done here.
28	We include these as examples of possible design choices, and don’t necessarily endorse them all, though we agree with many.
29	By “post-training enhancements,” we mean techniques that improve model performance after pretraining. This includes methods that modify model weights, such as fine-tuning, and methods that change how the model is used (such as scaffolding, tool use, agent architectures, and prompt engineering). For details, see AI capabilities can be significantly improved without expensive retraining.
30	See, e.g., AI capabilities can be significantly improved without expensive retraining and AI Benchmarking Dashboard for examples of useful work in this category.
31	This question is taken from UK AISI’s “Priority research areas for academic collaborations.”
32	This question is taken from UK AISI’s “Priority research areas for academic collaborations.” For relevant work, see, e.g., Burnell et al. (2023).
33	This question is taken from UK AISI’s “Priority research areas for academic collaborations.”
34	Possible research directions here include threat modelling, building consensus on “red lines” for AI deployment, and assessing the validity, reliability and robustness of different capabilities tests. For more details, see section 3.3 of International AI Safety Report.
35	This question is taken from UK AISI’s “Priority research areas for academic collaborations.” For relevant work, see, e.g., Stress-Testing Capability Elicitation With Password-Locked Models.
36	We include these as examples of possible design choices, and don’t necessarily endorse them all, though we agree with many.
37	This point is made in International AI Safety Report 2025, which states (pg 181): “Rigorous risk assessment requires combining multiple evaluation approaches, significant resources, and better access. Key risk indicators include evaluations of systems themselves, how people apply them, as well as forward-looking threat analysis. For evaluations at the technical frontier to be effective, evaluators need substantial and growing technical ability and expertise. They also need sufficient time and more direct access than is currently available to the models, training data, methodologies used, and company-internal evaluations – but companies developing general-purpose AI typically do not have strong incentives to grant these.”
38	By “safety case,” we mean “a structured argument, supported by a body of evidence, that provides a compelling, comprehensible, and valid case that a system is safe for a given application in a given environment.”; we borrow this definition from Safety cases at AISI. For more details and examples, see Safety Cases: How to Justify the Safety of Advanced AI Systems, Three Sketches of ASL-4 Safety Case Components, and Notes on control evaluations for safety cases.
39	For a definition of SL-3 and SL-4, see Securing AI Model Weights: Preventing Theft and Misuse of Frontier Models.
40	See, e.g., A statistical approach to model evaluations.
41	See, e.g., Details about METR’s preliminary evaluation of OpenAI o1-preview, US AISI1 and UK AISI2 Joint Pre-Deployment Test, and International AI Safety Report 2025 (pg 189).
42	All names are listed alphabetically by last name.
43	People having given feedback does not necessarily mean they endorse the final version of this text.
44	Again, all names are listed alphabetically by last name.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.