[Closed] Request for proposals: benchmarking LLM agents on consequential real-world tasks

Update, 2/5/25: We recently launched a new RFP focused on improving capability evaluations. The new RFP covers three areas: building GCR-relevant benchmarks, advancing the science of evaluations, and improving third-party access infrastructure. While this RFP remains closed, we encourage applications to our new RFP here.

Update, 7/26/24: This RFP has closed after funding $25 million in grants; see this thread for examples of grants we funded.

In the wake of surprisingly rapid progress in large language models (LLMs) like GPT-4, some experts have predicted that agents built from these LLMs (“LLM agents”) will be able to outperform human professionals at virtually all tasks within five to twenty years.^[1]This view is held both among some academic researchers and some leading AI labs. A paper by Bengio, Hinton, and other leading AI researchers says “Combined with the ongoing growth and automation in AI R&D, we must take seriously the possibility that generalist AI systems will outperform human … Continue reading Other experts are skeptical — they argue that narrow benchmarks and anecdotal hype have massively overstated LLMs’ capabilities, and expect the technology to make a modest impact on a few sectors before running up against fundamental limitations.^[2]For example, Yann LeCun has claimed that a “system trained on language alone will never approximate human intelligence”, and that “better [AI systems] will be appearing, but they will be based on different principles [from autoregressive LLMs]”, as “[autoregressive LLMs] will be abandoned … Continue reading

Researchers on all sides of this debate increasingly agree that existing benchmarks for evaluating the capabilities of LLMs are not up to the task of settling this question^[3]Some of the many critiques of current benchmarking practices: Vera Liao argues that instead of using existing benchmarks, we should “center our analysis on how these models will be used in practice” and “[develop] evaluation methods that can provide valid assessments for whether and how much … Continue reading — among other issues, almost all benchmarks simply don’t attempt to measure how far LLM agents can get on the difficult open-ended tasks involved in most human professions.

To help build scientific understanding of the near-term impacts of LLMs, Open Philanthropy is looking to fund benchmarks that measure how close LLM agents can get to performing consequential real-world tasks.

Anyone is eligible to submit a form, including those working in academia, nonprofits, or independently; we are also open to making restricted grants to projects housed within for-profit companies. A grant would include funding for LLM API credits and other forms of compute; we expect grants to be in the range of $0.3-3M over a period of 6 months to 2 years. This RFP is now closed; we are not currently accepting new expressions of interest.

The rest of this page goes into more detail on:

Our motivation for supporting these benchmarks.
What projects are and aren’t eligible.
What makes for a strong proposal.
How the expression of interest process works.

If you want to express your interest, we recommend reading the entire page to maximize your chances of success. We also hosted a webinar to answer questions about this RFP on 11/29/23; the recording is here and the slides are here.

1. Motivation: existing benchmarks cannot settle crucial debates

LLM agents (e.g. OpenAI’s GPTs, AutoGPT, BabyAGI, LangChain, Act-1, natbot, etc) are AI systems built from LLMs that are designed to accomplish open-ended tasks (e.g. finding a suitable house in Zillow or testing and debugging a Python script) by interacting with an external environment over multiple timesteps, typically using tools like a web browser or a terminal command line.

LLM agents are very new and their impact has been limited so far, but well-functioning LLM agents could have much more wide-ranging applications than pure chatbots. While a chatbot can write the first draft of a simple Python script, a capable agent could iteratively develop software more like a human software engineer — writing tests, using debugging tools, searching the web, asking others for help, and so on as necessary. By the same token, agents could pose more extensive risks than chatbots. While a chatbot can attempt to give advice about how to steal someone’s identity, a capable agent could autonomously execute all the necessary steps.

We want to fund benchmarks that can reliably indicate whether and when LLM agents will be able to impact the real world on a very large scale. In particular, we would like to fund benchmarks that can be used to extrapolate whether and when LLM agents will be able to:

Replace or outperform humans in professions which account for a large share of the labor market.
Steal or destroy billions of dollars in economic value, kill thousands of people, develop and proliferate destructive technologies like bioweapons, or cause other types of large-scale destruction if aiming to do so.
Hugely accelerate the pace of technological R&D, especially AI R&D itself, by automating steps in the R&D process that were previously major bottlenecks.

Despite the fact that many experts believe that LLM agents may be able to do these things within years,^[4]A lot of our view comes from informal discussions with many AI researchers who have these views. For a collection of public statements to this effect, see footnote 1, as well as Senate testimony by Yoshua Bengio and Dario Amodei regarding the possibility that LLM-based systems could be used to … Continue reading almost no existing benchmarks are designed to measure progress toward these extreme capabilities:^[5]A similar sentiment is expressed by Shevlane et al. 2023, Model evaluations for extreme risks: “Many AI researchers (and other stakeholders) view extreme risks from AI as an important challenge. In a 2022 survey of AI researchers, 36% of respondents thought that AI systems could plausibly … Continue reading

Most benchmarks (e.g. MMLU, HellaSwag, VQA, SquAD, SuperGLUE, MATH, HotpotQA, etc) are designed to evaluate how accurately an LLM chatbot can answer various self-contained questions,^[6]In the case of VQA, the question makes reference to an image, but the LLM chatbot is still expected to simply respond with an answer, not have a back-and-forth interaction. not how effectively an LLM agent can pursue some open-ended objective.
Of the relatively few benchmarks that are set up to evaluate agents, some evaluate agents in artificial environments like text adventure games or video games (e.g. Hoodwinked or the three “game-grounded” environments in AgentBench). The connection between games and the real world is unclear and debatable.
Most of the rest evaluate agents on tasks like “retrieving simple pieces of information from the web” (e.g. the code-grounded and web-grounded environments in AgentBench, WebShop, Mind2Web, WebArena, etc). These tasks, while realistic, are clearly much more limited than the extreme capabilities some experts are predicting. If the next generation of LLM agents is able to perform such tasks perfectly, it would not necessarily be obvious how close they are to performing more consequential tasks.

Observers with different starting intuitions can draw very different conclusions from the relatively limited existing evidence, and debate doesn’t tend to lead to much agreement.^[7]For example, the late 2022 Existential Risk Persuasion Tournament (XPT), which brought together subject matter experts and forecasters to debate the likelihood of near-term catastrophic risks, found that participants broadly agreed about the rate at which benchmark performance would improve — … Continue reading This is an extraordinary lack of clarity to have about an extraordinarily consequential question — the right approach to take to LLM safety, security, and regulation crucially depends on accurate forecasts about near-term capabilities.^[8]In particular, if near-future LLM agents will outperform humans across the board (as some experts believe), they would likely be capable of causing catastrophic harm if humans instruct them to or if they develop unwanted intrinsic drives from their training process; this could include something as … Continue reading

We hope that having more benchmarks measuring how well current LLM agents perform on very difficult real-world tasks will help researchers come to greater agreement about their near-future capabilities.^[9]There is some preliminary evidence that more difficult and more real-world oriented benchmarks can change people’s minds. While both skeptical and concerned experts in the XPT had the same forecasts about benchmarks like MATH and MMLU, the follow-up AI Adversarial Collaboration (forthcoming) … Continue reading This would hopefully allow policymakers to draw on a robust scientific consensus to shape safety and security requirements commensurate with near-term risks, and give everyone advance notice if LLM agents are poised to broadly outperform humans and/or pose catastrophic risks.

Note that there is a plausible argument that constructing such benchmarks might inadvertently cause harm by accelerating extremely powerful and dangerous AI. We considered this argument, but ultimately felt that the risks, while real, were outweighed by the benefits. See this post for a discussion of the process we followed to assess whether to proceed with this RFP in light of these concerns.

2. Eligibility criteria

We define an LLM agent to be an AI system built by combining a large language model or multi-modal LLM with a program that executes actions selected by the LLM (operating tools like a browser or terminal as needed), relays the resulting outcomes back to the LLM, and performs other functions such as storing the history of actions and observations.

To be eligible for this RFP, your proposed project must create a benchmark designed to measure the capabilities of LLM agents. That is, the output of your project must be:

A suite of tasks which require an LLM agent to pursue an objective by interacting with an external environment (e.g. a virtual machine or the internet). This external environment could include humans — the agent may need to interact with humans or provide instructions to human workers in order to accomplish some or all of the tasks in the task suite.^[10]For example, suppose one task in the suite is “Prepare a nice meal for four people, one of whom is vegetarian, given an $X budget and two hour prep time.” An LLM agent would need to order ingredients online through a site like Instacart (delivered by a human), hire a human sous chef through a … Continue reading
Results from evaluating one or more baseline agents on that task suite.

For example, projects similar to Mind2Web (Deng et al. 2023), WebArena (Zhou et al. 2023), this task suite developed by Model Evaluation and Threats Research (METR),^[11]METR was originally incubated as a team within Alignment Research Center (ARC), which was founded and previously run by Paul Christiano. Paul is married to Ajeya Cotra, who oversees our giving to research that could help clarify risks from AI and is spearheading this RFP. Paul is no longer employed … Continue reading GAIA (Mialon et al. 2023), the “AI Safety Level 3” evaluations listed in an appendix of Anthropic’s responsible scaling policy, and MLAgentBench (Huang et al. 2023) would be eligible.^[12]Note that these are simply illustrative examples of eligible projects; these projects would not all be equally strong candidates. The next section goes into more detail on what we believe makes for a strong proposal. We expect most tasks compatible with the METR Task standard would be eligible.

By contrast, the following types of projects are ineligible for this RFP:

Proposals to develop better designs for LLM agents which are validated on existing benchmarks — for example, proposals similar to ReAct (Yao et al. 2022), Toolformer (Schick et al. 2023), Reflexion (Shinn et al. 2023), and Zhou et al. 2023 would not be eligible for consideration. Some projects we fund through this RFP may incidentally improve on LLM agent design in the course of establishing baselines for the benchmark, but proposals where that is the main focus are not eligible.
Proposals to develop a benchmark and evaluate it on a “raw LLM” producing a text response to a prompt^[13]Note that we may accept proposals containing tasks that theoretically could have been performed by a “raw LLM,” as long as they are in practice evaluated on an agent in the context of the project. For example, MATH was originally introduced as a benchmark for “raw LLMs”, but Zhou et al. … Continue reading — for example, proposals similar to MMLU, HellaSwag, VQA, SquAD, SuperGLUE, MATH, most BIG Bench tasks, LegalBench, and so on are not eligible.
Proposals to measure whether LLM agents will reliably refuse harmful or immoral requests in the face of jailbreaks or prompt injection attacks or similar — for example, proposals similar to HarmBench would not be eligible, even if they were studying the robustness of LLM agents (as opposed to raw LLMs).^[14]Some researchers have argued that measuring adversarial robustness is essential to measuring real-world capabilities, because lack of adversarial robustness is currently limiting LLMs’ commercial viability. We think that this is plausible, but in this RFP, we are seeking proposals for … Continue reading In this RFP, we are looking to measure how competently LLM agents perform real-world tasks when they don’t refuse the request.
Proposals to measure LLMs’ alignment, values, ethics, or fairness — for example, proposals similar to ETHICS, BiasBench, or StereoSet would not be eligible, even if they were studying the values and alignment of LLM agents (as opposed to raw LLMs).
Proposals to apply LLMs to beneficial use cases — for example, proposals to develop LLMs tools to assist people in low- and middle-income countries, augment doctors or therapists, or improve translation services to low-resource languages would not be eligible.
Proposals to study or forecast the real-world impacts of LLMs in broader ways — for example, randomized controlled trials (RCTs) such as GitHub’s Copilot study or this Harvard Business school working paper, projects to elicit expert opinion such as this survey conducted by AI Impacts, or the Existential Risk Persuasion Tournament conducted by the Forecasting Research Institute, would not be eligible.

3. What makes for a strong benchmark

We want to fund benchmarks that allow researchers starting from very different places to come to much greater agreement about whether extreme capabilities and risks are plausible in the near-term. If LLM agents score highly on these benchmarks, a skeptical expert should hopefully become much more open to the possibility that they could soon automate large swathes of important professions and/or pose catastrophic risks. And conversely, if they score poorly, an expert who is highly concerned about imminent catastrophic risk should hopefully reduce their level of concern for the time being.

As we described in the motivation section, we believe almost no existing benchmarks achieve this. In this section, we describe what we will be looking for in proposed benchmarks to evaluate whether they stand a good chance of meeting this ambitious goal. We cover:

The properties we consider highest priority — the “three Cs” (construct validity, consequential tasks, and a continuous scale).
Nice-to-have properties that we think are acceptable to sacrifice if it would mean improving on the three Cs.

3.1 High-priority desiderata: the “three Cs”

We think that a benchmark is more likely to be a successful advance warning signal if:

The tasks in the benchmark have high construct validity.
Additionally, at least some tasks are very consequential.
The benchmark allows for a continuous scale of performance.

3.1.1 Construct validity

“Construct validity”, originally introduced in the context of psychological testing, refers to how closely a measurement reflects the underlying concept it’s meant to measure. For example, suppose we try to measure someone’s well-being by interviewing them and asking them “How satisfied are you with your life?” We can interrogate the construct validity of this measurement: could we get a high score on this measurement even if the person has low well-being (e.g. because they want to claim their life is good due to social desirability bias)? Or vice versa — could we get a low score even if they have high well-being?

In the case of this RFP, the underlying property we want these benchmarks to capture is “How much impact will LLM agents have on the real world in the near future?” Because we highly prioritize construct validity, we are much more likely to fund a benchmark if it consists of tasks that come up in a real-world profession.^[15]Or tasks that would come up in the course of causing real-world harm, such as stealing someone’s identity. Some elements of the LLM agent’s environment may need to be artificially simulated for practicality, efficiency, or safety — but we are more likely to fund a proposed benchmark if deviations from the corresponding real-world environment are kept to a minimum.

We are in principle open to especially well-constructed benchmarks set in an artificial environment (e.g. a video game, text adventure game, or board game). However, we will hold proposals of that form to a higher bar, because we are concerned that such tasks are less likely to have high construct validity — there is no obvious mapping between a certain level of performance on a game like chess and a certain level of real-world impact, and we expect LLM skeptics and LLM bulls to make very different inferences from the same level of performance. Those proposing to create an artificial benchmark should present a compelling argument that it has high construct validity despite its artificiality.

3.1.2 Consequential tasks

We are more likely to fund a proposed benchmark if the most difficult tasks in the task suite are clearly consequential. Ideally, a wide range of researchers would agree that an agent which could solve these difficult tasks would have a massive economic impact and/or pose massive risks if misaligned or misused by humans.

For example, Mustafa Suleyman proposed a new “Turing Test”: simply task an AI system with making a million dollars. This is an extremely consequential task — an LLM agent that can autonomously make a million dollars (e.g. by starting its own online business) is likely to have a massive impact on the economy and pose massive risks.

Similarly, the example tasks listed in Anthropic’s responsible scaling policy and the most difficult couple of tasks in the ARA benchmark are aiming to be consequential (albeit not quite as consequential as Suleyman’s proposal). Anthropic says:

“The purpose of these evaluations is to quantify the risk that a model is capable of accumulating resources (e.g. through fraud), navigating computer systems, devising and executing coherent strategies, and surviving in the real world while avoiding being shut down. The tasks will be chosen to be at a difficulty level that a domain expert (not world-class) human could complete each one in roughly 2–8 hours.”

In general, we are looking to fund benchmarks where the most difficult task(s) in the task suite are as consequential as the ARA and Anthropic tasks, or more so. In other words, we are looking to fund benchmarks containing one or more difficult task(s) that would take a typical human domain expert (in a relatively widely-held and lucrative profession) a few hours or more to perform.^[16]With that said, depending on the amount and quality of interest we receive, we may end up funding some benchmarks consisting entirely of tasks which are less consequential than this threshold. With that said, we expect most tasks in the suite to be shorter and simpler than the most consequential task(s).

3.1.3 Continuous scale of performance

We are more likely to fund a benchmark if we believe performance on that benchmark is likely to improve relatively continuously as LLMs and agent architectures improve. We are unsure how to achieve this goal in practice — several LLM benchmarks have jumped quickly from very poor (<10%) to fairly strong (>50%) performance recently, and it is difficult to rule out the same happening for these agent benchmarks.^[17]As Bowman 2023 says: “Scaling laws generally only predict a model’s pretraining test loss…. [It] is largely not possible to predict when models will start to show specific skills or become capable of specific tasks […] Often, a model can fail at some task consistently, but a new model … Continue reading With that said, below we have brainstormed some measures that we think would (all else equal) help create a more continuous scale of performance:

Assigning “partial credit” to an attempt on an individual task, rather than scoring each task as “correct” or “incorrect.”
- The rubric that Kinniment et al. 2023 used to score their tasks allowed for three levels of completion (“Complete,” “Incomplete,” and “Partially complete,” where the criterion for the latter was predefined and specific to each task). But agents took hundreds of individual actions in the course of attempting to complete the more difficult tasks in their suite.^[18]For example, in the course of attempting to phish a volunteer target, an agent built on GPT-4 autonomously did online research on the target, composed a reasonable phishing email, wrote a (visually unconvincing) webpage, and did various other subtasks. Given the rich data this generates, it seems possible to create more fine-grained partial credit measures.
- For example, we could ask domain experts to subjectively rate how “close” the agent got from 0 to 100, or we could instruct a human to intervene to help the agent under certain circumstances and measure how many instances of help were required, or we could start the agent from a state where a task has been partially completed and record whether it succeeds from that point, and so on.
Creating a range of tasks from simple (e.g. “modify the behavior of a feature in a small codebase”) to very difficult and consequential (e.g. “refactor a large codebase”), with many intermediate tasks of increasing complexity in between.
- Simpler tasks could be built to mimic difficult (or not-so-difficult) subtasks that crop up in the course of doing a longer and more consequential task. For example, simpler subtasks that cropped up in the phishing task described above include “researching information about someone online,” “composing a phishing email,” “cloning a Harvard login page,” and more.
- Simpler tasks could also be constructed with reference to human capabilities. If the most consequential task in the task suite was designed to take a human domain expert five hours to complete, simpler tasks could be designed to take shorter amounts of time and/or require less domain expertise.
Choosing tasks with a natural quantifiable performance measure. For example, most of the MLAgentBench (Huang et al. 2023) tasks require the LLM agent to improve the performance of a smaller model or other piece of software. There is often no obvious hard bound on how much performance could be improved — the more competent the agent is, the more performance improvements it can eke out. Similarly, Huang et al. also record the amount of time and other resources used by the agent in the course of attempting a task; this quantity can also increase as the agent’s capabilities improve.

3.2 Nice-to-have properties that are acceptable to sacrifice

We prefer to fund benchmarks that score highly on the three Cs but make sacrifices on other nice-to-have properties, such as:

Being cheap and easy to set up and run — we are open to benchmarks containing individual tasks that take hours to perform, cost hundreds or thousands of dollars to run, or involve interacting with or providing instructions to humans.
Being cheap and objective to score — we are open to benchmarks that cannot be graded by objective hard-coded metrics, and instead require human labor and judgment to grade.
Being exactly reproducible — we are open to benchmarks where running the same agent on the same task can lead to different outcomes on different runs, for example because the task involves interacting with humans or other stochastic elements of the external world.
Having many individual tasks — while a benchmark like WebArena (Zhou et al. 2023) contains 812 individual tasks (e.g. “Tell me the closest cafe(s) to CMU Hunt library” or “Cancel order 3021″), we are open to benchmarks containing only ten or twenty individual tasks, because it’s likely to be difficult to formulate and run a large number of highly consequential tasks.

Other things being equal, all of these properties would make a benchmark more valuable. However, we expect there to be a fundamental tension between achieving these properties and scoring highly on the three Cs, and we prioritize the latter.

Applicants may find it useful to review the resources for autonomous capability evaluation produced by METR (Model Evaluation and Threats Research). These include an example task suite, software tools, and details of METR’s task standard, which is designed to enable tasks to be easily shared between different research groups, and is currently being used by the UK’s AI safety Institute. These resources (particularly the task standard) may make it easier to achieve these nice-to-have properties, particularly for those who are new to developing tasks for LLM agents.

4. Expression of interest process and other logistics

This RFP is currently closed, and we aren’t accepting new expressions of interest. If we re-open it, we will update this page with application materials.

5. Acknowledgements

This RFP text was largely drafted by Ajeya Cotra, in collaboration with Max Nadeau and Isabel Juniewicz; Tom Davidson and Javier Prieto also contributed to the initiative by formulating ideas and investigating potential grants.

We’d like to thank Sanjeev Arora, Beth Barnes, Paul Christiano, Dean Edelman, Ryan Greenblatt, Tatsunori Hashimoto, Ezra Karger, Percy Liang, Tao Lin, Julian Michael, Rohin Shah, Helen Toner, and Florian Tramer^[19]All names are listed alphabetically by last name. for providing useful external feedback on an earlier draft of this RFP, as well as several Open Philanthropy colleagues (notably Asya Bergal, Aaron Gertler, and Mike Levine) who also provided feedback. We’d also like to thank several others for discussions that helped shape this RFP, especially Nicholas Carlini, Davis Foote, Daniel Kokotajlo, Jonathan Mann, Alec Stapp, Caleb Watney, and Bill Zito.^[20]Again, all names are listed alphabetically by last name.

Footnotes[+]Footnotes[−]

Footnotes
1	This view is held both among some academic researchers and some leading AI labs. A paper by Bengio, Hinton, and other leading AI researchers says “Combined with the ongoing growth and automation in AI R&D, we must take seriously the possibility that generalist AI systems will outperform human abilities across many critical domains within this decade or the next”. OpenAI says that “it’s conceivable that within the next ten years, AI systems will exceed expert skill level in most domains”. Anthropic says that “If any of this is correct, then most or all knowledge work may be automatable in the not-too-distant future.” While the linked resources don’t explicitly mention LLM agents, we believe that in most cases they are referring to similar technologies (based on conversations with many of the relevant researchers). Note that our conception of LLM agents is meant to include agents built from multi-modal LLMs (such as ChatGPT with vision).
2	For example, Yann LeCun has claimed that a “system trained on language alone will never approximate human intelligence”, and that “better [AI systems] will be appearing, but they will be based on different principles [from autoregressive LLMs]”, as “[autoregressive LLMs] will be abandoned within a few years”. Additionally, In August 2023, Andrew Ng predicted that human-level AI was 30-50 years away, based on the limitations of LLMs relative to human brains. Gary Marcus has gone further, claiming that achieving human-level AI will likely (80+%) require a paradigm shift beyond using deep learning alone.
3	Some of the many critiques of current benchmarking practices: Vera Liao argues that instead of using existing benchmarks, we should “center our analysis on how these models will be used in practice” and “[develop] evaluation methods that can provide valid assessments for whether and how much human needs in downstream use cases can be satisfied.” Melanie Mitchell discusses issues with current measurements of AI capabilities, including data contamination, brittleness to question rephrasing, and test questions with hidden giveaways, arguing that models’ test scores overstate their capabilities. She concludes that “designing methods to properly assess their intelligence—and associated capabilities and limitations—is an urgent matter.” A paper from researchers at GDM, OpenAI, Anthropic, Oxford, Cambridge, and other institutions argues that “very few existing model evaluations intentionally target risks on [an extreme] scale”, including risks from political influence, weapons acquisition, and self-proliferation, despite the plausibility of these risks. Inflection AI’s Mustafa Suleyman argues that the Turing Test is flawed, as it has been passed without the underlying technology creating dramatic real-world impacts. He proposes that a better benchmark for worldchanging capabilities will be when an AI can autonomously turn $100k into $1mil. This RFP emphasizes realism of the evaluation task and magnitude of the impact of the measured capability, focusing less explicitly on issues like data contamination. That said, a data-contaminated benchmark would be quite problematic as well.
4	A lot of our view comes from informal discussions with many AI researchers who have these views. For a collection of public statements to this effect, see footnote 1, as well as Senate testimony by Yoshua Bengio and Dario Amodei regarding the possibility that LLM-based systems could be used to create bioweapons.
5	A similar sentiment is expressed by Shevlane et al. 2023, Model evaluations for extreme risks: “Many AI researchers (and other stakeholders) view extreme risks from AI as an important challenge. In a 2022 survey of AI researchers, 36% of respondents thought that AI systems could plausibly “cause a catastrophe this century that is at least as bad as an all-out nuclear war” (Michael et al., 2022). However, very few existing model evaluations intentionally target risks on this scale.”
6	In the case of VQA, the question makes reference to an image, but the LLM chatbot is still expected to simply respond with an answer, not have a back-and-forth interaction.
7	For example, the late 2022 Existential Risk Persuasion Tournament (XPT), which brought together subject matter experts and forecasters to debate the likelihood of near-term catastrophic risks, found that participants broadly agreed about the rate at which benchmark performance would improve — despite disagreeing sharply about the likelihood of catastrophic risks from AI in the near term. The most concerned third of participants had a median extinction risk from AI by 2100 of 7.5%, while the least concerned third had a median risk of 0.01% (pg 34-35). Both groups expected the top performance by 2030 on MATH to be about 87% and on MMLU to be about 89% (pg 43).
8	In particular, if near-future LLM agents will outperform humans across the board (as some experts believe), they would likely be capable of causing catastrophic harm if humans instruct them to or if they develop unwanted intrinsic drives from their training process; this could include something as severe as human extinction. If this is a realistic risk, controlling the development and use of powerful LLMs should be a top international priority analogous to limiting the proliferation of nuclear weapons — very stringent restrictions, up to and including a thoroughgoing globally-enforced ban on frontier LLM development and upstream technologies such as frontier hardware, could be warranted. On the other hand, if LLM capabilities are massively overhyped and concerns about catastrophic risks are implausible science fiction stories, such aggressive and costly regulatory measures would be a huge overreach. And if the truth is somewhere in between — if LLM agents will develop some capabilities that make a big impact or pose big risks, while stopping short of automating human professions across the board — then understanding the specific details could suggest better-tailored safety and policy interventions that minimize the (more moderate) risks and allow the benefits to be realized.
9	There is some preliminary evidence that more difficult and more real-world oriented benchmarks can change people’s minds. While both skeptical and concerned experts in the XPT had the same forecasts about benchmarks like MATH and MMLU, the follow-up AI Adversarial Collaboration (forthcoming) found that this question generated sharp disagreement: “Will the ARC Evals project, or a project with similar objectives, conclude that state-of-the-art AI models have the ability to (1) devise effective strategies for autonomous replication, (2) effectively acquire resources, and (3) successfully evade being deactivated when operating ‘in the wild’ by 2030?” Here, the concerned group had a median forecast of 28.3% and the skeptical group had a median of 1%. The skeptical group updates towards being more concerned if an organization concludes that a model is capable of ARA (autonomous replication and adaptation) before 2030.
10	For example, suppose one task in the suite is “Prepare a nice meal for four people, one of whom is vegetarian, given an $X budget and two hour prep time.” An LLM agent would need to order ingredients online through a site like Instacart (delivered by a human), hire a human sous chef through a site like TaskRabbit, and instruct the sous chef on what steps they should take in what order in order to be able to prepare the meal in the allotted time. Traditional benchmarks would not include this kind of messy, not-fully-objective task, but we are excited about supporting benchmarks that are more realistic, even if they come at the cost of rigor.
11	METR was originally incubated as a team within Alignment Research Center (ARC), which was founded and previously run by Paul Christiano. Paul is married to Ajeya Cotra, who oversees our giving to research that could help clarify risks from AI and is spearheading this RFP. Paul is no longer employed by ARC or METR.
12	Note that these are simply illustrative examples of eligible projects; these projects would not all be equally strong candidates. The next section goes into more detail on what we believe makes for a strong proposal.
13	Note that we may accept proposals containing tasks that theoretically could have been performed by a “raw LLM,” as long as they are in practice evaluated on an agent in the context of the project. For example, MATH was originally introduced as a benchmark for “raw LLMs”, but Zhou et al. 2023 (linked above) evaluates an LLM agent with access to a code interpreter on MATH. A novel mathematics benchmark in the same vein as MATH would be eligible as long as the baseline performance is set by an LLM agent.
14	Some researchers have argued that measuring adversarial robustness is essential to measuring real-world capabilities, because lack of adversarial robustness is currently limiting LLMs’ commercial viability. We think that this is plausible, but in this RFP, we are seeking proposals for specific consequential real-world tasks that may lean on a certain kind of adversarial robustness as a sub-skill, e.g. “make money in online crypto trading.” We are not looking to measure adversarial robustness per se.
15	Or tasks that would come up in the course of causing real-world harm, such as stealing someone’s identity.
16	With that said, depending on the amount and quality of interest we receive, we may end up funding some benchmarks consisting entirely of tasks which are less consequential than this threshold.
17	As Bowman 2023 says: “Scaling laws generally only predict a model’s pretraining test loss…. [It] is largely not possible to predict when models will start to show specific skills or become capable of specific tasks […] Often, a model can fail at some task consistently, but a new model trained in the same way at five or ten times the scale will do well at that task.”
18	For example, in the course of attempting to phish a volunteer target, an agent built on GPT-4 autonomously did online research on the target, composed a reasonable phishing email, wrote a (visually unconvincing) webpage, and did various other subtasks.
19	All names are listed alphabetically by last name.
20	Again, all names are listed alphabetically by last name.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.