Update, 5/3/24: This RFP is now on hiatus.
Update, 4/5/24: We have replaced the request for a proposal PDF with a much shorter expression of interest (EOI) form [no longer active]. Submissions should be 300 words at most. Additionally, we will put this RFP on hiatus starting on May 3rd, 2024 in order to focus on other priorities.
There is no expert consensus about what systems built from LLMs will and won’t be capable of in the next few years. We think it’s important to change this, because the right approach to take to policy and safety depends crucially on what real-world impacts these systems could have in the near term.
To this end, in addition to our request for proposals to create benchmarks for LLM agents, we are also seeking proposals for a wide variety of research projects which might shed light on what real-world impacts LLM systems could have over the next few years.
Anyone is eligible to submit a form, including those working in academia, nonprofits, or independently; we are also open to restricted grants to projects housed within for-profit companies.[1]We occasionally make restricted grants to research projects conducted within for-profit companies, but it is legally and logistically more challenging to make grants to for-profit organizations, and logistics processing may be delayed for such grants. If applicable, we would include funding for LLM API credits and other forms of compute.
We encourage you to consider the LLM agent benchmarks RFP instead of this one if you have a project idea that is an appropriate fit for it. This RFP is considerably broader than the other one — projects could span a wide range of fields and methodologies — but we also expect that proposals will vary more widely in how effectively they advance Open Philanthropy’s priorities in this space. As such, we are more likely to reject proposals coming through this RFP, and may take more time to investigate the proposals that we do fund. With that said, we can certainly imagine a number of highly impactful research projects in this space which don’t fit the mold of an LLM agent benchmark; we brainstorm some ideas below.
1. Example projects
Below are some examples of project ideas that could make for a strong proposal to this RFP, depending on details:
- Conducting randomized controlled trials to measure the extent to which access to LLM products can increase human productivity on real-world tasks. For example:
- GitHub released a study in mid-2022 which found that having access to GitHub CoPilot halved the time that programmers needed to write an HTTP server in JavaScript (from ~2 hours to ~1 hour).
- Fabrizio Dell’Acqua and others from Harvard Business school released a working paper in Sep 2023 which found that consultants with access to GPT-4 completed many tasks significantly more quickly and at a higher level of quality, but completed some tasks at a lower level of quality, compared to a control group of consultants working on their own.
- Polling members of the public about whether and how much they use LLM products, what tasks they use them for, and how useful they find them to be. We have seen informal surveys from e.g. Business.com and FishBowl, but so far haven’t seen rigorous polls with random sampling. We would be especially interested in user surveys that conduct deeper interviewers about the types of tasks LLM products are helpful and unhelpful with. We’d also be interested in surveys targeted at understanding certain important use cases, e.g. the use of LLM agents to automate software development or AI research.
- In-depth interviews with people working on deploying LLM agents in the real world. There are multiple relevant parts of the AI value chain here, including the product teams of AI labs, organizations helping companies integrate AI in their workflows, VCs that focus on AI, and companies using AI. This ecosystem should contain a wealth of knowledge about the use cases, productivity benefits, and limitations of LLM agents.
- Collecting “in the wild” case studies of LLM use, for example by scraping Reddit (e.g. r/chatGPT), asking people to submit case studies to a dedicated database, or even partnering with a company to systematically collect examples from consenting customers. While there are a lot of individual case studies on the internet, we are not aware of existing work that collects and analyzes them. Even though they will not constitute a representative sample, seeing thousands of case studies of people attempting to LLMs in the course of real jobs could be helpful for understanding qualitative patterns of language model strengths and weaknesses.
- Estimating and collecting key numbers into one convenient place to support analysis. For example, HELM evaluates a wide variety of language models on a wide variety of existing benchmarks, and Papers with Code also provides a similar reference. Epoch similarly estimates or collects numbers such as hardware price performance, spending on large training runs, parameter count and FLOP/s of notable ML models, etc. We would be interested in similar data estimation and collection efforts for:
- Key AI-specific economic indicators, such as revenues of LLM products, valuations of LLM-exposed companies, number of users of LLM products, etc., as well as key statistics about the AI supply chain – R&D and CapEx spending throughout the supply chain, AI accelerator production and aggregation, data center construction, etc.
- An analysis of and collection of broader economic indicators that would capture the effects of and anticipation of effects from AI on the broader economy, such as real interest rates.
- Performance on LLM agent benchmarks such as ARA, MLAgentBench, or forthcoming projects funded by our LLM agent benchmarks RFP.
- Creating interactive experiences that allow people to directly make and test their guesses about what LLMs can do,[2]As a side effect, this can create datasets of interactions that researchers can later analyze. such as Cameron Jones’ Turing Test game, Nicholas Carlini’s forecasting game,[3]Nicholas is open to sharing code for this game with someone who would extend and maintain it; you can indicate in your EOI whether you would be interested in that. and Joel Eriksson’s Parrot Chess — or enable people to more concretely understand AI progress as models grow in scale, such as Sage + FAR AI’s comparative demos.
- Eliciting expert forecasts about what LLM systems are likely to be able to do in the near future and what risks they might pose, either via a survey such as AI Impacts’ 2022 survey or via a forecasting competition such as the Existential Risk Persuasion Tournament by the Forecasting Research Institute. We are especially interested in conditional forecasts, which ask about real-world impacts and risks conditional on certain benchmark performance.
- Synthesizing, summarizing, and analyzing the various existing lines of evidence about what language model systems can and can’t do at present (including benchmark evaluations, deployed commercial uses, and qualitative case studies, etc) and what they might be able to do soon (extrapolations of scaling behavior, market projections, expert surveys, etc.) to arrive at an overall judgment about what LLM systems are likely to be able to do in the near term. There are existing overviews of the AI field, such as the AI 100 report or market reports like this from McKinsey, as well as occasional news articles like this recent one from TIME. We would be most excited about a systematic, frequently-updated qualitative overview which is narrowly focused on the capabilities of systems built out of LLMs (and multi-modal LLMs). For example, this article in Asterisk Magazine by forecaster Jonathan Mann reaches a bottom line conclusion on the likelihood of LLM-based systems replacing tech jobs.
The motivation section of the benchmark RFP goes into more detail on why we are interested in better understanding and forecasting the capabilities of LLM systems, with a special focus on autonomous agents built from LLMs. We strongly encourage you to read that section (and generally review the text of the other RFP) to maximize your chances of submitting a strong proposal.
2. Expression of interest process and other logistics
This RFP is currently on hiatus, and we aren’t accepting new expressions of interest. If we re-open it, we will update this page with application materials.
3. Acknowledgements
This RFP text was largely drafted by Ajeya Cotra, in collaboration with Isabel Juniewicz and Tom Davidson; Javier Prieto also contributed to the initiative by formulating ideas and investigating potential grants. We’d like to thank Ezra Karger, Jonathan Mann, Helen Toner, and others for discussions that helped shape this RFP.[4]Names are listed alphabetically by last name.
Footnotes
1 | We occasionally make restricted grants to research projects conducted within for-profit companies, but it is legally and logistically more challenging to make grants to for-profit organizations, and logistics processing may be delayed for such grants. |
---|---|
2 | As a side effect, this can create datasets of interactions that researchers can later analyze. |
3 | Nicholas is open to sharing code for this game with someone who would extend and maintain it; you can indicate in your EOI whether you would be interested in that. |
4 | Names are listed alphabetically by last name. |