Our grantmaking decisions rely crucially on our uncertain, subjective judgments — about the quality of some body of evidence, about the capabilities of our grantees, about what will happen if we make a certain grant, about what will happen if we don’t make that grant, and so on.
In some cases, we need to make judgments about relatively tangible outcomes in the relatively near future, as when we have supported campaigning work for criminal justice reform. In others, our work relies on speculative forecasts about the much longer term, as for example with potential risks from advanced artificial intelligence. We often try to quantify our judgments in the form of probabilities — for example, the former link estimates a 20% chance of success for a particular campaign, while the latter estimates a 10% chance that a particular sort of technology will be developed in the next 20 years.
We think it’s important to improve the accuracy of our judgments and forecasts if we can. I’ve been working on a project to explore whether there is good research on the general question of how to make good and accurate forecasts, and/or specialists in this topic who might help us do so. Some preliminary thoughts follow.
In brief:
- There is a relatively thin literature on the science of forecasting.1 It seems to me that its findings so far are substantive and helpful, and that more research in this area could be promising.
- This literature recommends a small set of “best practices” for making accurate forecasts that we are thinking about how to incorporate into our process. It seems to me that these “best practices” are likely to be useful, and surprisingly uncommon given that.
- In one case, we are contracting to build a simple online application for credence calibration training: training the user to accurately determine how confident they should be in an opinion, and to express this confidence in a consistent and quantified way. I consider this a very useful skill across a wide variety of domains, and one that (it seems) can be learned with just a few hours of training. (Update: This calibration training app is now available.)
I first discuss the last of these points (credence calibration training), since I think it is a good introduction to the kinds of tangible things one can do to improve forecasting ability.
1. Calibration training
An important component of accuracy is called “calibration.” If you are “well-calibrated,” what that means is that statements (including predictions) you make with 30% confidence are true about 30% of the time, statements you make with 70% confidence are true about 70% of the time, and so on.
Without training, most people are not well-calibrated, but instead overconfident. Statements they make with 90% confidence might be true only 70% of the time, and statements they make with 75% confidence might be true only 60% of the time.2 But it is possible to “practice” calibration by assigning probabilities to factual statements, then checking whether the statements are true, and tracking one’s performance over time. In a few hours, one can practice on hundreds of questions and discover patterns like “When I’m 80% confident, I’m right only 65% of the time; maybe I should adjust so that I report 65% for the level of internally-experienced confidence I previously associated with 80%.”
I recently attended a calibration training webinar run by Hubbard Decision Research, which was essentially an abbreviated version of the classic calibration training exercise described in Lichtenstein & Fischhoff (1980). It was also attended by two participants from other organizations, who did not seem to be familiar with the idea of calibration and, as expected, were grossly overconfident on the first set of questions.3 But, as the training continued, their scores on the question sets began to improve until, on the final question set, they both achieved perfect calibration.
For me, this was somewhat inspiring to watch. It isn’t often the case that a cognitive skill as useful and domain-general as probability calibration can be trained, with such objectively-measured dramatic improvements, in so short a time.
The research I’ve reviewed broadly supports this impression. For example:
- Rieber (2004) lists “training for calibration feedback” as his first recommendation for improving calibration, and summarizes a number of studies indicating both short- and long-term improvements on calibration.4 In particular, decades ago, Royal Dutch Shell began to provide calibration for their geologists, who are now (reportedly) quite well-calibrated when forecasting which sites will produce oil.5
- Since 2001, Hubbard Decision Research trained over 1,000 people across a variety of industries. Analyzing the data from these participants, Doug Hubbard reports that 80% of people achieve perfect calibration (on trivia questions) after just a few hours of training. He also claims that, according to his data and at least one controlled (but not randomized) trial, this training predicts subsequent real-world forecasting success.6
I should note that calibration isn’t sufficient by itself for good forecasting. For example, you can be well-calibrated on a set of true/false statements, for which about half the statements happen to be true, simply by responding “True, with 50% confidence” to every statement. This performance would be well-calibrated but not very informative. Ideally, an expert would assign high confidence to statements that are likely to be true, and low confidence to statements that are unlikely to be true. An expert that can do so is not just well-calibrated, but also exhibits good “resolution” (sometimes called “discrimination”). If we combine calibration and resolution, we arrive at a measure of accuracy called a “proper scoring rule.”7 The calibration trainings described above sometimes involve proper scoring rules, and likely train people to be well-calibrated while exhibiting at least some resolution, though the main benefit they seem to have (based on the research and my observations) pertains to calibration specifically.
The primary source of my earlier training in calibration was a game intended to automate the process. The Open Philanthropy Project is now working with developers to create a more extensive calibration training game for training our staff; we will also make the game available publicly.
2. Further advice for improving judgment accuracy
Below I list some common advice for improving judgment and forecasting accuracy (in the absence of strong causal models or much statistical data) that has at least some support in the academic literature, and which I find intuitively likely to be helpful.8
- Train probabilistic reasoning: In one especially compelling study (Chang et al. 2016), a single hour of training in probabilistic reasoning noticeably improved forecasting accuracy.9 Similar training has improved judgmental accuracy in some earlier studies,10 and is sometimes included in calibration training.11
- Incentivize accuracy: In many domains, incentives for accuracy are overwhelmed by stronger incentives for other things, such as incentives for appearing confident, being entertaining, or signaling group loyalty. Some studies suggest that accuracy can be improved merely by providing sufficiently strong incentives for accuracy such as money or the approval of peers.12
- Think of alternatives: Some studies suggest that judgmental accuracy can be improved by prompting subjects to consider alternate hypotheses.13
- Decompose the problem: Another common recommendation is to break each problem into easier-to-estimate sub-problems.14
- Combine multiple judgments: Often, a weighted (and sometimes “extremized”15) combination of multiple subjects’ judgments outperforms the judgments of any one person.16
- Correlates of judgmental accuracy: According to some of the most compelling studies on forecasting accuracy I’ve seen,17 correlates of good forecasting ability include “thinking like a fox” (i.e. eschewing grand theories for attention to lots of messy details), strong domain knowledge, general cognitive ability, and high scores on “need for cognition,” “actively open-minded thinking,” and “cognitive reflection” scales.
- Prediction markets: I’ve seen it argued, and find it intuitive, that an organization might improve forecasting accuracy by using prediction markets. I haven’t studied the performance of prediction markets yet.
- Learn a lot about the phenomena you want to forecast: This one probably sounds obvious, but I think it’s important to flag, to avoid leaving the impression that forecasting ability is more cross-domain/generalizable than it is. Several studies suggest that accuracy can be boosted by having (or acquiring) domain expertise. A commonly-held hypothesis, which I find intuitively plausible, is that calibration training is especially helpful for improving calibration, and that domain expertise is helpful for improving resolution.18
Another interesting takeaway from the forecasting literature is the degree to which – and consistency with which – some experts exhibit better accuracy than others. For example, tournament-level bridge players tend to show reliably good accuracy, whereas TV pundits, political scientists, and professional futurists seem not to.19 A famous recent result in comparative real-world accuracy comes from a series of IARPA forecasting tournaments, in which ordinary people competed with each other and with professional intelligence analysts (who also had access to expensively-collected classified information) to forecast geopolitical events. As reported in Tetlock & Gardner’s Superforecasting, forecasts made by combining (in a certain way) the forecasts of the best-performing ordinary people were (repeatedly) more accurate than those of the trained intelligence analysts.
3. How commonly do people seek to improve the accuracy of their subjective judgments?
Certainly many organizations, from financial institutions (e.g. see Fabozzi 2012) to sports teams (e.g. see Moneyball), use sophisticated quantitative models to improve the accuracy of their estimates. But the question I’m asking here is: In the absence of strong models and/or good data, when decision-makers must rely almost entirely on human subjective judgment, how common is it for those decision-makers to explicitly invest substantial effort into improving the (objectively-measured) accuracy of those subjective judgments?
Overall, my impression is that the answer to this question is “Somewhat rarely, in most industries, even though the techniques listed above are well-known to experts in judgment and forecasting accuracy.”
Why do I think that? It’s difficult to get good evidence on this question, but I provide some data points in a footnote.20
4. Ideas we’re exploring to improve accuracy for GiveWell and Open Philanthropy Project staff
Below is a list of activities, aimed at improving the accuracy of our judgments and forecasts, that are either ongoing, under development, or under consideration at GiveWell and the Open Philanthropy Project:
- As noted above, we have contracted a team of software developers to create a calibration training web/phone application for staff and public use. (Update: This calibration training app is now available.)
- We encourage staff to participate in prediction markets and forecasting tournaments such as PredictIt and Good Judgment Open, and some staff do so.
- Both the Open Philanthropy Project and GiveWell recently began to make probabilistic forecasts about our grants. For the Open Philanthropy Project, see e.g. our forecasts about recent grants to Philip Tetlock and CIWF. For GiveWell, see e.g. forecasts about recent grants to Evidence Action and IPA. We also make and track some additional grant-related forecasts privately. The idea here is to be able to measure our accuracy later, as those predictions come true or are falsified, and perhaps to improve our accuracy from past experience. So far, we are simply encouraging predictions without putting much effort into ensuring their later measurability.
- We’re going to experiment with some forecasting sessions led by an experienced “forecast facilitator” – someone who helps elicit forecasts from people about the work they’re doing, in a way that tries to be as informative and helpful as possible. This might improve the forecasts mentioned in the previous bullet point.
I’m currently the main person responsible for improving forecasting at the Open Philanthropy Project, and I’d be very interested in further ideas for what we could do.