Yves here. This year’s “Nobel” award for “randomized control trials” or in Brit-speak, “randomised control trials” has generated more pushback, some of the polite throat-clearing sort, others more frontal, than I can ever recall for a past decision. Note that this post includes a brutal critique, then backs off from it by throwing bouquets. Other comments are more straightforward. But the level of detail is helpful for those not familiar with this approach.
By Kevin Bryan, Assistant Professor of Strategic Management, University of Toronto Rotman School of Management. Originally published at VoxEU
The 2019 Nobel Prize in Economic Sciences has been awarded jointly to Abhijit Banerjee, Esther Duflo, and Michael Kremer “for their experimental approach to alleviating global poverty”. This column outlines their impact on development economics research and practical action to reduce poverty. It also considers some of the critiques of randomised controlled trials as an approach to development.
Abhijit Banerjee, Esther Duflo, and Michael Kremer have won the 2019 Nobel Prize. Their victory was inevitable, and for a straightforward reason: an entire branch of economics – development – looks absolutely different from what it looked like 30 years ago.
Development used to be essentially a branch of economic growth. Researchers studied topics like the productivity of large versus small farms, the nature of ‘marketing’ (or the nature of markets and how economically connected different regions in a country are) or the necessity of exports versus industrialisation. Studies were almost wholly observational, deep data collections with throwaway references to old-school growth theory. Policy was largely driven by the subjective impression of donors or programme managers about projects that ‘worked’. To be a bit too honest – it was a dull field, and hence a backwater. And worse than dull, it was a field where scientific progress was seriously lacking.
Development Economics Transformed
Banerjee (2005) has a lovely description of the state of affairs when he entered the field of development economics. Lots of probably good ideas were funded, informed deeply by history, but with very little convincing evidence that highly funded projects were achieving their stated aims. Of the World Bank Sourcebook of recommended projects, everything from scholarships to girls to vouchers for poor children to citizens’ report cards were recommended.
Did these actually work? Banerjee quotes a programme providing computer terminals in rural areas of Madhya Pradesh, which explains that due to a lack of electricity and poor connectivity, “only a few of the kiosks have proved to be commercially viable”. Without irony, “following the success of the initiative”, similar programmes would be funded.
Clearly this state of affairs was unsatisfactory. Surely we should be able to evaluate the projects we’ve funded already? And better, surely we should structure those evaluations to inform future projects? Banerjee again: “the most useful thing a development economist can do in this environment is stand up for hard evidence”.
And where do we get hard evidence? If by this we mean internal validity – that is, whether the effect we claim to have seen is actually caused by a particular policy in a particular setting – applied econometricians of the ‘credibility revolution’ in labour in the 1980s and 1990s provided an answer. Either take advantage of natural variation with useful statistical properties, like the famed regression discontinuity or else randomise treatment like a medical study. The idea here is that the assumptions needed to interpret a ‘treatment effect’ are often less demanding than those needed to interpret the estimated parameter of an economic model, hence more likely to be ‘real’. The problem in development is that most of what we care about cannot be randomised. How are we, for instance, to randomise whether a country adopts import substitution industrialisation or not, or randomise farm size under land reform – and at a scale large enough for statistical inference?
What Banerjee, Duflo and Kremer noticed is that much of what development agencies do in practice has nothing to do with those large-scale interventions. The day-to-day work of development is making sure teachers show up to work, vaccines are distributed and taken up by children, corruption does not deter the creation of new businesses, and so on.
By breaking down the work of development on the macro scale to evaluations of development at micro scale, we can at least say something credible about what works in these bite-size pieces. No longer should the World Bank Sourcebook give a list of recommended programmes, based on handwaving. Rather, if we are to spend 100 million dollars sending computers to schools in a developing country, we should at least be able to say “when we spent five million on a pilot, we designed the pilot so as to learn that computers in that particular setting led to a 12% decrease in dropout rate, and hence a 34-62% return on investment according to standard estimates of the link between human capital and productivity”.
The Experimental Approach
How to run those experiments? How should we set them up? Who can we get to pay for them? How do we deal with ‘piloting bias’, where the initial NGO we pilot with is more capable than the government we expect to act on evidence learned in the first study? How do we deal with spillovers from randomised experiments, econometrically?
Banerjee, Duflo, and Kremer not only ran some of the famous early experiments, they also established the premier academic institution for running these experiments – J-PAL at MIT – and further wrote some of the best-known practical guides to experiments in development (e.g. Duflo et al. 2007). It is not a stretch to say that the Nobel was given not just for the laureates’ direct work, but also for the collective contributions of the field they built.
Nonetheless, many of the experiments written directly by the three winners are now canonical. Let’s start with Michael Kremer’s paper on deworming with Ted Miguel (Miguel and Kremer 2004). Everyone agreed that treating kids infected with things like hookworm has large health benefits for the children themselves. But since worms are spread by outdoor bathroom use and other poor hygiene practices, one infected kid can also harm nearby kids by spreading the disease.
Kremer and Miguel suspected that one reason school attendance is so poor in some developing countries is because of the disease burden, and hence that reducing infections among one kid benefits the entire community, and neighbouring ones as well, by reducing overall infection. By randomising mass school-based deworming, and measuring school attendance both at the focal and at neighbouring schools, they found that villages as far as 4km away saw higher school attendance (4km rather 6km in the original paper due to a correction of an error – Clemens and Sandefur 2015 – in the analysis).
Note the good economics here: a change from individual to school-based deworming helps identify spillovers across schools, and some care goes into handling the spatial econometric issue whereby density of nearby schools equals density of nearby population equals differential baseline infection rates at these schools. An extra year of school attendance could therefore be ‘bought’ by a donor for $3.50, much cheaper than other interventions such as textbook programmes or additional teachers. Organisations like GiveWell (2018) still rate deworming among the most cost-effective educational interventions in the world. In terms of short-run impact, surely this is one of the single most important pieces of applied economics of the 21st century.
The laureates have also used experimental design to learn that some previously highly regarded programmes are not as important to development as you might suspect. Banerjee et al. (2015) studied microfinance rollout in Hyderabad, randomising the neighbourhoods that received access to a major first-generation microlender. These programmes are generally woman-focused, joint-responsibility, high-interest loans along the lines of the Nobel Peace Prize winning Grameen Bank.
Around 2,800 households across the city were initially surveyed about their family characteristics, lending behaviour, consumption and entrepreneurship; then follow-ups were performed a year after the microfinance rollout; and then three years later. While women in treated areas were 8.8 percentage points more likely to take a microloan, and existing entrepreneurs do in fact increase spending on their business, there is no long-run impact on education, health or the likelihood women make important family decisions, nor does it make businesses more profitable. That is, credit constraints, at least in poor neighbourhoods in Hyderabad, do not appear the main barrier to development.
This is perhaps not very surprising, since higher-productivity firms in India in the 2000s already had access to reasonably well-developed credit markets, and surely these firms are the main driver of national income (follow-up work – Banerjee et al. 2019 – does see some benefits for very high talent, very poor entrepreneurs, but the long-run key result remains).
Let’s realise how wild this paper is: a literal Nobel Peace Prize was awarded for a form of lending that had not really been rigorously analysed. This form of lending effectively did not exist in rich countries at the time they developed, so it is not a necessary condition for growth. And yet enormous amounts of money went into a somewhat-odd financial structure because donors were nonetheless convinced, on the basis of very flimsy evidence, that microlending was critical.
Critiques of Randomised Controlled Trials
By replacing conjecture with evidence and showing that randomised controlled trials (RCTs) can actually be run in many important development settings, the laureates’ reformation of economic development has been unquestionably positive. Or has it? Before returning to the (truly!) positive aspects of Banerjee, Duflo, and Kremer’s research programme, we must grapple with the critiques of this programme and its influence. Because though Banerjee, Duflo, and Kremer are unquestionably the leaders of the field of development, and the most influential scholars for young economists working in that field, the pre-eminence of the RCT method has led to some virulent debates within economics.
Donors love RCTs, as they help select the right projects. Journalists love RCTs, as they are simple to explain (Wired 2013, in a typical example of this hyperbole: “But in the realm of human behaviour, just as in the realm of medicine, there is no better way to gain insight than to compare the effect of an intervention with the effect of doing nothing at all. That is: You need a randomized controlled trial.”) But though RCTs are useful, as we have seen, they are in no way a ‘gold standard’ compared with other forms of understanding economic development. The critiques are three-fold.
First, while the method of random trials is great for impact or programme evaluation, it is not great for understanding how similar but not exact replications will perform in different settings. That is, random trials have no specific claim to external validity, and indeed are worse than other methods on this count.
Second, development is much more than programme evaluation, and the reason real countries grow rich has essentially nothing to do with the kinds of policies studied in the papers we discussed above: the ‘economist as plumber’ famously popularised by Duflo (2017), who rigorously diagnoses small problems and proposes solutions, is an important job, but not as important as the engineer who invents and installs the plumbing in the first place.
Third, even if we only care about internal validity, and only care about the internal validity of some effect that can in principle be studied experimentally, the optimal experimental design is generally not an RCT. Let us tackle these issues in turn.
The external validity problem is often seen to be one related to scale: well-run partner NGOs are just better at implementing any given policy than, say, a government, so the benefit of scaled-up interventions may be much lower than that identified by an experiment.
We call this ‘piloting bias’, but it isn’t really the core problem. The core problem is that the mapping from one environment or one time to the next depends on many factors, and by definition the experiment cannot replicate those factors. A labour market intervention in a high-unemployment country cannot inform in an internally valid way about a low-unemployment country, or a country with different outside options for urban labourers, or a country with an alternative social safety net or cultural traditions about income sharing within families.
Worse, the mapping from a partial equilibrium to a general equilibrium world is not at all obvious, and experiments do not inform as to the mapping. Giving cash transfers to some villagers may make them better off, but giving cash transfers to all villagers may cause land prices to rise, or cause more rent extraction by corrupt governments, or cause all sorts of other changes in relative prices.
You can see this issue in the scientific summary of this year’s Nobel (Royal Swedish Academy of Sciences 2019). Literally, the introductory justification for RCTs is that, “[t]o give just a few examples, theory cannot tell us whether temporarily employing additional contract teachers with a possibility of re-employment is a more cost-effective way to raise the quality of education than reducing class sizes. Neither can it tell us whether microfinance programs effectively boost entrepreneurship among the poor. Nor does it reveal the extent to which subsidized health-care products will raise poor people’s investment in their own health.”
Theory cannot tell us the answers to these questions, but an internally valid RCT can? Surely the wage of the contract teacher vis-à-vis more regular teachers and hence smaller class sizes matters? Surely it matters how well trained these contract teachers are? Surely it matters what the incentives for investment in human capital by students in the given location are?
To put this another way: run literally whatever experiment you want to run on this question in, say, rural Zambia in grade 4 in 2019. Then predict the cost-benefit ratio of having additional contract teachers versus more regular teachers in Bihar in high school in 2039. Who would think there is a link? Actually, let’s be more precise: who would think there is a link between what you learned in Zambia and what will happen in Bihar which is not primarily theoretical?
Having done no RCT, I can tell you that if the contract teachers are much cheaper per unit of human capital, we should use more of them. I can tell you that if the students speak two different languages, there is a greater benefit in having a teacher assistant to translate. I can tell you that if the government or other principal has the ability to undo outside incentives with a side contract, hence are not committed to the mechanism, dynamic mechanisms will not perform as well as you expect. These types of statements are theoretical: good old-fashioned substitution effects due to relative prices, or a priori production function issues, or basic mechanism design.
Now, the problem of external validity is one that binds on any type of study. Randomised trials, observational studies, theory and structural models all must deal with the mapping of setting A to setting B. The difference with RCTs is that while randomisation is a strong statistical tool for understanding a treatment effect in setting A, it has no particular advantage in understanding ‘deep parameters’ or mechanisms that map from A to B.
Duhem-Quine effects mean that models with more structure are generally less likely to be internally valid – if the auxiliary assumptions are terribly misleading, we may have learned very little. However, they are more likely to be externally valid, since the implicit logic mapping A to B, and the relevant empirical data needed to make the mapping, has been laid out and gathered.
Simply performing many experiments in many settings does not solve this problem: how do you know that the settings you chose have themselves been randomised, or that you are stratifying on the heterogeneity that matters for external validity? For example, to answer the industrial organisation question, “Would firms, in general, improve profits by lowering or raising their prices?”, we would not find it worthwhile to randomise individual price changes and measure profit the week after! And if our partner firms in the RCT happened to be ones that price on the inelastic part of the demand curve, we certainly would not want to then write a paper suggesting that firms in general will improve profits by raising prices!
Even if external validity is not a concern, we may worry about distortions in what questions researchers focus on. Some of the important questions in development cannot be answered with RCTs. Everyone working in development has heard this critique. But just because a critique is oft repeated does not mean it is wrong. As Lant Pritchett argues (Manik 2018), national development is a social process involving markets, institutions, politics and organisations. RCTs have focused on, in his reckoning, “topics that account for roughly zero of the observed variation in human development outcomes”.
This isn’t to say that RCTs do not study useful questions! Improving the function of developing world schools, figuring out why malaria nets are not used, investigating how to reintegrate civil war fighters: these are not minor issues, and it’s good that folks like this year’s Nobelists and their followers provide solid evidence on these topics. The question is one of balance. Are we, as economists are famously wont to do, simply looking for keys underneath the spotlight when we focus our attention on questions that are amenable to a randomised study? Has the focus on internal validity diverted effort from topics that are much more fundamental to the wealth of nations?
But fine. Let us consider that our question of interest can be studied in a randomised fashion. And let us assume that we do not expect piloting bias or other external validity concerns to be first-order. We still have an issue: even on internal validity, RCTs are not perfect. They are certainly not a ‘gold standard’, and the econometricians who push back against this framing have good reason to do so.
Two primary issues arise. First, to predict what will happen if I impose a policy, I am concerned that what I have learned in this past is biased (for example, the people observed to use schooling subsidies are more diligent than those who would go to school if we made these subsidies universal).
But I am also concerned about statistical inference: with small sample sizes, even an unbiased estimate will not predict very well. Banerjee himself, alongside a group of theorists, has studied the optimal experimental design for a researcher hoping to persuade an audience with diverse priors about what works. When sample size is low, the optimal study is deterministic, not randomised (Banerjee et al. 2017b).
Econometricians like Max Kasy (2016) have shown that since randomisation always generates less covariate balance than deterministic assignment of treatments, you do not want to precisely randomise treatment even in a classic RCT setting. These two papers do not speak to observational versus randomised versus structural studies, but they nonetheless represent the broader idea: we care about expected loss when we generalise, and this loss depends on more than simply having an unbiased initial study.
To reiterate, randomised trials tend to have very small sample sizes compared with observational studies. When this is combined with high ‘leverage’ of outlier observations when multiple treatment arms are evaluated, particularly for heterogeneous effects, randomised trials often predict poorly out of sample even when unbiased (see Alwyn Young 2018 on this point). Observational studies allow larger sample sizes, and hence often predict better even when they are biased. The theoretical assumptions of a structural model permit parameters to be estimated even more tightly, as we use a priori theory to effectively restrict the nature of economic effects.
We have thus far assumed the randomised trial is unbiased, but that is often suspect as well. Even if I randomly assign treatment, I have not necessarily randomly assigned spillovers in a balanced way, nor have I restricted untreated agents from rebalancing their effort or resources.
A PhD student at the University of Toronto, Carlos Inoue (2019), examined the effect of random allocation of a new coronary intervention in Brazilian hospitals. Following the arrival of this technology, good doctors moved to hospitals with the ‘randomised’ technology. The estimated effect is therefore nothing like what would have been found had all hospitals adopted the intervention.
This issue can be stated simply: randomising treatment does not in practice hold all relevant covariates constant, and if your response is just ‘control for the covariates you worry about’, then we are back to the old setting of observational studies where we need a priori arguments about what these covariates are if we are to talk about the effects of a policy.
Theory and the Value of Experiments
The irony is that Banerjee, Duflo, and Kremer are often quite careful in how they motivate their work with traditional microeconomic theory. They rarely make grandiose claims of external validity when nothing of the sort can be shown by their experiment, as Oriana Bandiera (2019) has discussed.
Kremer is an ace theorist in his own right, Banerjee often relies on complex decision and game theory (Banerjee et al. 2016), particularly in his early work. And no one can read the care with which Duflo handles issues of theory and external validity and think she is merely punting (Banerjee and Duflo 2005, Duflo 2006). Most of the complaints about their ‘randomista’ followers do not fully apply to the work of the laureates themselves.
And none of the critiques above should be taken to mean that experiments cannot be incredibly useful to development. Indeed, the proof of the pudding is in the tasting: some of the small-scale interventions by Banerjee, Duflo, and Kremer have been successfully scaled up! (Banerjee et al. 2017a)
To make an analogy with a firm, consider a plant manager interested in improving productivity. She could read books on operations research and try to implement ideas, but it surely is also useful to play around with experiments within her plant. Perhaps she will learn that it’s not incentives but rather lack of information that is the biggest reason workers are, say, applying car door hinges incorrectly. She may then redo training, and find fewer errors in cars produced at the plant over the next year. This evidence – not only the treatment effect, but also the rationale – can then be brought to other plants at the same company.
All totally reasonable. Indeed, would we not find it insane for a manager to try things out, and make minor changes on the margin, before implementing a huge change to incentives or training? And of course the same goes, or should go, when the World Bank or DFID or USAID spend tonnes of money trying to solve some development issue.
On that point, what would even a sceptic agree a development experiment can do?
First, it is generally better than other methods at identifying internally valid treatment effects, though still subject to the caveats above.
Second, it can fine-tune interventions along margins where theory gives little guidance. For instance, do people not take AIDS drugs because they don’t believe they work, because they don’t have the money, or because they want to continue having sex and no one will sleep with them if they are seen picking up antiretrovirals?
My colleague Laura Derksen suspected that people are often unaware that antiretrovirals prevent transmission, hence in locations with high rates of HIV, it may be safer to sleep with someone taking antiretrovirals than the population at large (Derksen and van Oosterhout 2019). She shows that informational interventions informing villagers about this property of antiretrovirals meaningfully increases take-up of medication. We learn from her study that it may be important in the case of AIDS prevention to correct this particular set of beliefs. Theory, of course, tells us little about how widespread these incorrect beliefs are, hence about the magnitude of this informational shift on drug take-up.
Third, experiments allow us to study policies that no one has yet implemented. Ignoring the problem of statistical identification in observational studies, there may be many policies we wish to implement that are wholly different in kind from those seen in the past. The negative income tax experiments of the 1970s are a classic example (Hausman and Wise 1976).
Experiments give researchers more control. This additional control is of course balanced against the fact that we should expect super meaningful interventions to have already occurred, and we may have to perform experiments at relatively low scale due to cost.
We should not be too small-minded here. There are now experimental development papers on topics thought to be outside the bounds of experiment. Kevin Donovan at Yale has randomised the placement of roads and bridges connecting remote villages to urban centres (Brooks and Donovan 2018). What could be ‘less amenable’ to randomisation that the literal construction of a road and bridge network?
So Where Do We Stand?
It is unquestionable that a lot of development work in practice was based on the flimsiest of evidence. It is unquestionable that armies Banerjee, Duflo, and Kremer have sent into the world via J-PAL and similar institutions have brought much more rigour to understanding programme evaluation. Some of these interventions are now literally improving the lives of millions of people with clear, well-identified, non-obvious policy. That is an incredible achievement!
And there is something likeable about the desire of the ivory tower to get into the weeds of day-to-day policy. Michael Kremer on this point: “The modern movement for RCTs in development economics… is about innovation, as well as evaluation. It’s a dynamic process of learning about a context through painstaking on-the-ground work, trying out different approaches, collecting good data with good causal identification, finding out that results do not fit pre-conceived theoretical ideas, working on a better theoretical understanding that fits the facts on the ground, and developing new ideas and approaches based on theory and then testing the new approaches.” (Evans 2017). No objection here.
That said, we cannot ignore that there are serious people who seriously object to the J-PAL style of development. Angus Deaton, who won the Nobel Prize only four years ago, writes the following (Bryan 2015), in line with our discussion above: “Randomized controlled trials cannot automatically trump other evidence, they do not occupy any special place in some hierarchy of evidence, nor does it make sense to refer to them as ‘hard’ while other methods are ‘soft’… [T]he analysis of projects needs to be refocused towards the investigation of potentially generalizable mechanisms that explain why and in what contexts projects can be expected to work.”
Lant Pritchett (2014) argues that despite success persuading donors and policymakers, the evidence that RCTs lead to better policies at the governmental level, and hence better outcomes for people, is far from the case. The barrier to the adoption of better policy is bad incentives, not a lack of knowledge on how given policies will perform (Gueron and Rolston 2013). These critiques are quite valid, and the randomisation movement in development often way overstates what they have, and could have in principle, learned.
But let’s give the last word to Chris Blattman (2014) on the sceptic’s case for randomised trials in development: “if a little populist evangelism will get more evidence-based thinking in the world, and tip us marginally further from Great Leaps Forward, I have one thing to say: Hallelujah.” Indeed. No one, randomista or not, longs to go back to the day of unjustified advice on development, particularly ‘Great Leap Forward’ type programmes without any real theoretical or empirical backing!
A Few Remaining Bagatelles
1) It is surprising how early this award was given. Though incredibly influential, the earliest published papers by any of the laureates mentioned in the Nobel scientific summary are from 2003 and 2004 (Miguel-Kremer on deworming, Duflo-Saez on retirement plans, Chattopadhyay and Duflo on female policymakers in India, Banerjee and Duflo on health in Rajathstan). This seems shockingly recent for a Nobel – are there any other Nobel winners in economics who won entirely for work published so close to the prize announcement?
2) In the field of innovation, Kremer is most famous for his paper on patent buyouts (Kremer 1998). How do we both incentivise new drug production but also get these drugs sold at marginal cost once invented? We think the drug-makers have better knowledge about how to produce and test a new drug than some bureaucrat, so we can’t finance drugs directly. If we give a patent, then high-value drugs return more to the inventor, but at massive deadweight loss. What we want to do is offer inventors some large fraction of the social return to their invention ex-post, in exchange for making production perfectly competitive. Kremer proposes patent auctions where the government pays a multiple of the winning bid with some probability, giving the drug to the public domain. The auction reveals the market value, and the multiple allows the government to account for consumer surplus and deadweight loss as well. There are many practical issues, of course. But patent buyouts are nonetheless an elegant, information-based attempt to solve the problem of innovation production, and it has been quite influential on those grounds.
3) Somewhat ironically, Kremer also has a great 1990s growth paper with RCT-sceptics Easterly, Pritchett and Summers (Easterly et al. 1993). The point is simple: growth rates by country vacillate wildly decade-to-decade. Knowing the 2000s, you likely would not have predicted countries like Ethiopia and Myanmar as growth miracles of the 2010s. Yet things like education, political systems, and so on are quite constant within-country across any two-decade period. This necessarily means that shocks of some sort, whether from international demand, the political system, nonlinear cumulative effects, and so on, must be first-order for growth.
4) There is some irony that two of Duflo’s most famous papers are not experiments at all. Her most cited paper by far is a piece of econometric theory on standard errors in difference-in-difference models, written with Marianne Bertrand (Bertrand et al. 2004). Her next most cited paper (Duflo 2001) is a lovely study of the quasi-random school expansion policy in Indonesia, used to estimate the return on school construction and on education more generally. Nary a randomised experiment in sight in either paper.
5) Kremer’s 1990s research, before his shift to development, has been incredibly influential in its own right. The O-ring theory (Kremer 1993a) is an elegant model of complementary inputs and labour market sorting, where slightly better ‘secretaries’ earn much higher wages. The “One Million B.C.” paper (Kremer 1993b) notes that growth must have been low for most of human history, and that it was limited because low human density limited the spread of non-rivalrous ideas. It is the classic Malthus plus endogenous growth paper.
6) Ok, one more for Kremer, since “Elephants” is the greatest paper title in economics (Kremer and Morcom 2000). Theoretically, future scarcity increases prices. When people think elephants will go extinct, the price of ivory therefore rises, making extinction more likely as poaching incentives go up. What to do? Hold a government stockpile of ivory and commit to selling it if the stock of living elephants falls below a certain point. Elegant. And one might wonder: how can we study this particular general equilibrium effect experimentally?
See original post for references