What Randomization Can and Cannot Do: The 2019 Nobel Prize

It is Nobel Prize season once again, a grand opportunity to dive into some of our field’s most influential papers and to consider their legacy. This year’s prize was inevitable, an award to Abhijit Banerjee, Esther Duflo, and Michael Kremer for popularizing the hugely influential experimental approach to development. It is only fitting that my writeup this year has been delayed due to the anti-government road blockades here in Ecuador which delayed my return to the internet-enabled world – developing countries face many barriers to reaching prosperity, and rarely have I been so personally aware of the effects of place on productivity as I was this week!

The reason for the prize is straightforward: an entire branch of economics, development, looks absolutely different from what it looked like thirty years ago. Development economists used to be essentially a branch of economic growth. Researchers studied topics like the productivity of large versus small farms, the nature of “marketing” (or the nature of markets and how economically connected different regions in a country are), or the necessity of exports versus industrialization. Studies were almost wholly observational, deep data collections with throwaway references to old-school growth theory. Policy was largely driven by the subjective impression of donors or program managers about projects that “worked”. To be a bit too honest – it was a dull field, and hence a backwater. And worse than dull, it was a field where scientific progress was seriously lacking.

Banerjee has a lovely description of the state of affairs back in the 1990s. Lots of probably-good ideas were funded, informed deeply by history, but with very little convincing evidence that highly-funded projects were achieving their stated aims. Of the World Bank Sourcebook of recommended projects, everything from scholarships to girls to vouchers for poor children to citizens’ report cards were recommended. Did these actually work? Banerjee quotes a program providing computer terminals in rural areas of Madhya Pradesh which explains that due to a lack of electricity and poor connectivity, “only a few of the kiosks have proved to be commercially viable”, then notes, without irony, that “following the success of the initiative,” similar programs would be funded. Clearly this state of affairs is unsatisfactory. Surely we should be able to evaluate the projects we’ve funded already? And better, surely we should structure those evaluations to inform future projects? Banerjee again: “the most useful thing a development economist can do in this environment is stand up for hard evidence.”

And where do we get hard evidence? If by this we mean internal validity – that is, whether the effect we claim to have seen is actually caused by a particular policy in a particular setting – applied econometricians of the “credibility revolution” in labor in the 1980s and 1990s provided an answer. Either take advantage of natural variation with useful statistical properties, like the famed regression discontinuity, or else randomize treatment like a medical study. The idea here is that the assumptions needed to interpret a “treatment effect” are often less demanding than those needed to interpret the estimated parameter of an economic model, hence more likely to be “real”. The problem in development is that most of what we care about in development cannot be randomized. How are we, for instance, to randomize whether a country adopts import substitution industrialization or not, or randomize farm size under land reform – and at a scale large enough for statistical inference?

What Banerjee, Duflo, and Kremer noticed is that much of what development agencies do in practice has nothing to do with those large-scale interventions. The day-to-day work of development is making sure teachers show up to work, vaccines are distributed and taken up by children, corruption does not deter the creation of new businesses, and so on. By breaking down the work of development on the macro scale to evaluations of development at micro scale, we can at least say something credible about what works in these bite-size pieces. No longer should the World Bank Sourcebook give a list of recommended programs, based on handwaving. Rather, if we are to spend 100 million dollars sending computers to schools in a developing country, we should at least be able to say “when we spent 5 million on a pilot, we designed the pilot so as to learn that computers in that particular setting led to a 12% decrease in dropout rate, and hence a 34%-62% return on investment according to standard estimates of the link between human capital and productivity.” How to run those experiments? How should we set them up? Who can we get to pay for them? How do we deal with “piloting bias”, where the initial NGO we pilot with is more capable than the government we expect to act on evidence learned in the first study? How do we deal with spillovers from randomized experiments, econometrically? Banerjee, Duflo, and Kremer not only ran some of the famous early experiments, they also established the premier academic institution for running these experiments – J-PAL at MIT – and further wrote some of the best known practical guides to experiments in development.

Many of the experiments written by the three winners are now canonical. Let’s start with Michael Kremer’s paper on deworming, with Ted Miguel, in Econometrica. Everyone agreed that deworming kids infected with things like hookworm has large health benefits for the children directly treated. But since worms are spread by outdoor bathroom use and other poor hygiene practices, one infected kid can also harm nearby kids by spreading the disease. Kremer and Miguel suspected that one reason school attendance is so poor in some developing countries is because of the disease burden, and hence that reducing infections among one kid benefits the entire community, and neighboring ones as well, by reducing overall infection. By randomizing mass school-based deworming, and measuring school attendance both at the focal and at neighboring schools, they found that villages as far as 4km away saw higher school attendance (4km rather 6km in the original paper due to a correction of an error in the analysis). Note the good economics here: a change from individual to school-based deworming helps identify spillovers across schools, and some care goes into handling the spatial econometric issue whereby density of nearby schools equals density of nearby population equals differential baseline infection rates at these schools. An extra year of school attendance could therefore be “bought” by a donor for $3.50, much cheaper than other interventions such as textbook programs or additional teachers. Organizations like GiveWell still rate deworming among the most cost-effective educational interventions in the world: in terms of short-run impact, surely this is one of the single most important pieces of applied economics of the 21st century.

The laureates have also used experimental design to learn that some previously highly-regarded programs are not as important to development as you might suspect. Banerjee, Duflo, Rachel Glennerster and Cynthia Kinnan studied microfinance rollout in Hyderabad, randomizing the neighborhoods which receiving access to a major first-gen microlender. These programs are generally woman-focused, joint-responsibility, high-interest loans a la the Nobel Peace Prize winning Grameen Bank. 2800 households across the city were initially surveyed about their family characteristics, lending behavior, consumption, and entrepreneurship, then followups were performed a year after the microfinance rollout, and then three years later. While women in treated areas were 8.8 percentage points more likely to take a microloan, and existing entrepreneurs do in fact increase spending on their business, there is no long-run impact on education, health, or the likelihood women make important family decisions, nor does it make businesses more profitable. That is, credit constraints, at least in poor neighborhoods in Hyderabad, do not appear the main barrier to development; this is perhaps not very surprising, since higher-productivity firms in India in the 2000s already have access to reasonably well-developed credit markets, and surely they are the main driver of national income (followup work does see some benefits for very high talent, very poor entrepreneurs, but the long run key result remains).

Let’s realize how wild this paper is: a literal Nobel Peace Prize was awarded for a form of lending that had not really been rigorously analyzed. This form of lending effectively did not exist in rich countries at the time they developed, so it is not a necessary condition for growth. And yet enormous amounts of money went into a somewhat-odd financial structure because donors were nonetheless convinced, on the basis of very flimsy evidence, that microlending was critical.

By replacing conjecture with evidence, and showing randomized trials can actually be run in many important development settings, the laureates’ reformation of economic development has been unquestionably positive. Or has it? Before returning to the (truly!) positive aspects of Banerjee, Duflo and Kremer’s research program, we must take a short negative turn. Because though Banerjee, Duflo, and Kremer are unquestionably the leaders of the field of development, and the most influential scholars for young economists working in that field, there is much more controversy about RCTs than you might suspect if all you’ve seen are the press accolades of the method. Donors love RCTs, as they help select the right projects. Journalists love RCTs, as they are simple to explain (Wired, in a typical example of this hyperbole: “But in the realm of human behavior, just as in the realm of medicine, there’s no better way to gain insight than to compare the effect of an intervention to the effect of doing nothing at all. That is: You need a randomized controlled trial.”) The “randomista” referees love RCTs – a tribe is a tribe, after all. But RCTs are not necessarily better for those who hope to understand economic development! The critiques are three-fold.

First, that while the method of random trials is great for impact or program evaluation, it is not great for understanding how similar but not exact replications will perform in different settings. That is, random trials have no specific claim to external validity, and indeed are worse than other methods on this count. Second, it is argued that development is much more than program evaluation, and that the reason real countries grow rich has essentially nothing to do with the kinds of policies studied in the papers we discussed above: the “economist as plumber” famously popularized by Duflo, who rigorously diagnoses small problems and proposes solutions, is a fine job for a World Bank staffer, but a crazy use of the intelligence of our otherwise-leading scholars in development. Third, even if we only care about internal validity, and only care about the internal validity of some effect that can in principle be studied experimentally, the optimal experimental design is generally not an RCT.

The external validity problem is often seen to be one related to scale: well-run partner NGOs are just better at implementing any given policy that, say, a government, so the benefit of scaled-up interventions may be much lower than that identified by an experiment. We call this “piloting bias”, but it isn’t really the core problem. The core problem is that the mapping from one environment or one time to the next depends on many factors, and by definition the experiment cannot replicate those factors. A labor market intervention in a high-unemployment country cannot inform in an internally valid way about a low-unemployment country, or a country with different outside options for urban laborers, or a country with an alternative social safety net or cultural traditions about income sharing within families. Worse, the mapping from a partial equilibrium to a general equilibrium world is not at all obvious, and experiments do not inform as to the mapping. Giving cash transfers to some villagers may make them better off, but giving cash transfers to all villagers may cause land prices to rise, or cause more rent extraction by corrupt governments, or cause all sorts of other changes in relative prices.

You can see this issue in the Scientific Summary of this year’s Nobel. Literally, the introductory justification for RCTs is that, “[t]o give just a few examples, theory cannot tell us whether temporarily employing additional contract teachers with a possibility of re-employment is a more cost-effective way to raise the quality of education than reducing class sizes. Neither can it tell us whether microfinance programs effectively boost entrepreneurship among the poor. Nor does it reveal the extent to which subsidized health-care products will raise poor people’s investment in their own health.”

Theory cannot tell us the answers to these questions, but an internally valid randomized control trial can? Surely the wage of the contract teacher vis-a-vis more regular teachers and hence smaller class sizes matters? Surely it matters how well-trained these contract teachers are? Surely it matters what the incentives for investment in human capital by students in the given location is? To put this another way: run literally whatever experiment you want to run on this question in, say, rural Zambia in grade 4 in 2019. Then predict the cost-benefit ratio of having additional contract teachers versus more regular teachers in Bihar in high school in 2039. Who would think there is a link? Actually, let’s be more precise: who would think there is a link between what you learned in Zambia and what will happen in Bihar which is not primarily theoretical? Having done no RCT, I can tell you that if the contract teachers are much cheaper per unit of human capital, we should use more of them. I can tell you that if the students speak two different languages, there is a greater benefit in having a teacher assistant to translate. I can tell you that if the government or other principal has the ability to undo outside incentives with a side contract, hence are not committed to the mechanism, dynamic mechanisms will not perform as well as you expect. These types of statements are theoretical: good old-fashioned substitution effects due to relative prices, or a priori production function issues, or basic mechanism design.

Things are worse still. It is not simply that an internally valid estimate of a treatment effect often tells us nothing about how that effect generalizes, but that the important questions in development cannot be answered with RCTs. Everyone working in development has heard this critique. But just because a critique is oft-repeated does not mean it is wrong. As Lant Pritchett argues, national development is a social process involving markets, institutions, politics, and organizations. RCTs have focused on, in his reckoning, “topics that account for roughly zero of the observed variation in human development outcomes.” Again, this isn’t to say that RCTs cannot study anything. Improving the function of developing world schools, figuring out why malaria nets are not used, investigating how to reintegrate civil war fighters: these are not minor issues, and it’s good that folks like this year’s Nobelists and their followers provide solid evidence on these topics. The question is one of balance. Are we, as economists are famously wont to do, simply looking for keys underneath the spotlight when we focus our attention on questions which are amenable to a randomized study? Has the focus on internal validity diverted effort from topics that are much more fundamental to the wealth of nations?

But fine. Let us consider that our question of interest can be studied in a randomized fashion. And let us assume that we do not expect piloting bias or other external validity concerns to be first-order. We still have an issue: even on internal validity, randomized control trials are not perfect. They are certainly not a “gold standard”, and the econometricians who push back against this framing have good reason to do so. Two primary issues arise. First, to predict what will happen if I impose a policy, I am concerned that what I have learned in this past is biased (e.g., the people observed to use schooling subsidies are more diligent than those who would go to school if we made these subsidies universal). But I am also concerned about statistical inference: with small sample sizes, even an unbiased estimate will not predict very well. I recently talked with an organization doing recruitment who quasi-randomly recruited at a small number of colleges. On average, they attracted a handful of applicants in each college. They stopped recruiting at the colleges with two or fewer applicants after the first year. But of course random variation means the difference between two and four applicants is basically nil.

In this vein, randomized trials tend to have very small sample sizes compared to observational studies. When this is combined with high “leverage” of outlier observations when multiple treatment arms are evaluated, particularly for heterogeneous effects, randomized trials often predict poorly out of sample even when unbiased (see Alwyn Young in the QJE on this point). Observational studies allow larger sample sizes, and hence often predict better even when they are biased. The theoretical assumptions of a structural model permit parameters to be estimated even more tightly, as we use a priori theory to effectively restrict the nature of economic effects.

We have thus far assumed the randomized trial is unbiased, but that is often suspect as well. Even if I randomly assign treatment, I have not necessarily randomly assigned spillovers in a balanced way, nor have I restricted untreated agents from rebalancing their effort or resources. A PhD student of ours on the market this year, Carlos Inoue, examined the effect of random allocation of a new coronary intervention in Brazilian hospitals. Following the arrival of this technology, good doctors moved to hospitals with the “randomized” technology. The estimated effect is therefore nothing like what would have been found had all hospitals adopted the intervention. This issue can be stated simply: randomizing treatment does not in practice hold all relevant covariates constant, and if your response is just “control for the covariates you worry about”, then we are back to the old setting of observational studies where we need a priori arguments about what these covariates are if we are to talk about the effects of a policy.

The irony is that Banerjee, Duflo and Kremer are often quite careful in how they motivate their work with traditional microeconomic theory. They rarely make grandiose claims of external validity when nothing of the sort can be shown by their experiment. Kremer is an ace theorist in his own right, Banerjee often relies on complex decision and game theory particularly in his early work, and no one can read the care with which Duflo handles issues of theory and external validity and think she is merely punting. Most of the complaints about their “randomista” followers do not fully apply to the work of the laureates themselves.

And none of the critiques above should be taken to mean that experiments cannot be incredibly useful to development. Indeed, the proof of the pudding is in the tasting: some of the small-scale interventions by Banerjee, Duflo, and Kremer have been successfully scaled up! To analogize to a firm, consider a plant manager interested in improving productivity. She could read books on operations research and try to implement ideas, but it surely is also useful to play around with experiments within her plant. Perhaps she will learn that it’s not incentives but rather lack of information that is the biggest reason workers are, say, applying car door hinges incorrectly. She may then redo training, and find fewer errors in cars produced at the plant over the next year. This evidence – not only the treatment effect, but also the rationale – can then be brought to other plants at the same company. All totally reasonable. Indeed, would we not find it insane for a manager to try things out, and make minor changes on the margin, before implementing a huge change to incentives or training? And of course the same goes, or should go, when the World Bank or DFID or USAID spend tons of money trying to solve some development issue.

On that point, what would even a skeptic agree a development experiment can do? First, it is generally better than other methods at identifying internally valid treatment effects, though still subject to the caveats above.

Second, it can fine-tune interventions along margins where theory gives little guidance. For instance, do people not take AIDS drugs because they don’t believe they work, because they don’t have the money, or because they want to continue having sex and no one will sleep with them if they are seen picking up antiretrovirals? My colleague Laura Derksen suspected that people are often unaware that antiretrovirals prevent transmission, hence in locations with high rates of HIV, it may be safer to sleep with someone taking antiretrovirals than the population at large. She shows that informational interventions informing villagers about this property of antiretrovirals meaningfully increases takeup of medication. We learn from her study that it may be important in the case of AIDS prevention to correct this particular set of beliefs. Theory, of course, tells us little about how widespread these incorrect beliefs are, hence about the magnitude of this informational shift on drug takeup.

Third, experiments allow us to study policies that no one has yet implemented. Ignoring the problem of statistical identification in observational studies, there may be many policies we wish to implement which are wholly different in kind from those seen in the past. The negative income tax experiments of the 1970s are a classic example. Experiments give researchers more control. This additional control is of course balanced against the fact that we should expect super meaningful interventions to have already occurred, and we may have to perform experiments at relatively low scale due to cost. We should not be too small-minded here. There are now experimental development papers on topics thought to be outside the bounds of experiment. I’ve previously discussed on this site Kevin Donovan’s work randomizing the placement of roads and bridges connected remote villages to urban centers. What could be “less amenable” to randomization that the literal construction of a road and bridge network?

So where do we stand? It is unquestionable that a lot of development work in practice was based on the flimsiest of evidence. It is unquestionable that armies Banerjee, Duflo, and Kremer have sent into the world via J-PAL and similar institutions have brought much more rigor to understanding program evaluation. Some of these interventions are now literally improving the lives of millions of people with clear, well-identified, nonobvious policy. That is an incredible achievement! And there is something likeable about the desire of the ivory tower to get into the weeds of day-to-day policy. Michael Kremer on this point: “The modern movement for RCTs in development economics…is about innovation, as well as evaluation. It’s a dynamic process of learning about a context through painstaking on-the-ground work, trying out different approaches, collecting good data with good causal identification, finding out that results do not fit pre-conceived theoretical ideas, working on a better theoretical understanding that fits the facts on the ground, and developing new ideas and approaches based on theory and then testing the new approaches.” No objection here.

That said, we cannot ignore that there are serious people who seriously object to the J-PAL style of development. Deaton, who won the Nobel Prize only four years ago, writes the following, in line with our discussion above: “Randomized controlled trials cannot automatically trump other evidence, they do not occupy any special place in some hierarchy of evidence, nor does it make sense to refer to them as “hard” while other methods are “soft”… [T]he analysis of projects needs to be refocused towards the investigation of potentially generalizable mechanisms that explain why and in what contexts projects can be expected to work.” Lant Pritchett argues that despite success persuading donors and policymakers, the evidence that RCTs lead to better policies at the governmental level, and hence better outcomes for people, is far from the case. The barrier to the adoption of better policy is bad incentives, not a lack of knowledge on how given policies will perform. I think these critiques are quite valid, and the randomization movement in development often way overstates what they have, and could have in principle, learned. But let’s give the last word to Chris Blattman on the skeptic’s case for randomized trials in development: “if a little populist evangelism will get more evidence-based thinking in the world, and tip us marginally further from Great Leaps Forward, I have one thing to say: Hallelujah.” Indeed. No one, randomista or not, longs to go back to the day of unjustified advice on development, particularly “Great Leap Forward” type programs without any real theoretical or empirical backing!

A few remaining bagatelles:

1) It is surprising how early this award was given. Though incredibly influential, the earliest published papers by any of the laureates mentioned in the Nobel scientific summary are from 2003 and 2004 (Miguel-Kremer on deworming, Duflo-Saez on retirement plans, Chattopadhyay and Duflo on female policymakers in India, Banerjee and Duflo on health in Rajathstan). This seems shockingly recent for a Nobel – I wonder if there are any other Nobel winners in economics who won entirely for work published so close to the prize announcement.

2) In my field, innovation, Kremer is most famous for his paper on patent buyouts (we discussed that paper on this site way back in 2010). How do we both incentivize new drug production but also get these drugs sold at marginal cost once invented? We think the drugmakers have better knowledge about how to produce and test a new drug than some bureaucrat, so we can’t finance drugs directly. If we give a patent, then high-value drugs return more to the inventor, but at massive deadweight loss. What we want to do is offer inventors some large fraction of the social return to their invention ex-post, in exchange for making production perfectly competitive. Kremer proposes patent auctions where the government pays a multiple of the winning bid with some probability, giving the drug to the public domain. The auction reveals the market value, and the multiple allows the government to account for consumer surplus and deadweight loss as well. There are many practical issues, but I have always found this an elegant, information-based attempt to solve the problem of innovation production, and it has been quite influential on those grounds.

3) Somewhat ironically, Kremer also has a great 1990s growth paper with RCT-skeptics Pritchett, Easterly and Summers. The point is simple: growth rates by country vacillate wildly decade to decade. Knowing the 2000s, you likely would not have predicted countries like Ethiopia and Myanmar as growth miracles of the 2010s. Yet things like education, political systems, and so on are quite constant within-country across any two decade period. This necessarily means that shocks of some sort, whether from international demand, the political system, nonlinear cumulative effects, and so on, must be first-order for growth. A great, straightforward argument, well-explained.

4) There is some irony that two of Duflo’s most famous papers are not experiments at all. Her most cited paper by far is a piece of econometric theory on standard errors in difference-in-difference models, written with Marianne Bertrand. Her next most cited paper is a lovely study of the quasi-random school expansion policy in Indonesia, used to estimate the return on school construction and on education more generally. Nary a randomized experiment in sight in either paper.

5) I could go on all day about Michael Kremer’s 1990s essays. In addition to Patent Buyouts, two more of them appear on my class syllabi. The O-Ring theory is an elegant model of complementary inputs and labor market sorting, where slightly better “secretaries” earn much higher wages. The “One Million B.C.” paper notes that growth must have been low for most of human history, and that it was limited because low human density limited the spread of nonrivalrous ideas. It is the classic Malthus plus endogenous growth paper, and always a hit among students.

6) Ok, one more for Kremer, since “Elephants” is my favorite title in economics. Theoretically, future scarcity increases prices. When people think elephants will go extinct, the price of ivory therefore rises, making extinction more likely as poaching incentives go up. What to do? Hold a government stockpile of ivory and commit to selling it if the stock of living elephants falls below a certain point. Elegant. And I can’t help but think: how would one study this particular general equilibrium effect experimentally? I both believe the result and suspect that randomized trials are not a good way to understand it!

Advertisements

3 thoughts on “What Randomization Can and Cannot Do: The 2019 Nobel Prize

  1. Randomista says:

    This was a really well written and balanced post. Thank you for writing it. That said, I find it it interesting that for this Nobel you chose to go so deep into critiques of the RCT approach while you never brought up critiques of winners in your posts from previous years. Is that because e.g. Romer and Nordhaus aren’t criticized in any deep methodological way or Is this a product of your own beliefs? I would love this kind of balanced assessment to extend to other prize winners in your future writing

  2. Great post. I offered some comments on RCTs from a somewhat different philosophical angle here

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Advertisements
%d bloggers like this: