The Limitations of Randomised Controlled Trials

Yves here. Even though this post is a bit wonky, it’s short and very important. And you need to read about something other than the election. Recall that Angus Deaton is the winner of the Swedish-National-Bank-named-for-Alfred-Nobel prize and with Anne Case, performed an important study that exposed the spike in death rates among less-educated middle aged whites.

By Nancy Cartwright, Professor of Philosophy, University of Durham and University of California, San Diego and Angus Deaton, Senior Scholar and Dwight D. Eisenhower Professor of Economics and International Affairs Emeritus, Woodrow Wilson School of Public and International Affairs and Economics Department, Princeton University. Originally published at VoxEU

In recent years, the use of randomised controlled trials has spread from labour market and welfare programme evaluation to other areas of economics (and to other social sciences), perhaps most prominently in development and health economics. This column argues that some of the popularity of such trials rests on misunderstandings about what they are capable of accomplishing, and cautions against simple extrapolations from trials to other contexts.

Randomised controlled trials (RCTs) have been sporadically used in economic research since the negative income tax experiments between 1968 and 1980 (see Wise and Hausman 1985), and have been regularly used since then to evaluate labour market and welfare programmes (Manski and Garfinkel 1992, Gueron and Rolston 2013). In recent years, they have spread widely in economics (and in other social sciences), perhaps most prominently in development and health economics. The ‘credibility revolution’ in econometrics (Angrist and Pischke 2010) putatively frees empirical investigation from implausible and arbitrary theoretical and statistical assumptions, and RCTs are seen as the most ‘credible’ and ‘rigorous’ of the credible methods; indeed, credible non-RCT designs typically pattern themselves as closely as possible on RCTs. Imbens (2010) writes, “Randomised experiments do occupy a special place in the hierarchy of evidence, namely at the very top.”

In medicine, Pocock and Elbourne (2000) argue that only RCTs “can provide a reliably unbiased estimate of treatment effects”, and without such estimates, they “see considerable dangers to clinical research and even to the well-being of patients”. The link between bias and risk to patients is taken as obvious, with no attempt to show that an RCT experimental design does indeed minimise the expected harm to patients. The World Bank has run many development related RCTs, and makes claims well beyond unbiasedness. Its implementation manual states, “we can be very confident that our estimated average impact” (given as the difference in means between the treatment and control group) “constitute the true impact of the program, since by construction we have eliminated all observed and unobserved factors that might otherwise plausibly explain the differences in outcomes” (Gertler et al. 2011). High-quality evidence indeed; the truth is surely the ultimate in credibility.

What Are Randomised Controlled Trials Good For?

In a recent paper, we argue that some of the popularity of RCTs, among the public as well as some practitioners, rests on misunderstandings about what they are capable of accomplishing (Deaton and Cartwright 2016). Well-conducted RCTs could provide unbiased estimates of the average treatment effect (ATE) in the study population, provided no relevant differences between treatment and control are introduced post randomisation, which blinding of subjects, investigators, data collectors, and analysts serves to diminish. Unbiasedness says that, if we were to repeat the trial many times, we would be right on average. Yet we are almost never in such a situation, and with only one trial (as is virtually always the case) unbiasedness does nothing to prevent our single estimate from being very far away from the truth. If, as if often believed, randomisation were to guarantee that the treatment and control groups are identical except for the treatment, then indeed, we would have a precise – indeed exact – estimate of the ATE. But randomisation does nothing of the kind, even at baseline; in any given RCT, nothing ensures that other causal factors are balanced across the groups at the point of randomisation. Investigators often test for balance on observable covariates, but unless the randomisation device is faulty, or people systematically break their assignment, the null hypothesis underlying the test is true by construction, so that the test is not informative and should not be carried out.

Of course, we know that the ATE from an RCT is only an estimate, not the infallible truth, and like other estimates, it has a standard error. If appropriately computed, the standard error of the estimated ATE can give an indication of the importance of other factors. As was understood by Fisher from the very first agricultural trials, randomisation, while doing nothing to guarantee balance on omitted factors, gives us a method for assessing their importance. Yet even here there are pitfalls. The t-statistics for estimated ATEs from RCTs do not in general follow the t-distribution. As recently documented by Young (2016), a large fraction of published studies have made spurious inferences because of this Fisher-Behrens problem, or because of the failure to deal appropriately with multiple-hypothesis testing. Although most of the published literature is problematic, these issues can be addressed by improvements in technique. Not so, however, in cases where individual treatment effects are skewed – as in healthcare experiments, where a one or two individuals can account for a large share of spending (this was true in the Rand Health Experiment); or in microfinance, where a few subjects make money and most do not (where the t-distribution again breaks down). Once again, inferences are likely to be wrong, but here there is no clear fix. When there are outlying individual treatment effects, the estimate depends on whether the outliers are assigned to treatments or controls, causing massive reductions in the effective sample size. Trimming of outliers would fix the statistical problem, but only at the price of destroying the economic problem; for example, in healthcare, it is precisely the few outliers that make or break a programme. In view of these difficulties, we suspect that a large fraction of the published results of RCTs in development and health economics are unreliable.

The ‘credibility’ of RCTs comes from their ability to get answers without the use of potentially contentious prior information about structure, such as specifying other causal factors or detailing the mechanisms through which they operate. A sceptical lay audience is often unwilling to accept prior economic knowledge and even within the profession, there are differences about appropriate assumptions or controls. Yet, as is always the case, the only route to precision is through prior information and controlling for factors that are likely to be important, just as in a (non-randomised) laboratory experiment in physics, biology, or even economics, scientists seek accurate measurement by controlling for known confounders. Cumulative science happens when new results are built on top of old ones – or undermine them – and RCTs, with their refusal to use prior science, make this very difficult. And any RCT can be challenged ex post by examining the differences between treatments and controls as actually allocated, and showing that arguably important factors were unevenly distributed; prior information is excluded by randomisation, but reappears in the interpretation of the results.

A well-conducted RCT can yield a credible estimate of an ATE in one specific population, namely the ‘study population’ from which the treatments and controls were selected.  Sometimes this is enough; if we are doing a post hoc program evaluation, if we are testing a hypothesis that is supposed to be generally true, if we want to demonstrate that the treatment can work somewhere, or if the study population is a randomly drawn sample from the population of interest whose ATE we are trying to measure. Yet the study population is often not the population that we are interested in, especially if subjects must volunteer to be in the experiment and have their own reasons for participating or not. A famous early example comes from Ashenfelter (1981), who found that people who volunteer for a training programme tend to have seen a recent drop in their wages; similarly, people who take a drug may be those who have failed other forms of therapy. Indeed, many of the differences in results between experimental and non-experimental studies can be traced not to differences in methodology, but to differences in the populations to which they apply.

The ‘Transportation’ Problem

More generally, demonstrating that a treatment works in one situation is exceedingly weak evidence that it will work in the same way elsewhere; this is the ‘transportation’ problem: what does it take to allow us to use the results in new contexts, whether policy contexts or in the development of theory? It can only be addressed by using previous knowledge and understanding, i.e. by interpreting the RCT within some structure, the structure that, somewhat paradoxically, the RCT gets its credibility from refusing to use. If we want to go from an RCT to policy, we need to build a bridge from the RCT to the policy. No matter how rigorous or careful the RCT, if the bridge is built by a hand-waving simile that the policy context is somehow similar to the experimental context, the rigor in the trial does nothing to support a policy; in any chain of evidence, it is the weakest link that determines the overall strength of the claim, not the strongest. Using the results of an RCT cannot simply be a matter of simple extrapolation from the experiment to another context. Causal effects depend on the settings in which they are derived, and often depend on factors that might be constant within the experimental setting but different elsewhere. Even the direction of causality can depend on the context. We have a better chance of transporting results if we recognise the issue when designing the experiment – which itself requires the commitment to some kind of structure – and try to investigate the effects of the factors that are likely to vary elsewhere. Without a structure, without an understanding of why the effects work, we not only cannot transport, but we cannot begin to do welfare economics; just because an intervention works, and because the investigator thinks the intervention makes people better off, is no guarantee that it actually does so. Without knowing why things happen and why people do things, we run the risk of worthless casual (‘fairy story’) causal theorising, and we have given up on one of the central tasks of economics.

See original post for references

Print Friendly, PDF & Email

31 comments

  1. Steve H.

    : Even the direction of causality can depend on the context.

    There’s yer problem. It means that, reflecting on this:

    : Cumulative science happens when new results are built on top of old ones – or undermine them

    whether the priors work is context-dependent. Time_2 is supposed to be dependent on Time_1. But conditions having changed at Time_2, when the antecedent becomes a precedent, means the coupling does not map 1:1.

    That’s not just a replication problem, it’s an epistemological one.

    “Ends achieved are nothing more than means expressed.” – R.G.H. Siu

    1. makedoanmend

      Thanks for the link!

      One of my favourite things – Guinness – mixed up one of my least favourite – stats.

      Who knew.

      And I’m taking stats in the new year for a biology course. Synchronicity in troubling time is always a delight.

      Next time I’m in Dublin, I’ll raise a pint to yee. Now for a heavy dose of reading your esteemed professor’s writings – made a start and I’m delighted.

      1. Katharine

        Double or triple that now I have gone on to the articles and found the footnote about Gosset’s using two kangaroos and a platypus to explain kurtosis!

  2. Cris Kennedy

    Randomised controlled trials in the field of medicine are usually designed to answer questions about the efficient use of health care dollars. This occurs because trials are expensive and must be paid for; in the end a prospective trial sponsor envisions some kind of “return on [their] investment.”

    Rather than ROI, I think RCTs in medicine should be designed around the question of what is the best way to die. It’s a radical idea. Most people won’t seriously consider it. One fact is undeniable however: each of us will die someday. An example: if you thought you had a choice, wouldn’t you rather die of a nice clean heart attack rather than in a prolonged Alzheimer’state? But look at the dollars that have poured into lowering cholesterol drug research versus brain research…so why is this?? As a physician I can tell you much research is needed on making patients live lives more vigorous, more pain free, with less anxiety, etc.

    I think the reason we’re not asking th right questions is because we’re in denial. We don’t want to die of anything. Thus, research continues to pursue ROI under the guise of prolonging “quality adjusted life years” rather than just facing the facts. Each of us are mortal, we really need to reprioritise precious research dollars with that in mind.

    1. HotFlash

      Interesting. Not a physician, just a patient (as rarely as I can manage). My next major Life Event will most likely be my death. Some of us would prefer to talk about it beforehand, but apparently it is Not Done.

  3. subgenius

    RCTs rely on a simplistic mechanistic model of reality – one size fits all similar circumstances…sadly, nature is a mess of complex interactions, feedbacks, etc.

    Toy-world models tend to fail when transplanted to the natural world. It is also worth looking at the gaming of the RCT paradigm by those with vested interests (unreleased studies, massaged data, etc)

  4. PHACOPS

    At least for industrial processes and for the efficiency of cost for the data, I had gone to factorial designs. The problem that I’ve seen, though, are niave practitioners teasing out significance through illigitimate operations like non-orthogonal collapse of the matrix which misallocates variance.

  5. TheCatSaid

    @subgenius

    RCTs rely on a simplistic mechanistic model of reality – one size fits all similar circumstances…sadly, nature is a mess of complex interactions, feedbacks, etc.

    Our individual characteristics are relevant. RCT tries its hardest to get rid of these “confounders” but it doesn’t work.

    From personal experience as a health care professional, whose training included medical anthropology as well as conventional anatomy, physiology etc.:

    One example in medicine for the inadequacy of attempting to use RCT is in Chinese herbal medicine. When used as designed, this is a system of medicine that relies on unique individual observations by a practitioner (after lengthy training and experience) of a patient who is seen as a unique individual. Chinese herbs, when combined, can completely change the characteristic of any given herb when used on its own. The traditional knowledge of the impact of herbal combinations on unique patients is both an art and a skill. To use RCT would be not only ineffective, it would be immoral within the context of Chinese herbal medicine when used as designed (and not in a dumbed-down-for-Westerners OTC context).

  6. Greg Taylor

    Using multiple metrics to determine treatment effectiveness might address a few of these issues. For instance, estimating and reporting the distribution of treatment effects across populations could avoid some of the issues with using averages or medians to determine effectiveness. You could then make statements about the percentage of the population expected to see practically significant effects.

    Once approved, it takes too long to determine that a treatment doesn’t “transport.” We should be doing a better job of monitoring and measuring effects of treatments deemed successful enough to unleash widely.

  7. kareninca

    That was amazingly well written for an academic article. It would have been nice if they could have used concrete examples, but I guess they could reasonably assume that their expected audience would have examples in mind.

    I’ve always been skeptical of lab work results, but this particular study was the tipping point for me: http://www.nature.com/nmeth/journal/v11/n6/full/nmeth.2935.html. It turns out that exposure to the smell of male researchers (but not to the smell of female researchers) massively raises cortisol/stress levels in rats. This has to affect the interpretation of the results (to put it charitably) of just about every rat study out there. All that rat suffering for nothing. This is a variable that was not even looked into until this study; how many more variables are being missed? And are present studies using rats taking this into account? I don’t think so.

    1. TheCatSaid

      Thank you for that link. It’s a real eye opener. We already know that animals see and smell and hear things we cannot–just think of what else they must be picking up on, and the multitude of ways it might cause confounding responses that unknowingly affect results.

      Then add the numerous known instances of animal communication. (A good personal friend was a professional animal communicator so this is nothing new to me, including extensive personal experience.) Animals in experiments could be picking up on many aspects of human thought which could also influence experiments in many ways.

    2. Cojo

      Add to the fact that lab rats are blind so their olfactory brain areas are magnified magnifying the effect.

  8. Gabriel

    Happened to know about this paper from Lars P. Syll’s blog. He adds a nice gloss here.

    External validity/extrapolation/generalization is founded on the assumption that we could make inferences based on P(A|B) that is exportable to other populations for which P'(A|B) applies. Sure, if one can convincingly show that P and P’are similar enough, the problems are perhaps surmountable. But arbitrarily just introducing functional specification restrictions of the type invariance/stability /homogeneity, is, at least for an epistemological realist far from satisfactory. And often it is – unfortunately – exactly this that I see when I take part of mainstream economists’ RCTs and ‘experiments.’

    Many ‘experimentalists’ claim that it is easy to replicate experiments under different conditions and therefore a fortiori easy to test the robustness of experimental results. But is it really that easy? Population selection is almost never simple. Had the problem of external validity only been about inference from sample to population, this would be no critical problem. But the really interesting inferences are those we try to make from specific labs/experiments/fields to specific real world situations/institutions/ structures that we are interested in understanding or (causally) to explain. And then the population problem is more difficult to tackle.

    In randomized trials the researchers try to find out the causal effects that different variables of interest may have by changing circumstances randomly — a procedure somewhat (‘on average’) equivalent to the usual ceteris paribus assumption).

    Besides the fact that ‘on average’ is not always ‘good enough,’ it amounts to nothing but hand waving to simpliciter assume, without argumentation, that it is tenable to treat social agents and relations as homogeneous and interchangeable entities.
    (…)

    The problem here is that although we may get an estimate of the ‘true’ average causal effect, this may ‘mask’ important heterogeneous effects of a causal nature. Although we get the right answer of the average causal effect being 0, those who are ‘treated'( X=1) may have causal effects equal to – 100 and those ‘not treated’ (X=0) may have causal effects equal to 100. Contemplating being treated or not, most people would probably be interested in knowing about this underlying heterogeneity and would not consider the OLS average effect particularly enlightening.
    (…)
    Most ‘randomistas’ underestimate the heterogeneity problem. It does not just turn up as an external validity problem when trying to ‘export’ regression results to different times or different target populations. It is also often an internal problem to the millions of regression estimates that economists produce every year.

    Highly recommend the guy’s blog to anyone interested in this kind of question.

    (As it happens, the polisci department of a certain Bay Area Institution I once attended rented a number of West African villages in order to “randomize” treatment and control for divide-the-dollar games that were meant to reveal something very important about “institutions” in the Acemoglu and Robinson sense.)

    1. John Zelnicker

      @Gabriel – I wholeheartedly concur with your recommendation of Lars Sylls’ blog. He calls himself a critical realist and teaches economics in Sweden. Many of his blog posts are devoted to tearing down the current mainstream approach to economics, especially the inherent problems with the use of DSGE models based on totally unrealistic assumptions like rational expectations. Some of his takedowns of Krugman are a delight to read. Although he can get a bit wonkish at times, I think he is an important voice in heterodox economics. He is also a proponent of MMT.

      1. SoCal Rhino

        Like much of Yve’s original work, the quality is in the difficulty I think. Need to dig into the hard bits before you can internalize it and own it going forward.

    2. SoCal Rhino

      Thanks! Lost the link to his site and couldn’t remember his name. Found much to chew on in his writing. Pretty effective critic of Krugman’s is-lm model, one of the people I was reading when I stumbled on this site.

  9. Peter L.

    Thanks to everyone for further links on this topic. Here is another: http://philosophybites.com/2015/11/john-worrall-on-evidence-based-medicine.html

    Interview with John Worrall about evidence based medicine with short discussion of randomised controlled trials.

    On the other hand … I think Freakonomics’ podcast had an episode during which randomized controlled trials were praised as “the very best way to learn about the world.” I was (sort of )surprised by the attitudes expressed by the Freakonomics’ producers. They seemed unusually taken with randomization, in an almost religious way.

Comments are closed.