When ‘can’t miss’ programs fail
The movement to require governments to prioritize evidenced-based programs may be doing more harm than good.
We have known for a long time that rigorous evaluations of social programs almost always fail to find meaningful impacts. It has been 40 years since the sociologist Peter Rossi promulgated his iron law of evaluation, “The expected value of any new impact assessment of any large-scale social program is zero,” and his stainless steel law, “The better designed the impact assessment of a social program, the more likely is the resulting estimate of net impact to be zero.” It has been 14 years since Jon Baron and Isabel Sawhill reviewed the 10 most recent randomized controlled trials (RCTs) of large federal social programs and reported that nine found “weak or no positive effects” and one found “meaningful, though modest” effects.
As Rossi himself noted, the weak detectable impacts of large-scale programs could be the result of ineffective implementation of effective program models rather than a failure of the program models themselves. (Consistent with this hypothesis, my Government Performance Lab seeks to improve outcomes of social programs through better implementation.)
But what is striking about more recent evaluation findings is how often even the most exemplary interventions are found to have zero or negligible effects — in situations with plenty of attention to effective implementation. I’m thinking, for example, of the evaluation of the Camden Coalition program providing intensive care management for health care superutilizers, a program that was profiled by Atul Gawande in The New Yorker; the evaluation of the South Carolina expansion of Nurse Family Partnership, a program model that is at the top of nearly every list of “evidence-based” programs; and the failure to meet pre-specified targets of some of the RCT-evaluated social impact bond projects, program models deliberately picked from the universe of social policy interventions as the ones most worthy of investor backing.
What are the implications when even the “can’t-miss” programs regularly fail to exhibit detectable impacts?
First, we need a lot more experimentation, a lot more evaluation and a lot more iteration. Discovering scalable solutions to social problems is highly valuable. So even if it takes 50 tries to discover an effective intervention, it can be worth it. Since the benefits of solving social problems can rarely be monetized by the innovators, government and philanthropy need to be more generous and more strategic in funding experimentation. For example, rather than immediately cutting off resources from programs that produce a null evaluation result, funders should consider reinvesting in them so that program leaders can use the evaluation findings to improve their service delivery models. And we should be doing many more RCTs so that we can determine which approaches are truly effective, figure out what features of successful programs need to be maintained as they are spread and understand which programs are the best match for particular population subgroups.
Second, we must be more sophisticated in interpreting and learning from null findings. Suppose an RCT of an exemplary program fails to find a meaningful positive impact. Is it because the program never actually worked, perhaps because the prior evaluations generated false positives though hyping a few endpoints or subgroup effects out of a much larger set of null results? Is it that the world has changed such that a previously effective program no longer delivers the same impact today? Is it that the intended intervention was not actually delivered or was delivered ineffectively? Or is it that the program does work effectively, but the evaluation was incapable of discovering this, perhaps because the control group received similar services, the study was underpowered, or the outcomes we are capable of measuring are only a subset of program effects?
Third, we need to be much more cautious about overly fixating on some programs as “evidence-based” and prioritizing them for funding. It is, of course, a good thing to spend funds on program models with the highest expected impact and to create incentives for providers to take the courageous step of undergoing a rigorous impact evaluation. But I think the movement to require governments to prioritize evidence-based programs for funding may be doing more harm than good. Fundamentally, the evidence base behind many “evidence-based” programs is quite weak, even before considering external validity. This means it can be a mistake to prioritize a national “evidence-based” provider over a strong local provider. Moreover, “evidence-based” providers are often reluctant to adjust their program models once they have been certified as evidence-based. As a result, requirements to allocate funding to these providers can end up giving monopoly power to program models developed several decades ago, thereby disincentivizing innovation and preventing us from discovering even better solutions.
Fourth, when we conduct randomized controlled trials of social interventions, we should almost always include an extra experimental arm in which we give a group of participants cash equal to the cost of the intervention rather than providing them with the intervention. I am skeptical that a broad-based Universal Basic Income program would be the highest-value use of incremental tax revenue. But given how hard it is to come up with successful social interventions, there is a good chance we will find that providing cash to more limited populations like those we target for social interventions will often be better than the best intervention we can design. Moreover, the right decision rule for spreading a successful program is not simply that it has benefits greater than costs but also that there is nothing else we know how to do (such as providing cash) that has higher net benefits.
Fifth, if we want to find solutions that really move the dial on major social problems, we need to be willing to experiment with systemic changes, not simply incremental ones, even if the systemic changes can be evaluated only with less definitive methods like before-and-after comparisons, difference in differences or synthetic controls. Evaluation work in social policy is too focused on producing rigorous estimates of narrow interventions at the expense of designing, testing and assessing bigger changes. So at the same time that we greatly increase the number of RCTs we perform, we need an even greater percentage increase in research that tests more systemic changes. The tests of systemic changes should include more research on changes that prevent harms from occurring in the first place and reduce the need for downstream interventions.
In both my academic work conducting large-scale RCTs and in my policymaking work as a federal budget official, I have seen consumers of research findings struggle with how to react to null findings from RCTs. If we want to make more rapid progress on difficult social problems, we need to get away from the “what works” framework that reduces program effectiveness to a static binary outcome. Instead, we need sustained research programs that 1) test many alternative approaches to important problems and expect that many will fail, 2) allow innovators to swing and miss and then recalibrate their efforts, and 3) use evaluation not as a tool for culling of the weakest members of the herd, but instead as a way to generate the learning necessary for us to eventually discover highly effective solutions.
Jeffrey Liebman is the Robert W. Scrivner Professor of Social Policy at Harvard’s Kennedy School, where he directs the Taubman Center for State and Local Government, the Rappaport Institute for Greater Boston and the Government Performance Lab.