The Numbers that Drive Policy
"Evidence-based policies" are in vogue. But how do you synthesize the evidence base? People often engage in "vote counting": reading the literature and consciously or subconsciously summing up the number of findings for a positive effect, a negative effect, or no effect for a particular program. The group with the greatest number wins.
Unfortunately, vote counting is not an ideal method to synthesize the evidence. The biggest problem is that some "no effect" papers were unlikely to find an effect even if there was one. Many studies in development use too small of a sample to be likely to find an effect, so the fact that their results are insignificant is not actually all that informative.
An alternative technique, meta-analysis, can aggregate many insignificant findings and sometimes transform them into a jointly significant result. It also allows flexibility in weighting studies differently, since all studies are not equal.
In most of the cases in which vote counting and meta-analysis diverge, vote counting reports an insignificant result and meta-analysis reports a significant positive result. For example, both conditional and unconditional cash transfer programs often had several "no effect" results -- "cash transfers don't work!" These types of programs have effects on a very broad range of outcomes, but because some or all of them are only tangentially related to the intervention, it's harder to see an effect in any one study. But if you aggregate the insignificant results on labour force participation, grade promotion or test scores through meta-analysis then they become significant -- "cash transfers work!"
The error of overstating the strength of "no effect" results through vote counting is all the worse given that "no effect" does not really mean no effect. The common misconception is that failure to reject the null hypothesis of no effect means we have accepted the null hypothesis of no effect, but that is simply untrue. Absence of a positive finding becomes a finding of absent effect, but this is not what the test says. Perhaps with a bit more data the result would become significant.
How big is this problem? Preliminary analysis of a database I have assembled of development studies, through a group called AidGrade, suggests that the meta-analysis results for a particular intervention-outcome combination diverge from the results that would have been obtained using vote counting about a third of the time. Vote counting actually gives very similar results as to what one would get by just looking at a single paper selected at random from the entire literature; not a great foundation on which to base policy recommendations. If we want to use rigorous evidence, we have to be rigorous about how we use rigorous evidence.