Since it began in 2002, the What Works Clearinghouse has played an important role in finding, rating, and publicizing findings of evaluations of educational programs. It performs a crucial function for evidence-based reform. For this very reason, it needs to be right. But in several important ways, it uses procedures that are indefensible and have a big impact on its conclusions.
One of these relates to a study rating called “substantively important-positive.” This refers to study outcomes with an effect size of at least +0.25, but that are not statistically significant. I’ve written about this before, but the WWC has recently released a database of information on its studies that makes it easy to analyze WWC data on a large scale, and we have learned a lot more about this topic.
Study outcomes rated as “substantively important – positive” can qualify a study as “potentially positive,” the second-highest WWC rating. “Substantively important-negative” findings (non-significant effect sizes less than -0.25) can cause a study to be rated as potentially negative, which can keep a study from getting a positive rating forever, as a single “potentially negative” rating, under current rules, ensures that a program can never receive a rating better than “mixed,” even if other studies found hundreds of significant positive effects.
People who follow the WWC and know about “substantively important” may assume that it may be a strange rule, but relatively rare in practice. But that is not true.
My graduate student, Amanda Inns, has just done an analysis of WWC data from their own database, and if you are a big fan of the WWC, this is going to be a shock. Amanda has looked at all WWC-accepted reading and math studies. Among these, she found a total of 339 individual outcomes rated “positive” or “potentially positive.” Of these, 155 (46%) reached the “potentially positive” level only because they had effect sizes over +0.25, but were not statistically significant.
Another 36 outcomes were rated “negative” or “potentially negative.” 26 of these (72%) were categorized as “potentially negative” only because they had effect sizes less than -0.25 and were not significant. I’m sure patterns would be similar for subjects other than reading and math.
Put another way, almost half (48%) of outcomes rated positive/potentially positive or negative/potentially negative by the WWC were not statistically significant. As one example of what I’m talking about, consider a program called The Expert Mathematician. It had just one study with only 70 students in 4 classrooms (2 experimental and 2 control). The WWC re-analyzed the data to account for clustering, and the outcomes were nowhere near statistically significant, though they were greater than +0.25. This tiny study, and this study alone, caused The Expert Mathematician to receive the WWC “potentially positive” rating and to be ranked seventh among all middle school math programs. Similarly, Waterford Early Learning received a “potentially positive” rating based on a single tiny study with only 70 kindergarteners in 6 schools. The outcomes ranged from -0.71 to +1.11, and though the mean was more than +0.25, the outcome was far from significant. Yet this study alone put Waterford on the WWC list of proven kindergarten programs.
I’m not taking any position on whether these particular programs are in fact effective. All I am saying is that these very small studies with non-significant outcomes say absolutely nothing of value about that question.
I’m sure that some of you nerdier readers who have followed me this far are saying to yourselves, “well, sure, these substantively important studies may not be statistically significant, but they are probably unbiased estimates of the true effect.”
More bad news. They are not. Not even close.
The problem, also revealed in Amanda Inns’ data, is that studies with large effect sizes but not statistical significance tend to have very small sample sizes (otherwise, they would have been significant). Across WWC reading and math studies that used individual-level assignment, median sample sizes were 48, 74, or 86, for substantively important, significant, or indeterminate (non-significant with ES < +0.25), respectively. For cluster studies, they were 10, 17, and 33 clusters respectively. In other words, “substantively important” outcomes averaged less than half the sample sizes of other outcomes.
And small-sample studies greatly overstate effect sizes. Among all factors that bias effect sizes, small sample size is the most important (only use of researcher/developer-made measures comes close). So a non-significant positive finding in a small study is not an unbiased point estimate that just needs a larger sample to show its significance. It is probably biased, in a consistent, positive direction. Studies with sample sizes less than 100 have about three times the mean effect sizes of studies with sample sizes over 1000, for example.
But “substantively important” ratings can throw a monkey wrench into current policy. The ESSA evidence standards require statistically significant effects for all of its top three levels (strong, moderate, and promising). Yet many educational leaders are using the What Works Clearinghouse as a guide to which programs will meet ESSA evidence standards. They may logically assume that if the WWC says a program is effective, then the federal government stands behind it, regardless of what the ESSA evidence standards actually say. Yet in fact, based on the data analyzed by Amanda Inns for reading and math, 46% of the outcomes rated as positive/potentially positive by WWC (taken to correspond to “strong” or “moderate,” respectively, under ESSA evidence standards) are non-significant, and therefore do not qualify under ESSA.
The WWC needs to remove “substantively important” from its ratings as soon as possible, to avoid a collision with ESSA evidence standards, and to avoid misleading educators any further. Doing so would help make the WWC’s impact on ESSA substantive. And important.
This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.