I was recently having a conversation with a very well-informed, statistically savvy, and experienced researcher, who was upset that we do not accept researcher- or developer-made measures for our Evidence for ESSA website (www.evidenceforessa.org). “But what if a test is reliable and valid,” she said, “Why shouldn’t it qualify?”
I inwardly sighed. I get this question a lot. So I thought I’d write a blog on the topic, so at least the people who read it, and perhaps their friends and colleagues, will know the answer.
Before I get into the psychometric stuff, I should say in plain English what is going on here, and why it matters. Evidence for ESSA excludes researcher- and developer-made measures because they enormously overstate effect sizes. Marta Pellegrini, at the University of Florence in Italy, recently analyzed data from every reading and math study accepted for review by the What Works Clearinghouse (WWC). She compared outcomes on tests made up by researchers or developers to those that were independent. The average effect sizes across hundreds of studies were +0.52 for researcher/developer-made measures, and +0.19 for independent measures. Almost three to one. We have also made similar comparisons within the very same studies, and the differences in effect sizes averaged 0.48 in reading and 0.45 in math.
How could there be such a huge difference? The answer is that researchers’ and developers’ tests often focus on what they knew would be taught in the experimental group but not the control group. A vocabulary experiment might use a test that contains the specific words emphasized in the program. A science experiment might use a test that emphasizes the specific concepts taught in the experimental units but not in the control group. A program using technology might test students on a computer, which the control group did not experience. Researchers and developers may give tests that use response formats like those used in the experimental materials, but not those used in control classes.
Very often, researchers or developers have a strong opinion about what students should be learning in their subject, and they make a test that represents to them what all students should know, in an ideal world. However, if only the experimental group experienced content aligned with that curricular philosophy, then they have a huge unfair advantage over the control group.
So how can it be that using even the most reliable and valid tests doesn’t solve this problem?
In Alice in Wonderland, the Mad Hatter tries to fix the White Rabbit’s watch by opening it and putting butter in the works. This does not help at all, and the Mad Hatter remarks, “But it was the very best butter!”
The point of the “very best butter” conversation in Alice in Wonderland is that something can be excellent for one purpose (e.g., spreading on bread), but worse than useless for another (e.g., fixing watches).
Returning to assessment, a test made by a researcher or developer might be ideal for determining whether students are making progress in the intended curriculum, but worthless for comparing experimental to control students.
Reliability (the ability of a test to give the same answer each time it is given) has nothing at all to do with the situation. Validity comes into play where the rubber hits the road (or the butter hits the watch).
Validity can mean many things. As reported in test manuals, it usually just means that a test’s scores correlate with other scores on tests intended to measure the same thing (convergent validity), or possibly that it correlates better with things it should correlate than with things it shouldn’t, as when a reading test correlates better with other reading tests than with math tests (discriminant validity). However, no test manual ever addresses validity for use as an outcome measure in an experiment. For a test to be valid for that use, it must measure content being pursued equally in experimental and control classes, not biased toward the experimental curriculum.
Any test that reports very high reliability and validity in its test manual or research report may be admirable for many purposes, but like “the very best butter” for fixing watches, a researcher- or developer-made measure is worse than worthless for evaluating experimental programs, no matter how high it is in reliability and validity.
This blog was developed with support from the Laura and John Arnold Foundation. The views expressed here do not necessarily reflect those of the Foundation.