Random Thoughts


Perhaps the most extraordinary change brought about by the evidence-based reform movement in education is the rapidly expanding number of experimental studies that use random assignment to treatment or control groups. As recently as the 1990s, randomized experiments were rare in education. However, in the 2000s, the new Institute of Education Sciences (IES) began to strongly encourage randomized experiments, and later Investing in Innovation (i3) insisted on randomization for its larger grants. IES established training programs to greatly increase the number of scholars able to design, carry out, and analyze randomized experiments. In England, similar developments took place with the establishment of the Education Endowment Foundation (EEF). As randomized experiments have become the standard of evidence, other agencies, private foundations, and commercial companies have also begun to sponsor randomized experiments.

The importance of randomization is clear from the world of medicine. One of the most important reasons that medicine makes rapid and irreversible progress in new drugs and procedures is that medicine routinely subjects promising treatments to experiments in which subjects are assigned at random to receive either the new treatment or a control treatment representing the current standard of care. Random assignment is essential in experiments because it allows experimenters to make sure that on average, subjects in the experimental and control groups can be considered equal in every way except for the treatment itself.

The main alternative to randomized experiments is quasi-experiments, where subjects who get the new treatment are matched on key factors. In education, especially in evaluations of methods intended to increase learning, students may be matched on prior achievement, as well as demographic factors such as social class, race, and English proficiency. Quasi-experiments can be very good in many cases, but in a quasi-experiment there is always a chance that some unmeasured factor could explain positive-looking effects. For example, even if a quasi-experiment matched students on achievement and demographics, teachers may not be equal because those using the new method chose to do so, while the control teachers did not. The teachers who chose the method might be better teachers, perhaps more enthusiastic, harder working, or more positively oriented toward innovation, and these factors rather than the treatment itself could lead to improved outcomes.

While methodologists have long favored randomized over matched experiments, actual experience in education has been mixed, in the sense that in some systematic reviews of treatment studies, effect sizes were pretty much the same in randomized and matched studies. If matched studies introduce bias in favor of the treatment group, shouldn't this result in inflated effects?

Along with my colleague Alan Cheung, we had an opportunity to test this question on a large scale. In a study of the effects of methodology on effect sizes, we looked at 645 studies that had met the stringent inclusion standards of the Best Evidence Encyclopedia (BEE). Of these, 196 used random assignment and 449 were quasi-experiments.

The result was clear. Matched quasi-experiments did produce inflated effect sizes (ES=+0.23 for quasi-experiments, +0.16 for randomized). This difference is not nearly as large as other factors we looked at, such as sample size (small studies greatly exaggerate outcomes), use of experimenter-made measures, and published vs. unpublished sources (experimenter-made tests and published sources exaggerate impacts). But our findings about matched vs. randomized studies are reason for caution about putting too much faith in quasi-experiments.

One kind of matched study is of particular concern. This is the post hoc, or retrospective study. In such studies, experimenters might start with a group that already received a given treatment and has already taken posttests and then go find a control group that started out at the same level on pretest scores and was similar in demographics. Such studies have an obvious danger in that since outcomes are already known, an unscrupulous investigator can easily choose matched controls already known to have made limited gains.

However, even honest researchers (and most are honest) can fool even themselves with post hoc designs. The problem is that in any experiment, some number of students drop out, or fail to complete the treatment. Researchers are likely to exclude such students from the experimental group. However, forming a control group by picking off students from a computer file, they might be including the very students who, had they been in the experimental program, would have dropped out. Here's an extreme example. Imagine that in the first months of a matched experiment testing a new high school technology approach, 10% of the students are arrested and sent to school at Juvenile Hall. These students are naturally dropped from the experimental group. However, similar students in the control group are still in the district, so those with matching pretest scores to those in the experimental group would be maintained in the sample (in randomized experiments, "intent to treat" procedures are usually used, keeping all subjects in the experiment no matter what). Our Best Evidence Encyclopedia (BEE) now excludes post hoc quasi-experiments, although it continues to accept matched studies in which the experimental and control groups were designated in advance.

It is not yet possible in education to require that every study be randomized, and the dangers of quasi-experiments are much less when controls are designated in advance. However, if things continue as they have been in recent years, there will soon come a day when we no longer have to play with matches, except under situations in which randomization is impossible.