We recently watched election returns that seemed to defy the mathematical experts. For months in advance, those experts had posted odds for the outcome that varied day by day, often by tiny amounts—one day 88% Clinton and 12% Trump, the next day 87% and 13%. There was no explanation of the mathematical models, the data behind the models, or the meaning of the numbers. On election night we watched gauges that displayed probabilities, bouncing left to right as data arrived, providing minute by minute predictions of the outcome. Broadcasters promoted this information as superior to that of mere pundits because it was based on data and therefore objective. Mathematics gave it authority.
That authority was misplaced. Mathematical models depend on mathematics – calculus, algebra, statistics – but they also depend on much more. They rely on underlying assumptions—these are key factors, this part depends on these data, we can ignore these things—and their quality depends on the quality of the data that go into them. Mathematical models attempt to tell us something about the real world, but they can fail. To test a model, we compare their predictions with reality to see whether they match up. When they do, we are cautiously confident; when they don’t, we start over. Failure doesn't mean the mathematics is wrong, but rather that the assumptions were faulty or the data were unreliable.
Both problems were the cause of the failed election predictions. Many assumptions, including who would vote and how voters were making decisions, were incorrect. Much of the polling data was inaccurate, and for many reasons. Experts are still trying to understand the failure.
But models are used in other areas beyond election predictions, and we can learn from this failure. In particular, mathematical models have become ubiquitous in education. The best known examples are Value Added Models (VAM).
These models claim to provide a simple solution to a complicated problem. If we measure the quality of teachers using test scores (not a wise, but alas a common practice), straight-forward comparisons are clearly unfair. Scores depend not only on the teacher but on student characteristics as well— previous scores, socioeconomic status, language proficiency, attendance, and so forth. Teachers who are lucky enough to have students who excel for other reasons are likely to shine; those with challenging students less so.
Value Added Models offer an alternative to looking at raw test scores. The model predicts scores of students based on factors other than the teacher. It makes assumptions about what factors are important, and uses data (along with some mathematics!) to predict the scores on this year's test. If scores are higher than predicted, the excess is counted as "value added" by the teacher; if they are lower, the deficit belongs to the teacher.
But like this year’s election predictions, VAM can go wrong. First, the assumptions. We may think that we have captured all factors influencing test scores, but many are missing—for example, the effort students put into the tests or the influence of test-prep sessions. Assumptions include a simplified view of the modern classroom. Today’s students often come in contact with multiple teachers for the same or related subject in a given year. Which teacher owns the value? What about accelerated classes in which current standardized tests cover material taught in previous years? The assumptions of VAM are frequently incomplete or inaccurate, like this year’s election models.
The data are even more problematic. For some students, past test scores are missing. Students may have moved from another school where they took different tests. Students may have been sick or on vacation when previous tests were given, or scores may have been lost. Models frequently estimate data with "imputed" scores for those students, requiring more assumptions still. Other data are unreliable too. There is no easy way to measure socio-economic status—free or reduced lunch is just a guess. Little data exists to gauge parents’ educational attainment. We don’t know how many books are in a student’s home. We don’t know what language is spoken in the family; not all disabled students are alike; there are many varieties of English language learners; students may come and go throughout the year. When it comes to VAM, all data are what statisticians call “noisy”.
Of course, the test of a mathematical model is a comparison to reality. In this year’s election, almost all models failed that test miserably. The results for VAM are, at best, uncertain. Many studies show high variability in VAM scores, with high value-added teachers becoming low the next year and vice-versa. Even in the same year, a teacher may be high value-added in one class and low in another. This makes no sense if value added models are doing what they claim to do—isolating the "teacher effect." Stories circulate of star teachers, admired and praised by parents and principals, who rate low on VAM. Conversely, some VAM stars are not much admired. But the central problem in comparing VAM to reality is that we have no simple, reliable way to measure teacher quality. As a consequence, some policy-makers and researchers have done something clever: They implicitly define teacher quality as a high VAM score, which makes the comparison to reality unnecessary.
Of course, this is like deciding our election using the predictions of mathematical models, not the vote itself. We could argue this is better because those models are based on sophisticated mathematics and lots of data. They are objective. They have mathematical authority behind them. But we would never accept mathematical authority in politics; we would never decide elections based on mathematical models that predict the outcome.
Then why are we willing to do this in education?