Education policies that affect millions of students have long been tied to test scores, but a new paper suggests those scores are regularly misinterpreted.
According to the new research out of Mathematica, a statistical research group, the comparisons sometimes used to judge school performance are more indicative of demographic change than actual learning.
For example: Last week's release of National Assessment of Educational Progress scores led to much finger-pointing about what's working and what isn't in education reform. But according to Mathematica, policy assessments based on raw test data is extremely misleading -- especially because year-to-year comparisons measure different groups of students.
"Every time the NAEP results come out, you see a whole slew of headlines that make you slap your forehead," said Steven Glazerman, an author of the paper and a senior fellow at Mathematica. "You draw all the wrong conclusions over whether some school or district was effective or ineffective based on comparisons that can't be indicators of those changes."
"We had a lot of big changes in DC in 2007," Glazerman continued. "People are trying to render judgments of Michelle Rhee based on the NAEP. That's comparing people who are in the eighth grade in 2010 vs. kids who were in the eighth grade a few years ago. The argument is that this tells you nothing about whether the DC Public Schools were more or less effective. It tells you about the demographic."
Those faulty comparisons, Glazerman said, were obvious to him back in 2001, when he originally wrote the paper. But Glazerman shelved it then because he thought the upcoming implementation of the federal No Child Left Behind act would make it obsolete.
That expectation turned out to be wrong. NCLB, the country's sweeping education law which has been up for authorization since 2007, mandated regular standardized testing in reading and math and punished schools based on those scores. As Glazerman and his coauthor Liz Potamites wrote, severe and correctable errors in the measurement of student performance are often used to make critical education policy decisions associated with the law.
"It made me realize somebody still needs to make these arguments against successive cohort indicators," Glazerman said, referring to the measurement of growth derived from changes in score averages or proficiency rates in the same grade over time. "That's what brought this about." So he picked up the paper again.
NCLB requires states to report on school status through a method known as "Adequate Yearly Progress." It is widely acknowledged that AYP is so ill-defined that it has depicted an overly broad swath of schools as "failing," making it difficult for states to distinguish truly underperforming schools. Glazerman's paper argues NCLB's methods for targeting failing schools are prone to error.
"Don't compare this year's fifth graders with last year's," Glazerman said. "Don't use the NAEP to measure short-term impacts of policies or schools."
The errors primarily stem from looking at the percentage of students proficient in a given subject from one year to the next -- but it measures different groups of students from year to year, leading to false impressions of growth or loss.
And using testing data in different -- more accurate -- ways would likely result in states pouring their resources into different groups of schools. "Differences in scores between two cohorts – say, fourth graders one year and fourth graders the next year – are comparisons of two different groups of students," Matthew Di Carlo, senior fellow at the Albert Shanker Institute, wrote in an email. "They do not even necessarily reflect real student progress, to say nothing of whether the changes can be attributed to schooling factors."
The counting flaws highlighted by Glazerman's paper are particularly significant as states revamp the way they hold schools accountable for their performance. Though attempts to rewrite No Child Left Behind fizzled out in Congress this fall, states are rewriting the way they target schools for interventions through waivers that get them out of NCLB-style reporting. The federal Education Department has already received waiver requests from 11 states, and one of the conditions for getting a waiver is developing a new accountability plan.
"It's gone under the radar with the stalled reauthorization process," said Doug Harris, a University of Wisconsin professor who wrote a recent book on education performance metrics. "You get really different answers depending on what you do with these numbers. You can talk all you want about what you do with failing schools but if you haven’t identified schools that are failing, it's a waste of time."
Glazerman's paper provides equations to help solve these errors. Meanwhile, researchers hope that school districts wise up when using test scores to drive policies, such as teacher evaluations.
"Using these data for resource allocation, staffing and other high-stakes decisions means that accuracy and fairness must be the primary considerations," Di Carlo wrote. "Most assessments aren’t designed to measure school and teacher effects in the first place; if they are to play a productive role in that capacity, it will have to be done in the most rigorous feasible manner: using longitudinal data, adjusting for non-schooling factors and interpreting the estimates in a responsible way."