Now that the votes are finally being cast and counted, it’s time to turn to that hallowed American tradition: rating how well pollsters and meta-pollsters did. I cannot claim to have an opinion on how most of these performed. However, I did take a look into one of the best-known meta-pollsters: Nate Silver’s FiveThirtyEight.com. Unfortunately for 538, it seems we can call the race early: things did not go well.
Silver’s model, throughout the campaign, showed a dramatically higher likelihood that Donald J. Trump would be our next president ― higher than any other of the major forecasting models at work. That might be OK (and I’ll get to why shortly), but his model also sometimes exhibited strange behavior ― like moving overall results the opposite way of new data fed into the model. Silver has defended his projection as “commonsense” and dismissed models that estimated probabilities above 90% that Hillary Clinton would win.
That strange behavior is what initially led me to take a closer look at Silver’s methodology. The model is not transparent, to be sure; however, the elements that are publicly visible are concerning. What is public suggests that the level of uncertainty he found in the race was not the result of hidden or undecided voters, Latinos who were difficult to count, Gary Johnson, or any other real-world element of the campaign. Instead, it was Silver’s own modeling process that introduced the uncertainty, due to a series of basic modeling and statistical mistakes.
Having made those mistakes or (to be charitable) less-than-ideal modeling decisions, he is then forced to add on hack after hack. Each one of those hacks brings with it new and unnecessary uncertainty. Stack enough of them on top of each other, and almost any model would say that a race is close to a toss up, even as other forecasters find that it’s 98% likely to go one way.
On Saturday, The Huffington Post’s Ryan Grim uncorked a torrent of criticism from Silver when he publicly challenged some of the adjustments he’s making, questioning whether 538 was producing a model or merely punditry. Nate Silver then fired back with some less-than-charitable reprobation on that bulwark of intellectual, political and statistical discussion: Twitter.
While I never had Nate in one of the statistics classes I taught or TAd at the University of Chicago, here is what I would have said to him: Quite simply, his modeling approach is overly complicated and baroque. It has so many moving parts that it is like an animal with no bones. That is why it then has so many places where he has to impose his (hopefully unbiased) views. The problem with this is that he could push the results around quite a bit if he wanted to. That doesn’t mean he is purposely rigging the model; and, I don’t suspect he is.
However it does mean that he isn’t letting the data do the talking and he is, instead, imposing his views. (Perhaps we could say he is mansplaining the implications to the data?) Where do these views come from? Maybe past data or his “gut feeling” or unconscious bias. In the end, we don’t know and so we are left with a heap of uncertainty and an opaque series of assumptions. Ryan Grim has said that this amounts to, effectively, punditry. I’m not sure I would go that far, but it isn’t a win for reproducible research or defensible and unbiased analysis.
Here’s how his website explains his “trend line adjustment,” which Grim singled out for criticism. “The question is how much smoothing to use in the trend line. Less smoothing = a more aggressive forecast. Empirically, using more smoothing early in the race and less smoothing late in the race works best. In other words, the trend line starts out being quite conservative and becomes more aggressive as Election Day approaches,” Silver writes.
That gets at one of the biggest methodological problems he has: the non-transparent, made-up (from who knows where?) weighting schemes. If he argued that these weightings were variance-stabilizing, he could at least avoid charges of bias... but they don’t seem to be. Instead they are what “works best” according to him. We should just trust him ― which means his correctness is not falsifiable. When somebody creates a system that cannot be proven wrong or criticized, it becomes hard to believe it is right.
Now Nate Silver is not the only person who has done controversial data analysis. Sadly, finance and economics (the fields I now work in) have recent scandals involving poor data analysis leading to indefensible financial policies and questionable research waved in to top journals by unquestioning editors who are friends of the authors. I might be cynical, but I think the effect of a poor academic article is limited: most academics can discern the true quality of shoddy work no matter where it is published.
However, I think right now our nation needs more transparency. And as someone raised in Minnesota (yes, bring on the southern Canada jokes), I think we also need more calm, in-depth discussion. So, I want to detail what seems to be wrong and some ways in which this could all be done better.
We could try to characterize these statistically, but the imposition of weighting schemes that are his own guesstimating... well, the effect of those could be anything. There’s no point building a statistical model of somebody possibly imposing their own opinion: it is no longer random.
We’re going to get technical, but the conclusion is pretty simple: 538’s presidential model is built in a sub-optimal way. It’s at once far too complex, while also making basic errors that have thrown its projections out of whack. These seem to have unrealistically over-estimated the probability of a Trump presidency. If you are a Trump supporter, this may have lulled you into a false sense of security which may have cost some battleground states. If you are not a Trump supporter, this may have scared the bejesus out of you and led you to donate more or volunteer more to secure seemingly-uncertain battleground states.
So, here goes. There are three central, potentially fatal flaws with the model.
- He’s not using the best model setup.
- He’s not handling correlations and time in his data properly.
- He’s not transparent with where various “fudge factors” come from.
Let’s run through each of them one by one. As we go, I’ll mention my thoughts on how this could be done better.
First: Model setup.
538 should be modeling each state’s race with a generalized linear model: either a multinomial model to estimate the probabilities of Clinton, Trump, Johnson, McMullin, and Stein each winning that state or a logistic-link binomial model for Trump vs Clinton. Those models were created for these sorts of scenarios. It’s a little bit of work to use these: you have to input the number of respondents in favor of each candidate instead of just sticking in the reported percentages. However, that would have the added advantage of not trusting any given poll’s claims of uncertainty.
While Nate Silver doesn’t spell it out on his site, he appears to be using either a linear regression or a logistic regression. Since the logistic regression is a better choice, I’ll assume he is using that. Some people might confuse logistic regression and a binomial GLM with a logistic link, but they aren’t the same. The difference is in how they handle the uncertainty of unusual events (i.e. likely landslides). This is because a binomial random variable with probability of success p has a variance of p*(1-p). In other words: a race that is nearly tied is much more sensitive to all the inputs than a race that is likely to be a landslide. For example, Reagan would have had to screw up hugely to have lost to Mondale ― while even a small screw-up for W might have handed the win to Gore.
A binomial GLM with a logistic link is built to that sort of variation in sensitivity. Logistic regression is not built to handle that. Because logistic regression doesn’t handle that variation in sensitivity, it tends to be biased for events which are estimated to be rare. Since most polls and meta-pollsters are estimating a Trump win an very unlikely, this suggests that Silver’s model form is likely biasing his results.
Second: Correlated data
When we build models, we assume the data are all independent. If the data are correlated, we cannot do our analysis as though we had independent data: the results will look artificially certain. So we need to account for certain groupings of data sharing common effects. For example, polls by the same polling organization share methodology and, perhaps, biases of the polling agents. We could account for these correlations by having the polling organization in the model and estimating, when we fit the data, an offset for each polling organization. That would try to estimate the bias of a polling organization, but it wouldn’t account for the increased uncertainty due to their bias.
There’s just no way to describe this approach as anything other than wrong: it doesn’t address the added uncertainty, it imposes strange forms on how the data are prepared for the model, and it sure doesn’t handle correlations properly.
The best choice is to try to account for the correlations with a correlation model. That is often implemented with random effects that are shared by, say, all polls done by the same organization. Think of it like this: We might think that polls by the same polling organization tend to lean in a certain direction. We might think that polls of the same state lean in the same direction ― because both are sampling the same polity. We might also think that polls conducted in the same week lean in the same direction because they share the same “spirit of the moment.” Those “leanings” are tendencies which imply correlations, but we don’t know what the leanings are. It is especially difficult since each poll is subject to all of these effects: teasing out one effect from the other is difficult ― and certainly not the place to introduce even more complication.
Handling these correlations properly has two benefits, however. First, it increases the uncertainty of our results to be more honest about the variation in the data. Second, it will tend to push estimates slightly toward the center (aka “squeezing”). These are both defensible when they come from a correlation model. The basic idea is that shared factors imply more uncertainty in the data ― and that those squeeze the model output toward the middle. Thus a Trump victory is more likely than the data would seem to indicate. However, the effect of squeezing is rarely huge. (I would be surprised if it is as big as we seem to see from 538.)
How does Nate Silver account for correlations in the polls? He assumes some bias for a given state and for a given polling organization. He fits wiggly (loess) regression curves for time trends with an imposed sensitivity factor. But he does all of these separately so it isn’t clear how he has separated all of these overlapping effects. Then (from what I can tell) uses these to create almost “pseudo-polls” which he then puts in the model. If he adds in economic or other variables, he seems to do another level of this. There’s just no way to describe this approach as anything other than wrong: it doesn’t address the added uncertainty, it imposes strange forms on how the data are prepared for the model, and it sure doesn’t handle correlations properly.
Third: the fudge factors.
Perhaps all of the various weightings Silver does makes his model more reliable, but we don’t know that. Unless a weighting was designed to make all the data equally influential or to reduce the effects of extreme observations, it is probably introducing new uncertainty. Add up all these various steps and weightings and you’ve introduced a lot more uncertainty. We could try to characterize these statistically, but the imposition of weighting schemes that are his own guesstimating... well, the effect of those could be anything. There’s no point building a statistical model of somebody possibly imposing their own opinion: it is no longer random.
And then he has a model which seems to be far too certain of itself. To fix that, he simulates elections from fat-tailed (Student’s t) distributions. Why those distributions? Why their particular degree of fat-tailedness? Furthermore, fat-tailed distributions do not just have a higher likelihood of extreme surprises; they also have an exaggerated concentration of likelihood in the middle. That might be fine if we are modeling markets for electricity or wheat, but I’d hope to see some data to convince me that it mirrors the distribution of US voters. So here too we have a bolted-on effect which can swing the results.
Are all of these effects why the 538 model seemed to jump around inexplicably in response to seemingly small changes in data? I don’t know. That’s the part which, ultimately, is hard to disentangle: what in the model is due to data and data-driven methods and what is being imposed by a (hopefully) benevolent data dictator? We don’t know, but seeing the model jump around and ascribe unusually high probabilities to unusual events is concerning. The net effect of all of this is not confidence-inspiring.
So why would I, a statistician who teaches about financial markets and builds trading models write this up? I wrote this because I’m glad to see the rise of these meta-pollsters, and I’m glad to see them making statistics sexy ― because statistics is sexy. However, I’m disappointed when our national conversation devolves into yelling instead of discussing the issues calmly, and I’m disappointed when meta-pollsters like Nate Silver start to also devolve into insults and yelling. We should all admit that we are not experts but merely servants of this data which we are trying to understand as honestly as possible.
Finally, I wrote this because honest data-driven and policy-driven discussions are the hallmarks of high-quality institutions and how we avoid extremism. That is strongly associated with higher economic growth. If we want jobs, we don’t need to repudiate free trade or default on our debt or regulate everything in sight or cut immigration. If we want jobs, we need to calm down, stop yelling, go back to listening to science and economic research, and start getting our hands dirty with details. That’s when we will find the nuance and uncomfortable compromises that move the nation forward. And it starts with saying we could be smarter about all of this.
Dobson and Barnett, _An Introduction to Generalized Linear Models, 3rd Ed._, 2008.
McCullagh and Nelder, _Generalized Linear Models, 2nd Ed._, 1989.
McCulloch and Searle, _Generalized, Linear, and Mixed Models, 2nd Ed._, 2008.
Venables and Ripley, _Modern Applied Statistics with S, 4th Ed._, 2003.
NOTE: Nate Silver did not respond to repeated requests for comment.