Why 'Small Sample Size' is Wrong

Philadelphia Eagles quarterback Sam Bradford drops back to pass during the first half of an NFL football game against the Gre
Philadelphia Eagles quarterback Sam Bradford drops back to pass during the first half of an NFL football game against the Green Bay Packers Saturday, Aug. 29, 2015, in Green Bay, Wis. (AP Photo/Matt Ludtke)

Don't run from this title: it will help you win some arguments.

I spend a lot of time in Philadelphia, which, along with Boston and New York, is tops in terms of sports mania and knowledge. Philly fans may throw snowballs at Santa Claus, and they may boo their draft picks, but their passion is overwhelmingly backed by almost scholarly levels of scrutiny of their teams.

So let's take a look at what they are being told.

Philly boasts two of the highest-quality sports radio stations in America: venerable WIP and upstart WPEN. Between them they bring to the microphone an astonishing number of nationally recognizable sports commentators, led by Sal Paolantonio, Brian Baldinger, my favorites Mike Missanelli and Cuz, and an army of former players.

Lately, however, I've been hearing one phrase abused repeatedly, and it's worth a moment's reflection to consider its true meaning.

That phrase is "small sample size."

I hear this phrase in every broadcast, every hour, before every hard stop for a commercial break.

Generally, it's invoked to caution listeners who forecast future outcomes from a really small set of data. This could be a rookie's first appearances, the initial performance of an athlete returning from an injury, or a high-profile acquisition's performance in a small number of games.

These beloved sports commentators tell listeners, again and again, that small sample sizes lead to hasty, foolish judgments. Are they right? No.

Let's talk about Sam Bradford. Bradford was acquired by the Eagles in a headline-making trade with the St. Louis Rams. Bradford is a once precocious quarterback who has spent most of his career recovering from surgery.

This preseason, Bradford has appeared in two games. In the first, he looked ok, completing passes at a statistically average rate (3 of 5), but also shook off (with a snarl) a possibly dirty shot at his newly rehabilitated knees (that alone endeared him to Philly).

In the second game (the team's third of the preseason), he was a future Hall of Famer, completing all 10 of his passes, including three for touchdowns, and looked to be the next Tom Brady or Peyton Manning.

Then the cascade of fear of the small sample size rained down on the ecstatic Eagles fan base. Incorrectly.

In statistics, there are two key concepts: validity and reliability. Does your data measure what you think it measures (validity) and would additional samples from the same data yield the same result (reliability)? Here, what we really care about is validity: are the 15 preseason throws enough to forecast Sam Bradford's future?

Philly sports broadcasters have focused on the idea that 15 is a small number. However, statisticians would focus on what those 15 observations measure.

Here's the difference. Suppose Bradford's 15 throws, 13 successfully executed, occurred in an environment identical to the regular season. Then, a statistician would say, rather than a nice bell curve with a big hump in the middle, the results would look like a bell curve that has been stretched to the edge of the page and squashed in the middle. Because the bell curve is flatter and wider, it gives you less confidence Bradford's throws to date show what he can do in the regular season.

But, statistically, even though small sample sizes yield less confidence in results, it's still unlikely that what we've seen from Bradford will be different from his expected performance in the regular season. Even though Bradford has only thrown 15 times, it's a good bet his performance in the regular season will be similar to his performance in the preseason. That is, if his results in the preseason are valid.

A more important problem would occur if the observed data represent a world (the preseason) that looks nothing like the data it is being used to predict (the regular season). If that is true, if in the preseason players don't play as hard, if game plans are more simplistic, if better players are less likely to play, if the deck is effectively stacked in favor of Bradford (statisticians call this selection bias), then the data are not valid: they simply don't measure what you want them to measure, regular season effectiveness.

So, "small sample size" isn't the problem, validity is. If the data aren't valid, a larger sample size won't solve the problem. Only better (valid) data will.

Philly has great sports fans, and they deserve more precise language. And excellent quarterback play.