Statistical Thinking: The Bedrock of Data Science

By Joel B. Greenhouse

Thanks to Google Chief Economist Hal Varian's 2009 prediction that "the sexy job in the next 10 years will be statisticians," it is now OK to self-identify as a statistician. This calls for some explanation. For many, their first experience with statistics was anything but pleasant, so telling someone you are a statistician has often been a conversation stopper -- or worse. Recently, a taxi driver told me, somewhat accusingly, that his wife had had to change her major because she was failing her required statistics class. I was sympathetic, knowing that all too often the first statistics class is divorced from real-world applications and emphasizes methods but not how to apply them or interpret their results -- or even tell why one would care. That's a far cry from sexy. What's changed?

Current excitement about statistics and data analysis is due in part to our ability to generate, manage and use massive amounts of data (Big Data) for scientific discovery and make predictions about future events. The media have been generous with Big Data success stories. Three familiar examples are the following:
1. Google's ability to predict flu outbreaks in real time based on an analysis of Internet searches using keywords related to flu symptoms
2. The use of statistical models to analyze baseball strategies and predict player performance, as featured in the movie Moneyball
3. Nate Silver's synthesis and modeling of political surveys to successfully predict election outcomes, described in his New York Times column FiveThirtyEight

Although the problems of generating and managing massive amounts of data are relatively new, methods for analyzing and making sense of such data, what some would now like to call data science, are quite old and the domain of statistical science.

The scope of statistical science is broad and consists of four major activities: conducting collaborative research, designing studies, analyzing data and developing new statistical theory.

Statisticians help advance science through collaboration with subject-matter specialists. John Tukey, who in 1962 introduced the modern definition of data analysis, said, "The best thing about being a statistician is that you get to play in everyone's backyard." This opportunity is what drew me to statistics. Most of my collaborations, for example, are in the biomedical and public health sciences, but I also have worked on projects in public policy, manufacturing and historical demography.

Statisticians develop methods for the design and efficient implementation of studies -- such as surveys, randomized experiments and observational studies -- to generate data to help answer real-world questions. Once data are in hand, statisticians engage in the science (and art) of data analysis to extract usable information from the data. Data analysis includes informal methods for describing and visualizing relationships in the data, as well as formal methods such as building statistical models that can be used to make inferences or predictions about questions of interest and provide measures of how good the resulting inferences are. The Big Data examples mentioned earlier are good illustrations of using statistical models for prediction.

Statisticians also are involved in research that helps advance the theory and practice of statistics. For example, participation in collaborative research often reveals new methodological problems (i.e., problems for which existing methods are inadequate), which leads to new statistical research and improved techniques. This area of work is called mathematical statistics and is primarily concerned with developing and evaluating the performance of new statistical methods and algorithms. It is important to note that computing and solving computational problems are integral components of all four of the previously mentioned areas of statistical science.

Although there is a wide range of activities that engage statistical scientists, the one common element central to all of them is statistical thinking. Good statistical thinking requires a nontrivial understanding of the real-world problem and the population for whom the research question is relevant. It involves judgments such as those about the relevance and representativeness of the data, about whether the underlying model assumptions are valid for the data at hand and about causality and the role of confounding variables as possible alternative explanations for observed results. Finally, an essential component of good statistical thinking is the ability to interpret and communicate the results of a statistical analysis so nonstatisticians can understand the findings.

Varian's prediction that "the sexy job in the next 10 years will be statisticians" was based on his recognition of the important role of statistical thinking in Big Data problems. As he explained, "The ability to take data, to be able to understand it, to process it, to extract values from it, to visualize it and to communicate it -- that is going to be a hugely important skill of the next decades, not only at the professional level but even at the education level."

Big Data is here to stay, and the promise for using it to advance science and help solve pressing societal problems is exciting. Some have argued there is a need for a new profession called "data science" that will make sense of Big Data. "But making sense of data," as Gil Press writes in "A Very Short History of Data Science," "has a long history and has been discussed by scientists, statisticians, librarians, computer scientists and others for years."

Data science is statistics. But isn't statistics by any other name just as sexy? Not really, if the result is confusion and misinformation for potential students and employers. Employers may be having trouble finding "data scientists" to fill their job openings, but it is not because of the lack of qualified candidates being appropriately trained in university departments of statistics and biostatistics. It is true that the tactics for managing and analyzing Big Data are changing and improving, but the strategies for working with Big Data, as well as small data, are still based on a rock-solid foundation of good statistical thinking.

Greenhouse is a statistics professor and director of the master's of statistical practice program in the department of statistics at Carnegie Mellon University. He is an elected Fellow of the American Statistical Association and American Association for the Advancement of Science.