What are important topics in statistics that every data scientist must know? originally appeared on Quora: the knowledge sharing network where compelling questions are answered by people with unique insights.
If I were to pick one topic in statistics that every data scientist should know about, I'd say linear models. They unify many common statistical tests (t-tests, ANOVA, ANCOVA), and have many useful extensions (mixed models, generalized linear models, lasso and ridge regression). They are the modeling tool that I reliably start with.
I'm typically not in the business of building models, but I use linear models all the time when trying to create visualizations that dig into what's going on. If you see a strong pattern in a plot, it's a good idea to make that pattern explicit with a model. You can then look at the residuals to see the subtler trends that remain. That's particularly useful when the initial graphic is dominated by a known and uninteresting pattern. I explore this idea in depth in R for Data Science: model building.
Bear in mind George Box's maxim that "all models are wrong; some models are useful." I think an important part of a statistical mindset is to understand that uncovering Truth is extremely difficult, and even when possible, may be so complicated as to not be practically useful.