What are the advantages of logistic regression over decision trees? originally appeared on Quora: the place to gain and share knowledge, empowering people to learn from others and better understand the world.
What are the advantages of logistic regression over decision trees? First off, you need to be clear what exactly you mean by advantages. People have argued the relative benefits of trees vs. logistic regression in the context of interpretability, robustness, etc.
But let’s assume for now that all you care about is out of sample predictive performance. Again, you may need to specify what kind of predictive performance you need: accuracy, ranking, probability estimation. In short: all things equal, trees might have a leg up on accuracy whereas logistic might be better at ranking and probability estimation.
Theoretical Answer: No algorithm is in general ‘better’ than another. There is the famous “No Free Lunch” theorem. It basically states that any two optimization algorithms are equivalent when their performance is averaged across all possible problems. But of course in reality, you do not want to solve all possible problems but some particular practical one…
Practical Answer: Who cares? If you already have your data setup for one of them, simply run both with a holdout set and compare which one does better using whatever appropriate measure of performance you care about. The important caveat however is, I would not set the data up the same way for both. It can make a huge difference how you represent your features to make one model perform better than another on the exact same task and dataset. I have spend some time on this on a Quora question about feature construction.
(Somewhat) Scientific Answer: While there is little one can do in formal scientific terms about the relative expected performance that is not either hopeless (see the Free Lunch argument) or close to a tautology (linear models perform better on linear problems), we have some general understanding why things (sometimes) work better. Most of those (theoretical) reasons center around the bias-variance tradeoff. But there is also some empirical work comparing various algorithms across many datasets and drawing some conclusions, what types of problems tend to do better with trees vs logistic regression. My own work on the topic can be summarized simply as:
- If the signal to noise ratio is low (it is a ‘hard’ problem) logistic regression is likely to perform best. In technical terms, if the AUC of the best model is below 0.8, logistic very clearly outperformed tree induction.
- You have have low signal to noise for a number of reasons - the problem is just inherently unpredictable (think stock market) dataset or it is too small to ‘find the signal’. The latter is an interesting case - we observe that the performance order of the two algorithms can cross - meaning, logistic performs better on a small version of the dataset but eventually is beaten by the tree when the dataset gets large enough.
Some other honorable mentions:
- Trees generally have a harder time coming up with calibrated probabilities. This can be helped somewhat with bagging and Laplace correction.
- Trees tend to have problems when the base rate is very low. In the worst case, it will not split at all. While this might maximize accuracy it is obviously useless for ranking or probability estimation. You can try to fix this with downsampling, but then your probability estimates are off.
This question originally appeared on Quora - the place to gain and share knowledge, empowering people to learn from others and better understand the world. You can follow Quora on Twitter, Facebook, and Google+. More questions: