"Editor's note" for an old question on Quora: What are the advantages of different classification algorithms? Xavier Amatriain, a Netflix engineering director, recently gave a new answer, and in turn recommended the logic regression, SVM, decision tree integration and deep learning based on the principles of the Ames Razor, and talked about his different understandings. He does not recommend deep learning as a universal approach, which also echoes the question we discussed earlier: Whether deep learning can replace other machine learning algorithms.
What are the advantages of different classification algorithms? For example, there are a lot of training data sets, tens of thousands of instances, more than 100,000 of the characteristics, we choose which classification algorithm best? Xavier Amatriain, director of engineering at Netflix, argues that the algorithm should be chosen based on the principles of the Ames Razor (Occam's Razor) and that logistic regression should be considered first.
Choosing a reasonable algorithm can be explored in a number of ways, including:
- The number of training instances?
- Dimension of the feature space?
- Do you want the problem to be linearly divided?
- Is the feature independent?
- Do you expect features to scale linearly?
- Will overfitting become a problem?
- What are the requirements of the system in terms of speed/performance/memory usage?
- ......
This seemingly scary list does not answer the question directly, but we can solve the problem according to the Ames Razor principle: Use the simplest algorithm that satisfies the requirements, and do not add complexity if absolutely necessary.
Logistic regression
As a general rule of thumb, I recommend that you consider logistic regression first (Lr,logistic Regression). Logistic regression is a pretty well-behaved classification algorithm that can train you to expect the features to be roughly linear and problem-linear to be divided. You can easily do some feature engines to convert most of the nonlinear features to linear. Logistic regression is also quite robust to noise, avoids overfitting, and even uses L2 or L1 regularization for feature selection. Logistic regression can also be used in big data scenarios, as it is fairly efficient and can be distributed using, for example, ADMM. The last advantage of logistic regression is that the output can be interpreted as a probability. This is a good add-on effect, for example, you can use it to rank instead of classify.
Even if you do not want logistic regression 100% to work, you can also do yourself a favor by running a simple L2 regularization logistic regression as a baseline before using the "fancier" approach.
Well, now that you've set up a logistic regression baseline, the next thing you should do, I'll basically recommend two possible directions: Support vector Machine (SVM) or decision tree integration. If I don't know your specific question, I would definitely choose the latter, but I'll start to describe why SVM is probably a worthwhile consideration.
Support Vector Machine
Support Vector machines use a loss function (Hinge) that differs from LR. They also have different interpretations (maximum-margin). However, in practice, SVM with linear kernel functions and logistic regression is not very different (if you are interested, you can observe Andrew Ng in his Coursera machine learning course on how to drive SVM from logistic regression). One of the main reasons for using SVM instead of logistic regression may be because your problem is linearly non-divided. In this case, you will have to use a SVM (such as RBF) with a nonlinear kernel. In fact, logistic regression can also be used with different cores, but for practical reasons you are more likely to choose SVM. Another reason for using SVM might be high-dimensional space. For example, SVM has been reported to have done a better job of working with text categorization.
Unfortunately, the main disadvantage of SVM is that their training is inefficient to the pain. Therefore, I would not recommend SVM for any problem with a large number of training samples. To go further, I will not recommend SVM for most "industrial-scale" applications. Any problem that goes beyond toy/lab may be better solved by using other algorithms.
Decision Tree Integration
The third family of algorithms: Decision Tree Integration (tree ensembles). This basically covers two different algorithms: Random Forest (RF) and gradient-boosted decision trees (GBDT). The differences between them are discussed later, and they are now compared as a whole and a logistic regression.
Decision tree integration has different advantages over LR. A major advantage is that they do not expect linear features, or even interactive linear properties. What I did not mention in LR is that it can hardly handle categorical (binary) features. The decision tree integration, because it is just a combination of a bunch of decision trees, can deal with this problem very well. Another major advantage is that because they construct algorithms (using bagging or boosting), they work well with high-dimensional spaces and a large number of training instances.
As for the difference between RF and GBDT, it can be simply understood that the performance of GBDT is usually better, but they are more difficult to guarantee correctly. More specifically, GBDT has more hyper-parameters to adjust and is more prone to overfitting. RF is almost "out of the Box", which is one reason they are very popular.
Deep learning
Last but not least, the answer will be incomplete without a secondary reference to deep learning. I would definitely not recommend this approach as a generic classification technique. However, you may hear how these methods behave in some cases (like classification). If you have already passed the previous steps and feel that your solution has room for optimization, you may try to use a deep learning approach. The fact is that if you use an open source tool (such as Theano), you will know how to make these methods very fast in your data set.
Summarize
In summary, set a benchmark as simple as logistic regression, and then make the problem more complicated if you need to. At this point, decision tree integration may be the right path to go, especially for random forests, which are easy to adjust. If you think there is room for improvement, try GBDT, or be more flashy, and choose deep learning.
You can also see the Kaggle game. If you search for the keyword "category" and choose what has been done, you can find something similar so that you may know what to choose a way to win the game. At this point, you may realize that it's always easy to get things done with an integrated approach. Of course the only problem with integration is the need to keep all independent methods working in parallel. This may be your last step, a fancy step.
Editorial review: Xavier Amatriain does not recommend deep learning as a general-purpose algorithm, and cannot be said to be because deep learning is not good, but because deep learning adds complexity and cost, but does not guarantee better results in all scenarios than logistic regression, SVM, and decision tree integration. In fact, Xavier Amatriain's Netflix team has already begun to look at artificial neural networks and deep learning technologies, hoping to analyze netizens ' favorite movie dramas with the help of AWS Cloud services and GPU-accelerated distributed neural networks to personalize the program recommendations.
Netflix recommended system architecture (image from Xavier Amatrain's official Netflix blog)
Since then, Xavier Amatriain has also shared ten lessons learned from Netflix's machine learning practices, broadly including:
- More data needs to be matched to a better model
- You may not need all the big data
- More complex models don't necessarily mean better results, maybe your sample set is too simple
- Take full account of your training data
- Learn to handle deviations
- The UI is the only channel between the contact algorithm and the most important users
- The right evolutionary approach is more important than data and models
- Distributed algorithms are important to know at which level it is more important to use it
- Choosing the right metric for automatic Hyper-parameter optimization
- Not all things can be done offline, near-line processing is also a choice
The classification algorithm in the eyes of Netflix engineering Director: The lowest priority in deep learning