Author: Xyzh
Link: https://www.zhihu.com/question/26726794/answer/151282052
Source: Know
Copyright belongs to the author. Commercial reprint please contact the author to obtain authorization, non-commercial reprint please indicate the source.
I just saw this article today about the problem. The analysis of the pros and cons of each algorithm is very pertinent.
https://zhuanlan.zhihu.com/p/25327755
It was just 14 when someone did an experiment [1], comparing the actual effects of different classifiers (179) on different datasets (121).
Paper titled: Do we Need hundreds of the classifiers to Solve real world classification Problems?
The experiment was a little early, and I tried to combine my own understanding with some recent experiments to talk about it. Mainly for classifiers (classifier).
Write to lazy people: there is no best classifier, only the most appropriate classifier.
Random forests are the strongest on average, but only on 9.9% of data sets, the advantage is that there are few short boards.
The average level of SVM is followed by the first on the 10.7% data set.
Neural Networks (13.2%) and boosting (~9%) performed well.
The higher the data dimension, the more random forests are stronger than the adaboost, but the overall is less than svm[2].
The greater the amount of data, the stronger the neural network.
Nearest neighbor (nearest neighbor)
A typical example is KNN, which is the idea--for the point to be judged, find the nearest data points, depending on their type, to determine the type of point to be judged.
Its characteristic is to follow the data completely, there is no mathematical model to say.
Applicable scenario:
Need a model that is particularly easy to explain.
For example, you need to explain the reason to the user recommendation algorithm.
Bayesian (Bayesian)
A typical example is naive Bayes, where the core idea is to calculate the type of point to be judged based on conditional probabilities.
is a relatively easy to understand model that is still used by spam filters.
Applicable scenario:
A model that is easier to explain and less relevant to different dimensions is needed.
High dimensional data can be handled efficiently, although the results may not be satisfactory.
Decision Trees (Decision tree)
The characteristic of a decision tree is that it always splits along features. As the layers evolve, the division becomes thinner.
Although the generated tree is not easy for users to see, but the data analysis, through the observation of the upper structure of the tree, the classifier's core ideas can have an intuitive feeling.
As a simple example, when we predict a child's height, the first layer of the decision tree may be the child's gender. Boys go to the left of the tree to make further predictions, the girls go to the right tree. This shows that gender has a strong effect on height.
Applicable scenario:
Because it produces a clear, feature-based tree structure that chooses different predictions, the data analyst can often use the decision tree when it wants to better understand the data on the hand.
At the same time it is also relatively easily attacked classifier [3]. The attack here refers to the artificial change of some characteristics, so that the classifier judge error. Common in spam avoidance detection. Because the decision tree is ultimately based on a single condition at the bottom, an attacker often needs to change a few features to escape monitoring.
Constrained by its simplicity, the greater usefulness of decision trees is the cornerstone of some more useful algorithms.
Random Forest (Random forest)
When it comes to decision trees, you have to mention random forests. As the name suggests, the forest is a lot of trees.
Strictly speaking, random forest is actually an integration algorithm. It first randomly selects different characteristics (feature) and training samples (training sample), generates a large number of decision trees, and then synthesizes the results of these decision trees to carry out the final classification.
Stochastic forest is widely used in the analysis of reality, and it has a great improvement in accuracy relative to decision tree, and it improves the character of decision tree easily attack.
Applicable scenario:
The data dimension is relatively low (dozens of D), while higher accuracy is required.
Because do not need a lot of parameter adjustment can achieve a good effect, basically do not know what method of time can first try random forest.
SVM (Support vector machine)
The core idea of SVM is to find the interface between different categories, so that the two types of samples as far as possible on both sides of the surface, and the separation of the interface as much as possible.
The earliest SVM was planar and limited in size. But using the kernel function (kernel functions), we can make the plane projection (mapping) into the surface, and then greatly improve the application range of SVM.
The improved SVM is also used in large quantities, which shows excellent accuracy in the actual classification.
Applicable scenario:
SVM has excellent performance on many data sets.
In contrast, SVM keeps the nature of the distance between samples as much as possible, resulting in a stronger ability to attack.
Like a random forest, this is an algorithm to get the data and try it first.
Logistic regression (regression)
Logistic return this name is too weird, I call it LR bar, anyway, the discussion is classifier, there is no other way to call LR. As the name suggests, it is actually a variant of the regression class approach.
The core of the regression method is to find the most suitable parameters for the function, so that the value of the function and the value of the sample are closest. For example, linear regression (Linear regression) is the most appropriate a,b for function f (x) =ax+b.
The LR fitting is not a linear function, it fits a function in probability, and the value of f (x) now reflects the probability that the sample belongs to this class.
Applicable scenario:
LR is also the basic component of many classification algorithms, and its advantage is that the output value naturally falls between 0 and 1 and has a probabilistic meaning.
Because it is essentially a linear classifier, it deals with the relative situation between bad features.
Although the effect is general, but win in the model clear, behind the probability of study can withstand scrutiny. The fitted parameters represent the effect of each feature (feature) on the result. It's also a good tool for understanding data.
discriminant Analysis (discriminant)
Discriminant analysis is mainly in the statistics over there, so I am not very familiar with the temporary find statistics Department of the Boudoir Honey made up a missed lesson. Here we are now learning to sell.
A typical example of discriminant analysis is linear discriminant analysis (Linear discriminant analyses), referred to as LDA.
(notice here not to be confused with the implied Dirichlet distribution (latent Dirichlet allocation), although it's called LDA but it's not one thing to say. )
The core idea of LDA is to project the High-dimensional sample (project) to the low dimension, and if it is divided into two categories, it is projected into one dimension. To be divided into three categories to be projected onto the two-dimensional plane. There are, of course, many different ways to do this, and the standard of LDA projection is to keep similar samples as close to each other as possible and not to separate them as far as they can. For future samples to be predicted, you can easily distinguish categories by projecting them in the same way.
Working with stories:
Discriminant analysis is suitable for the dimensionality reduction of high dimensional data, and the function of dimensionality reduction enables us to observe the sample distribution conveniently. Its correctness has a mathematical formula to prove it, so it is also a very good way to withstand scrutiny.
However, its classification accuracy rate is often not very high, so it is not a statistical department of people to use it as a dimensionality reduction tool.
Also note that the sample is assumed to be a normal distribution, so the concentric circular data should not be tried.
Neural Network (Neural network)
The neural network is now not a good fire. Its core idea is to use the Training sample (training sample) to improve the parameters gradually. As an example of predicting height, if one of the characteristics of the input is gender (1: male; 0: female), and the output is characterized by height (1: high; 0: short). Then when the training sample is a tall boy, in the neural network, the route from "male" to "high" will be strengthened. Similarly, if a girl comes to a high, the route from "female" to "high" will be strengthened.
The final neural network which routes are relatively strong, is determined by our sample.
The advantage of neural networks is that it can have many layers. If the input output is directly connected, it is no different from LR. But with the introduction of a large number of middle layers, it can capture the relationships between many input features. Convolution neural Network has a very classic different layers of visual display (visulization), I do not repeat here.
The neural network is very early, but its accuracy depends on the large training set, originally limited by the speed of the computer, the classification effect has been inferior to random forest and SVM this classic algorithm.
Working with stories:
The amount of data is huge and there is an intrinsic connection between parameters.
Of course now the neural network is not just a classifier, it can also be used to generate data to do dimensionality reduction, these are not discussed here.
Rule-based methods
I am really not familiar with this, I do not know what the Chinese translation is.
The typical algorithm in it is the C5.0 Rules, a variant based on the decision tree. Because the decision tree is a tree-like structure, it still has some difficulty in understanding. So it extracts the result of the decision tree to form a small rule consisting of two or three conditions.
Working with stories:
It is slightly less accurate than the decision tree and is rarely used by people. It is probably necessary to provide clear rules to explain the decision.
Propulsion algorithm (boosting)
The next series of models, all belong to the Integrated Learning Algorithm (ensemble Learning), based on a core concept: heads, the top of Zhuge Liang.
When we combine a number of weaker classifiers, the result is more than a strong classifier.
The typical example is AdaBoost.
The implementation of AdaBoost is a gradual process, starting with a most basic classifier, each looking for a classifier that best resolves the current error sample. The new classifier is combined with a weighted fetch and (weighted sum) into an existing classifier.
Its advantage is to have the feature selection (feature Selection), which only uses in the training set to find effective characteristics (feature). This reduces the number of features that need to be computed when classifying, and solves the problem that high dimensional data is difficult to understand to some extent.
In the most classic adaboost implementation, each of its weak classifiers is actually a decision tree. That's why the decision tree is the cornerstone of the various algorithms.
Working with stories:
Good boosting algorithm, its accuracy is not inferior to random forest. Although in [1] experiments only one squeeze into the top 10, but the actual use of it is very strong. Since the feature selection (feature selection) is very friendly to the novice, it is a "do not know what to try it" algorithm.
Bagging algorithm (bagging)
Also is the weak classifier combination of ideas, in contrast to boosting, in fact, bagging better understanding. It first randomly extracts the training set (training set) to train a number of weak classifiers based on it. The final classification result is then determined by averaging, or by voting (voting).
Because it chooses the characteristic of the training set randomly, the bagging can avoid the transition fitting (overfit) to some extent.
In [1], the strongest bagging algorithm is based on SVM. If the definition is not so strict, the random forest is a kind of bagging.
Working with stories:
Compared to the classical algorithm, bagging uses fewer people. Part of the reason is that the bagging effect and the choice of parameters are relatively large, with default parameters often do not have good results.
Although the tuning parameters are better than the decision tree and LR, the model becomes more complex, so don't use it for any particular reason.
Stacking
I really don't know how to say it in Chinese. What it does is to set up a new classifier on the results of multiple classifiers.
The new classifier is trained on the basis of the analysis results of the weak classifier, plus the training tag (training label). Generally this last layer uses LR.
The poor performance of stacking in [1] may have been due to the introduction of more parameters to the increased level of classifiers, or to the phenomenon of transition fitting (overfit).
Working with stories:
Don't use it if it's all right.
(Amendment: @ Zhong remind that stacking in the data Mining competition site Kaggle very fire, I believe that the parameters of the good words or the results can be helpful.
http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/
This article is a good introduction to the benefits of stacking. In Kaggle this kind of ascension means that the position is different, stacking is still very effective, but for the general business, it brings the promotion is difficult to value back to the additional complexity. )
Multi-expert model (mixture of experts)
The model is very popular recently, mainly used to merge the classification results of neural networks. I am not very familiar with the neural network is interested in, and training set heterogeneity (heterogeneity) More strong words can study this.
In fact, the classifier is basically finished. Let's talk about some other nouns in the question.
Maximum entropy (Maximum entropy model)
The maximum entropy model itself is not a classifier, it is generally used to judge the quality of the model prediction results.
For it, the classifier prediction is equivalent to: For a sample, give each class a probability of occurrence. For example, the characteristics of the sample is: Sex male. My classifier may give the following probability: High (60%), short (40%).
And if the sample is really high, then we get a score of 60%. The goal of the maximum entropy model is to make the product of these fractions as large as possible.
LR is actually using the maximum entropy model as an algorithm for optimizing the target [4].
Em
Just like the maximum entropy model, EM is not a classifier, but an idea. Many algorithms are implemented based on this idea.
@ Liu has made it very clear, I will not say more.
Hidden Markov (Hidden Markov model)
This is a predictive method based on sequence, where the core idea is to predict the next state through the last (or several) states.
It is called "hidden" Markov because its setting is the state itself we do not see, we can only according to the result sequence generated by the state to learn the possible state.
Applicable scenario:
Can be used to predict a sequence, which can be used to generate a sequence.
Conditional Random Airport (Conditional random field)
A typical example is the Linear-chain CRF.
The specific use of @Aron have said, I will not shortcoming, because I have never used this.
That's all, if I have time, I can draw a picture, it should be clearer.
Related articles:
[1]: Do we need hundreds of the classifiers to solve real world classification problems.
Fernández-delgado, Manuel, et al J. Mach Learn. Res 15.1 (2014)
[2]: An empirical evaluation of the supervised learning in the high dimensions.
Rich Caruana, Nikos Karampatziakis, and Ainur Yessenalina. ICML ' 08
[3]: Man vs. machine:practical adversarial detection of malicious crowdsourcing workers
Wang, G., Wang, T., Zheng, H., & Zhao, B. Y. Usenix security ' 14
[4]: Http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf Edition in 2017-03-14