Comparison of machine learning algorithms

Source: Internet
Author: User
Tags svm

Original address: http://www.csuldw.com/2016/02/26/2016-02-26-choosing-a-machine-learning-classifier/

This paper mainly reviews the adaptation scenarios and the advantages and disadvantages of several common algorithms!

Machine learning algorithm too many, classification, regression, clustering, recommendation, image recognition field and so on, to find a suitable algorithm is really not easy, so in practical applications, we are generally using heuristic learning method to experiment. Usually at the beginning we would choose the algorithms that are universally accepted, such as Svm,gbdt,adaboost, which is now hot in deep learning, and neural networks are a good choice. If you care about precision (accuracy), the best way is to test each algorithm by cross-validation (cross-validation), compare them, then adjust the parameters to ensure that each algorithm achieves the optimal solution, and finally chooses the best one. But if you are just looking for a "good enough" algorithm to solve your problem, or here are some tips for reference, below to analyze the advantages and disadvantages of each algorithm, based on the advantages and disadvantages of the algorithm, it is easier for us to choose it.

Deviations & Variances

In statistics, a model is good or bad, measured by deviations and variances, so let's first popularize the deviations and variances:

    • Deviation: Describes the difference between expected E ' and true value y of the predicted value (estimate). The greater the deviation, the more deviated from the real data.

    • Variance: Describes the range of changes in the predicted value p, the degree of dispersion, the variance of the predicted value, that is, the distance from its expected e. The larger the variance, the more dispersed the data is distributed.

The true error of the model is the sum of the two, such as:

In the case of a small training set, a classifier with high/low variance (for example, naive Bayesian nb) is much more advantageous than the low deviation/takakata difference (for example, KNN) because the latter is overfitting. However, with the growth of your training set, the better the model's ability to predict the original data, the lower the bias, and the lower deviation/takakata difference classifier will gradually show its advantages (because they have a lower asymptotic error), when the high-deviation classifier is not enough to provide an accurate model at this time.

Of course, you can also think of this as a difference between the generative model (NB) and the discriminant model (KNN).

Why is naive Bayes a high deviation, low variance?

The following is an insight into:

First, let's say you know the relationship between the training set and the test set. In short, we have to learn a model in the training set, and then get the test set to use, the effect is not good to be measured according to the test set error rate. But most of the time, we can only assume that the test set and the training set are in accordance with the same data distribution, but do not get the real test data. How do you measure the test error rate when you see only the training error rate?

Because the training sample is very small (at least not enough), the model obtained through the training set is not always true. (even if the correct rate is 100% on the training set, it does not mean that it depicts the real data distribution, but that it is our goal to portray the real data distribution, not just the limited data points of the training set). Moreover, in practice, training samples often have a certain noise error, so if the pursuit of perfection in the training set and adopt a very complex model, will make the model of the training set inside the error as the real data distribution characteristics, so that the wrong data distribution estimates. In this case, the real test set on the wrong mess (this phenomenon called fitting). But also can not use too simple model, otherwise when the data distribution is more complex, the model is not enough to depict the data distribution (reflected in the training set the error rate is very high, this phenomenon is less than fit). Over-fitting indicates that the model used is more complex than the real data distribution, and that the model used in the less-fitting representation is simpler than the real data distribution.

In the framework of statistical learning, when you describe the complexity of the model, there is a view that error = Bias + Variance. The error here can probably be understood as the prediction error rate of the model, which is made up of two parts, partly because the model is too simple to estimate the inaccurate parts (Bias), and the other part is because the model is too complex to bring about the greater change of space and uncertainty (Variance).

So it's easy to analyze naive Bayes. It simply assumes that the individual data is irrelevant and is a severely simplified model . Therefore, for such a simple model, most of the occasions will be bias part greater than the variance part, that is, high deviation and low variance.

In practice, in order to make the error as small as possible, we need to balance the proportions of bias and variance when choosing the model, that is, balancing over-fitting and under-fitting.

The relationship between deviation and variance and model complexity is more straightforward to use:

When the complexity of the model rises, the deviation becomes smaller and the variance becomes larger.

Common algorithm pros and cons 1. Naive Bayes

Naive Bayes belongs to the generative model (the generation model and discriminant model, mainly whether it is required to be federated distribution), very simple, you just do a bunch of counts. If the conditional independence hypothesis (a stricter condition) is assumed, the convergence rate of the naive Bayesian classifier will be faster than the discriminant model, such as logistic regression, so you only need less training data. NB Classifiers still perform well in practice even if the NB condition independent hypothesis is not established. Its main disadvantage is that it can not learn the interaction between features, in MRMR R, is the feature redundancy. Cite a classic example, for example, although you like Brad Pitt and Tom Cruise's film, but it can't learn that you don't like them in the movie together.

Advantages :

    • Naive Bayesian model originates from classical mathematics theory, has a solid mathematical foundation, and stable classification efficiency.

    • The small-scale data performance is very good, can handle multi-classification tasks, suitable for incremental training;

    • Less sensitive to missing data, the algorithm is relatively simple and often used for text categorization.

Disadvantages :

    • Need to calculate a priori probability;

    • The error rate exists in classification decision;

    • is sensitive to the form of input data expression.

2.Logistic Regression (logistic regression)

It's a discriminant model, there are a lot of regularization models (L0, L1,L2,ETC), and you don't have to worry about the relevance of your features as you would with naive Bayes. You'll also get a good probability explanation compared to the decision tree and the SVM, and you can even easily update the model with the new data (using online gradient descent algorithm gradient descent). If you need a probabilistic architecture (such as simply adjusting the classification threshold, indicating uncertainty, or getting a confidence interval), or if you want to quickly integrate more training data into the model, then use it.

sigmoid function :

Advantages:

    • The implementation is simple and widely used in industrial problems;

    • The calculation of the classification is very small, fast, storage resources are low;

    • Convenient observation sample probability score;

    • In the case of logistic regression, multi-collinearity is not a problem, it can be combined with L2 regularization to solve the problem.

Disadvantages :

    • When the feature space is very large, the performance of logistic regression is not very good;

    • Easy to fit , general accuracy not too high

    • A large number of multi-class features or variables cannot be handled well;

    • Can only deal with two classification problems (based on which the derived softmax can be used for multi-classification), and must be linearly divided ;

    • For nonlinear characteristics, conversion is required;

3. Linear regression

Linear regression is used for regression, and unlike logistic regression is used for classification, its basic idea is to use gradient descent method to optimize the form of least squares error, of course, you can also use normal equation directly to obtain the solution of the parameter, the result is:

In LWLR (locally weighted linear regression), the calculation expression for a parameter is:

This shows that unlike LR, LWLR is a non-parametric model, because every time a regression is performed, the training samples are traversed at least once LWLR.

Advantages : Simple implementation, simple calculation;
disadvantage : Cannot fit nonlinear data.

4. Nearest neighbor Algorithm--knn

KNN is the nearest neighbor algorithm, and its main process is:

1. 计算训练样本和测试样本中每个样本点的距离(常见的距离度量有欧式距离,马氏距离等); 2. 对上面所有的距离值进行排序; 3. 选前k个最小距离的样本; 4. 根据这k个样本的标签进行投票,得到最后的分类类别;

How to choose an optimal K-value, depending on the data. In general, a large k value at the time of classification can reduce the effect of noise. However, the boundaries between categories are blurred. A good K-value can be obtained through a variety of heuristic techniques, such as cross-validation. In addition, the existence of noise and non-correlation eigenvector can reduce the accuracy of K-nearest neighbor algorithm.

The nearest neighbor algorithm has strong consistency results. As the data tends to infinity, the algorithm guarantees that the error rate will not exceed twice times the error rate of the Bayesian algorithm. For some good k values, the K-Nearest neighbor guarantee error rate will not exceed the Bayesian theory error rate.

The advantages of KNN algorithm

    • Mature theory, simple thinking, can be used to do the classification can also be used to do regression;

    • Can be used for nonlinear classification;

    • The complexity of training time is O (n);

    • No assumptions about the data, high accuracy, not sensitive to outlier;

Disadvantages

    • Large computational capacity;

    • Sample imbalance problem (that is, there are a large number of samples in some categories, while the number of other samples is very small);

    • Requires a lot of memory;

5. Decision Tree

Easy to explain. It can handle the interaction between features without stress and is non-parametric, so you don't have to worry about outliers or whether the data is linear (for example, a decision tree can easily handle a class A at the end of a feature dimension X, Category B is in the middle, and then category A appears in the case of feature Dimension x front end). One of its drawbacks is that it does not support online learning, so the decision tree needs to be rebuilt after the new sample arrives. Another drawback is that it is prone to overfitting, but this is the entry point for integration methods such as random forest rf (or ascending tree boosted trees). In addition, random forests are often the winners of many classification problems (usually a little less than support vector machines), which are trained fast and adjustable, and you don't have to worry about having to tune a bunch of parameters like a support vector machine, so it's always been popular before.

One of the important points in a decision tree is to select an attribute for branching, so pay attention to the calculation formula for the information gain and understand it in depth.

The entropy of information is calculated as follows:

where n means there are n categorical categories (for example, a 2-class problem, then n=2). The probability P1 and P2 of the 2 samples in the total sample are calculated separately, so that the information entropy before the branch of the unselected attribute can be calculated.

Now select an attribute $x_i$ is used for branching, at which point the branching rule is: if $x_i=v$, divide the sample into one branch of the tree, or enter another branch if it is not equal. Obviously, the sample in the branch is likely to include 2 categories, calculate the entropy H1 and H2 of the 2 branches respectively, calculate the total information entropy H ' =p1 h1+p2 H2 after branching, then the information gain ΔH = H-h '. Taking the information gain as the principle, all the attributes are tested on one side, and a property that maximizes the gain is selected as the Branch property.

Advantages of the decision tree itself

    • Simple calculation, easy to understand, and strong explanatory ability;

    • More suitable for handling samples with missing attributes;

    • Ability to handle irrelevant features;

    • Be able to make feasible and effective results for large data sources in a relatively short period of time.

Disadvantages

    • Easy-to-fit (random forest can be reduced to a large extent by overfitting);

    • The correlation between data is ignored;

    • For data with inconsistent sample numbers, the results of the information gain in the decision tree are biased towards those with more numerical values (as long as the information gain is used, which has the disadvantage, such as RF).

5.1 adaboosting

AdaBoost is a kind of addition and model, each model is based on the error rate of the previous model, over-focus on the wrong sample, and the correct classification of the sample to reduce attention, after successive iterations, you can get a relatively good model. is a typical boosting algorithm. Here is a summary of its pros and cons.

Advantages

    • AdaBoost is a classifier with very high accuracy.

    • A sub-classifier can be constructed using various methods, and the AdaBoost algorithm provides a framework.

    • When using a simple classifier, the computed results are understandable, and the weak classifier is extremely simple to construct.

    • Simple, do not do feature screening.

    • It is not easy to happen overfitting.

For a combination of random forest and GBDT algorithms, refer to this article: machine learning-Combinatorial algorithm summary

Cons: more sensitive to outlier

6.SVM support Vector Machine

High accuracy, in order to avoid overfitting provides a good theoretical guarantee, and even if the data in the original feature space is not divided, as long as the appropriate kernel function, it can run very well. It is particularly popular in text categorization issues that are extremely high-dimensional. Unfortunately, memory consumption is large, difficult to explain, running and tuning is also a bit annoying, but the random forest just avoid these shortcomings, more practical.

Advantages

    • Can solve the high-dimensional problem, that is, large feature space;

    • Able to deal with the interaction of nonlinear features;

    • No reliance on the entire data;

    • can improve generalization ability;

Disadvantages

    • When many samples are observed, the efficiency is not very high;

    • There is no universal solution to nonlinear problems, sometimes it is difficult to find a suitable kernel function;

    • sensitive to missing data;

The choice of the nucleus is also tricky (LIBSVM has four kernel functions: linear, polynomial, RBF, and sigmoid nuclei):

    • First, if the number of samples is less than the number of features, then there is no need to select a nonlinear kernel, simple use of linear kernel can be;

    • Second, if the number of samples is greater than the number of features, then the nonlinear kernel can be used to map the sample to a higher dimension, generally better results can be obtained;

    • Third, if the number of samples and the number of features are equal, the case can use a nonlinear kernel, the same principle as the second.

In the first case, it is also possible to dimension the data first, and then use the nonlinear kernel, which is also a method.

7. Advantages and disadvantages of artificial neural networks

Advantages of artificial Neural networks:

    • High accuracy of classification;

    • The parallel distributed processing ability is strong, the distributed storage and learning ability is strong,

    • The noise nerve has strong robustness and fault-tolerant ability, and can fully approximate the complex nonlinear relation.

    • With the function of associative memory.

Disadvantages of artificial Neural networks:

    • Neural networks require a large number of parameters, such as the network topology, weights and thresholds of the initial value;

    • Cannot observe the learning process between, the output result is difficult to explain, it will affect the credibility and acceptability of the results;

    • The study time is too long, may not even reach the study goal.

8, K-means cluster

About K-means clustering articles, Links: machine learning algorithms-k-means clustering. About the derivation of K-means, there is a very strong em thought.

Advantages

    • The algorithm is simple and easy to implement;

    • For processing large datasets, the algorithm is relatively scalable and efficient, because its complexity is approximately o (NKT), where n is the number of all objects, K is the number of clusters, and T is the number of iterations. Usually k<<n. This algorithm is usually locally convergent.

    • The algorithm attempts to find the K division that minimizes the value of the squared error function. When clusters are dense, spherical or clustered, and the difference between clusters and clusters is obvious, the clustering effect is better.

Disadvantages

    • High requirement for data type, suitable for numerical data;

    • May converge to local minimum, slow convergence on large-scale data

    • K Value comparison is difficult to select;

    • It is sensitive to the cluster heart value of initial value, and may result in different clustering results for different initial values.

    • Not suitable for finding clusters with non-convex shapes, or clusters with large differences in size.

    • For "noise" and outlier data sensitivity, a small amount of this data can have a significant impact on the average.

Algorithm selection Reference

Before translating a number of foreign articles, there is an article that gives a simple algorithm selection technique:

    1. The first choice is the logistic regression, if its effect is not good, then it can be used as a benchmark for reference, on the basis of comparison with other algorithms;

    2. Then try the decision tree (random forest) to see if you can dramatically improve your model's performance. Even if at the end you don't think of it as the final model, you can use random forests to remove noise variables and make feature choices.

    3. If the number of features and observations are particularly large, then using SVM is an option when resources and time are sufficient (this premise is important).

Usually: "Gbdt>=svm>=rf>=adaboost>=other ...", now deep learning is very hot, many fields are used, it is based on neural networks, I am also learning, but the theoretical knowledge is not very thick, not deep enough to understand , there is no introduction here.

Algorithm is important, but good data is better than good algorithm , the design of good features is beneficial. If you have a very large data set, whichever algorithm you use may not have much impact on the performance of the classification (at this point, you can choose between speed and usability).

Comparison of machine learning algorithms

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.