How to choose a machine learning algorithm

Last Update:2015-08-27 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article turns from http://www.52ml.net/15063.html

If you just want to find a "good enough" algorithm for your problem, or a starting point, here are some good general guidelines.

1. Training set Size

In the case of a small training set, a classifier with high/low variance (such as naive Bayes) has an advantage over a classifier with a low deviation/takakata difference (such as the K nearest neighbor) because the latter is prone to overfitting. However, with the increase of the training set, the low deviation/takakata difference classifier will begin to have an advantage (they have lower asymptotic error), because the high-deviation classifier is less force-efficient to provide accurate models. an explanation of high variance and high deviation: high Variance is the test error is far less than the training error, if the high deviation is (may be a two-time model, the result of using a model) fitting effect is not good. Deviation is the training error, the variance is said to be the test error. The difference can also be seen as the difference between the model (Bayesian method and Hidden Markov model) and the discriminant model (KNN, perceptron, decision tree, logistic regression, maximum entropy, SVM, lifting method, conditional random field, etc.) for the first joint probability distribution and then the conditional probability distribution.

2. Advantages and disadvantages of common algorithms

Naive Bayes (Naive bayes,nb): Calculates the prior probability P (y) conditional probability P (x|y), calculates P (y) p (x|y) in X given case to find the largest Y

　　Pros: simple, you just have to do some arithmetic. If the conditional independence hypothesis is true, naive Bayesian classifiers will converge faster than discriminant models, such as logistic regression, so you only need less training data. Even if this hypothesis is not established, naive Bayesian classifier still has a decent performance in practice. This is a good choice if you need to be quick and easy and perform well.

　　disadvantage: Its main disadvantage is that it cannot learn the interaction between features (for example, it can't learn you although like Donnie Yen and Jiang Wen film, but hate their co-starred in the film "cloud Long" situation).

Logistic regression (logistic Regression, LR) has many methods to regularization the model. Compared to NB's conditional independence hypothesis, LR does not need to consider whether the sample is relevant. Unlike the decision tree and Support vector Machine (SVM), NB has a good probability interpretation, and it is easy to use the new training data to update the model (using the online gradient descent method). LR is worth using if you want some probability information (for example, to make it easier to adjust the classification thresholds, to get categorical uncertainties, to get confidence intervals), or to update the improved model easily if you want more data in the future.

Decision Trees (decision tree, DT)
DT is easy to understand and explain. DT is non-parametric, so you don't have to worry about whether the wild dots (or outliers) and data are linearly divided (for example, DT can easily handle this case: the characteristics of the samples belonging to class A are often very small or very large, whereas the characteristics of the samples belonging to Class B are in the middle range).

The main disadvantage of DT is that it is easy to fit, which is why the integrated learning algorithms such as random Forest, RF, or boosted trees are mentioned. In addition, RF is often the best in many classification problems (I personally believe that generally better than SVM), and the speed can be expanded, and not like SVM need to adjust a large number of parameters, so the recent RF is a very popular algorithm.

Support Vector Machines (SVM)
High classification accuracy rate, the right to fit a good theory to ensure that the selection of the appropriate kernel function, the face of the characteristics of linear irreducible problems can also be very good. SVM is very popular in text categorization where the dimensions are usually very high.

Due to the large memory requirements and cumbersome tuning parameters

Better data is often more important than better algorithms, and extracting good features also requires a lot of effort. If your data set is very large, then the choice of classification algorithms may not have a significant impact on the final classification performance (so you can choose based on speed or ease of use).

If you are concerned about the accuracy of the classification, then you have to try a variety of classifiers, based on the results of cross-validation to pick the best performance. Or, learn about Netflix prize and middle Earth, using some sort of integrated approach to combine multiple classifiers.

How to choose a machine learning algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More