Comparison of several classical machine learning algorithms

Source: Internet
Author: User

Quaro on the question and answer, I feel very good reply!

What is the advantages of different classification algorithms?For instance, if we had large training data set with approx to than 10000 instances and more than 100000 features, then Which classifier'll be is best to choose to classify the test data set.--------------------------here is some general Gu Idelines I ' ve found over the years.

How large is your training set?

If your training set is small, high bias/low variance classifiers (e.g., Naive Bayes) has an advantage over low Bias/high Variance classifiers (e.g, kNN or logistic regression), since the latter would overfit. But low Bias/high variance classifiers Start-to-win out as your training set grows (they has lower asymptotic error), Sin Ce high bias classifiers aren ' t powerful enough to provide accurate models.
(I understand that NB is a high bias/low variance model, so the small training set can achieve very good results, but due to the lack of classification ability of the model itself (higher bias), when the data set becomes larger, the classification ability will not change better. ) You can also think of this as a generative model vs. Discriminative model distinction.

advantages of some particular algorithms

Advantages of Naive Bayes: Super simple, you ' re just doing a bunch of counts. If The NB conditional independence assumption actually holds, a Naive Bayes classifier would converge quicker than Discrimi Native models like the logistic regression, so you need less training data. And even if the NB assumption doesn ' t hold, a NB classifier still often performs surprisingly well in practice. A Good bet if you want to does some kind of semi-supervised learning, or want something embarrassingly simple that performs Pretty well.
(Advantages: Simple model, fast convergence in the case of independent assumptions, good work on small datasets, suitable for multi-classification issues) (Disadvantage: A condition that has an independent hypothesis; high bias, data set increases, and the classification effect is limited)
Advantages of Logistic Regression: Lots of ways to regularize your model, and you don ' t has to worry as much on your features being correlated, like yo u do in Naive Bayes. You also has a nice probabilistic interpretation, unlike decision trees or SVMs, and you can easily update your model to Take the new data (using an online gradient descent method), again unlike decision trees or SVMs. Use it if you want a prob Abilistic framework (e.g., to easily adjust classification thresholds, to say if you ' re unsure, or to get confidence int Ervals) or if you expect to receive more training data in the "future" and want to being able to quickly incorporate into Your model. (Advantage: There is a probability explanation; it is convenient to take the new data into the model (online update)) (Disadvantage: can only deal with linear sub-problem)

Advantages of decision Trees: Easy to interpret and explain (for some people--i ' m not sure I fall into this camp). Non-parametric, so you don ' t has to worry about outliers or whether the data is linearly separable (e.g., decision trees Easily take care of cases where are you having class A at the low end of some feature X, Class B in the mid-range of feature X, and A again at the high end). Their main disadvantage is this they easily overfit, but that's where ensemble methods like random forests (or boosted Tre ES) come in. Plus, random forests is often the winner for lots of problems in classification (usually slightly ahead of SVMs, I Believ e), they ' re fast and scalable, and you don ' t has to worry about tuning a bunch of parameters what do I with SVMs, so th EY seem to is quite popular these days. (Advantages: easy to explain; no parameters; do not consider whether linear can be divided) (Disadvantage: easy to fit)

Advantages of SVMs: High accuracy, nice theoretical guarantees regarding overfitting, and a appropriate kernel they can work well even If you're ' re data isn ' t linearly separable in the base feature space. Especially popular in text classification problems where very high-dimensional spaces is the norm. Memory-intensive and kind of annoying to run and tune, though, so I think random forests is starting to steal the crown. ( Advantages: High accuracy, not easy to fit, able to handle linear irreducible problems (disadvantages: large memory requirements and cumbersome parameters) KNN: (Advantages: Simple thinking, ability to handle linear non-points, insensitive to outliers) (disadvantage: The time complexity and spatial complexity of the classification are large)

To go back to the particular question of logistic regression vs. Decision Trees (which I'll assume to be a question of log Istic regression vs. random forests) and summarize a bit:both is fast and scalable, random forests tend to beat out Logi Stic regression in terms of accuracy, but logistic regression can is updated online and gives you useful probabilities. And since you ' re at Square (not quite sure what a inference scientist is, other than the embodiment of fun) and possibly Working on fraud detection:having probabilities associated to each classification might is useful if you want to quickly Adjust thresholds to change false positive/false negative rates, and regardless of the algorithm you choose, if your class ES is heavily imbalanced (as often happens with fraud), you should probably resample the classes or adjust your error met Rics to make the classes more equal.

but .....

Recall, though,that better data often beats better algorithms, and designing good features goes a long. And if you had a huge dataset, your choice of classification algorithm might not really matter so much in terms of Classi Fication performance (so choose your algorithm based on speed or ease of use instead).

And if you really-accuracy, you should definitely try a bunch of different classifiers and select the best one by Cross-validation. Or, to take a lesson from the Netflix Prize and Middle Earth, just use an ensemble method to choose them all!

Comparison of several classical machine learning algorithms

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.