Summary of ten algorithms of data mining--core idea, algorithm advantages and disadvantages, application field

Last Update:2017-06-26 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

------------------------------------------------------------------------------------

Welcome reprint, please attach the link

http://blog.csdn.net/iemyxie/article/details/40736773

------------------------------------------------------------------------------------

The algorithms in this paper only summarize the core idea. Detailed implementation details refer to this blog "Data Mining Algorithm learning" classification of other articles, not regularly updated.

A lot of information and personal understanding, the top ten algorithms such as the following classification:

? Classification algorithm: C4.5,cart,adaboost,naivebayes,knn,svm

? Clustering algorithm: Kmeans

? Statistical Learning: Em

? Correlation Analysis: Apriori

? Link Mining: PageRank

The EM algorithm can be used to cluster. However, because the EM algorithm iterative speed is very slow, much worse than Kmeans performance, and the Kmeans algorithm clustering effect is no more than EM difference, so generally with kmeans clustering, rather than EM. The main function of EM algorithm is to estimate the number of parameters, so it is divided into statistical learning classes. SVM algorithm also has a significant contribution to the regression analysis and statistics, and also occupies a certain position in the classification algorithm. Consider the next or the SVM into the classification algorithm. Readers who have different views on classification are welcome to discuss the message.

described below.

classification Algorithm--c4.5 specific explanation see Data Mining Algorithm Learning (v) C4.5 algorithm

? Core idea: realizing the classification of data by the information gain rate as the measure standard

? algorithm Strengths: The resulting classification rules are easy to understand, the accuracy rate is high

? algorithm disadvantage: in the process of constructing the tree. Requires sequential scanning and sequencing of data sets, resulting in inefficient algorithm

? application areas: Clinical decision making, manufacturing, document analysis, bioinformatics, spatial data modeling, etc.

Classification algorithm--cart specific explanation see Data Mining Algorithm Learning (vi) CART algorithm

? Core idea: recursive classification of data based on the minimum distance of the Gini exponential function as the measure criterion

? algorithm Strengths: extraction rules are simple and easy to understand. Robust in the face of problems such as missing values and multiple variables

? algorithm Disadvantage: requires the selected attribute to produce only two child nodes; When the category is too high, the error may be added faster

? field of application: information distortion recognition. Telecom potential customer identification. Predict loan risk, etc.

classification Algorithm--adaboost specific explanation see Data Mining Algorithm Learning (eight) Adaboost algorithm

? Core idea: train different classifiers (weak classifiers) for the same training set, and then assemble these weak classifiers to form a stronger finally classifier (strong classifier)

? algorithm Strengths: high-precision, simple no need to do feature screening. does not over-fit

? algorithm Disadvantage: The training time is too long, the running effect depends on the selection of weak classifier

? application field: widely used in human face detection, target recognition and other fields

classification Algorithm--naivebayes specific explanation see Data Mining Algorithm Learning (iii) Naivebayes algorithm

? Core idea: through the prior probability of an object, the Bayesian formula is used to calculate the posterior probability, that is, the probability that the object belongs to a certain class. Select the class with the maximum posteriori probability as the class to which the object belongs

? Algorithm Advantages: The algorithm is simple, the expected number of parameters is very small. Less sensitive to missing data

? The disadvantage of the algorithm is that the number of attributes is more or the correlation between attributes is large. Classification efficiency decreased

? application areas: spam filtering, text categorization, news classification. Query classification, product classification, etc.

Classification algorithm--KNN

? Core idea: assuming that most of the samples in the K-similarity in the feature space (that is, the nearest neighbor in the feature space) belong to a category, the sample also falls into this category

? algorithm Strengths: simple. No need to predict the number of sessions, no training required. Suitable for multi-classification problems

? algorithm Disadvantage: The computational amount is large. Poor explanatory, unable to give a rule like a decision tree

? application areas: customer churn prediction, fraud detection, etc. (more suitable for classification of rare events)

classification Algorithm--SVM specific explanation see Data Mining Algorithm Learning (vii) SVM algorithm

? Core idea: to establish an optimal decision-making super-plane. The distance between two kinds of samples is maximized, which can provide a good generalization ability for the classification problem.

? algorithm Strengths: better generalization capabilities, solving non-linear problems at the same time to avoid dimensional disasters, can find the global optimal

? algorithm Disadvantage: low computational efficiency. Excessive resource usage during calculation

? application areas: Remote sensing image classification, sewage treatment process implementation status monitoring, etc.

Clustering Algorithm--kmeans specific explanation see Data Mining Algorithm Learning (a) Kmeans algorithm

? Core idea: enter the number of clusters K, and the database containing N data objects. Output k clusters that meet the minimum variance criteria

? algorithm Advantage: faster than KNN

? algorithm Disadvantage: the number of clusters K is an input parameter. Inappropriate k-values may return poor results

? application area: picture cutting. Analyze product similarity to classify commodities, analyze the company's customer classification to use different business strategies

Statistical learning--em

? Core idea: maximize expectations with e-steps and M-Steps

? algorithm Strengths: Simple and stable

? algorithm disadvantage: iterative slow, many times, easy into the local optimal

? field of application: estimated parameters. Data aggregation of computer vision

Association analysis--apriori

? Core idea: the algorithm of Mining Association rules based on two-stage frequency set theory

? algorithm Strengths: simple, easy to understand, low data requirements

? Algorithm Disadvantages: Large I/O load, resulting in excessive set of candidate items

? application field: Consumer market price analysis, intrusion detection. Mobile communication Field

Link Mining--pagerank

? Core idea: based on a lot of high-quality web links to take over the page, it is inevitable that the return of high-quality web pages. To determine the importance of all pages

? algorithm Strengths: completely independent of the query. Only relies on the web link structure to be able to calculate offline

? The disadvantage of the algorithm: ignore the timeliness of web search, the old page sort is very high, there is a long time, accumulated a large number of in-links, with the latest information on the new page ranking is very low. Because they almost didn't in-links

? application area: page sorting

Update on:2014-12-10

Summary of ten algorithms of data mining--core idea, algorithm advantages and disadvantages, application field

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More