Summary of ten algorithms of data mining--core idea, algorithm advantages and disadvantages, application field

Source: Internet
Author: User
Tags svm

The algorithm in this paper only outlines the core idea, the specific implementation details of this blog "Data Mining Algorithm learning" classification under other articles, not regularly updated. Reprint please indicate the source, thank you.

Referring to a lot of information and personal understanding, the ten algorithms are categorized as follows:

? Classification algorithm: C4.5,cart,adaboost,naivebayes,knn,svm

? Clustering algorithm: Kmeans

? Statistical Learning: Em

? Correlation Analysis: Apriori

? Link Mining: PageRank

Among them, the EM algorithm can be used to cluster, but because the EM algorithm iterative speed is very slow, than Kmeans performance is much worse, and Kmeans algorithm clustering effect is no more than EM difference, so generally with kmeans clustering, rather than EM. The main function of EM algorithm is to estimate the parameters, so it is divided into statistical learning classes. SVM algorithm has a significant contribution to regression analysis and statistics, and also occupies a certain position in the classification algorithm, and then the SVM is divided into the classification algorithm. Readers who have different views on classification are welcome to discuss the message.

described below.

classification Algorithm--c4.5 detailed explanation see Data Mining Algorithm Learning (v) C4.5 algorithm

? Core idea: realizing the classification of data by the information gain rate as the measure standard

? Algorithm Advantages: The resulting classification rules are easy to understand, the accuracy rate is high

? The disadvantage of the algorithm: in the process of constructing the tree, the data sets need to be scanned and sorted several times, which leads to the inefficiency of the algorithm.

? application areas: Clinical decision making, manufacturing, document analysis, bioinformatics, spatial data modeling, etc.

Classification algorithm--cart detailed explanation see Data Mining Algorithm Learning (vi) CART algorithm

? Core idea: recursive classification of data using the Gini index estimation function based on the minimum distance

? Algorithm Advantages: extraction rules are simple and easy to understand, facing the existence of missing values, the number of variables, such as very robust

? algorithm Disadvantage: requires the selected attribute to produce only two child nodes; when there are too many categories, the error may increase faster

? Application areas: Information distortion identification, telecom potential customer identification, forecast loan risk, etc.

classification Algorithm--adaboost detailed explanation see Data Mining Algorithm Learning (eight) Adaboost algorithm

? Core idea: train different classifiers (weak classifiers) for the same training set, and then assemble these weak classifiers to form a stronger final classifier (strong classifier)

? algorithm Advantages: high precision, simple no feature screening, no over-fitting

? algorithm Disadvantage: training time is too long, execution effect depends on the selection of weak classifier

? application field: widely used in face detection, target recognition and other fields.

classification Algorithm--naivebayes detailed explanation see Data Mining Algorithm Learning (iii) Naivebayes algorithm

? Core idea: through the prior probability of an object, the Bayesian formula is used to calculate the posteriori probability, that is, the probability that the object belongs to a certain class, and select the class with the maximum posteriori probability as the class to which the object belongs.

? Algorithm Advantages: The algorithm is simple, the required parameters are few, less sensitive to the missing data

? disadvantage of the algorithm: the number of attributes is more or the correlation between attributes is large, the classification efficiency decreases

? application areas: spam filtering, text categorization

Classification algorithm--KNN

? Core idea: If a sample is in the K most similar in the feature space (that is, the nearest neighbor in the feature space) Most of the samples belong to a category, then the sample belongs to that category.

? Algorithm Advantages: simple, no need to estimate parameters, no training, suitable for multi-classification problems

? algorithm Disadvantage: The computational amount is large, the explanatory is poor, cannot give the rule of decision tree

? application areas: customer churn prediction, fraud detection, etc. (more suitable for classification of rare events)

classification Algorithm--SVM detailed explanation see Data Mining Algorithm Learning (vii) SVM algorithm

? Core idea: to establish an optimal decision-making super-plane, so as to maximize the distance between two kinds of samples which are closest to plane on both sides of the plane, and to provide good generalization ability for classification problem.

? algorithm Advantages: better generalization ability, solve non-linear problem while avoiding dimension disaster, can find global optimal

? Algorithmic Disadvantage: low computational efficiency and excessive resource usage in computing

? application areas: Remote sensing image classification, sewage treatment process operation status monitoring, etc.

Clustering Algorithm--kmeans detailed explanation see Data Mining Algorithm Learning (a) Kmeans algorithm

? Core idea: input cluster number k, and a database containing N data objects, output K clusters that meet the minimum variance criteria

? Algorithm Advantages: fast operation speed

? algorithm Disadvantage: the number of clusters K is an input parameter, the inappropriate k value may return poor results

? application areas: image segmentation, analysis of commodity similarity to classify commodities, analysis of the company's customer classification to use different business strategies

Statistical learning--em

? Core idea: maximize expectations with e-steps and M-Steps

? Algorithm Advantages: Simple and stable

? algorithm disadvantage: iterative speed is slow, many times, easy to fall into the local optimal

? application areas: parameter estimation, data aggregation of computer vision

Association analysis--apriori

? Core idea: the algorithm of Mining Association rules based on two-stage frequency set theory

? Algorithm Advantages: simple, easy to understand, low data requirements

? Algorithm Disadvantages: Large I/O load, resulting in excessive set of candidate items

? Application: Consumer market price analysis, intrusion detection, mobile communication field

Link Mining--pagerank

? Core idea: based on a number of high-quality web links from the Web page, must be the return of quality pages, to determine the importance of all pages

? Algorithm Advantages: completely independent of the query, only rely on the web link structure, can be calculated offline

? The disadvantage of the algorithm: ignore the timeliness of web search, the old web page sort is very high, there is a long time, accumulated a lot of in-links, with the latest news of the new page ranking is very low, because they have little in-links

? application area: page sorting

Original articles, reproduced please indicate the source, thank you.

Summary of ten algorithms of data mining--core idea, algorithm advantages and disadvantages, application field

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.