Summary of 18 Classic data mining algorithms

Last Update:2015-02-27 Source: Internet

Author: User

Tags id3 svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

All of the data mining code involved in this article is on my github:https://github.com/linyiqun/DataMiningAlgorithm

It took about 2 months to learn the classical algorithms of big data Mining and implement the code, which involved decision classification, clustering, link mining, mining, pattern mining and so on. is also a small introduction to the field of data mining. Here is a small summary, the following are my own corresponding algorithm blog post link, I hope to help you learn.

1.c4.5 algorithm. c4.5 algorithm and id3 algorithm, are all mathematical classification algorithm, c4.5 algorithm is id3 an improvement of the algorithm. id3 The algorithm uses information gain for decision making, and c4.5 uses the gain rate. More information about Links: http://blog.csdn.net/androidlushangderen/article/details/42395865

2.CART algorithm. The full name of the CART algorithm is the categorical regression tree algorithm, he is a two-dollar classification, using a similar to the entropy of the Gini index as a classification decision-making, after the formation of a decision tree after pruning, I myself in the implementation of the entire algorithm is the cost of complexity algorithm,

Details Link:http://blog.csdn.net/androidlushangderen/article/details/42558235

3.KNN (K nearest neighbor ) algorithm. Given some already trained data, enter a new test data point, calculate the classification of the nearest point contained in this test data point, which category is the majority, then the classification of this test point is the same, so here , Sometimes it is possible to copy different weights for various classification points. Near the point of the weight of the big point, far from the point of natural small point. Details Link:http://blog.csdn.net/androidlushangderen/article/details/42613011

4.Naive Bayes ( naive Bayesian ) algorithm. Naive Bayesian algorithm is a relatively simple classification algorithm in Bayesian algorithm, which uses a relatively important Bayesian theorem, and a simple word generalization is the derivation of the conditional probabilities of mutual transformation.

Details Link:http://blog.csdn.net/androidlushangderen/article/details/42680161

5.SVM ( support vector machine ) algorithm. Support Vector Machine (SVM) algorithm is a method for classifying linear and nonlinear data, which can be processed by the kernel function when the nonlinear data is classified. One of the key steps is to search for the maximum edge hyper-plane.

Details Link:http://blog.csdn.net/androidlushangderen/article/details/42780439

6.EM ( desired maximization ) algorithm. The desired maximization algorithm can be split into 2 algorithms,1 e-step steps , and 1 m-step maximum steps. He is an algorithm framework that approximates the maximum likelihood or maximum posteriori estimate of the statistical model parameters after each calculation.

Details Link:http://blog.csdn.net/androidlushangderen/article/details/42921789

7.Apriori algorithm. Apriori algorithm is an association rule mining algorithm, mining frequent itemsets by linking and pruning operations, and then getting association rules based on frequent itemsets, and the export of association rules needs to satisfy the requirement of minimum confidence level.

Details Link:http://blog.csdn.net/androidlushangderen/article/details/43059211

8.fp-tree ( frequent pattern tree ) algorithm. This algorithm is also known as the fp-growth algorithm, which overcomes the shortcomings of the Apriori algorithm, which produces the frequency pattern tree by recursion, Then the tree is excavated, and the subsequent process is consistent with the Apriori algorithm.

details Link: /http/ blog.csdn.net/androidlushangderen/article/details/43234309

9.pagerank ( Web importance / rank ) algorithm. pagerank algorithm originated in google, The core idea is to use the number of pages into the chain as a good fast criteria for a Web page, if 1 The page contains multiple links to the outside, then PR algorithm will also be link span attack.

Details Link:http://blog.csdn.net/androidlushangderen/article/details/43311943

10.HITS algorithm. HITS algorithm is another link algorithm, part of the principle and the PageRank algorithm is relatively similar,HITS The algorithm introduces the concept of the authoritative value and the center value,theHITS algorithm is affected by the user query conditions, he is generally used for small-scale data link analysis, but also more vulnerable to attack.

Details Link:http://blog.csdn.net/androidlushangderen/article/details/43311943

11.k-means (K " algorithm. k-means algorithm is a clustering algorithm, k In this case it refers to the type number of the classification, so at the beginning of the setting is very critical, the principle of the algorithm is to first assume that Span style= "Font-family:consolas" >k classification points, then the classification according to the Euclidean distance, and then to the same classification as the new cluster Center, Loop the operation until it converges. More information about Links: http://blog.csdn.net/androidlushangderen/article/details/43373159

12.BIRCH algorithm. the BIRCH algorithm uses the construction of the CF Clustering feature tree as the core of the algorithm, through the tree form,theBIRCH algorithm scans the database, Building an initial cf- tree in memory can be seen as a multi-layer compression of data.

Details Link:http://blog.csdn.net/androidlushangderen/article/details/43532111

13.AdaBoost algorithm. the AdaBoost algorithm is a lifting algorithm that obtains multiple complementary classifiers through multiple training of data, and then combines multiple classifiers to form a more accurate classifier, with a detailed description of the link:/http blog.csdn.net/androidlushangderen/article/details/43635115

14.GSP algorithm. the GSP algorithm is a sequential pattern mining algorithm. the GSP algorithm is also a Apriori class algorithm, in the process of the algorithm will also be connected and pruning operations, but in the pruning judgment also added some time constraints and other conditions.

Details Link:http://blog.csdn.net/androidlushangderen/article/details/43699083

15.PreFixSpan algorithm. Prefixspan algorithm is another sequential pattern mining algorithm, in the process of the algorithm will not produce candidate sets, given the initial prefix pattern, constantly through the suffix pattern of the elements to go to the prefix pattern, and continuous recursive mining down.

Details Link:http://blog.csdn.net/androidlushangderen/article/details/43766253

16.CBA ( based on association rule classification ) algorithm. CBA algorithm is an integrated mining algorithm, because he is based on association rules Mining algorithm, in the context of existing association rules, to do the classification and judgment, only at the beginning of the algorithm to do the data processing, become similar to the form of a transaction.

Details Link:http://blog.csdn.net/androidlushangderen/article/details/43818787

17.RoughSets ( rough set ) algorithm. Rough set theory is a relatively new idea of data mining. In this paper, a rough set is used to attribute reduction algorithm, by the upper and lower approximation set to delete invalid properties, to regulate the output.

Details Link:http://blog.csdn.net/androidlushangderen/article/details/43876001

18.gSpan algorithm. gspan algorithm belongs to the field of graph mining algorithm. , mainly used for mining frequent sub-graphs, compared with other graph algorithms, the sub-graph mining algorithm is their premise or basic algorithm. the Gspan algorithm uses DFS Coding, and the Edge Five tuple, the most right path sub-graph extension concept, algorithm comparison of the abstract and complex. Details Link:http://blog.csdn.net/androidlushangderen/article/details/43924273

Summary of 18 Classic data mining algorithms

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More