International authoritative Academic organization the IEEE International Conference on Data Mining (ICDM) 2006 12 The top ten classic data mining algorithms of the Month: C4.5, K-means, SVM, Apriori, EM, Pa Gerank, AdaBoost, KNN, Naive Bayes, and CART.
No, but the top ten algorithms are selected. In fact , the selection of 18 algorithms, in fact, casually come up with a kind of can be called classical algorithms, they have in the field of data mining has a very far-reaching impact.
1. C4.5
C4.5 algorithm is a classification decision tree algorithm in machine learning algorithm, and its core algorithm is ID3 algorithm. The C4.5 algorithm inherits the advantages of the ID3 algorithm. The ID3 algorithm is improved in the following aspects:
1) Use the information gain rate to select attributes. It overcomes the disadvantage of choosing the attribute with the information gain to choose the value.
2) pruning in the process of tree construction;
3) The discrete processing of continuous attributes can be completed;
4) The incomplete data can be processed.
The C4.5 algorithm has the following advantages: The resulting classification rules are easy to understand and the accuracy rate is high.
The disadvantage is that it is in the process of constructing the tree. It is necessary to sequentially scan and sort the data sets, resulting in inefficient algorithms.
2. The K-means algorithm is the K-means algorithm
The K-means algorithm algorithm is a clustering algorithm. The objects of n are divided into K-cuts according to their attributes, K <n.
It is very similar to the maximum expected algorithm for dealing with mixed normal distributions, as they all try to find the center of natural clustering in the data. It is if the object attributes come from a space vector, and the goal is to minimize the sum of the mean squared errors within each group.
3. Support Vector Machines
Support Vector machine, in English, is supported Vectormachine. SV Machine (generally referred to as SVM in the paper). It is a kind of supervised learning method, which is widely used in statistical classification and regression analysis.
Support Vector machines map vectors to a higher dimensional space, where a maximum interval of hyperspace is established in this space. On both sides of the super plane that separates the data, there are two super-planes that are parallel to each other.
The separation of the superelevation plane maximizes the distance of two parallel super-planes.
It is assumed that the distance or gap between parallel planes is greater. The smaller the total error of the classifier. An excellent guide is c.j.c Burges's "Pattern Recognition Support vector machine Guide".
Van Derwalt and Barnard compare support vector machines with other classifiers.
4. The Apriori algorithm
Apriori algorithm is one of the most influential algorithms for mining Boolean association rule frequent itemsets.
The core is the recursive algorithm based on the two-stage frequency set theory. The association rule belongs to single-dimension, single-Layer and Boolean association rules in classification. In this case, the itemsets with all support degrees greater than the minimum support are called frequent itemsets, referred to as frequency sets.
5. Maximum expectation (EM) algorithm
In statistical computation, the maximal expectation (em,expectation–maximization) algorithm is the algorithm for finding the maximum likelihood of the parameters in the probability (probabilistic) model, in which the probabilistic model relies on the hidden variables that cannot be predicted (latent VARIABL).
Maximum expectations are often used in the field of data aggregation (dataclustering) for machine learning and computer vision.
6. PageRank
PageRank is an important part of Google's algorithm.
A U.S. patent was granted in September 2001, and the patent owner was Larry Page, a Google founder. So. The page in PageRank does not refer to pages, but to page, which means that this hierarchy is named after page.
PageRank the value of the site based on the number and quality of the site's external links and internal links. The concept behind PageRank is that every link to a page is a vote on that page. The more you link, the more you will be voted by other sites.
This is called "link popularity" – A measure of how many people are willing to hook up their site to your site. The concept of PageRank is quoted as a quote from an academic paper-the more times a person is quoted, the more authoritative the article is generally inferred.
7. AdaBoost
AdaBoost is an iterative algorithm. The core idea is to train different classifiers (weak classifiers) for the same training set, and then assemble the weak classifiers to form a stronger, finally classifier (strong classifier). The algorithm itself is implemented by changing the data distribution, which is based on the correct classification of each sample in each training set. and the exact rate of the last overall classification. To determine the weights of each sample.
A new data set that changes the weights is sent to the lower classifier for training. Finally, the classifier that gets each training is finally fused together. As the final decision classifier.
8. Knn:k-nearest Neighbor Classification
K Recent neighbor (K-nearest NEIGHBOR,KNN) classification algorithm is a theoretically mature method. is one of the simplest machine learning algorithms. The idea of this approach is to assume that most of the samples in a sample that are most similar to the K in the feature space (that is, the nearest neighbor in the feature space) belong to a category, and the sample belongs to that category.
9. Naive Bayes
In many classification models, the two most widely used classification models are decision tree (decision tree model) and naive Bayesian model (Naive Bayesian MODEL,NBC).
naive Bayesian model originates from classical mathematics theory, has a solid mathematical foundation, and stable classification efficiency. At the same time, the NBC model required a very small number of expected references. Less sensitive to missing data, the algorithm is simpler. In theory, the NBC model has the smallest error rate compared to other classification methods.
But that's not always the case. This is due to the NBC model if the attributes are independent of each other. This is often not true in practical applications, which has a certain impact on the correct classification of the NBC model. The efficiency of the NBC model is inferior to the decision tree model when the number of attributes is more or the correlation between attributes is large. The performance of the NBC model is best when the attribute correlation is small.
CART: Classification and regression tree
CART, classification and Regression Trees.
There are two key ideas below the classification tree. The first is the idea of recursive Hua Zi of parametric space; The second idea is to use validation data to prune.
This article source: http://blog.csdn.net/aladdina/
Top 10 Search all reprinted articles from the Feed Web page, the biggest content of Baidu encyclopedia, some Wikipedia and other websites from China.
Ten classical data mining algorithms