The IEEE International Conference on Data Mining (ICDM), an authoritative international academic organization, evaluated the top ten classic algorithms in the field of data mining in December 2006: C4.5, K-means, SVM, Apriori, em, pageRank, AdaBoost, KNN, Naive Bayes, and cart.
Not the top ten algorithms selected. In fact, the 18 algorithms selected by NLP can be regarded as classic algorithms, they have had a profound impact in the field of data mining.
1. C4.5
C4.5 is a classification decision tree algorithm in machine learning algorithms. Its core algorithm is ID3. the C4.5 algorithm inherits the strengths of the ID3 algorithm and improves the ID3 algorithm in the following aspects:
1) The information gain rate is used to select attributes, which overcomes the shortcomings of attributes with many options;
2) pruning during tree construction;
3) discretization of continuous attributes can be completed;
4) incomplete data can be processed.
C4.5 algorithms have the following advantages: the generated classification rules are easy to understand and have high accuracy. Its disadvantage is that in the process of constructing a tree, the dataset needs to be scanned and sorted multiple times, which leads to inefficient algorithms.
2. The K-means algorithm is the K-means algorithm.
The K-means algorithm is a clustering algorithm that divides n objects into k segments based on their attributes, k <n. It is very similar to the maximum Expectation Algorithm for processing mixed normal distribution, because they all try to find the center of natural clustering in the data. If the object property comes from a space vector, and the goal is to minimize the sum of mean square errors in each group.
3. Support Vector Machines
Support Vector Machine (SVM ). It is a supervised learning method, which is widely used in statistical classification and regression analysis. SVM maps a vector to a higher-dimensional space, in which a maximum interval hyperplane is created. Two parallel superplanes are built on both sides of the data plane. Separates the hyperplane to maximize the distance between two parallel superplanes. It is assumed that the larger the distance or gap between parallel superplanes, the smaller the total error of the classifier. An excellent guide is Pattern Recognition Support Vector Machine Guide by C. j.c burges. Van derwalt and Barnard compared SVM with other classifiers.
4. the Apriori algorithm
The Apriori algorithm is the most influential algorithm used to mine frequent item sets of Boolean association rules. Its core is a recursive algorithm based on the two-phase frequency set thinking. This association rule belongs to single-dimension, single-layer, and boolean association rules in classification. Here, all items with a higher degree of support than the minimum level are called frequent item sets (frequency sets.
5. Maximum expectation (EM) Algorithm
In statistical calculation, the maximum expectation (EM, expectation-maximization) algorithm is used to find the maximum likelihood prediction algorithm of the number of values in the probability (Probabilistic) model, the probability model depends on the Hidden variable (latent Variabl) that cannot be viewed ). It is expected that data aggregation (dataclustering) is frequently used in machine learning and computer vision.
6. PageRank
PageRank is an important part of Google algorithms. In September 2001, the patent was granted to the US. The patent holder is Larry Page, one of the founders of Google ). Therefore, pages in PageRank refer not to webpages, but to pages, that is, the hierarchical method is named by pages.
PageRank measures the site value based on the number and quality of the site's external links and internal links. The concept behind PageRank is that every link to the page is a vote for the page. The more links are, the more votes other sites have. This is the so-called "link popularity"-to measure how many people are willing to hook their websites with your sites. The concept of PageRank is derived from the frequency of quote in an academic paper-that is, the more times the paper is quoted by others, the more authority the paper generally assumes.
7. AdaBoost
AdaBoost is an iterative algorithm. Its core idea is to train different classifiers (weak classifiers) for the same training set, and then combine these weak classifiers, form a stronger final classifier (strong classifier ). The algorithm itself is implemented by changing the data distribution. It is based on whether the classification of each sample in each training set is correct and the accuracy of the previous overall classification, to determine the weight of each sample. Send the new dataset that has changed the weight value to the lower-level classifier for training, and finally combine the classifier obtained each time as the final decision classifier.
8. KNN: K-Nearest Neighbor Classification
K's recent neighbor (k-nearest neighbor, KNN) classification algorithm is a theoretically more mature method than cosine and one of the simplest machine learning algorithms. The idea of this method is to assume that most of the K samples in the feature space that are most similar (that is, the most adjacent in the feature space) belong to a certain category, the sample also belongs to this category.
9. Naive Bayes
Among the many classification models, the two most widely used classification models are decision tree model and Naive Bayes model (naive Bayesian model, NBC ).Naive Bayes modelOriginated from classical mathematical theory, it has a solid mathematical foundation and stable classification efficiency. At the same time, the NBC model requires a very small number of metrics, which are not sensitive to missing data and the algorithm is simpler than the latency. Theoretically, the NBC model has the minimum error rate compared with other classification methods. In fact, this is not always the case. This is because if the attributes of the NBC model are independent of each other, this is often not true in actual applications, this affects the correct classification of the NBC model. The classification efficiency of the NBC model is lower than that of the decision tree model when the number of attributes is greater than that of the limit model or when the correlation between attributes is greater. However, when the attribute correlation is small, the performance of the NBC model is the best.
10. Cart: Classification and regression tree
Cart, classification and regression trees. There are two key ideas below the classification tree. The first is the idea of recursively dividing the space of independent variables; the second is to use verification data for pruning.
Source: http://blog.csdn.net/aladdina/
The summaries of the above 10 articles are sourced from all reposted on the Internet for search. Baidu encyclopedia has the most content, and a few are from Chinese Wikipedia and other webpages.
Top 10 typical data mining algorithms