Top 10 typical data mining algorithms

Source: Internet
Author: User
Tags id3 svm

The IEEE International Conference on Data Mining (ICDM), an authoritative international academic organization, evaluated the top ten classic algorithms in the field of data mining in December 2006: C4.5, K-means, SVM, Apriori, em, pageRank, AdaBoost, KNN, Naive Bayes, and cart.

Not the top ten algorithms selected. In fact, the 18 algorithms selected by NLP can be regarded as classic algorithms, they have had a profound impact in the field of data mining.

 

1. C4.5

C4.5 is a classification decision tree algorithm in machine learning algorithms. Its core algorithm is ID3. the C4.5 algorithm inherits the strengths of the ID3 algorithm and improves the ID3 algorithm in the following aspects:

1) The information gain rate is used to select attributes, which overcomes the shortcomings of attributes with many options;
2) pruning during tree construction;
3) discretization of continuous attributes can be completed;
4) incomplete data can be processed.

C4.5 algorithms have the following advantages: the generated classification rules are easy to understand and have high accuracy. Its disadvantage is that in the process of constructing a tree, the dataset needs to be scanned and sorted multiple times, which leads to inefficient algorithms.

 

2. The K-means algorithm is the K-means algorithm.

The K-means algorithm is a clustering algorithm that divides n objects into k segments based on their attributes, k <n. It is very similar to the maximum Expectation Algorithm for processing mixed normal distribution, because they all try to find the center of natural clustering in the data. If the object property comes from a space vector, and the goal is to minimize the sum of mean square errors in each group.

 

3. Support Vector Machines

Support Vector Machine (SVM ). It is a supervised learning method, which is widely used in statistical classification and regression analysis. SVM maps a vector to a higher-dimensional space, in which a maximum interval hyperplane is created. Two parallel superplanes are built on both sides of the data plane. Separates the hyperplane to maximize the distance between two parallel superplanes. It is assumed that the larger the distance or gap between parallel superplanes, the smaller the total error of the classifier. An excellent guide is Pattern Recognition Support Vector Machine Guide by C. j.c burges. Van derwalt and Barnard compared SVM with other classifiers.

 

4. the Apriori algorithm

The Apriori algorithm is the most influential algorithm used to mine frequent item sets of Boolean association rules. Its core is a recursive algorithm based on the two-phase frequency set thinking. This association rule belongs to single-dimension, single-layer, and boolean association rules in classification. Here, all items with a higher degree of support than the minimum level are called frequent item sets (frequency sets.

 

5. Maximum expectation (EM) Algorithm

In statistical calculation, the maximum expectation (EM, expectation-maximization) algorithm is used to find the maximum likelihood prediction algorithm of the number of values in the probability (Probabilistic) model, the probability model depends on the Hidden variable (latent Variabl) that cannot be viewed ). It is expected that data aggregation (dataclustering) is frequently used in machine learning and computer vision.

 

6. PageRank

PageRank is an important part of Google algorithms. In September 2001, the patent was granted to the US. The patent holder is Larry Page, one of the founders of Google ). Therefore, pages in PageRank refer not to webpages, but to pages, that is, the hierarchical method is named by pages.

PageRank measures the site value based on the number and quality of the site's external links and internal links. The concept behind PageRank is that every link to the page is a vote for the page. The more links are, the more votes other sites have. This is the so-called "link popularity"-to measure how many people are willing to hook their websites with your sites. The concept of PageRank is derived from the frequency of quote in an academic paper-that is, the more times the paper is quoted by others, the more authority the paper generally assumes.

 

7. AdaBoost

AdaBoost is an iterative algorithm. Its core idea is to train different classifiers (weak classifiers) for the same training set, and then combine these weak classifiers, form a stronger final classifier (strong classifier ). The algorithm itself is implemented by changing the data distribution. It is based on whether the classification of each sample in each training set is correct and the accuracy of the previous overall classification, to determine the weight of each sample. Send the new dataset that has changed the weight value to the lower-level classifier for training, and finally combine the classifier obtained each time as the final decision classifier.

 

8. KNN: K-Nearest Neighbor Classification

K's recent neighbor (k-nearest neighbor, KNN) classification algorithm is a theoretically more mature method than cosine and one of the simplest machine learning algorithms. The idea of this method is to assume that most of the K samples in the feature space that are most similar (that is, the most adjacent in the feature space) belong to a certain category, the sample also belongs to this category.

 

9. Naive Bayes

Among the many classification models, the two most widely used classification models are decision tree model and Naive Bayes model (naive Bayesian model, NBC ).Naive Bayes modelOriginated from classical mathematical theory, it has a solid mathematical foundation and stable classification efficiency. At the same time, the NBC model requires a very small number of metrics, which are not sensitive to missing data and the algorithm is simpler than the latency. Theoretically, the NBC model has the minimum error rate compared with other classification methods. In fact, this is not always the case. This is because if the attributes of the NBC model are independent of each other, this is often not true in actual applications, this affects the correct classification of the NBC model. The classification efficiency of the NBC model is lower than that of the decision tree model when the number of attributes is greater than that of the limit model or when the correlation between attributes is greater. However, when the attribute correlation is small, the performance of the NBC model is the best.

 

10. Cart: Classification and regression tree

Cart, classification and regression trees. There are two key ideas below the classification tree. The first is the idea of recursively dividing the space of independent variables; the second is to use verification data for pruning.

 

Source: http://blog.csdn.net/aladdina/

The summaries of the above 10 articles are sourced from all reposted on the Internet for search. Baidu encyclopedia has the most content, and a few are from Chinese Wikipedia and other webpages.

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.