Ten classical algorithms for data mining

Last Update:2015-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, C4.5
C4.5 is a classification decision tree algorithm in machine learning algorithm, it is a decision tree (decision tree is a decision-making node of the organization like a tree, in fact, a inverted tree) core algorithm ID3 improved algorithm, so basically understand half decision tree construction method can construct it. The decision tree construction method is actually the selection of a good feature and the split point as the current node classification criteria.

Second, the K-means algorithm is K-means algorithm
The K-means algorithm algorithm is a clustering algorithm that divides n objects into K-divisions (K < N) according to their attributes. It is similar to the maximum expected algorithm for dealing with mixed normal distributions, as they all try to find the center of natural clustering in the data. It assumes that the object attributes come from the space vector, and that the goal is to minimize the sum of the mean squared errors within each group.

Third, support Vectormachines
Support Vector machines, in English, supports vector machine, referred to as SV Machine. It is a supervised learning method, which is widely used in statistical classification and regression analysis. Support Vector machines map vectors to a higher dimensional space, where a maximum interval of hyperspace is established in this space. Two parallel super-planes are built on both sides of the super plane separating the data, and the distance between the two parallel planes is maximized by separating the super plane.
Iv. the Apriori algorithm
Apriori algorithm is one of the most influential algorithms for mining Boolean association rule frequent itemsets.
The core is the recursive algorithm based on the two-stage frequency set theory. The association rule belongs to single-dimension, single-Layer and Boolean association rules in classification. In this case, all itemsets with support degrees greater than the minimum support are called frequent itemsets, or frequency sets.

V. Maximum expectation (EM) algorithm
In statistical computation, the maximal expectation (em,expectation–maximization) algorithm is the algorithm for finding the maximum likelihood estimation of parameters in the probability (probabilistic) model, in which the probabilistic model relies on the invisible hidden variables (latent VARIABL). Maximum expectations are often used in the field of machine learning and computer vision Data aggregation (dataclustering).

Liu, PageRank
PageRank is an important part of Google's algorithm. The U.S. patent was granted in September 2001, and the patent owner is one of Google's founders, Larry Page. As a result, the page in PageRank is not a webpage, it refers to Paige, that is, the hierarchical method is named after page. PageRank measures the value of the site based on the number and quality of external links and internal links to the site. The concept behind PageRank is that each link to a page is a poll of that page, and the more links it has, the more votes are being voted on by other sites.

Seven, AdaBoost
AdaBoost is an iterative algorithm whose core idea is to train different classifiers (weak classifiers) for the same training set, and then set up these weak classifiers to form a stronger final classifier (strong classifier). The algorithm itself is achieved by changing the distribution of data, which determines the weights of each sample based on the correctness of the classification of each sample in each training set and the accuracy of the last population classification. The new data set that modifies the weights is sent to the lower classifier for training, and finally the classifier that is trained each time is combined as the final decision classifier.

Viii. Knn:k-nearestneighbor Classification
K Nearest neighbor (k-nearest NEIGHBOR,KNN) classification algorithm is a theoretically mature method and one of the simplest machine learning algorithms. The idea of this approach is that if a sample is in the K most similar in the feature space (that is, the nearest neighbor in the feature space) Most of the samples belong to a category, then the sample belongs to that category.

IX, Naive Bayes
In many classification models, the two most widely used classification models are decision tree model (decision TreeModel) and naive Bayesian model (Naive Bayesian MODEL,NBC).
Naive Bayesian model originates from classical mathematics theory, has a solid mathematical foundation, and stable classification efficiency. At the same time, the NBC model has few parameters to estimate, less sensitive to missing data, and simpler algorithm. In theory, the NBC model has the smallest error rate compared to other classification methods.

But this is not always the case, because the NBC model assumes that the properties are independent of each other, and this hypothesis is often not true in practice, which has a certain effect on the correct classification of the NBC model. When the number of attributes is more or the correlation between attributes is large, the efficiency of the NBC model is inferior to the decision tree model. The performance of the NBC model is best when the attribute correlation is small.

Ten, CART: Classification and regression tree
CART, classification and regressiontrees. There are two key ideas under the classification tree: the first is the idea of recursively dividing the argument space, and the second idea is to prune it with validation data.

Ten classical algorithms for data mining

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Ten classical algorithms for data mining

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Ten classical algorithms for data mining

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support