Research on ten classical algorithms in Data mining field

Source: Internet
Author: User
Tags id3 svm

Translator: July January 15, 2011

-----------------------------------------

Reference documents:
ICDM, the international authoritative academic organization, has selected ten classical algorithms in the field of data mining in December 06:
C4.5, K-means, SVM, Apriori, EM, PageRank, AdaBoost, KNN, Naive Bayes, and CART.
==============
Blogger Description:
1, the original literature is not the latest article, but I have always been more sensitive to the algorithm, interested in the original text,
Translation process, there are references to some of the translation of the article, but personally, the exposition is not accurate, and are all generalities,
Therefore, this translation, I hope, to provide readers with a more authoritative and detailed documentation.
2, at the same time, you can also choose their one or two on the occasion of a good study, analysis of the ten classical algorithms in the field of data mining.
In this article, we have added some personal understanding, please identify yourself.
---------------------------------------------------------------------


Here are the top ten classic algorithms from the 18 candidate algorithms chosen:

First, C4.5
C4.5, is a classification decision tree algorithm in machine learning algorithm,
It is the decision tree (the decision tree is the organization of the decision-making nodes like a tree, actually a inverted tree) core algorithm
ID3 improved algorithm, so basically understand half decision tree construction method can construct it.
The decision tree construction method is actually the selection of a good feature and the split point as the current node classification criteria.

C4.5 compared to the ID3 improvements are:
1, use the information gain rate to select attributes.
ID3 The selection attribute is a subtree of information gain, there are many ways to define information, ID3 using entropy (entropy, entropy is a measure of purity),

That is, the value of entropy change.

And C4.5 is using the information gain rate. Yes, the difference is that one is information gain and one is information gain rate.
Generally speaking, the rate is used for balance, just like the effect of variance,
For example, there are two runners, a starting point is the 10m/s, and its tens after the 20m/s;
The other person's starting speed is 1m/s, and its 1s is 2m/s.
If the difference is tight, then two of the gap is very large, if the use of speed increase (acceleration, that is, 1m/s^2) to measure, 2 people are the same acceleration.

Therefore, the C4.5 overcomes the insufficiency of the attribute that ID3 chooses the value when the information gain chooses the attribute.

2, in the tree construction process pruning, in the construction decision tree, those hanging several elements of the node, do not consider the best, otherwise easily lead to overfitting.
3, the non-discrete data can also be processed.
4. Be able to deal with incomplete data.


Second, the K-means algorithm is K-means algorithm
The K-means algorithm algorithm is a clustering algorithm that divides n objects into K-divisions (K < N) according to their attributes.
It is similar to the maximum expected algorithm for dealing with mixed normal distributions (fifth of the Ten algorithms), as they all try to find the center of natural clustering in the data.
It assumes that the object attributes come from the space vector, and that the goal is to minimize the sum of the mean squared errors within each group.


Third, support vector machines
Support Vector machines, the English-supported vector machine, referred to as SV Machine (generally referred to as SVM in the paper).

It is a supervised learning method, which is widely used in statistical classification and regression analysis.
Support Vector machines map vectors to a higher dimensional space, where a maximum interval of hyperspace is established in this space.
Two parallel super-planes are built on both sides of the super plane separating the data, and the distance between the two parallel planes is maximized by separating the super plane.
It is assumed that the larger the distance or gap between parallel planes, the smaller the total error of the classifier.

An excellent guide is c.j.c Burges's "Pattern Recognition Support vector machine Guide".
Van der Walt and Barnard compare support vector machines with other classifiers.


Iv. the Apriori algorithm
Apriori algorithm is one of the most influential algorithms for mining Boolean association rule frequent itemsets.
The core is the recursive algorithm based on the two-stage frequency set theory.

The association rule belongs to single-dimension, single-Layer and Boolean association rules in classification.
In this case, all itemsets with support degrees greater than the minimum support are called frequent itemsets, or frequency sets.


V. Maximum expectation (EM) algorithm
In statistical calculations, the maximum expectation (em,expectation–maximization) algorithm is the probability
In the probabilistic model, the algorithm for maximum likelihood estimation of parameters is found, in which the probabilistic model relies on the invisible hidden variable (latent variabl).

The greatest expectations are often used in the field of data aggregation (clustering) for machine learning and computer vision.


Liu, PageRank
PageRank is an important part of Google's algorithm. The U.S. patent was granted in September 2001, and the patent owner is one of Google's founders, Larry Page.
As a result, the page in PageRank is not a webpage, it refers to Paige, that is, the hierarchical method is named after page.

PageRank measures the value of the site based on the number and quality of external links and internal links to the site.
The concept behind PageRank is that each link to a page is a poll of that page, and the more links it has, the more votes are being voted on by other sites.

This is called "link popularity" – A measure of how many people are willing to hook up their site to your site.
The concept of PageRank is quoted as a quote from an academic paper-the more times people are quoted, the more authoritative it is to judge the paper.


Seven, AdaBoost
AdaBoost is an iterative algorithm whose core idea is to train different classifiers (weak classifiers) for the same training set.
These weak classifiers are then assembled to form a stronger final classifier (strong classifier).

The algorithm itself is implemented by changing the data distribution, which is based on the correct classification of each sample in each training set.

And the accuracy of the last population classification to determine the weights for each sample.

The new data set that modifies the weights is sent to the lower classifier for training, and finally the classifier that is trained each time is combined as the final decision classifier.


Eight, knn:k-nearest neighbor classification
K Nearest neighbor (k-nearest NEIGHBOR,KNN) classification algorithm is a theoretically mature method and one of the simplest machine learning algorithms.

The idea of this method is that if a sample is the most similar in the K in the feature space (that is, the nearest neighbor in the feature space)

Most belong to a category, the sample also falls into this category.


IX, Naive Bayes
In many classification models, the two most widely used classification models are decision tree models (decision tree model) and
Naive Bayesian model (Naive Bayesian MODEL,NBC).
Naive Bayesian model originates from classical mathematics theory, has a solid mathematical foundation, and stable classification efficiency.

At the same time, the NBC model has few parameters to estimate, less sensitive to missing data, and simpler algorithm.
In theory, the NBC model has the smallest error rate compared to other classification methods.

This is not always the case, however, because the NBC model assumes that the attributes are independent of each other, and that this hypothesis is actually applied
is often not tenable, which brings some impact to the correct classification of the NBC model. When the number of attributes is more or the property
The efficiency of the NBC model is inferior to that of decision tree model.

The performance of the NBC model is best when the attribute correlation is small.


Ten, CART: Classification and regression tree
CART, classification and Regression Trees. There are two key ideas under the classification tree: the first one
Is the idea of recursively dividing an argument space, and the second idea is to prune it with validation data.
OK, in the future choose its one or two detailed study, elaboration, finish.
As for the 18 candidate algorithms, refer to here:
Http://www.cs.uvm.edu/~icdm/algorithms/CandidateList.shtml

Research on ten classical algorithms in Data mining field

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.