Algorithms in Data mining

Last Update:2014-12-10 Source: Internet

Author: User

Tags id3 svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original address: http://blog.csdn.net/taigw/article/details/19407297

In the 2006 ICDM (the IEEE international Conference on Data Mining), the top ten algorithms for data mining were selected, namely

1,c4.5

C4.5 is a series of algorithms used in machine learning and data mining classification problems. Its goal is to supervise learning: Given a dataset, each tuple can be described with a set of attribute values, each of which belongs to a class in a mutually exclusive category. The goal of C4.5 is to find a mapping relationship from attribute values to categories by learning, and this mapping can be used to classify entities that are unknown to the new category.

C4.5 was proposed by J.ross Quinlan on the basis of ID3. The ID3 algorithm is used to construct decision trees. A decision tree is a flowchart-like tree structure in which each internal node (non-leaf node) represents a test on an attribute, each branch represents a test output, and each leaf node holds a class label. Once a decision tree is established, for a tuple that is not given a class designator, a path with a root node to the leaf node is tracked, and the leaf node holds the predictions for that tuple. The advantage of a decision tree is that it does not require any domain knowledge or parameter setting and is suitable for exploratory knowledge discovery.

Article:Quinlan, J. R. C4.5:programs for Machine learning. Morgan Kaufmann Publishers, 1993.

Code implementation

2,k-means

K-means algorithm is one of the most widely used classification based clustering algorithm, the N objects are divided into K clusters, so that the cluster has a higher similarity. The similarity calculation is based on the average of the objects in a cluster. It is similar to the maximum expected algorithm for dealing with mixed normal distributions, as they all try to find the center of natural clustering in the data.

The algorithm first randomly selects K-objects, each of which initially represents the mean or center of a cluster. Each remaining object is assigned to the nearest cluster according to its distance from the center of each cluster, and then the average of each cluster is recalculated. This process repeats continuously until the criterion function converges.

K-means Clustering algorithm and implementation code

3,svm

SVM (Support vector machines) is a supervised learning method, which can be widely used in statistical classification and regression analysis.

Support Vector machines map vectors to a higher dimensional space, where a maximum interval of hyperspace is established in this space. Two parallel super-planes are built on both sides of the super plane separating the data, and the distance between the two parallel planes is maximized by separating the super plane. It is assumed that the larger the distance or gap between parallel planes, the smaller the total error of the classifier.

Article: Christopher J. C. Burges. "A Tutorial on support vectors machines for Pattern recognition". Data Mining and Knowledge Discovery 2:121-167, 1998

Algorithm implementation

4,apriori

The Apriori algorithm is used to deal with the problem of association rules, the Association Rules (Association Rules, AR), which is an important task of data mining, which is used to excavate the correlation between valuable data items from a large amount of data. Common issues that are addressed by association rules are: "If a consumer buys a product A, how much does he have to buy product B?" and "If he buys the product C and D, what else will he buy?" ”

The Apriori algorithm is one of the most influential algorithms for mining Boolean association rule frequent itemsets. Its core is the recursive algorithm based on the two-stage frequency set theory. The association rule belongs to single-dimension, single-Layer and Boolean association rules in classification. In this case, all itemsets with a support degree greater than the minimum support are called frequent itemsets (frequency sets) and are often referred to as the maximum set of items.
In the Apriori algorithm, the basic idea of finding the maximum item set (frequent itemsets) is that the algorithm needs to process the data set in multiple steps. The first step is to simply count the frequency of all occurrences of an item set with one element, and find out those sets of items that are not less than the minimum support, that is, one-dimensional maximum set of items. Loop processing starts from the second step until there is no maximum item set generation. The loop process is: In step K, a K-dimensional candidate project set is produced according to the largest set of items generated by step k-1 (K-1), and then the database is searched for the itemsets support of the candidate project set, compared with the minimum support to find the K-dimension maximum project set.
Java code

5,em expectation maximization algorithm, expectation maximization algorithm

In statistical calculation, the EM algorithm is the algorithm for finding the maximum likelihood estimation or the maximum posterior estimation in the probabilistic model, in which the probabilistic model relies on the invisible hidden variable (latent Variable). Maximum expectations are often used in the field of data clustering for machine learning and computer vision. The maximum expectation algorithm is calculated by alternating two steps, the first step is to calculate the expectation (E), use the existing estimate of the hidden variable, calculate its maximum likelihood estimate; the second step is to maximize (M) and maximize the maximum likelihood value calculated on the E step to calculate the value of the parameter. The parameter estimates found on M-step are used in the next E-step calculation, and the process is constantly alternating.

Article: Arthur Dempster, Nan Laird, and Donald Rubin. "Maximum likelihood from incomplete data via the EM algorithm". Journal of the Royal Statistical Society, Series B, 39 (1): 1–38, 1977

Algorithm detailed and source code

6,pagerank

PageRank, the page rank, also known as the page level, the Google left ranking or the PEC ranking, is a search engine based on the links between Web pages of the technology, and as one of the elements of the page ranking, Google founder Larry Page named after the name. Google uses it to reflect the relevance and importance of Web pages, and is one of the factors that are often used to evaluate the effectiveness of Web optimization in search engine optimization operations. Google's founder Larry Page and Sergey Brin invented the technology at Stanford University in 1998.

Detailed

7,adaboost

AdaBoost is also a simple principle, but very practical supervised machine learning algorithm, it is the abbreviation of adaptive boosting. When it comes to the boosting algorithm, one cannot mention the bagging algorithm, both of which are grouped together to classify some weak classifiers, collectively referred to as the integrated approach (ensemble method), similar to investment, "do not put eggs in a basket", Although the classification of each weak classifier is not so accurate, but if the combination of multiple weak classifiers can be quite good results, in addition to say that the integration method can also combine different classifiers, and adaboost and boosting algorithm of each weak classifier of the same type. Their two different places are: boosting each weak classifier combination of weights, this section of the AdaBoost is an example, and bagging each weak classifier weight is equal, the representative example is the random forest. Each weak classifier of the Random forest is a decision tree, and the output category is determined by the majority of the categories that have multiple decision tree classifications.

Detailed

8,knn

K Nearest neighbor (k-nearest NEIGHBOR,KNN) classification algorithm is one of the simplest machine learning algorithms. The idea of this approach is that if a sample is in the K most similar in the feature space (that is, the nearest neighbor in the feature space) Most of the samples belong to a category, then the sample belongs to that category. In the KNN algorithm, the selected neighbors are the objects that have been correctly categorized.

Matlab code

9,naive Bayes

Naive Bayesian classifier: The basis of Bayesian classification is probabilistic inference, which is how to complete the reasoning and decision-making task in the case of the existence uncertainty of various conditions and only knowing its occurrence probability. Probabilistic inference is corresponding to deterministic reasoning. The naive Bayesian classifier is based on the assumption that each characteristic of the sample is not correlated with other characteristics. For example, if a fruit has a red, round, or roughly 4-inch diameter, the fruit can be judged to be an apple.

Although these characteristics are interdependent or some characteristics are determined by other characteristics, the naive Bayesian classifier considers these properties to be independent of the probability distribution of whether the fruit is an apple. Naive Bayesian classifier relies on the accurate natural probability model, which can obtain very good classification effect in the supervised learning sample concentration. In many practical applications, the naive Bayesian model parameter estimation uses the maximum likelihood estimation method, in other words the naive Bayesian model can work without using Bayesian probability or any Bayesian model.

text classification based on naive Bayesian classifier

10,cart

Classification and regression trees (classification and Regression tree) s, is a method in the decision tree.

The decision tree consists of the following sections:

Root: Root node, decision tree uses the concept of trees and must have the root attribute.
decision node: The node's data continues to be judged based on the data properties.
Branch: The subtree generated from the decision node iteration is a property of the sub-tree nodes
End node: Also called leaf node, the node is actually the node that made the decision, and for the sample attribute the judgment to leaf node ends.
Decision Tree Categories:

Classification tree: test data after classification tree processing, see the results belong to that category (class).

Regression Tree: If the output of the test data is a numeric type, consider using Regression tree.

The cart algorithm is a term covering both of the above algorithms. The trees used for regression (Regression tree) are similar to those used to classify trees (classification tree), but the difference is in the process of deciding to split.

With

Machine Learning (Dragon Star program): http://bigeye.au.tsinghua.edu.cn/DragonStar2012/download.html

Algorithms in Data mining

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More