Ten algorithms for data mining

Last Update:2017-08-04 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The ten classical algorithms of data mining in the big Data era are not the top ten algorithms, in fact, the 18 kinds of algorithms that are chosen. Actually come up with a kind of can be called classical algorithm, they have a very far-reaching influence in the field of data mining.

1.c4.5
C4.5 algorithm is a classification decision tree algorithm in machine learning algorithm, its core algorithm is ID3 algorithm. The C4.5 algorithm inherits the advantages of the ID3 algorithm. The ID3 algorithm is improved in the following aspects:
1) Use the information gain rate to select attributes. It overcomes the disadvantage of choosing the attribute with the information gain to choose the value.
2) pruning in the process of tree construction;
3) The discretization of continuous attributes can be completed.
4) The incomplete data can be processed.
The C4.5 algorithm has the following advantages: The resulting classification rules are easy to understand and the accuracy rate is high. The disadvantage is that it is in the process of constructing the tree. It is necessary to sequentially scan and sort the data sets, resulting in inefficient algorithms.
2.thek-meansalgorithm is the K-means algorithm
The k-meansalgorithm algorithm is a clustering algorithm that divides n objects into K-cuts based on their attributes, K 3.Supportvectormachines
Support Vector machine, English for Supportvectormachine, called SV Machine (generally referred to as SVM in the paper). It is a supervised learning method, which is widely used in statistical classification and regression analysis. Support Vector machines map vectors to a higher dimensional space, where a maximum interval of hyperspace is established in this space.

On both sides of the super plane that separates the data, there are two super-planes that are parallel to each other. The separation of the superelevation plane maximizes the distance of two parallel super-planes. It is assumed that the distance or gap between parallel planes is greater. The smaller the total error of the classifier. An excellent guide is C.j.cburges's "pattern Recognition Support vector machine Guide". Vanderwalt and Barnard compare support vector machines to other classifiers.

4.TheApriorialgorithm
Apriori algorithm is one of the most influential algorithms for mining Boolean association rule frequent itemsets. The core is the recursive algorithm based on the two-stage frequency set theory.

The association rule belongs to single-dimension, single-Layer and Boolean association rules in classification.

Over here. Itemsets with all support degrees greater than minimum support are called frequent itemsets, or frequency sets.

5. Maximum expectation (EM) algorithm
In the statistical calculation. The maximal expectation (em,expectation–maximization) algorithm is the algorithm for finding the maximum likelihood of the parameters in the probability (probabilistic) model, in which the probabilistic model relies on the invisible hidden variable (latentvariabl).

Maximum expectations are often used in the field of data aggregation (dataclustering) for machine learning and computer vision.

6.PageRank
PageRank is an important part of Google's algorithm.

Larrypage was awarded a U.S. patent in September 2001, and the patent owner was Larry Page, a Google founder. As a result, the page in PageRank is not a webpage, it refers to Paige, that is, the hierarchical method is named after page.

PageRank the value of the site based on the number and quality of the site's external links and internal links.

The concept behind PageRank is that every link to a page is a poll of that page, and the more links it has, the more votes are being cast by other sites. This is called "link popularity" – A measure of how many people are willing to hook up their site to your site.

The concept of PageRank is quoted as a quote from an academic paper-the more times a person is quoted, the more authoritative the article is generally inferred.
7.AdaBoost
AdaBoost is an iterative algorithm. The core idea is to train different classifiers (weak classifiers) for the same training set, and then set up these weak classifiers. constitute a stronger finally classifier (strong classifier). The algorithm itself is achieved by changing the distribution of data, which determines the weights of each sample based on the correctness of the classification of each sample in each training set and the accuracy of the last overall classification. A new data set that changes the weights is sent to the lower classifier for training. Finally, the classifier of each training is finally fused, as the final decision classifier.
8.knn:k-nearestneighborclassification
K Nearest neighbor (K-NEARESTNEIGHBOR,KNN) classification algorithm is a theoretically mature method and one of the simplest machine learning algorithms.

The idea of this approach is to assume that most of the samples in a sample that are in the K most similar in the feature space (that is, the nearest neighbor in the feature space) belong to a category. The sample is also part of this category.
9.NaiveBayes
In many classification models, the two most widely used classification models are decision tree Model (Decisiontreemodel) and naive Bayesian model (NAIVEBAYESIANMODEL,NBC). Naive Bayesian model originates from classical mathematics theory, has a solid mathematical foundation, and stable classification efficiency. At the same time, the NBC model required a very small number of expected references. Less sensitive to missing data, the algorithm is simpler. Theory. The NBC model has a minimum error rate compared to other classification methods.

However, this is not always the case, because the NBC model if the properties are independent of each other, this if in practical applications often is not established. This has a certain effect on the correct classification of the NBC model. The efficiency of the NBC model is inferior to the decision tree model when the number of attributes is more or the correlation between attributes is large.

And when the attribute dependency is small. The NBC model performs best.
10.CART: Classification and regression tree
Cart,classificationandregressiontrees. There are two key ideas below the classification tree. The first is the idea of recursively dividing an argument space, and the second idea is to prune it with validation data.

Ten algorithms for data mining

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More