The international authoritative academic organization theieeeinternationalconferenceondatamining (ICDM) selected ten classical algorithms in the field of data mining in December 2006: C4.5,k-means,svm,apriori,em , Pagerank,adaboost,knn,naivebayes,andcart.
Not only the top ten algorithms selected, in fact, participate in the selection of the 18 algorithms, in fact, casually come up with a kind of can be called Classic algorithm, they in the field of data mining has produced a very far-reaching impact.
1.c4.5
C4.5 constructs a classifier in the form of a decision tree. A classifier is a tool for data mining that handles a large amount of data that needs to be categorized and tries to predict which category the new data belongs to.
C4.5 algorithm is a classification decision tree algorithm in machine learning algorithm, and its core algorithm is ID3 algorithm. The C4.5 algorithm inherits the advantages of the ID3 algorithm and improves the ID3 algorithm in the following ways:
1) Using the information gain rate to select the attribute, overcomes the disadvantage of choosing the attribute with the information gain to choose the value;
2) pruning in the process of tree construction;
3) be able to complete the discretization of the continuous properties of the processing;
4) Ability to process incomplete data.
The C4.5 algorithm has the following advantages: The resulting classification rules are easy to understand and the accuracy rate is high. The disadvantage is that in the process of constructing the tree, the data sets need to be scanned and sorted several times, which results in the inefficiency of the algorithm.
Why use the C4.5 algorithm?
So to speak, the best selling point for decision trees is that they are easy to translate and interpret. They are also very fast, and they are a more popular algorithm. The results of the output are simple and understandable.
Where can I use it?
A popular open source Java implementation method can be found on the Opentox. Orange is an open source data visualization and analysis tool for data mining, and its decision tree classifier is implemented with C4.5.
2.thek-meansalgorithm, or K-means algorithm
The k-meansalgorithm algorithm is a clustering algorithm that creates multiple groups from a single target set, and each group member is relatively similar.
这是个想要探索一个数据集时比较流行的聚类分析技术。聚类分析属于设计构建组群的算法,这里的组成员相对于非组成员有更多的相似性。在聚类分析的世界里,类和组是相同的意思。
The objects of n are divided into K-partitions according to their attributes, k<n. It is similar to the maximum expected algorithm for dealing with mixed normal distributions, as they all try to find the center of natural clustering in the data. It assumes that the object attributes come from the space vector, and that the goal is to minimize the sum of the mean squared errors within each group.
Why use the K-means algorithm?
I think most people agree with this: the key selling point of K-means is its simplicity. Its simplicity means that it is usually faster and more efficient than other algorithms, especially in cases where large datasets are needed.
The K-means algorithm is designed to process continuous data. For discrete data you need to use some small tricks to make the K-means algorithm work.
Where is Kmeans used?
There are many implementations of Kmeans clustering algorithms available on the Web:
Apachemahout
Julia
R
SciPy
Weka
Matlab
Sas
3.Supportvectormachines
Support Vector machine, English for Supportvectormachine, called SV Machine (generally referred to as SVM in the paper). Support Vector Machine (SVM) gets a hyper-plane that divides the data into two categories. In terms of high standards, the SVM and C4.5 algorithms perform similar tasks in addition to not using decision trees. A hyper-plane (hyperplane) is a function similar to an equation that parses a line. In fact, for a simple classification task with only two attributes, the superelevation plane can be a line.
SVM is a kind of supervised learning method, which is widely used in statistical classification and regression analysis. Support Vector machines map vectors to a higher dimensional space, where a maximum interval of hyperspace is established in this space. On both sides of the super plane that separates the data, there are two super-planes that are parallel to each other. The separation of the superelevation plane maximizes the distance of two parallel super-planes. It is assumed that the larger the distance or gap between parallel planes, the smaller the total error of the classifier. An excellent guide is C.j.cburges's "pattern Recognition Support vector machine Guide". Vanderwalt and Barnard compare support vector machines to other classifiers.
Why use SVM?
SVM and C4.5 are generally the first two class classifiers to be tried. According to the "No free Lunch principle", no classifier is the best in all cases. In addition, the selection and interpretation of kernel functions is the weakness of the algorithm.
Where to use SVM?
What is the implementation of SVM, the more popular is to use Scikit-learn,matlab and LIBSVM to achieve these kinds.
4.TheApriorialgorithm
Apriori algorithm is one of the most influential algorithms for mining Boolean association rule frequent itemsets. The Apriori algorithm learns data Association Rules (associationrules) for databases that contain a large number of transactions (transcation). Association rule Learning is a data mining technique that learns the relationships among different variables in a database.
The core of Apriori algorithm is the recursive method based on the two-stage frequency set theory. The association rule belongs to single-dimension, single-Layer and Boolean association rules in classification. In this case, all itemsets with support degrees greater than the minimum support are called frequent itemsets, or frequency sets.
The basic Apriori algorithm has three steps:
Participate, scan through the entire database, and calculate the frequency of 1-itemsets occurrences.
Pruning, satisfying the support and credibility of these 1-itemsets move to the next round of processes, and then look for the 2-itemsets that appear.
Repeat, the itemsets for each level are repeated, knowing the size of the itemsets we defined earlier.
is the algorithm supervised or unsupervised?
Apriori is generally considered an unsupervised method of learning, as it is often used to excavate and discover interesting patterns and relationships.
Also, the Apriori algorithm can be modified to classify the data already marked.
Why use the Apriori algorithm?
It is easy to understand, easy to apply, and a lot of derivation algorithms. On the other hand, when generating itemsets, the algorithm consumes memory, space, and time.
A large number of language implementations of the Apriori algorithm are available for use. The more popular is the artool,weka,andorange.
5. Maximum expectation (EM) algorithm
In the field of data mining, the maximal expectation algorithm (EXPECTATION-MAXIMIZATION,EM) is generally used as a clustering algorithm (similar to K-means algorithm) for knowledge mining. The probabilistic model relies on invisible hidden variables (latentvariabl). Maximum expectations are often used in the field of machine learning and computer vision Data aggregation (dataclustering).
The essence of the algorithm is:
By optimizing the likelihood, EM generates a great model that can assign type labels to data points-sounding like a clustering algorithm!
is em a supervisory or unsupervised algorithm?
Because we do not provide the classified information already marked, this is an unsupervised learning algorithm.
Why use it?
One of the key selling points of the EM algorithm is its straightforward implementation. In addition, it can not only optimize the model parameters, but also can repeatedly guess the lost data.
This makes the algorithm perform well on clustering and generating models with parameters. In the case of clustering and model parameters, it is possible to explain the classification with the same attributes and which class the new data belongs to.
The weaknesses of the EM algorithm:
First, the EM algorithm runs fast in early iterations, but the later iterations are slower.
Second, the EM algorithm can not always find the optimal parameters, it is easy to fall into the local optimization rather than finding the global optimal solution.
The implementation of the EM algorithm can be found in the Weka, Mclustpackage has the R language to the algorithm implementation, Scikit-learn Gmmmodule also has its implementation.
6.PageRank
PageRank is an important part of Google's algorithm. September 2001 was awarded the U.S. patent, the patent owner is one of Google's Founders Larry Page (Larrypage). As a result, the page in PageRank is not a webpage, it refers to Paige, that is, the hierarchical method is named after page.
PageRank measures the value of the website based on the number and quality of the external links and internal links of the site. PageRank is a connection analysis algorithm (LINKANALYSISALGORITHM) designed to determine the relative importance of objects and other objects in the network.
The concept behind PageRank is that each link to a page is a poll of that page, and the more links it has, the more votes are being voted on by other sites. This is called "link popularity" – A measure of how many people are willing to hook up their site to your site. The concept of PageRank is quoted as a quote from an academic paper-the more times people are quoted, the more authoritative it is to judge the paper.
is the algorithm supervised or unsupervised?
PageRank is commonly used to discover the importance degree of a Web page, which is often considered an unsupervised learning algorithm.
Why use PageRank?
The main selling point of PageRank is that the algorithm still has good robustness due to the difficulty of getting new related links.
Simply put, if you have another diagram or network, and want to understand the relative importance of elements, priorities, rankings or correlations, you can try it with PageRank.
Where did you use it? Google has a PageRank trademark. But Stanford University has patented the PageRank algorithm. If you use PageRank, you may have questions: I'm not a lawyer, so it's best to check with a real lawyer. But it should be possible to use this algorithm as long as there is no commercial competition with Google or Stanford.
Three implementations of PageRank are given:
1c++opensourcepagerankimplementation
2PythonPageRankImplementation
3igraph–thenetworkanalysispackage (R)
7.AdaBoost
AdaBoost is an iterative algorithm whose core idea is to train different classifiers (weak classifiers) for the same training set, and then set up these weak classifiers to form a stronger final classifier (strong classifier). The algorithm itself is achieved by changing the distribution of data, which determines the weights of each sample based on the correctness of the classification of each sample in each training set and the accuracy of the last population classification. The new data set that modifies the weights is sent to the lower classifier for training, and finally the classifier that is trained at the end of each training is fused as the final decision classifier.
Why use AdaBoost?
The adaboost algorithm is simple, the programming is relatively concise and straightforward.
Plus, it's fast!
Since each successive round of adaboost has redefined the weights of each of the best learners, this is a very concise algorithm for automatically adjusting the learning classifier, and all you have to do is specify the number of rounds to run.
The algorithm is flexible and universal: AdaBoost can join any learning algorithm, and it can handle a variety of data.
AdaBoost has many program implementations and variants. Give some of the following:
Scikit-learn
Icsiboost
Gbm:generalizedboostedregressionmodels
8.knn:k-nearestneighborclassification
K Nearest neighbor (K-NEARESTNEIGHBOR,KNN) classification algorithm is a theoretically mature method and one of the simplest machine learning algorithms. The idea of this approach is that if a sample is in the K most similar in the feature space (that is, the nearest neighbor in the feature space) Most of the samples belong to a category, then the sample belongs to that category.
Why do we use KNN?
Easy to understand and implement is the two key reasons why we use it. Depending on the distance measure, KNN can be very precise.
But this is only part of the story, here are five things we need to be aware of:
1 The KNN algorithm can be computationally expensive when trying to calculate the closest point on a large data set.
2 Noise data (noisydata) may affect the classification of KNN.
3 Choosing a wide range of attribute filters (feature) has many advantages over small-scale filtering, so the scale of attribute filtering (feature) is important.
4 due to delays in data processing, KNN typically requires more robust storage requirements than active classifiers.
5 Choosing a suitable distance measure is critical to the accuracy of KNN.
Where did you use this method?
There are many existing KNN implementations:
Matlabk-nearestneighborclassification
Scikit-learnkneighborsclassifier
K-nearestneighbourclassificationinr
9.NaiveBayes
Naive Bayes (Naivebayes) is not just an algorithm, but a series of classification algorithms, which are premised on a common assumption:
Each attribute of the data being classified is independent of its other properties in this class.
What does it mean to be independent? When one property value does not have any effect on another property value, it is said that the two properties are independent.
In many classification models, the two most widely used classification models are decision tree Model (Decisiontreemodel) and naive Bayesian model (NAIVEBAYESIANMODEL,NBC). Naive Bayesian model originates from classical mathematics theory, has a solid mathematical foundation, and stable classification efficiency. At the same time, the NBC model has few parameters to estimate, less sensitive to missing data, and simpler algorithm.
In theory, the NBC model has the smallest error rate compared to other classification methods. But this is not always the case, because the NBC model assumes that the properties are independent of each other, and this hypothesis is often not true in practice, which has a certain effect on the correct classification of the NBC model. When the number of attributes is more or the correlation between attributes is large, the efficiency of the NBC model is inferior to the decision tree model. The performance of the NBC model is best when the attribute correlation is small.
Why use Naivebayes?
As you can see in the example above, Naivebayes only involves simple mathematical knowledge. Add up to count, multiply, and divide.
Once the frequency tables (frequencytables) have been calculated, to classify an unknown fruit only involves calculating the probability for all classes, and then choosing the maximum probability.
Although the algorithm is simple, Naivebayes is surprisingly accurate. For example, it is found to be an efficient algorithm for spam filtering.
The implementation of Naivebayes can be found from Orange,scikit-learn,weka and R.
10.CART: Classification and regression tree
Cart,classificationandregressiontrees. There are two key ideas under the classification tree. The first is the idea of recursively dividing an argument space, and the second idea is to prune it with validation data.
The cart and C4.5 are compared as follows:
Is this a supervisory algorithm or an unsupervised one?
In order to construct the classification and regression tree model, it is necessary to provide a well-categorized training data set, so the cart is a supervised learning algorithm.
Why use a cart?
Most of the reasons for using C4.5 also apply to cart, as they are all methods of decision tree learning. The reasons for this type of explanation are also applicable to the cart. As with C4.5, they are computationally fast, the algorithms are generally popular, and the output is readable.
Scikit-learn implements the CART algorithm in their decision tree classifier, and the treepackage of the R language also has the implementation of the cart, and Weka and MATLAB also have the process of the cart.
overview, advantages and usage scenarios of ten classic algorithms for data mining