Ten classic algorithms in machine learning and Data Mining

Source: Internet
Author: User
Tags id3 svm

Ten classic algorithms in machine learning and Data Mining

Background:

In the early stage of the top 10 algorithm, Professor Wu made a report on the top 10 challenges of Data Mining in Hong Kong. After the meeting, a mainland professor put forward a similar idea. Professor Wu felt very good and began to solve the problem. I found a series of big cows (both big cows for data mining) and thought they were doing well, but they didn't want to do it on their own. There are several possible causes: 1. It's really busy. 2. It's hard to find someone to blame. 3. A series of jobs are complicated. Finally, I took this responsibility together with Professor Vipin Kumar of the University of Minnesota. First, I asked the data mining industry to win the KDD and ICDM awards for the nomination of fourteen cattle candidates. One of them was very busy and was switching from IBM to Microsoft. Although Professor Wu did not mention his name, however, from the perspective of experience, Rakesh Agrawal, the father of Data Mining in my eyes, was not nominated by him, and the other 13 gave candidates in his mind. 18 algorithms have been summarized and filtered, covering areas such as classification, clustering, graph mining, association analysis, and rough set. Because it is an algorithm, some of the more influential fields, such as Neural Networks and evolutionary algorithms, do not have a particularly specific algorithm. They can only be a relatively large framework or idea, not selected. After summary, Wu and Vipin Kumar began to quarrel. Wu wanted to delete some algorithms, such as naive Bayes. He thought it was too simple, while Vipin Kumar wanted to add some, for example, a rule-based mining algorithm (this is probably the meaning of forgetting what Wu says ). Finally, as a compromise, no one adds or deletes the algorithm. Step 2 ask more big cows, including the largest cattle nominated among them, to vote, each with only one vote, and finally get 10 of them as the final algorithm. Some people have excellent algorithms (Professor Han Jiawei has three algorithms to become candidates, but none of them have entered the top 10), but because of their pioneering efforts, influence and other issues are not listed in the final list.

When we announced at the conference, we wanted to invite some people to talk about these algorithms. Everyone was happy, but they were not happy because of their rankings. (When the invitation was made, the ranking was not mentioned, but the top 10 was mentioned.) The cart advertiser was a technical consultant of the company owning the system (the four statistician who invented the algorithm, for example, breiman, two of them have died due to age issues. One of them has retired and cannot come, and they have transferred all ownership of cart to a company ), I think I was the first to speak about it. I am not happy, even though it is fengwei. After the last lecture, he was even more unhappy because the first one dared to be C4.5. Both cart and cart were the classic algorithms of decision trees, and cart was earlier than C4.5, some C4.5 ideas come from this directly or indirectly. WU Jian said: Which of the ten algorithms are easy to remember? The man said, "I know". Wu replied: the last one and the last one smile and smile deeply. See: http://blog.csdn.net/playoffs/article/details/5115336

 

 

The following are the top 10 classic algorithms selected from the 18 candidate algorithms:

For more detailed introduction, see PDF file: http://pan.baidu.com/share/link? Consumer id = 474935 & UK = 2466280636


I,C4.5


C4.5 is a classification decision tree algorithm in machine learning algorithms. It is a decision tree (decision tree is the way decision trees are organized between decision-making nodes like a tree, which is actually a inverted tree) the improved algorithm of core algorithm ID3, so it can be constructed simply by learning about the half decision tree construction method. The decision tree construction method is to select a good feature and split point each time as the classification condition of the current node.
C4.5 has the following improvements compared with ID3:
1. Use information gain rate to select attributes.
The ID3 selection attribute uses the information gain of the subtree. Many methods can be used to define the information here. ID3 uses entropy (entropy, which is a non-purity measurement criterion ), that is, the change value of entropy. C4.5 uses the information gain rate. Yes, the difference is that one is information gain and the other is information gain rate. In general, the rate is used for balancing, just like the variance, for example, two runners, the starting point of a person is 10 Mb/s, and the starting point is 20 Mb/s after 10 s. The starting speed of another person is 1 Mb/s, and the starting point is 2 Mb/s after 1 s.
If the difference value is closely calculated, the two gaps will be very large. If the speed increase rate (acceleration, that is, 1 Mbit/s ^ 2) is used to measure, two people will enjoy the same acceleration. Therefore, C4.5 overcomes the deficiency of attributes with multiple values when using information gain to select attributes in ID3.
2. pruning in the tree construction process. When constructing a decision tree, the nodes with several elements are not considered as the best. Otherwise, overfitting is easily caused.
3. Non-discrete data can also be processed.
4. Ability to process incomplete data.


II,The K-means algorithm is the K-means algorithm.


The K-means algorithm is a clustering algorithm that divides n objects into k shards (k <n) based on their attributes ). It is similar to the maximum Expectation Algorithm for processing the mixed normal distribution (the fifth of the top 10 algorithms) because they all try to find the natural clustering center in the data. It assumes that the object property comes from the spatial vector, and the goal is to minimize the sum of mean square errors in each group.

III,Support Vector Machines


Support Vector Machine (SVM ). It is a supervised learning method and is widely used in statistical classification and regression analysis. SVM maps a vector to a higher-dimensional space, in which a maximum interval hyperplane is created. Two parallel superplanes are built on both sides of the data plane to separate the data plane to maximize the distance between the two parallel superplanes. It is assumed that the larger the distance or gap between parallel superplanes, the smaller the total error of the classifier. An excellent guide is Pattern Recognition Support Vector Machine Guide by C. j.c burges. Van der Walt and Barnard compare SVM with other classifiers.


IV,The Apriori algorithm


The Apriori algorithm is the most influential algorithm used to mine frequent item sets of Boolean association rules. Its core is a recursive algorithm based on the two-phase frequency set thinking. This association rule belongs to single-dimension, single-layer, and boolean association rules in classification. Here, all the item sets with a higher degree of support than the minimum level are called frequent item sets.

5. Maximum expectation (EM) Algorithm

 

In statistical calculation, the maximum expectation (EM, expectation-maximization) algorithm is used to find the Maximum Likelihood Estimation of Parameters in the probability (Probabilistic) model, the probability model depends on hidden variables that cannot be observed (latent Variabl ). It is expected that Data Clustering is often used in machine learning and computer vision.


Vi. PageRank


PageRank is an important part of Google algorithms. In September 2001, the patent was granted to the US. The patent holder is one of Google's founders, Larry Page ). Therefore, pages in PageRank refer not to webpages, but to pages, that is, the hierarchical method is named by pages. PageRank measures the value of a Website Based on the quantity and quality of its external links and internal links. The concept behind PageRank is that every link to the page is a vote for the page. The more links are, the more votes other websites will have.
This is the so-called "link popularity"-to measure how many people are willing to hook their website with your website. PageRank is derived from the frequency of quote in an academic paper. That is, the more times the paper is quoted by others, the higher the authority of the paper.

VII,AdaBoost
AdaBoost is an iterative algorithm. Its core idea is to train different classifiers (weak classifiers) for the same training set, and then combine these weak classifiers, form a stronger final classifier (strong classifier ). The algorithm itself is implemented by changing the data distribution. It determines whether the classification of each sample in each training set is correct and the accuracy of the previous overall classification, to determine the weight of each sample.
The new dataset with modified weights is sent to the lower-level classifier for training, and the classifier obtained after each training is combined as the final decision classifier.


8,KNN: K-Nearest Neighbor Classification


K-Nearest Neighbor (KNN) classification algorithm is a theoretically mature method and one of the simplest machine learning algorithms. The idea of this method is: if most of the K most similar samples in the feature space (that is, the most adjacent samples in the feature space) belong to a certain category, the sample also belongs to this category.


IX,Naive Bayes


Among the many classification models, the two most widely used classification models are decision tree model and Naive Bayes model (naive Bayesian model, NBC ). The naive Bayes model originated from classical mathematical theory and has a solid mathematical foundation and stable classification efficiency. At the same time, the NBC model requires few parameters, which are not sensitive to missing data and the algorithm is relatively simple. Theoretically, the NBC model has a minimum error rate compared with other classification methods. But this is not always the case. This is because the attributes of the NBC model are independent of each other. This assumption is often not true in actual applications, which affects the correct classification of the NBC model. When the number of attributes is large or the correlation between attributes is large, the classification efficiency of the NBC model is inferior to that of the decision tree model. However, when the attribute correlation is small, the performance of the NBC model is the best. Note: essentially linear classifier, see http://www.rustle.us /? P = 21


10,Cart: Classification and regression tree


Cart, classification and regression trees. There are two key ideas under the classification tree: the first is the idea of recursively dividing the independent variable space; the second is to use verification data for pruning. OK. You can choose one or two of them for detailed research and elaboration in the future.

Reproduced in: http://aimit.blog.edu.cn/home.php? MoD = Space & uid = 1555054 & Do = Blog & id = 593069

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.