10 big algorithms in data mining

Last Update:2018-02-05 Source: Internet

Author: User

Tags rounds svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1.c4.5 algorithm

2. K-mean-value clustering algorithm

3. Support Vector Machine

4. Apriori Correlation algorithm

5.EM maximum expectation algorithm expectation maximization

6. PageRank algorithm

7. AdaBoost Iterative algorithm

8. KNN algorithm

9. Naive Bayesian algorithm

10, CART classification algorithm.

1.c4.5 algorithm

What does C4.5 do? C4.5 constructs a classifier in the form of a decision tree. To do this, you need to give a collection of data that has been categorized by C4.5 expression content.

Wait, what is a classifier? A classifier is a tool for data mining that handles a large amount of data that needs to be categorized and tries to predict which category the new data belongs to.

For example, suppose a data set that contains a lot of patient information. We know all kinds of information about each patient, such as age, pulse, blood pressure, maximum oxygen uptake, family history, etc. These are called data attributes.

Right now:

Given these attributes, we want to predict whether the patient will develop cancer. Patients may enter the following two categories: cancer or no cancer. The C4.5 algorithm will tell us the classification of each patient.

The practice is this:

Using a patient's data attribute set and corresponding patient feedback type, C4.5 constructs a decision tree that predicts their type based on the new patient attribute.

This is great, so what is a decision tree? Decision Tree Learning is the creation of something similar to a flowchart to classify new data. Using the same patient example, a specific flowchart path can be this:

The patient has a history of cancer.
The patient has a highly similar gene expression to the cancer patient.
The patient has a tumor.
The patient's tumor size is over 5cm.

The basic principles are:

Each part of the flowchart is a matter of attribute value, and according to these values, the patient is classified. You can find many examples of decision trees.

Does the algorithm supervise learning or unsupervised learning? This is a supervised learning algorithm, because the training data is already divided into good class. Using well-divided patient data, the C4.5 algorithm does not need to learn whether a patient is suffering from cancer.

What is the difference between the C4.5 algorithm and the decision tree system?

First, the C4.5 algorithm uses the information gain when generating the information tree.

Second, while other systems also contain pruning, C4.5 uses a one-way pruning process to mitigate transition fitting. Pruning has brought many improvements to the results.

Thirdly, the C4.5 algorithm can process both continuous data and discrete data. My understanding is that the algorithm transforms continuous data into discrete data by specifying a range or threshold value for successive data.

Finally, incomplete data is processed in an algorithm-owned way.

Why use the C4.5 algorithm? So to speak, the best selling point for decision trees is that they are easy to translate and interpret. They are also very fast, and they are a more popular algorithm. The results of the output are simple and understandable.

Where can I use it? A popular open source Java implementation method can be found on the Opentox. Orange is an open source data visualization and analysis tool for data mining, and its decision tree classifier is implemented with C4.5.

Classifiers are great things, but also see the next clustering algorithm ....

2. K-mean-value clustering algorithm

What does it do? The K-Clustering algorithm creates multiple groups from a single target set, each of which has a similar number of members. This is a popular clustering technique to explore a data set.

Wait, what is cluster analysis? Cluster analysis belongs to the algorithm of designing and constructing groups, where the group members have more similarities than those of non-groups. In the world of cluster analysis, classes and groups are the same meaning.

For example, suppose we define a patient's data set. In cluster analysis, these patients can be called observation objects. We know all kinds of information about each patient, such as age, blood pressure, blood type, maximum oxygen content and cholesterol levels. This is a vector that expresses the characteristics of a patient.

Please see:

You can basically think of a vector that represents a column of data that we know about the patient's situation. This column of data can also be understood as the coordinates of a multidimensional space. The pulse is one-dimensional coordinates, the blood type is the coordinates of other dimensions and so on.

You may have questions about:

Given this set of vectors, how do we cluster patients with similar age, pulse, and blood pressure data?

Want to know what the best part is?

You tell the K-means algorithm how many kinds you want. The K-means algorithm handles the later sections.

So how does it deal with it? The K-means algorithm has many variables that optimize specific data types.

The Kmeans algorithm has a deeper approach to this problem:

The K-means algorithm selects some points in the multidimensional space to represent each K class. They call it the center point.
Each patient will find the nearest one from the K-center point. We want the patients closest to the point not to be the same central point, so they form a class around the nearest center point near them.
We now have K classes, and now each patient is a member of a class.
The K-means algorithm then finds the center of each K-cluster based on its class members (yes, the patient information vector)
This center becomes the new central point of the class.
Because the center point is now in a different position, the patient may now be close to the other central points. In other words, they may modify their class memberships.
Repeat 2-6 steps until the center point is no longer changing, so that the class members are stable. This is also called convergence.

is the algorithm supervised or unsupervised? This depends on the situation, but in most cases K-means will be classified as unsupervised learning type. Instead of specifying the number of categories, or any information that the object belongs to that class, the K-means algorithm itself "learns" how to cluster. The K-means can be semi-supervised.

Why use the K-means algorithm? I think most people agree with this:

The key selling point of K-means is its simplicity. Its simplicity means that it is usually faster and more efficient than other algorithms, especially in cases where large datasets are needed.

He can improve this:

K-means can pre-cluster a large number of datasets and then do a higher-cost clustering analysis for each subclass. K-means can also be used to quickly process "K" and explore whether there are neglected patterns or relationships in the data set.

But using the K-means algorithm is not smooth sailing:

The two key weaknesses of the K-means algorithm are its sensitivity to outliers and its sensitivity to the initial center point selection. The last one to remember is that the K-means algorithm is designed to handle continuous data. For discrete data you need to use some small tricks to make the K-means algorithm work.

Where did Kmeans use it? There are many implementations of Kmeans clustering algorithms available on the Web:

? Apache Mahout

? Julia

? R

? SciPy

? Weka

? Matlab

? Sas

If the decision tree and clustering algorithms haven't touched you yet, then you'll like the next algorithm.

3. Support Vector Machine

What does it do? Support Vector Machine (SVM) gets a hyper-plane that divides the data into two categories. In terms of high standards, the SVM and C4.5 algorithms perform similar tasks in addition to not using decision trees.

Hey? A super.. What the? A hyper-plane (hyperplane) is a function similar to an equation that parses a line. In fact, for a simple classification task with only two attributes, the superelevation plane can be a line.

In fact, it turns out that:

SVM can use a small trick to elevate your data to a higher dimension to handle. Once promoted to a higher dimension, the SVM algorithm calculates the best hyper-plane that separates your data into two categories.

Do you have an example? Of course, for the simplest example. I found a bunch of red and blue balls starting on the table, and if these balls don't mix too much and don't move the balls, you can take a stick and split them off.

You see, when you add a new ball to the table, you can predict the color of the new ball by which side of the stick word you already know is the ball.

What's the coolest part? The SVM algorithm can calculate the equation of this super plane.

What if things get more complicated? Of course, things are usually very complicated. If the ball is mixed together, a straight stick will not solve the problem.

Here's the solution:

Lift the table quickly, throw all the balls into the air, and when all the balls are thrown in the air in the right way, you use a large piece of paper to separate the balls in the air.

You may wonder if this is a foul. No, lifting a table is equivalent to mapping your data into a high-dimensional space. In this example, we have gone from the two-dimensional space on the table surface to the three-dimensional space of the ball in the air.

So what should SVM do? By using kernel functions (kernel), we also have a great way to operate in high-dimensional spaces. This big piece of paper is still called the super plane, but now it corresponds to the equation that describes a plane instead of a line. According to Yuval, once we deal with the problem in three-dimensional space, the hyper-plane must be a face rather than a line.

As for the explanation of SVM, there are two great discussion posts in Reddit's ELI5 and ML two sub-sections.

So how does the ball on the table or in the air be interpreted with realistic data? Each ball on the table has its own position, and we can use the coordinates to represent it. For example, a ball may be a distance from the left edge of the table 20cm away from the bottom edge of the three cm, another way to describe the ball is to use coordinates (x, y) or (20,50) expression. X and Y are the two dimensions that represent the ball.

This can be understood: if we have a patient data set, each patient can be described with a number of indicators, such as pulse, cholesterol level, blood pressure, etc. Each metric represents a dimension.

Basically, SVM maps the data to a higher dimensional space and then finds a super plane that can be categorized.

Inter-class interval (margin) is often associated with SVM, what is the interval between classes? It is the distance between the hyper-plane and the closest data point in the respective class from the hyper-plane. In the case of the ball and desktop, the distance between the stick and the nearest red ball and the blue ball is the inter-class interval (margin).

The key to SVM is that it tries to maximize the inter-class interval, leaving the classified hyper-plane away from the red and blue balls. This can reduce the likelihood of mis-categorization.

So where does the support vector machine name come from? In the case of the ball and table, the distance between the super-plane and the red ball and the blue ball is equal. These balls, or data points, are called support vectors because they all support this hyper-plane.

So is this a supervisory algorithm or an unsupervised one? SVM belongs to supervised learning. Because the beginning requires a dataset to let SVM learn the types in these data. Only then will the SVM be able to classify the new data.

Why should we use SVM? SVM and C4.5 are generally the first two class classifiers to be tried. According to the "No free Lunch principle", no classifier is the best in all cases. In addition, the selection and interpretation of kernel functions is the weakness of the algorithm.

Where to use SVM? What is the implementation of SVM, the more popular is to use Scikit-learn, MATLAB and LIBSVM to achieve these kinds.

4. Apriori Correlation algorithm

What does it do? The Apriori algorithm learns data Association Rules (Association Rules) for databases that contain a large number of transactions (transcation).

What are association rules? Association rule Learning is a data mining technique that learns the relationships among different variables in a database.

As an example of a Apriori algorithm: we assume that there is a database full of supermarket transaction data, you can think of the database as a huge spreadsheet, each line of the table is a customer's transaction, each column represents the goods items that are not used.

The wonderful part: by using the Apriori algorithm, we know the items that are purchased at the same time, which is also called Association rules. What makes it so powerful is that you can see that some goods are being purchased more frequently than other goods-the ultimate goal is to get shoppers to buy more stuff. These goods items that are often purchased together are called Itemsets (itemset).

For example, you might soon see a combination of "French fries + dip sauce" and "fries + soda water" coming together frequently. These combinations are called 2-itemsets. In a data set that is large enough, it is difficult to "see" These relationships, especially when dealing with 3-itemset or more itemsets. This is where Apriori can help!

You may have questions about how the Apriori algorithm works, and before you enter the nature and details of the algorithm, there are 3 things to be clear:

The first is the size of your itemsets, the pattern you want to see is 2-itemset or 3-itemset or something else?
The second is the set of items that you support, or the items that are included in a transaction that is divided from the total number of transactions. A set of itemsets that satisfies the support level is called a frequent itemsets.
The third is to calculate the level of confidence or the conditional probability of a particular item of data based on the items you have already counted. For example, in the case of potato chips in the set, there is a 67% confidence level. Soda will also appear in the concentration.

The basic Apriori algorithm has three steps:

Participate, scan through the entire database, and calculate the frequency of 1-itemsets occurrences.
Pruning, satisfying the support and credibility of these 1-itemsets move to the next round of processes, and then look for the 2-itemsets that appear.
Repeat, the itemsets for each level are repeated, knowing the size of the itemsets we defined earlier.

is the algorithm supervised or unsupervised? Apriori is generally considered an unsupervised method of learning, as it is often used to excavate and discover interesting patterns and relationships.

But, wait, there is ... The Apriori algorithm can also be used to classify the already marked data.

Why use the Apriori algorithm? It is easy to understand, easy to apply, and a lot of derivation algorithms.

But on the other hand ...

When generating itemsets, the algorithm consumes memory, space, and time.

A large number of language implementations of the Apriori algorithm are available for use. The more popular are Artool, Weka, and Orange.

The next algorithm is the hardest for me to look at.

5.EM maximum expectation algorithm expectation maximization

What does the EM algorithm do? In the field of data mining, the maximal expectation algorithm (EXPECTATION-MAXIMIZATION,EM) is generally used as a clustering algorithm (similar to Kmeans algorithm) for knowledge mining.

Statistically, when estimating statistical model parameters with invisible hidden variables, the EM algorithm continuously iterates and optimizes the likelihood estimates of the observed data.

Okay, just a minute, let me explain.

I am not a statistician, so I hope that my concise expression will be correct and can help to understand.

Here are some concepts that can help us understand the problem better.

What's the statistic model? I think of the model as a description of how observational data is generated. For example, a test score may fit a bell-shaped curve, so this fractional distribution conforms to the hypothesis that the bell curve (also known as the normal distribution) is the model.

Wait, what's the distribution? The distribution represents the possibility of all measurable results. For example, a test score might match a normal distribution. This normal distribution represents all the possibilities of fractions. In other words, given a score, you can use this distribution to estimate how many test participants might get this score.

This is good, what is the parameter of the model? As part of the model, the distribution properties are described by parameters. For example, a bell-shaped curve can be described by its mean and variance.

Or using an example of a test, the score distribution (measurable result) of an exam conforms to a bell-shaped curve (that is, distribution). The mean value is 85, and the variance is 100.

So, all the things you need to describe the normal distribution are these two parameters:

Average
Variance

So, what about likelihood? Back to our previous bell-shaped curve example, let's say we've got a lot of score data and are told the score fits a bell-shaped curve. However, we did not give all the scores, just got a sample.

You can do this:

We don't know the mean or variance of all fractions, but we can use samples to calculate them. Likelihood is the probability that a bell-shaped curve with an estimated variance and mean is used to calculate many scores.

In other words, given a series of results that can be measured, let's estimate the parameters. Using these estimated parameters again, the hypothetical probability of getting the result is called likelihood.

Remember, this is the hypothetical probability of an existing score, not the probability of a future score.

You may wonder, what is the probability?

It is also explained by the example of a bell curve, assuming we know the mean and variance. We were told that the score was in line with the bell curve. The probability of some of the points we observe and how often they are observed are probabilities.

More generally, given the parameters, let's calculate what results can be observed. That's what the probabilities do for us.

Well, now, where is the difference between the observed data and the unobserved hidden data? The observed data is the data you see or record. The data that is not observed is the lost data. There are many reasons for data loss (no records, ignored, etc.).

The advantage of the algorithm is that for data mining and clustering, it is important for us to observe such data points of lost data. We do not know the specific class, so processing the lost data is critical to the task of clustering with the EM algorithm.

Again, when estimating statistical model parameters with no observable hidden variables, the EM algorithm continuously iterates and optimizes the likelihood estimates of the data that can be observed. I hope it's easier to understand now.

The essence of the algorithm is:

By optimizing the likelihood, EM generates a great model that can assign type labels to data points-sounding like a clustering algorithm!

How does the EM algorithm help to achieve clustering? The EM algorithm starts with a guess of the model parameters. Then it will carry out a loop of 3 steps:

E-Process: Based on the model parameters, it calculates the allocation probability of the cluster for each data point.
M process: Update the model parameters based on the clustering assignment of the E process.
Repeat to know that model parameters and cluster assignment work stably (also known as convergence).

is EM a supervisory or unsupervised algorithm? Because we do not provide the classified information already marked, this is an unsupervised learning algorithm.

Why use it? One of the key selling points of the EM algorithm is its straightforward implementation. In addition, it can not only optimize the model parameters, but also can repeatedly guess the lost data.

This makes the algorithm perform well on clustering and generating models with parameters. In the case of clustering and model parameters, it is possible to explain the classification with the same attributes and which class the new data belongs to.

But the EM algorithm is not without weakness ...

First, the EM algorithm runs fast in early iterations, but the later iterations are slower.

Second, the EM algorithm can not always find the optimal parameters, it is easy to fall into the local optimization rather than finding the global optimal solution.

The implementation of the EM algorithm can be found in the Weka, Mclust package has the R language to the algorithm implementation, Scikit-learn GMM module also has its implementation.

6.PageRank algorithm

What does the algorithm do? PageRank is a connection analysis algorithm designed to determine the relative importance of objects and other objects in the network (link analyze algorithm).

So what is the connection analysis algorithm? It is a kind of network analysis algorithm, to explore the relationship between objects (can also become a connection).

As an example: the most popular PageRank algorithm is Google's search engine. Although their search engine relies more on it, PageRank remains one of the tools Google uses to measure the importance of Web pages.

Explain:

Pages on the World Wide Web are linked to each other. If rayli.net links to a Web page on CNN, a poll is added to the CNN page to indicate that Rayli.net and CNN are linked.

It's not over yet:

In turn, the importance of voting from rayli.net Web pages is weighed against the importance and relevance of the Rayli.net network. In other words, any page that gives Rayli.net a vote can also improve the relevance of rayli.net Web pages.

Basically summarize:

Voting and relevance is the concept of PageRank. Rayli.net has added CNN's pagerank,rayli.net Pagerank level to CNN and also affected how much it has affected CNN's Pagerank for CNN polls.

So what does PageRank's 0,1,2,3 level mean? Although Google has not disclosed the exact meaning of PageRank, we can still understand its general meaning.

We can get some answers through the PageRank of the following websites:

Did you see that?

This ranking is a bit like a Web page popularity competition. Our minds have some information about the popularity and relevance of these sites.

PageRank is just a special way to define these.

What other applications does PageRank have? PageRank is designed specifically for the World Wide Web.

Consider, in the core functional perspective, that the PageRank algorithm is really just an extremely efficient way to handle link analysis. The linked objects that are processed are more than just web pages.

Here's a PageRank3 of innovative applications:

Dr Stefano Allesina, of the University of Chicago, applied PageRank to ecology to determine which species are critical to a sustainable ecosystem.
Twitter has developed an algorithm called WTF (Who-to-follow), which is a personalized PageRank recommended for the attention of the human engine.
Bin Jiang of the Hong Kong Polytechnic University uses a deformed PageRank to predict the pedestrian movement rate based on the London topographic indicators.

is the algorithm supervised or unsupervised? PageRank is commonly used to discover the importance degree of a Web page, which is often considered an unsupervised learning algorithm.

Why use PageRank? It can be said that the main selling point of PageRank is: because of the difficulty of getting new related links, the algorithm still has good robustness.

More simply, if you have another diagram or network, and want to understand the relative importance of elements, priorities, rankings or correlations, you can try it with PageRank.

Where has it been used? Google has a PageRank trademark. But Stanford University has patented the PageRank algorithm. If you use PageRank, you may have questions: I'm not a lawyer, so it's best to check with a real lawyer. But it should be possible to use this algorithm as long as there is no commercial competition with Google or Stanford.

Three implementations of PageRank are given:

1 C + + opensource PageRank implementation

2 Python PageRank Implementation

3 igraph–the Network Analysis Package (R)

7.AdaBoost Iterative algorithm

What does the AdaBoost algorithm do? AdaBoost is a lifting algorithm for building classifiers.

Perhaps you remember that the classifier took a lot of data and tried to predict or classify the categories of new data elements that belong to it.

But what does the boost mean? Ascension is a comprehensive learning algorithm that processes multiple learning algorithms (such as decision trees) and merges them together. The goal is to synthesize or form a group of weak learning algorithms that unite them to create a new strong learner.

What is the difference between a strong and weak learner? The accuracy of the weak learning classifier is only a bit higher than the guesswork. An example of a more popular weak classifier is a decision tree with only one layer.

Another, strong learning classifier has a higher accuracy rate, an example of a common strong learning device is SVM.

As an example of a AdaBoost algorithm: We started with 3 weak learners, and we will do 10 rounds of training on a data training set that contains patient data. The data set contains all the details of the patient's medical records.

The question is, how do we predict if a patient will get cancer? AdaBoost is the answer to this:

In the first round, AdaBoost took some training data and then tested the accuracy of each learner. The final result is that we find the best learning device. In addition, the mis-categorized sample learner gives a relatively high weight, so that they have a high probability of being selected in the next round.

To add, the best learner also gives a weight based on its accuracy and adds it to the Union learner (so there's only one classifier now)

In the second round, AdaBoost again tried to find the best learning device.

The key part is that the training data of the patient data sample is now affected by the weight of the high mis-allocation rate. In other words, patients who had previously been wrongly classified had a high probability of appearing in the sample.

Why?

It's like hitting a second level in a video game, but you don't have to start from scratch when your character dies. Instead, you start with the second level and focus and try to get to the third level.

Similarly, the first learner is likely to classify some patients correctly, rather than trying to classify them again, instead concentrating on trying to deal with patients who have been mistakenly classified.

The best learners are also given weights again and added to the joint classifier, and the patients who mistakenly classify them are also given weights so that they are more likely to be selected again and we will filter and repeat.

At the end of the 10 round, we had a trained joint learning classifier with different weights, followed by a repetition of the data that was mistakenly categorized in the previous round.

Is this an oversight or an unsupervised algorithm? This is supervised learning because each round of training has a weak trainer with the data set already marked.

Why use AdaBoost? The adaboost algorithm is simple, the programming is relatively concise and straightforward.

Plus, it's fast! Weak learners are generally simpler and simpler than strong learners, which means that they can run faster.

Here's another thing:

Since each successive round of adaboost has redefined the weights of each of the best learners, this is a very concise algorithm for automatically adjusting the learning classifier, and all you have to do is specify the number of rounds to run.

Finally, the algorithm is flexible and universal, AdaBoost can join any learning algorithm, and it can handle a variety of data.

AdaBoost has many program implementations and variants. Give some of the following:

? Scikit-learn

? Icsiboost

? gbm:generalized Boosted Regression Models

If you like Mr.rogers, you will like the following algorithm ...

8.knn:k Nearest Neighbor algorithm

What does it do? KNN, or K nearest neighbor (K-nearest neighbors), poetry classification algorithm. However, it differs from the classifier we described earlier because it is a lazy learning method.

What is lazy learning method? Unlike the algorithms that store training data, lazy learning does not require much processing during training. This type of algorithm is only classified when new, unclassified data is entered.

But on the other hand, the active learning rule creates a classification model in training, and when new unclassified data is entered, such a learner will also provide the new data to the classification model.

So what kind of c4.5,svm and AdaBoost belong? Unlike KNN algorithms, they are active learning algorithms.

Give the reason:

1 C4.5 A decision classification tree model was established in the training.

2 SVM has established a super-planar classification model in training.

3 AdaBoost A combined classification model was established in the training.

So what did KNN do? KNN does not establish such a classification model, instead, it simply stores some well-categorized training data. So when the new training data comes in, KNN performs two basic steps:

1 First, it observes the most recent classified training data points-that is, K closest point (K-nearest neighbors)

2 Second, KNN uses the classification of the nearest neighboring point of the new data, and the new data is categorized with better results.

You may wonder ... how does KNN figure out what's nearest? For continuous data, KNN uses a distance measure like Euclidean distance, and the selection of distance measures mostly depends on the data type. Some will even learn a distance measure based on the training data. There are more detailed discussions and paper descriptions about the KNN distance measure.

For discrete data, the solution is to convert discrete data into continuous data. Give two examples:

1 uses Hamming distance (Hamming distance) as a measure of the tightness of two strings.

2 transform discrete data into binary representations.

These two ideas from stack overflow also have some suggestions for dealing with discrete data:

? KNN Classification with categorical data

? Using K-nn in R with categorical values

When the adjacent points are different classes, how does KNN classify the new data? When the adjacent points are the same class, KNN is not laborious. We intuitively consider that if the nearby points are consistent, then the new data points are likely to fall into the same class.

I bet you can guess where it started to get into trouble ...

When the adjacent point is not the same class, how does KNN decide to classify the situation?

There are usually two ways of dealing with this situation:

1 Use these proximity points to make a simple majority voting method. Which class has more tickets, the new data belongs to that class.

2 is a similar poll, but the difference is that more voting weights are given to those closer to the nearest point. An easy way to do this is to use the inverse distance (reciprocal distance). For example, if a neighboring point is 5 units away, its voting weight is 1/5. When the proximity points are getting farther and further away, the reciprocal distance is getting smaller ... That's what we want.

Is this a supervisory algorithm or an unsupervised one? Since the KNN algorithm provides a data set that has been classified, it is a supervised learning algorithm.

Why do we use KNN? Easy to understand and implement is the two key reasons why we use it. Depending on the distance measure, KNN can be very precise.

But this is only part of the story, here are 5 things we need to be aware of:

1 The KNN algorithm can be computationally expensive when trying to calculate the closest point on a large data set.

2 Noise (Noisy data) may affect the classification of KNN.

3 Choosing a wide range of attribute filters (feature) has many advantages over small-scale filtering, so the scale of attribute filtering (feature) is important.

4 due to delays in data processing, KNN typically requires more robust storage requirements than active classifiers.

5 Choosing a suitable distance measure is critical to the accuracy of KNN.

Where did you use this method? There are many existing KNN implementations:

? MATLAB k-nearest Neighbor Classification

? Scikit-learn Kneighborsclassifier

? K-nearest Neighbour Classification in R

It's not rubbish, just leave it alone. Read the following algorithm first ....

9. Naive Bayes naive Bayesian algorithm

What does the algorithm do? Naive Bayes (Naive Bayes) is not just an algorithm, but a series of classification algorithms, which are premised on a common assumption:

Each attribute of the data being classified is independent of its other properties in this class.

What does it mean to be independent? When one property value does not have any effect on another property value, it is said that the two properties are independent.

As an example:

For example, you have a patient's data set that contains the patient's pulse, cholesterol level, weight, height, and properties such as postal code. If these attribute values do not affect each other, then all properties are independent. For this data set, it is reasonable to assume that the patient's height and zip code are independent of each other. Because the patient's height has nothing to do with their zip code. But we can not stop here, the other properties are independent?

Unfortunately, the answer is no. Give three attribute relationships that are not independent of each other:

? If your height increases, your weight may increase.

? If cholesterol levels increase, weight may increase.

? If the cholesterol level increases, the pulse may also increase.

In my experience, the properties of a dataset are generally not independent.

This is connected with the following questions ...

Why do we call the algorithm naïve (naive)? The assumption that all properties in the dataset are independent is what we call simplicity (naive)--all the properties in the following example are not independent.

What is Bayesian (Bayes)? Thomas Bayes is a British statistician, and the Bayes theorem is named after his name. Click on this link to know more about the Bayesian theorem (Bayes ' theorem)

In a word, we can use this theorem to predict classification based on a given set of attribute information and the knowledge of probabilities.

The simplified equation for classification looks like this:

We're just doing a little bit of digging.

What does this equation mean? Under the conditions of attribute 1 and attribute 2, the equation calculates the probability of Class A. In other words, if the attributes 1 and 2 are calculated, the data calculated by the equation is of the probability size of Class A.

The equation is interpreted as: under the conditions of attribute 1 and attribute 2, the probability of classification A is a fraction.

? The numerator of the fraction is the probability of the attribute 1 under the condition of Category A, multiplied by the probability of the attribute 2 under the condition of Category A, multiplied by the probability of classification a

? The denominator of a fraction is the probability that attribute 1 is multiplied by attribute 2.

As an example of Naive Bayes, here is a good example from the Stack Overflow thread (Ram s answer).

The thing is this:

? We have a training data set for 1000 fruits.

? The fruit may be bananas, oranges or other (these kinds of fruit are classes)

? Fruit may be long-shaped, sweet, or yellow (these are attributes).

What do you find in this training set?

? 500 bananas, 400 long, 350 sweet, 450 yellow.

? 300 oranges, not long, sweet 150, yellow 300

? The remaining 200 fruits, 100 long, sweet 150, yellow 50

If we are based on length, sweetness and fruit color, and we do not know their category, we can now calculate the probability that the fruit is a banana, an orange or some other fruit.

Suppose we are told that this unclassified fruit is long, sweet and yellow.

Here we calculate all probabilities in 4 steps:

The first step: to calculate the probability that a fruit is a banana, we first find that it looks familiar. This is the probability that the fruit is a banana in the condition that the property is long, sweet and yellow, and the expression is more concise:

This is really like the equation we discussed earlier.

Step two: Start with the molecule and let's add all the things to the formula.

Just like the formula, we multiply all of them and we get:

Step three: Do not use the denominator, because the molecules are the same when calculating other classifications.

Fourth step: Similar calculations are done when calculating other classes:

Because 0.252 is greater than 0.01875,naive Bayes will be long-shaped, sweet or yellow fruit into the banana category.

Is this a supervisory algorithm or an unsupervised algorithm? In order to get the frequency tables, Naive Bayes provides a training data set that has been divided into good classes, so this is a supervised learning algorithm.

Why use Naive Bayes? As you can see in the example above, Naive Bayes only involves simple mathematical knowledge. Add up to count, multiply, and divide.

Once the frequency tables have been calculated (frequency tables), to classify an unknown fruit only involves calculating the probability for all classes, and then choosing the maximum probability.

Although the algorithm is simple, Naive Bayes is surprisingly accurate. For example, it is found to be an efficient algorithm for spam filtering.

The implementation of the Naive Bayes can be found in orange, Scikit-learn, Weka and R.

Finally, take a look at the tenth algorithm.

10.CART Classification algorithm

What does the algorithm do? The CART represents the classification and regression tree (classification and regression trees). It is a decision tree learning method that outputs both categorical and regression trees. Like C4.5, a CART is a classifier.

Is the classification tree like a decision tree? The classification tree is one of the decision trees. The output of the classification tree is a class.

For example, depending on a patient's data set, you may try to predict whether the patient will get cancer. This classification is either "cancer of the Will" or "cancer is not".

What is the return tree? Unlike classification tree prediction classifications, a regression tree predicts a number or sequential value, such as the length of a patient's hospitalization or the price of a smartphone.

It's easier to remember this:

Classification tree output class, regression tree output number.

Since we've already talked about how decision trees classify data, we just skip to the chase ...

The cart and C4.5 are compared as follows:

Is this a supervisory algorithm or an unsupervised one? In order to construct the classification and regression tree model, it is necessary to provide a well-categorized training data set, so the CART is a supervised learning algorithm.

Why use a CART? Most of the reasons for using C4.5 also apply to CART, as they are all methods of decision tree learning. The reasons for this type of explanation are also applicable to the CART.

As with C4.5, they are computationally fast, the algorithms are generally popular, and the output is readable.

Scikit-learn implements the CART algorithm in their decision tree classifier, and the tree package for the R language is also implemented as a cart, and Weka and MATLAB also have a cart implementation process.

Finally, based on the theory of the world-renowned statisticians at Stanford and UC Berkeley, only the Salford system has the most primitive part of the implementation of the CART patent code.

Collected at: http://blog.jobbole.com/89037/

10 big algorithms in data mining

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More