Big Data era: a summary of knowledge points based on Microsoft Case Database Data Mining (Microsoft Naive Bayes algorithm)

Source: Internet
Author: User

Original: (original) Big Data era: a summary of knowledge points based on Microsoft Case Database Data Mining (Microsoft Naive Bayes algorithm)

This article is mainly to continue on the two Microsoft Decision Tree Analysis algorithm and Microsoft Clustering algorithm, the use of a more simple analysis algorithm for the target customer group mining, the same use of Microsoft case data for a brief summary. Interested students can first refer to the above two algorithms process.

Application Scenario Introduction

The Microsoft Naive Bayes algorithm is also applicable in the case of the previous two algorithms, but the Microsoft Naive Bayes algorithm in this article is simpler or lighter than the last two algorithms.

The algorithm uses Bayesian force, but does not integrate the dependency between the attributes, that is, with simple predictive analysis, so this assumption becomes the assumption of the idealized model, simple point: Bayesian algorithm is the historical attribute value of the simple two opposing states of the calculation, without regard to the relationship between historical attribute values, This also results in the limitation of the predicted result, can not predict the discrete or continuous value, can only predict the value of two yuan, such as: buy/Not buy, yes/No, will/not wait, Khan. Quite accord with the Chinese taiji figure in the easy classics. There are only two states can be explained, is the so-called: Tai Chi Sheng Two, two instrument four-phase, four-phase health gossip ... So the simplest is the most easy to use, but also the fastest.

Pull away, specific algorithm details can refer to Microsoft Official explanation Microsoft Naive Bayes algorithm

Because of the application scenario in the last two articles, the Bayesian algorithm can also be used to predict the customer population of buying bicycles, but it is more concise, this paper uses this algorithm to predict, and to see what the superiority of this algorithm is.

Technical preparation

(1) We also take advantage of Microsoft's case Data Warehouse (ADVENTUREWORKSDW2008R2), two fact sheets, a history of historical purchases of bicycle records, and another one that we are going to dig to collect the people who may have purchased bicycles. You can refer to the previous article

(2) VS, SQL Server, Analysis Services Nothing to introduce, the installation of the database when the full selection is possible.

Let's move on to the topic, and we'll continue to take advantage of the last solution, followed by the following steps:

(1) Open the solution and go to the mining model template

We can see that there are two kinds of analytic algorithms in the data mining model, which are the decision tree analysis algorithm and clustering algorithm used in the analysis of the two articles. We continue to add the Bayesian algorithm. 、

2. Right-click the Structure column, select New mining model, enter a name to

Click OK, this time will pop up a prompt box, I look at the picture:

What do you mean? .... Above we have analyzed the Bayesian algorithm as the simplest two-dollar state prediction algorithm, for discrete values or continuous values it is powerless, it is simply that the world only two states, that is, or not, the two columns of the age, the annual income is obviously a discrete attribute value, so it is to ignore. Click "Yes".

So our new Bayesian analysis algorithm will be added in the mining model, where we use the same primary key and decision tree, the same prediction behavior is the same, the input column is, of course, can be changed.

Next, the deployment processes the mining model.

Results analysis

Also in this we use "Mining Model Viewer" to view, here Mining model we choose "Clustering", which will provide four tabs, the following we introduced in turn, Direct blueprint:

This display panel is much cuter, focusing on "dependency network" in decision Tree algorithm, "attribute profile", "attribute characteristic", "attribute contrast" in clustering algorithm, and also the advantages of this algorithm, simple feature prediction, based on the opposite result prediction, but also has its shortcomings, the following we continue to analyze:

As you can see from the dependency network, the most important of the dependency properties that affect the purchase of bicycle behavior is the "number of family cars", followed by "commuting distances" .... We predicted the most bull factor "age" by decision Tree algorithm, and now it's gone, Khan ... Just because it is a discrete value, the same annual income is the same, so that the accuracy of our algorithm will be slightly low, of course, the algorithm also has a decision tree algorithm can not do, we look at the "attribute profile" Panel:

Through this panel we have been able to carry out a group feature analysis, which is the decision tree analysis algorithm can not do, of course, this is the characteristics of clustering analysis algorithm, above the meaning of the picture can see the home there are 1 or no car to buy a bike will be a bit bigger. Other analytical methods are similar, in particular, I can refer to my previous cluster analysis algorithm summary.

"Attribute characteristics" and "attribute comparison" two panel results analysis is also an inheritance and clustering algorithm, the previous article we have described in detail, the following is only a cut figure sun:

Yes, there are no children at home, in North America, the general driving distance in 1Miles (km? The comrades within are more interested in buying bicycles.

There are no cars at home ... Usually will buy a bike inclined to 1, if there are 2 cars are basically not bought in favor of 0, Khan ... Common sense... The rest is not analyzed.

Let's take a look at how accurate this algorithm is for the population prediction behavior of the bikes we buy.

Accuracy Verification

Finally, let's verify how accurate the Bayesian analysis algorithm is today, and what is the difference between the decision tree algorithm and the clustering algorithm in the previous two articles, and we click into the data Mining accuracy chart:

It can be seen that the Bayesian analysis algorithm scoring has come out, second only to decision Tree algorithm, ranked as: Decision tree Analysis algorithm, Bayesian analysis algorithm, clustering analysis algorithm. It seems that the simple Bayesian analysis algorithm is not simple, although it abandoned two major attribute values: Age, annual income, and where the age attribute through decision tree Analysis algorithm analysis or a more important attribute, Bayesian ruthless abandon, still with 0.78 points advantage far better than clustering analysis algorithm! And the above analysis can be seen that it also has clustering algorithm features, such as: Feature analysis, attribute comparison and other tools.

Through the evaluation of three kinds of analysis algorithms, we seem to have seen the best analysis algorithm suitable for our application needs, the evaluation of each algorithm, through the graph has been easy to show out, of course, our today's Miscrosoft Bayesian analysis algorithm should be ended.

<------------------------------------------------------------ Ornate Split Line------------------------------------------------------------------------------------------>

But....... I remember the last time I wrote the clustering algorithm, I accidentally mentioned that if the domestic it practitioners and non-it practitioners based on the gender attributes of the prediction ... The result will be a shudder! Do you know, then we speculate that the purchase of bicycles here will also be related to gender? Usually boys prefer to ride bicycles ... Well.. I mean, usually ... So the result ... Let's see:

We use the highest scoring decision tree analysis algorithm to speculate on our problem, we in the "mining model" right-click on the new model, select the decision tree Analysis algorithm, we have a name:

Click OK, we'll use the decision tree Analysis algorithm to analyze the probability of men buying bicycles, then right-click on the algorithm structure and select "Set Model Filter". Let's set the filter filter to: M, which is male silver

We use the method we want to continue to build women (female silver) Decision tree Mining algorithm, the following figure:

There is not much to explain here, we directly verify the results, to see if the above inference has no meaning.

Look at the following picture:

.... The amount ... The amount ..... E ... The table is agitated ... I'll go to the ... The results of the prediction model based on gender are already out, and judging from the scoring, man (male silver) decision tree has been able to match all the case results of the United States, are 0.71 ... This means that we only need to make predictions about a man's population to get the full market rule. Without having to spend all of your energy studying ... But women's score soared to 0.84 .... Sweat... In these mining algorithms, the decision tree algorithm is used to predict the population of women, so the accuracy of the results is so high! The existence of this model directly kills any other analysis algorithm, God horse clustering, Bayesian are floating clouds .... Floating clouds only.

Through the above analysis, we have established our inference, male and female comrades in want to buy bicycles this thing is a group difference, not only by analyzing the whole facts can be obtained, of course, the two species of men and women on the Earth, there is a large difference in behavior and characteristics, For buy not to buy a bicycle of course also not the same, hehe ... At least the rice country is like this, the above chart validates this claim! So for different behavioral predictions we can focus on the separate excavation, so that our excavation after the inferred value will be closer to the fact.

Interested in whether or not to marry the two groups to analyze the mining, to see if it is not married and buy a bike has no relationship.

Postscript

Well, this is the end of this article, and we'll use the top three data mining algorithm analysis results to dig out the Customer table for the groups that will be buying bikes and use them to achieve precise marketing purposes. At the end of the article I'll link to the top two summary links:

Microsoft Decision Tree Analysis Algorithm summary

Summary of Microsoft Clustering algorithms

End this article with a word from Master fan: brother, I do not want to know how I came, I just want to know how I did not ... Remember to recommend Oh!

Big Data era: a summary of knowledge points based on Microsoft Case Database Data Mining (Microsoft Naive Bayes algorithm)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.