Big Data era: a summary of knowledge points based on Microsoft Case Database Data Mining (Microsoft Clustering algorithm)

Source: Internet
Author: User

Original: (original) Big Data era: a summary of knowledge points based on Microsoft Case Database Data Mining (Microsoft Clustering algorithm)

This article is mainly to continue the previous Microsoft Decision tree Analysis algorithm, the use of another analysis algorithm for the target customer group mining, the same use of Microsoft case data for a brief summary.

Application Scenario Introduction

In the previous article, we used the Microsoft Decision tree Analysis algorithm to analyze the customer attributes in the orders that have taken place, and we can get some important information, here is a summary:

1, the most important factors affecting the purchase of bicycle behavior are: whether there is a car in the home, followed by age, again is the region

2, through the folding tree for the comparison of customers who want to buy a bike group characteristics are mainly: Home no car, age at 45 years old, not in North America, Home and no children (rice country inside the cock silk level),

There is also a car at home, age between 37 to 53, commuting distance of less than 10Miles, home children less than 4, and then the annual income of more than 58000$ (rice country of high-rich handsome)

In fact, the most important application scenario of decision tree algorithm is to analyze the order of the factors affecting some behavior, through which we can know that certain groups of people will have a few more significant properties, such as the family has no car, age, etc., but we want to analyze this part of the specific group of its unique properties can not be done, To analyze the common attributes shared by this particular group requires that our Microsoft Clustering algorithm appear today, simply to say: things are divided into categories, flock together, clustering algorithms we want to find those who are going to buy bicycles in the customer group have what properties, For example, when we enter the square in the evening will see, the square Aunt Group, children in a group, playing basketball group, there are a group of couples in the square side of the dark woods and so on, and they are different between these teams, if you want to sell children's toys ... That group is the natural thing you want to get close to.

Technical preparation

(1) We also take advantage of Microsoft's case Data Warehouse (ADVENTUREWORKSDW2008R2), two fact sheets, a history of historical purchases of bicycle records, and another one that we are going to dig to collect the people who may have purchased bicycles. You can refer to the previous article

(2) VS, SQL Server, Analysis Services Nothing to introduce, the installation of the database when the full selection is possible.

Let's move on to the topic, and we'll continue to take advantage of the last solution, followed by the following steps:

(1) Open the solution and go to the mining model template

We can see that a decision tree algorithm already exists, and we will add another algorithm.

2. Right-click the Structure column, select New mining model, enter a name to

Click OK so that our newly created cluster analysis will increase in the mining model, where we use the same primary key as the decision tree, the same prediction behavior, the input column is also, can be changed.

Next, the deployment processes the mining model.

Results analysis

Also in this we use "Mining Model Viewer" to view, here Mining model we choose "Clustering", which will provide four tabs, the following we introduced in turn, Direct blueprint:

Also in this we choose to happen to buy bicycles in the group, the most color for the most likely to buy bicycles in the group, the arrow we have shown, the same we can find the most unlikely to buy a bike of a group of people, that is, "classification four", the strength of the lines between them to indicate the relationship between strong and weak, Of course, in order to remember that we can change their name, directly select the class, right-click Rename.

For example, the following we have to do is to analyze what the characteristics of these groups, of course, we are most concerned about: the most want to buy a group of bicycles, do not want to buy bicycles can also be analyzed, as to the unknown truth of the group, passers-by group A, b ... These are all soy sauce, and we don't analyze them.

We open the "classification profile" to see:

Ha... The characteristics of these groups have been shown, if the data for a long time, will have a visual acuity of the chart, the data should also maintain a specific sense of smell.

Let's rearrange the order of the columns of this "categorical section", and expand it horizontally according to the intensity of our concerns.

The first column in the figure, such as the age, number of cars, the number of children in the home, the second column of the legend of each attribute, which is based on the value type of the attribute of the legend display, generally divided into two, such as the age in the library stored in the value of the type is usually distributed between 1-100, so the legend takes a segmented form a small to large column, the middle contains a pyramid, the size of the prism represents the density of the group of attributes, such as the customer concentration between 29 to 48 years of age:

Of course, if the property value is not a discrete attribute value, then take a different color of the prototype chart representation, the professional term is called: histogram, the Panel has a place to set the number of histogram bars, that is, the value of the property to obtain the maximum number of property values. For example: The total number of children in the home, generally divided into 0, one, 2, 3, and other ...

Nani! ... The above illustration does not have 3 children, the legend is also through data sampling, only take more than the amount of the show, indicating that there are less than 3 children in the family.

Let's analyze the group features that most want to buy bikes:

The first thing you can see is the age of 40, on average, 43.65 years.

I sweat .... The minimum age is 29 years. Average age: 43.65 years old. Maximum age: 81.79 years old. It is estimated that the data in the Microsoft case database is not necessarily reliable, or that people under the age of 30 do not like to ride bicycles but more than 80-year-olds still buy bicycles, or the store does not sell to customers under the age of 30, of course, it is possible that young people do not buy, most of the old people buy the children, This will not be analyzed. Anyway the data is so said, there is a picture of the truth!

Most of the probability that there is no car or only one car in the house is more than 0.3. The odds of buying a bike for a family larger than a car are few ... The probability of four cars at home is less than 0.003. Near the probability of not buying ...

The probability of having a child at home is up to 0.483 ... No children at home do not buy bicycles at all ... I'll go... Basically verified my above speculation, it seems that most people buy bicycles for their children to ride, no children do not buy, in the absence of children's purchase of self-confidence car probability is 0.000, there is a property can be studied, that is whether there is a house, look at the picture:

Well... Most people who want to buy a car have houses in their homes, and they say they have a fixed residence, and they have a 0.854 chance of buying bicycles ... And the lack of a house is less pitiful ... To 0.146.

Other properties can also be analyzed through this property panel, we can analyze the characteristics of the group we want to attribute, targeted marketing.

The above is only a partial analysis of the classified profile, and VS also provides another panel that specifically enumerates the attribute features: the classification feature.

We click to open this panel to see:

The above chart shows that the group that we want to know has been shown naked, well, to see. French Occupation: technical staff , English occupation: skilled hand, all have a house, region: North America, age range: between 41-48 years of age, annual income: between 35459.9-57244.9, the family has a child and so on. Of course, other groups can also be analyzed, this does not show.

Also we can target a certain attribute, targeted to the two groups of groups to compare, here apply to another panel: "Classification comparison", I suddenly think of the "gender" this attribute, the IT industry and the non-IT industry to compare, estimate the results should be shudder ... Oh... Off-topic, look at the following picture:

Nima... I looked at the picture below. Found that there is a property value is particularly interesting, the annual income of 10000-29950 between the basic is not going to buy bicycles, but the annual income to 29950-1700000, want to buy a bicycle is much higher probability, can be seen. Well... Bicycles are also cars ... If you want to buy a car, you have to have money.

Accuracy Verification

Finally, let's verify the accuracy of today's clustering algorithm, and what is the difference between the decision tree algorithms in the previous article, we click into the data Mining accuracy chart:

We can see that today's cluster analysis algorithm, the score is 0.72, than the previous decision tree algorithm 0.87, or a slight gap, of course, can not only score to evaluate the quality of the two algorithms, different mining requirements need different mining models, the same different mining model needs a different mining analysis algorithm.

However, through a few points needing special attention, the accuracy of the data analysis algorithm depends on how much of the underlying data, that is, the greater the amount of data, the more accurate the data you analyze, the same is the future of the concept of big data formation, no data any good algorithm also no recruit, and when the data reached a certain level, The task of individual inaccuracy will also be masked by the fact that big data is the meaning of the era of big data.

Of course, all things have to take the data to speak, can not be confused, the ideal model is the red one to verify that I just said that when the total data reached 50%, our data mining results are 100 points, 100 what meaning? Absolutely right! That means what you want to do next is something we can fully speculate about, of course, when the amount of data is low, we can't do anything, we use any data mining algorithm theoretically will be infinitely close to this red line (ideal model), will never surpass, and this close process is our big data era of the impetus.

And, of course, a worst-of-all stochastic prediction model. It's always a 50% probability that God exists ... Because there are only two results for the purchase of bicycles, one is to buy, the other is not to buy, it predicts the exact probability is always half ... 50% ...

Be interested in big data don't forget your "recommendation" Oh.

The power of data mining: sample, I knew you would do it!

Big Data era: a summary of knowledge points based on Microsoft Case Database Data Mining (Microsoft Clustering algorithm)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.