With the advent of the big data age, the importance of data mining becomes apparent, and several simple data mining algorithms, as the lowest tier, are now being used to make a brief summary of the Microsoft Data Case Library.
Application Scenario Introduction
In fact, the scene of data mining applications everywhere, many of the environment will be applied to data mining, before we did not apply because we have not learned to use the data, or have not realized the importance of data, now with the IT industry in the era of big data, let me go to embrace big data, gossip less, Here we list one of the simplest scenarios in which a sales vendor predicts a list of customers who might buy a product from a data mining technology based on a previous sales record, and I believe this is the data that many sales organizations want to get, and of course there's a lot of data to dig into, For example: whether the association between the various goods can bring linkage sales (beer and urine is not wet, commodity shelves goods, site navigation product layout, etc.), affect the value of each commodity property and so on, these accompanying and data mining process will form a report.
Technical preparation
Here are a few technical reserves to mention
(1) We take advantage of Microsoft's case Data Warehouse (ADVENTUREWORKSDW2008R2), of course, we apply the two fact table, an existing history of the purchase of bicycle history table, of course, including the purchase of bicycle customers some of the properties for mining, The other is the information sheet that we're going to dig up and collect from people who might buy bikes, and dig out the ones who might be buying bikes.
Sun History Sales Table structure:
Contains a primary key record, the customer's birthday, name, Email, marital status, whether there is a house, whether there is a car, age, distance to work and other attributes, and then there is a list of whether or not the purchase of bicycles, of course, the three-paradigm design is not compliant with OLTP, but here is OLAP, Nor is it a normative fact table, a structure that can be pieced together by a view, and we do not mention these basic techniques here.
Another table:
It is also a personnel information table, but also a record of some people's properties, of course, it will not be the same as the sales personnel recorded information, but will contain the same set of attributes, such as: Birthday, age, annual income and so on, we have to do is from the table to find the people who will buy bicycles.
(2) vs Data mining tools, installation database configuration good service, this all understand, there is nothing to say, but to achieve this purpose we will use three data mining algorithms, a little introduction
Microsoft Decision Tree: for discrete attributes, the algorithm predicts the relationships between the input columns in the dataset. It uses the values or state of these columns to predict the state of the specified predictable column. Specifically, the algorithm identifies the input columns that are related to the predictable column.
Microsoft Cluster Analysis: The algorithm uses iterative techniques to group cases in a dataset into clusters that contain similar characteristics. These groupings are useful when browsing data, identifying exceptions in data, and creating predictions. The simple point is to find out the same kind of attributes.
Microsoft Naive Bayes: The Microsoft Naive Bayes algorithm is a Bayesian theorem-based classification algorithm provided by Microsoft SQL Server Analysis Services that can be used for predictive modeling.
These algorithms are supported by a number of underlying algorithms, and we only need to remember their application scenarios and the characteristics of different algorithms. The following steps of the step analysis process I will summarize what each algorithm can do and what it can analyze.
Here we go to the topic, through a simple process configuration we can implement the entire data mining process, followed by the following steps
1. New project, configure Data source
This is nothing to analyze, based on the Microsoft case database to establish a data connection, simple data configuration, instance name and user name and password to connect the Data Warehouse can
2. Create a data source view
Here is to filter out the data table we want to data mining, we can choose the table or view, what we do here is very simple, in fact, the above two tables from the database connection, vs tool configuration is very simple, according to its prompt we can easily configure the Data view.
Here are a few tips to browse two table data, right-click on the selected table to select "Browse Data", you can view the columns inside, you can also analyze the ratio of the data, you can use similar to excel in a pivot table, perspective view, or through the chart tools, such as:
This step is to let us understand the table data, by analyzing the data can be analyzed in the data table can be mined in the data column properties of what, we can also briefly speculate on the impact of our target (purchase of bicycles) This behavior may affect the properties, such as: The family has a car, older than 60 or less than 10 years of age, Work distance across several urban areas (such as from the Chaoyang-Haidian to work), the annual income of millions of dollars and so on the common sense will buy a bike is quite small, the next step we use mining algorithms to verify that our experience based on these assumptions are reasonable.
3. Create a mining structure
In this step, we step into a little bit, the process of explaining
(1) In the solution "data structure", right-click New Data Mining, here we choose "from an existing database or Data Warehouse" to define the data mining structure
(2) Click Next, select an algorithm, here are a few of our more commonly used data mining algorithms, we select the "Microsoft Decision tree" algorithm
(3) Click Next to select the data source view we can apply
(4) We choose the "vtargetmail" table above as an example of the excavation, so-called case is the history of our existing table, click Next
(5) Enter the designated qualitative data module, shown below, there are several columns more important, first look at the picture
Here are a few columns that need our own configuration, the key column: This is our primary key column, vs can be identified by itself, and then the input column: This column is based on the target we want to predict, manually tick the status Value column, here we checked the age, commute distance, UK degree, UK position, marital status, family car number, The total number of children in the family, the number of children at home, the region, the annual income, etc. we think will affect the prediction of the Value column, of course, there are times when our own inference may not be accurate, but vs also based on the data for you to calculate the possible impact of the column for reference selection, here we click the "Suggest" button,
See, cute vs has already put Kenny's column of Influence to you, here is the score is to assess the probability of the impact of the purchase of vehicles, age is the most influential factors, followed by family vehicles, again is the total number of family children, of course, these values are sampled, click Input, The corresponding column will be selected, here our predictable column is selected: Buy bicycle column "Bikebuyer", here the first column is to display the detail column, the results of data mining provides detailed drill, display the details of the required columns, we can be selected here.
In fact, we can briefly infer that the purchase of bicycles seems to be more related to age or relationship, hmm ... The elderly and children estimate the possibility of cycling is not very small, hehe ... Of course it's just speculation, let's dig down.
(6) Click Next, enter the data mining model mechanism, which will show the column values of the type and value of the state, is a continuous value, or discrete type, you can click to detect inference
(7) Click Next, will enter the data set two more important parameter configuration, one is to calculate the ratio of the model data set, one is the maximum number of instances of the value, first look at the picture, but I I explain the meaning
As explained in VS, it is clear that the first value is used to calculate the value of the data mining model, the remaining value as a validation of the correctness of our data mining model, the simple point is to leave a part of the test data later we then it to test the data mining model we established is correct. The second value is the maximum number of cases per mining regression, where we make a limit.
(8) Click Next, give the data mining Model a name, and tick the Allow drill-through details
Now that our Microsoft Decision tree Data mining model has been built, look at the diagram
The next step is to analyze the results of our excavation, which is the best one, and we will analyze it gradually.
Results analysis
Before analyzing, we will first deploy the solution in the solution to the local Analysis Services database.
In our data mining model there are four tabs one is our mining structure, the second is the mining model we set the input columns, prediction columns, key columns and other information, nothing to say to see the picture
And then we focus on the analysis of the data mining Model Viewer, which will have our dug out the result value, the formation of our analysis report and other options, first look at the diagram:
The whole picture, the entire display is a tree structure, from the top of the horizontal picture, from the legend we can see, where the red is the value of 1 for the purchase of bicycles, the blue is not the proportion of the purchase of bicycles, here is an option is more important, is the background, the default is all values, That is to say the whole picture contains all the facts, because our demand is to analyze the status of customers who buy bicycles, so we choose a value of 1 (will happen to buy) analysis, of course, if the demand is more special, you can also choose not to buy bicycles customers this part of the characteristics of the group, here we choose 1 background:
Look at the analysis results:
Overall tree species, the deepest color for our most expected to get the value, and each block represents a state value interval, each block can be expanded in turn to see its child node state, the closest to the root of the block represents because that is the most important factors affecting the result value, We can see that the number of cars in the family (numbers cars owned) is the most important factor in deciding whether or not to buy bicycles, and this factor is the highest probability of buying a bicycle in the number of cars, which is the most likely to buy a bicycle, followed by a customer who has a car, we then analyze:
Move your mouse to the box with number Cars owned 0 to view details:
Can see that some customers buy a bicycle probability of 63.15%, there are 4,006 cases, the purchase of 2,530, so this part of the customer should be the manufacturer's favorite, but also they will focus on mining objects, reasons you understand. Of course, there is a car in the home there are a lot of people want to buy bicycles, we can not give up, also we point to open this node and then analysis.
Hey, you can see that we speculate on the age factor has surfaced, home no car, and then the age of 45 years old under the purchase of bicycles has soared to 73.49%, there is a picture of the truth, 45 years of age, and then there is no car, in the rice country should also be a dick silk level, buy a bike ride also normal, Again we continue to analyze.
Next, the impact of the third layer of factors also began to bubble, that is the geographical location, look at the picture, not in North America location of the customer to purchase a higher probability, and this inside the home of the children for 0 of the purchase rate has reached 92.5%, Khan, manufacturers and so what, meet this customer you let your sales wait for performance can be, Let's analyze This section of the crowd: no car at home, age at 45 years old, not in North America, home and no children ... I go, this part of the customer what situation, one is the direct type of the cock silk level, one is the absolute taste of the Gaofu (the child has married). Well, if the sales are facing this group of customers, all you have to do is get a promotion, raise a salary, win white and rich beauty ...
Let's analyze another kind of car in the family, first look at the picture:
results: The age of 37 to 53, commuting distance is less than 10Miles, the home child is not equal to 3 is not equal to 4, and then the annual income of more than 58000$, this part of the customer's chances of buying about 74.83%.
Here we can look at the dependency Network diagram to see how all of my elements affect the behavior of buying bicycles:
By dragging the slider on the left, you can see in turn the impact of all the factors on the purchase of bicycles, the most important of which is the number of cars owned, the second is the age, and the region again ....
The above is the analysis of the results inferred from the decision tree algorithm, the next step is to verify the accuracy of our analysis results, but also to increase the number of other different algorithms for comparison, and then in the best algorithm based on the correct rate to find the top of the list of customers to buy bikes the highest probability of the group of people. Confined to space, in the following article we continue to analyze these things in detail, below I bask in a few results map, for everyone to ponder:
Data Mining Accuracy Chart:
We have the rest of the test data to make a validation chart, there are ideal best model, the worst random prediction model, and their probability, of course, this is our decision tree Prediction model, the chart of the dimensions and values are not analyzed, their own taste.
We can also draw a profit chart based on our predictive model:
The so-called profit chart, that is, through this analysis can bring us profits, well, this is also the manufacturer's concern.
And the last thing left is to find the customer information sheet that we left behind to identify those customers who might be buying bicycles, the highest product of data mining, to speculate on what will happen later. This step we'll keep writing an introduction.
Conclusion: the era of big data has come, we as the bottom of the yards of the farmers to be ready at all times, for their own job and also can continue to play the code of the energy, when we cultivate to a certain state, encounter big data, we can also calmly shouted: "Beast, let go of those data, let me!" Oh... National Day, I wish you a happy national day.
Part of the content of Microsoft's official case to show, Microsoft Decision tree algorithm detailed reference http://technet.microsoft.com/zh-cn/library/ms175312.aspx
(original) Big Data era: Data analysis based on Microsoft Case Database Data Mining case Knowledge Point Summary