Microsoft Data Mining algorithm: Microsoft Decision Tree Analysis Algorithm (1)

Source: Internet
Author: User

Introduced:

The Microsoft decision Tree algorithm is a classification and regression algorithm that is used to model discrete and continuous attributes in a predictive mode.

For discrete attributes, the algorithm predicts the relationships between the input columns in the dataset. It uses the values of these columns (also called states) to predict the state of a column that is specified as predictable. Specifically, the algorithm identifies the input columns that are related to the predictable column. For example, in the scenario of predicting which customers might purchase a bicycle, if nine of the 10 young customers purchased bicycles, but only two of the 10 older customers purchased bicycles, the algorithm infers that age is the best predictor of bike buying. The decision tree roots are predicted according to the trend toward specific results.

For continuous attributes, the algorithm uses linear regression to determine the split position of the decision tree.

If more than one column is set as a predictable column, or if the input data contains nested tables that are set to be predictable, the algorithm generates a separate decision tree for each predictable column.

The principle of the algorithm:

The Microsoft decision tree algorithm generates a data mining model by creating a series of splits in the tree. These splits are represented as "nodes". Whenever an input column is found to be closely related to a predictable column, the algorithm adds a node to the model. The algorithm determines how the split is divided, primarily depending on whether it predicts sequential or discrete columns.

The Microsoft decision Tree algorithm uses feature selection to guide how to select the most useful properties. All SQL Server Data Mining algorithms to improve performance and the quality of analysis. Feature selection is significant for preventing unimportant attributes from taking up processor time. If you use too many input or predictable attributes when you design a data mining model, it may take a long time to process the model, or even lead to low memory. The methods used to determine whether to split the tree include the "entropy" and the industry standard metric of the Bayesian network. For more information about selecting meaningful properties and how to score and arrange those properties, see Feature Selection (Data mining).

A common problem in data mining models is that the model is too sensitive to subtle differences in training data, which is known as over-fitting or over-training. Overfitting models cannot be generalized to other datasets. To prevent the model from overfitting to any particular data set, the Microsoft decision tree algorithm uses some techniques to control tree growth. For a more in-depth explanation of how the Microsoft Decision tree algorithm works, see the Microsoft Decision Tree Algorithm Technical Reference.

Technical preparation

Here are a few technical reserves to mention

(1) We take advantage of Microsoft's case Data Warehouse (ADVENTUREWORKSDW2008R2), of course, we apply the two fact table, an existing history of the purchase of bicycle history table, of course, including the purchase of bicycle customers some of the properties for mining, The other is the information sheet that we're going to dig up and collect from people who might buy bikes, and dig out the ones who might be buying bikes.

Sun History Sales Table structure:

Contains a primary key record, the customer's birthday, name, Email, marital status, whether there is a house, whether there is a car, age, distance to work and other attributes, and then there is a list of whether or not the purchase of bicycles, of course, the three-paradigm design is not compliant with OLTP, but here is OLAP, Nor is it a normative fact table, a structure that can be pieced together by a view, and we do not mention these basic techniques here.

Another table:

It is also a personnel information table, but also a record of some people's properties, of course, it will not be the same as the sales personnel recorded information, but will contain the same set of attributes, such as: Birthday, age, annual income and so on, we have to do is from the table to find the people who will buy bicycles.

(2) vs Data mining tools, installation database configuration good service, this all understand, there is nothing to say, but to achieve this purpose we will use three data mining algorithms, a little introduction

Microsoft Decision Tree: for discrete attributes, the algorithm predicts the relationships between the input columns in the dataset. It uses the values or state of these columns to predict the state of the specified predictable column. Specifically, the algorithm identifies the input columns that are related to the predictable column.

Microsoft Cluster Analysis: The algorithm uses iterative techniques to group cases in a dataset into clusters that contain similar characteristics. These groupings are useful when browsing data, identifying exceptions in data, and creating predictions. The simple point is to find out the same kind of attributes.

Microsoft Naive Bayes: The Microsoft Naive Bayes algorithm is a Bayesian theorem-based classification algorithm provided by Microsoft SQL Server Analysis Services that can be used for predictive modeling.

These algorithms are supported by a number of underlying algorithms, and we only need to remember their application scenarios and the characteristics of different algorithms. The following steps of the step analysis process I will summarize what each algorithm can do and what it can analyze.

Here we go to the topic, through a simple process configuration we can implement the entire data mining process, followed by the following steps

1. New project, configure Data source

This is nothing to analyze, based on the Microsoft case database to establish a data connection, simple data configuration, instance name and user name and password to connect the Data Warehouse can

2. Create a data source view

Here is to filter out the data table we want to data mining, we can choose the table or view, what we do here is very simple, in fact, the above two tables from the database connection, vs tool configuration is very simple, according to its prompt we can easily configure the Data view.

Here are a few tips to browse two table data, right-click on the selected table to select "Browse Data", you can view the columns inside, you can also analyze the ratio of the data, you can use similar to excel in a pivot table, perspective view, or through the chart tools, such as:

This step is to let us understand the table data, by analyzing the data can be analyzed in the data table can be mined in the data column properties of what, we can also briefly speculate on the impact of our target (purchase of bicycles) This behavior may affect the properties, such as: The family has a car, older than 60 or less than 10 years of age, Work distance across several urban areas (such as from the Chaoyang-Haidian to work), the annual income of millions of dollars and so on the common sense will buy a bike is quite small, the next step we use mining algorithms to verify that our experience based on these assumptions are reasonable.

3. Create a mining structure

In this step, we step into a little bit, the process of explaining

(1) In the solution "data structure", right-click New Data Mining, here we choose "from an existing database or Data Warehouse" to define the data mining structure

(2) Click Next, select an algorithm, here are a few of our more commonly used data mining algorithms, we select the "Microsoft Decision tree" algorithm

(3) Click Next to select the data source view we can apply

(4) We choose the "vtargetmail" table above as an example of the excavation, so-called case is the history of our existing table, click Next

(5) Enter the designated qualitative data module, shown below, there are several columns more important, first look at the picture

Here are a few columns that need our own configuration, the key column: This is our primary key column, vs can be identified by itself, and then the input column: This column is based on the target we want to predict, manually tick the status Value column, here we checked the age, commute distance, UK degree, UK position, marital status, family car number, The total number of children in the family, the number of children at home, the region, the annual income, etc. we think will affect the prediction of the Value column, of course, there are times when our own inference may not be accurate, but vs also based on the data for you to calculate the possible impact of the column for reference selection, here we click the "Suggest" button,

See, cute vs has already put Kenny's column of Influence to you, here is the score is to assess the probability of the impact of the purchase of vehicles, age is the most influential factors, followed by family vehicles, again is the total number of family children, of course, these values are sampled, click Input, The corresponding column will be selected, here our predictable column is selected: Buy bicycle column "Bikebuyer", here the first column is to display the detail column, the results of data mining provides detailed drill, display the details of the required columns, we can be selected here.

In fact, we can briefly infer that the purchase of bicycles seems to be more related to age or relationship, hmm ... The elderly and children estimate the possibility of cycling is not very small, hehe ... Of course it's just speculation, let's dig down.

(6) Click Next, enter the data mining model mechanism, which will show the column values of the type and value of the state, is a continuous value, or discrete type, you can click to detect inference

(7) Click Next, will enter the data set two more important parameter configuration, one is to calculate the ratio of the model data set, one is the maximum number of instances of the value, first look at the picture, but I I explain the meaning

As explained in VS, it is clear that the first value is used to calculate the value of the data mining model, the remaining value as a validation of the correctness of our data mining model, the simple point is to leave a part of the test data later we then it to test the data mining model we established is correct. The second value is the maximum number of cases per mining regression, where we make a limit.

(8) Click Next, give the data mining Model a name, and tick the Allow drill-through details

Now that our Microsoft Decision tree Data mining model has been built, look at the diagram

The next step is to analyze the results of our excavation, which is the best one, and we will analyze it gradually.

Results analysis

Before analyzing, we will first deploy the solution in the solution to the local Analysis Services database.

In our data mining model there are four tabs one is our mining structure, the second is the mining model we set the input columns, prediction columns, key columns and other information, nothing to say to see the picture

And then we focus on the analysis of the data mining Model Viewer, which will have our dug out the result value, the formation of our analysis report and other options, first look at the diagram:

The whole picture, the entire display is a tree structure, from the top of the horizontal picture, from the legend we can see, where the red is the value of 1 for the purchase of bicycles, the blue is not the proportion of the purchase of bicycles, here is an option is more important, is the background, the default is all values, That is to say the whole picture contains all the facts, because our demand is to analyze the status of customers who buy bicycles, so we choose a value of 1 (will happen to buy) analysis, of course, if the demand is more special, you can also choose not to buy bicycles customers this part of the characteristics of the group, here we choose 1 background:

Look at the analysis results:

Overall tree species, the deepest color for our most expected to get the value, and each block represents a state value interval, each block can be expanded in turn to see its child node state, the closest to the root of the block represents because that is the most important factors affecting the result value, We can see that the number of cars in the family (numbers cars owned) is the most important factor in deciding whether or not to buy bicycles, and this factor is the highest probability of buying a bicycle in the number of cars, which is the most likely to buy a bicycle, followed by a customer who has a car, we then analyze:

Move your mouse to the box with number Cars owned 0 to view details:

Can see that some customers buy a bicycle probability of 63.15%, there are 4,006 cases, the purchase of 2,530, so this part of the customer should be the manufacturer's favorite, but also they will focus on mining objects, reasons you understand. Of course, there is a car in the home there are a lot of people want to buy bicycles, we can not give up, also we point to open this node and then analysis.

Hey, you can see that we speculate on the age factor has surfaced, home no car, and then the age of 45 years old under the purchase of bicycles has soared to 73.49%, there is a picture of the truth, 45 years of age, and then there is no car, in the rice country should also be a dick silk level, buy a bike ride also normal, Again we continue to analyze.

Next, the impact of the third layer of factors also began to bubble, that is the geographical location, look at the picture, not in North America location of the customer to purchase a higher probability, and this inside the home of the children for 0 of the purchase rate has reached 92.5%, Khan, manufacturers and so what, meet this customer you let your sales wait for performance can be, Let's analyze This section of the crowd: no car at home, age at 45 years old, not in North America, home and no children ... I go, this part of the customer what situation, one is the direct type of the cock silk level, one is the absolute taste of the Gaofu (the child has married). Well, if the sales are facing this group of customers, all you have to do is get a promotion, raise a salary, win white and rich beauty ...

Let's analyze another kind of car in the family, first look at the picture:

results: The age of 37 to 53, commuting distance is less than 10Miles, the home child is not equal to 3 is not equal to 4, and then the annual income of more than 58000$, this part of the customer's chances of buying about 74.83%.

Here we can look at the dependency Network diagram to see how all of my elements affect the behavior of buying bicycles:

By dragging the slider on the left, you can see in turn the impact of all the factors on the purchase of bicycles, the most important of which is the number of cars owned, the second is the age, and the region again ....

The above is the analysis of the results inferred from the decision tree algorithm, the next step is to verify the accuracy of our analysis results, but also to increase the number of other different algorithms for comparison, and then in the best algorithm based on the correct rate to find the top of the list of customers to buy bikes the highest probability of the group of people. Confined to space, in the following article we continue to analyze these things in detail, below I bask in a few results map, for everyone to ponder:

Data Mining Accuracy Chart:

We have the rest of the test data to make a validation chart, there are ideal best model, the worst random prediction model, and their probability, of course, this is our decision tree Prediction model, the chart of the dimensions and values are not analyzed, their own taste.

We can also draw a profit chart based on our predictive model:

The so-called profit chart, that is, through this analysis can bring us profits, well, this is also the manufacturer's concern.

Data modeling materials from: (original) Big Data era: based on Microsoft case Database Data Mining Knowledge Point Summary (Microsoft Decision Tree Analysis algorithm)

Microsoft Data Mining algorithm: Microsoft Decision Tree Analysis Algorithm (1)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.