The latest data mining capabilities in Microsoft SSAs are required in a project, although the data mining capabilities in SSAS have never been understood in the past when projects were often used in the SSAS cube (that is, Cube). So through the project demand this Dongfeng recently learned the next data mining for SSAS, here first write a blog to do a brief summary.
When it comes to data mining, we first need to know what SSAS data mining can do and why data mining is needed. Let's take a look at an example, assuming we now have a table in our database called Customersboughtcarssurvey, which records information about a company's customers buying a vehicle.
CREATE TABLE [dbo].[Customersboughtcarssurvey]( [CustomerID] [int] IDENTITY(1,1) not NULL,--primary Key ID is self-increment int type [Name] [nvarchar]( -)NULL,--Customer Name [ Age] [int] NULL,--Customer Age [Sex] [Char](1)NULL,--Customer Gender [Nation] [nvarchar]( -)NULL,--Country of Customer [ City] [nvarchar]( -)NULL,--The city where the customer resides [yearlysalary] [float] NULL,--Customer Annual Income [Boughtcar] [bit] NULL,--whether the customer has purchased a car CONSTRAINT [Pk_customersboughtcarssurvey] PRIMARY KEY CLUSTERED ( [CustomerID] ASC) with(Pad_index= OFF, Statistics_norecompute= OFF, Ignore_dup_key= OFF, Allow_row_locks= on, Allow_page_locks= on) on [PRIMARY]) on [PRIMARY]
Now if there are 10000 rows of records in this table, of which the [customerid],[name],[age],[sex],[nation],[city],[yearlysalary] columns in these 10000 records are valued, only the columns [ BOUGHTCAR] Only 4000 rows of data are recorded and the remaining 6000 rows are null values, because 4,000 customers in table Customersboughtcarssurvey have done market research, so they know if they have bought a car, So these 4,000 customers have a value on column [Boughtcar], but 6,000 customers haven't done a market survey, so the [Boughtcar] column with 6000 rows of data is null. Now all we have to do is use an algorithm to predict which of the 6,000 customers are likely to buy a car in the future? How likely are you to buy a car?
So around this question, let's look at the structure of the table Customersboughtcarssurvey, and we can see that the fact that the customer is buying a car may depend on [age],[sex],[nation],[city],[yearlysalary] The values of these columns, such as small-age people are unlikely to buy a car, men are more likely to buy a car than women (just for example, no other meaning ...) , people with a high annual income may be more likely to buy cars than people with low annual incomes. So simply, we now need to find an algorithm that infers the column based on the values of each row [age],[sex],[nation],[city],[yearlysalary] in table Customersboughtcarssurvey [ Boughtcar] is the value 1 or 0, if the probability of 1 appears high or low? If the inferred result is 1 and the probability is greater than 80%, then the likelihood of the customer buying a car is very high, for the company should focus on the customer.
So now our problem has become that we need a function logic that infers the result of a column [Boughtcar] based on the value of the columns [age],[sex],[nation],[city],[yearlysalary], using a mathematical formula to express the
[Boughtcar] =function ([age],[sex],[nation],[city],[yearlysalary])
The data mining model in SSAS is able to put the above formula into implementation, the function in the formula is a functional logic, the function logic in SSAS is the nine model of data mining algorithm:
- Microsoft Decision Tree Analysis algorithm
- Microsoft Cluster analysis algorithm
- Microsoft Naive Bayes algorithm
- Microsoft Time Series algorithm
- Microsoft Association Rule Analysis algorithm
- Microsoft sequential analysis and clustering algorithms
- Microsoft Neural network analysis algorithm
- Microsoft Linear regression analysis algorithm
- Microsoft Logistic regression analysis algorithm
The introduction of the nine major models of a blogger did a detailed introduction, I am also learning, here is recommended: Big Data era: In layman's Microsoft Data Mining algorithm series.
Selected The above nine models of a mining model, then the next thing to do is to drill the mining model data training to improve the accuracy of the mining model logarithmic prediction, in layman's data training is to make the above formula function logic better correct, can be more accurate formula equals to the left [ The value of Boughtcar]. In the example of this article we said earlier that there are 4000 rows of data in the table Customersboughtcarssurvey [Boughtcar] is a value, the 4000 rows of data is to participate in the market research customers, we have to use these 4000 rows of data to do data training, Improve the accuracy of the data mining model algorithm and then predict the value of the column [Boughtcar] in the remaining 6000 rows of data.
The process of data training is basically this, the training data is divided into two parts, the first part of the data to find a law to obtain an algorithm, and then according to the algorithm to calculate the value of another part of the data, and then compared with the real value, the accuracy of the algorithm is obtained. In our example is the table Customersboughtcarssurvey in 4000 rows [Boughtcar] has the value of the data to do the data training, 4000 rows 30% of the data as the first part of the data to do a logical analysis to get the algorithm, Then the algorithm is calculated to calculate the value of [Boughtcar] of the remaining 70% data in 4000 rows of data, then compare it with the real value, get the accuracy rate, if the accuracy rate can accept our mining model is constructed, We can use the mining model to derive the value of [Boughtcar] from the data in the 6000 rows [Boughtcar] of the table Customersboughtcarssurvey that have no value. If the accuracy is too low to accept, that shows that 4000 rows of data to do data training is not enough, the company's market research department to investigate more customers to prepare more [Boughtcar] column value data to do data training, or choose other mining models from nine large models to see whether the accuracy of the prediction can be improved. This process can be represented by the following diagram.
After understanding the concept of data training, let's look at how to build a data mining structure and mining model in SSAS, where the data mining structure is the container for the mining model, and the mining structure defines which data columns are used for the mining model, and a mining structure can contain multiple mining models. Demonstrates how to establish a mining structure in SSAS.
When we build a mining structure, we can build a mining model by the way, and we build a mining model of decision tree algorithm at the same time.
Then we have to choose which tables to use for the mining structure, a mining structure to select a case table (the cases column), and if a column of the case table does not meet the needs of the mining structure, you can also select several tables associated with the case table as nested tables (in nested column). In this case, we have selected only one case table, Customersboughtcarssurvey.
We then define the input and prediction columns for the decision tree model that we just selected, and at least one column in the mining model is selected as the key Value column, and the key Value column uniquely identifies a row of data in the mining model just like the primary key in the database table, in this case our key-Value column is CustomerID (the key column). And then the input column is equivalent to the parameter we mentioned earlier in the formula function, so we chose [age],[sex],[nation],[city],[yearlysalary] as the input column (in input column). The Prediction column is the return value to the left of the equals sign in the formula above, so we chose column Boughtcar as the prediction column (predictable column). Of course, a column can be either an input column or a prediction column, which is equivalent to a data need to pass in a function for data processing as a return value, so this case data is both an input column and a prediction column.
Then we have to choose how much data to do the algorithm analysis, is equivalent to the previous Data Training section mentioned in the need to take part of the data to derive a mining model algorithm, we mentioned in 4000 rows [boughtcar] column has the value of the data 30% to derive the mining model algorithm, So we configured the test data scale to be 30% (30% is also the default value).
Finally we have a name for the defined mining structure and the mining model, and the whole definition is over.
After building the mining structure we can deploy it to the SSAS server, and then look at the accuracy of the algorithm for the current mining model, the exact rate of the mining model we built in this example is only 16.67% very low, and one important reason is that we are using too little data for data training. This number can be significantly improved after the amount of data trained is increased.
Finally, we can call DMX statements on the built mining model to do the data prediction, we use the graphical designer to construct the data prediction query, the structure on the left is the mining model column, the right table structure represents the data row we want to predict (equivalent to the 6000 lines we mentioned earlier in this article [Boughtcar] Columns with no values). The middle line represents the mapping of the input column and the predictive data structure column of the mining model, which allows the data column of the forecast to be passed into the input column of the mining model to derive the value of the predicted column, which we can manually edit.
Of course, we can not use the designer to manually write DMX statements to do queries
Finally, we can get three rows of rows [bought Car] values that are originally null by the query and calculate the predicted values by the mining model:
Microsoft SQL Server Analysis Service Data mining technology