Objective
This article continues our Microsoft Mining Series algorithm Summary, the previous articles have been related to the main algorithm to do a detailed introduction, I for the convenience of display, specially organized a directory outline: Big Data era: Easy to learn Microsoft Data Mining algorithm summary serial, interested children shoes can be viewed, The algorithm we are going to summarize is: Microsoft Sequential analysis and clustering algorithm, which is an extension of the association rule Analysis algorithm in the previous article, for a more granular mining of the kinds of association rule analysis algorithms, to excavate the sequence principle of different kinds of internal cases, And then used to guide users to consume.
Application Scenario Introduction
Microsoft sequential analysis and clustering algorithm, according to the name can be associated with its application characteristics, the mining algorithm is based on the clustering algorithm and then the sequence of the cases within its classification of mining, its analysis is focused on the sequence rules between the cases, The Microsoft Association rules algorithm that we introduced is focused on mining the relationship between cases, and the order of the association relationship is not related, simple point : the Association Rules Algorithm Research is "chicken and egg relationship", while the sequence analysis and clustering algorithm is the study of " First chicken or egg problem ", in the previous article, we dug out a few groups of products with the strongest correlation among them, such as: mountain bikes, tires and inner tubes, bicycles, kettles, kettle boxes, these products are the strongest relationship, that is, customers want to buy some of these products, will produce the most possible purchase of other related products, but what is the order of purchase of these products?
Microsoft sequential parsing clustering algorithms are commonly used scenarios:
1. Web Click Stream generated by browsing websites in the website, and then the user behavior prediction
2. Event logs before an accident (such as server outage, database deadlock, etc.) to predict the next occurrence of the incident
3, according to the user purchase, add shopping cart sequence records, according to product priority for the best product recommendation
In fact, the algorithm is similar to the clustering algorithm, but compared with the algorithm, it is more granular, and then the order of the cases in clustering is excavated.
Technical preparation
(1) Microsoft Case Data Warehouse (ADVENTUREWORKSDW208R2), and the same table used in the previous article, two tables: Vassocseqlineitems table and Vassocseqorders table, these two tables typically "one-to-many" relationship, Vassocseqorders is the order table, the Vassocseqlineitems table is the order schedule, both through OrderNumber Association, the specific content can refer to the previous blog: Microsoft Association Rules analysis algorithm
(2) VS2008, SQL Server, Analysis Services
Operation Steps
(1) We still use the previous phase of the solution, the data source view also continues to follow, directly see the diagram
In fact, the most important column of the application is the LineNumber column, which records the purchase order of the product, and then the order of reaction rules, let's take a look at the data:
(2) Create a new mining structure
Let's build this data mining model, new and simple steps, specific content can refer to my previous blog content, see a few key steps:
Select the algorithm, then click Next, select the data source view, and select the Case table and nested table:
Click Next to set the input and output columns:
And then we'll start with a name:
Let's deploy the mining model and then process it, with a simple, no-nonsense introduction.
Results analysis
After the program is deployed, we view the analysis through the Mining Model viewer, no nonsense, we look directly at the diagram:
Hey, how familiar the panel, if there are children's shoes to see the article between me, you can see this algorithm with the results of the presentation panel actually and the previous Microsoft Clustering algorithm is the same, just add a new panel here: state transitions, below we briefly analyze these several panels, There is no clear reference to the previous article, the focus of the new panel to see what the role of.
Here we look for a product to analyze, we choose the previous association rule algorithm in the most Outstanding kettle (water Bottle) to see:
We renamed two groups, most of the most likely to buy water bottles and the most unlikely to buy water bottle groups, we pin these two groups for cluster analysis ', we look at the second panel:
We can see the two groups and regions have very big relationship, such as the most want to buy the crowd in North America, and in the Pacific region are very few people buy, if you drag downward, you can also see and income also has a very big relationship, such as high-income people to buy water bottles less people, Khan ... It is estimated to buy water to drink, on the contrary, the lower income of the cock silk class to buy a kettle more people! Hey... The data given by Microsoft's case database looks pretty real.
Next, let's look at the "Classification features" panel:
We select a group to view the detail data, the data that I tick out, show is the focus of this chapter algorithm , [start]->women's Mountain shorts represents a customer to the store, the most want to put in the shopping basket of the first product is: Women ' s Mountain Shorts (women's mountain shorts?) Why are women's!! ... Women like to buy shorts? ...) , [Beginning]->water Bottle, is the same meaning, said the first to put in the shopping basket is the bottle of this artifact. There are, of course, several other more important probabilistic attributes: All in North America, where incomes are moderate.
The order of commodities given in the figure is the sequential product inferred by the Microsoft Sequence Clustering algorithm, that is to say, the purchase behavior must occur in this order, such as the first one above:
Women ' s Mountain shorts (women's mountain shorts), then Long-sleeve Logo Jesey (long sleeves?) )
The most unlikely group to buy bikes ... Come and have a look ... Pacific, high-income, direct-to-buy bikes (Mountain-200) or patch kits ...
We compare directly with the "category comparison" panel to see the results:
Not introduced, the introduction is very comprehensive.
Let's take a look at the last one: State transition panel
Given is the state transition between the various products, the first color to tell us the characteristics of the group, and then the possibility of conversion between products, you can drag the left side of the slider to view, first with water bottle is associated with Sport-100, that is, after selling the kettle, The first to Buy is Sport-100, then sold out women ' s Mountain shorts, the first to buy is Long-sleeve Logo ...
It is interesting to analyze the characteristics of other groups and the order of purchase.
Inferred result Export
We go to this step to directly predict the model's analysis results and go to the Mining model prediction Panel:
Let's set up the mining function: Source Selection: Prediction function, field selection: predictsequence, Condition/parameter: Drag the v Assoc Seq line items directly into, click Run to view the results:
Let's look at the results:
You can see all the results, the first product to be put into the shopping basket is the Mountain-200 mountain bike. If the default does not add any selection criteria, this result output item only one, that is the first case of prediction, of course, many of the requirements are not so simple, for example: sometimes we need to distinguish between different purchase sequence problems according to the region, because we have analyzed from the "classification features" panel, There is a great difference between the order of product parts and the region;
We select "Single Query Input", click "Region", select the Europe area:
We can see that the first most likely purchase of this group of products has changed, changed to: Touring-1000, of course, we can write the DMX statement directly:
This approach is more flexible, and we can query for products and probability items:
In fact, for SSAS produced by the result set, has its own dedicated DMX language for flexible query and operation, here we do not explain, there is time dedicated to parse this piece.
Of course, this block we can also according to the existing customer list, mining, speculated that the user's choice of the greatest possibility of the next product is God horse? We select an existing case table and a nested table, and then design the statement:
Let's look at the results:
See, according to this article of the Microsoft Sequential analysis and clustering algorithm, has been different users may buy products in order, the results of this analysis is strictly in order, we can see that there is a customer number 18239, he most likely to first buy water Bottle, and then buy Sport-100 ....
Let's look at the purchase behavior that the customer has already taken:
Look at S051176 This order, the goods have bought two products ... What will he buy next? We have speculated above: water Bottle, and then buy Sport-100 ....
The next step of our work is to save to the database, a simple code after this part of the group's intention to buy the behavior to dig out, and then you take to find the boss can be ....
Original address: (original) Big Data era: a summary of knowledge points based on Microsoft Case Database Data Mining (Microsoft Sequential analysis and Clustering algorithm)
Microsoft Data Mining algorithm: Microsoft sequential analysis and Clustering algorithm (8)