Http://www.cnblogs.com/captain_ccc/articles/4093652.html
This article is also the continuation of the Microsoft Series Mining algorithm Summary, the previous several mainly based on state discrete value or continuous value for speculation and prediction, the main algorithm used is three: Microsoft Decision tree Analysis algorithm, Microsoft Clustering Analysis algorithm, Microsoft Naive Bayes algorithm , of course, the follow-up also added a result forecast, the application scenario involved in the previous articles are also introduced, interested students can click to view, this article we will summarize the algorithm for the Microsoft Time Series algorithm, this algorithm is a data mining algorithm is more important in a section, Because all projections and projections will be used in the future, and all of this will have a timeline throughout, and this will be the focus of the timing algorithm.
Introduction to Application Scenarios
Through the introduction of the previous articles, we have been able to predict the factors that affect some behavior, and based on these factors to develop our optimal customer base (will buy bicycles), which is the best of the several algorithms described above, but will not feel the information from the big data is too little point, And a lot of the problems can not be calculated only by the above algorithms, but this information happens to be the top leaders concerned, such as:
1, as a data analyst, you can predict the sales performance of the next year according to the past sales. How to solve this problem. A guy's going to do it, huh? I took the sales value of last year for the average, if less than a year. What if it predicts next January? ....
2, can be based on the past sales forecast the sales of the peak season, such as the real estate industry, "Golden nine silver Ten" that is this, these are senior sales staff experience summary, but you can guarantee that there are such people inside the company. Even if you can make sure that he is right. Even if it's right, it'll make sure he says it's suitable for other products. Even if it is suitable for different areas. .... I'm going to go to the ... These are the data that we'll let you know later.
3, different regions of the sales law is consistent. That is to say, is the same sales strategy .... Which marketing strategy is more suitable for that type of product? Whether the sales between the various products will have an impact, there is no joint sales. is not suitable for us to do bundled sales.
These are the problems we can solve through the Microsoft Time Series algorithm, which is the application of the algorithm scenario, gossip less, we enter the topic of this article.
Technical Preparation
(1) We also use the case Data Warehouse provided by Microsoft (ADVENTUREWORKSDW2008R2), where we only need to use a table, which is exactly a view vtimeseries, in fact, here is the record of the previous years in different months of sales summary value, We will analyze this part of the data in detail later.
(2) VS2008, SQL Server, Analysis Services, there is nothing to introduce, the installation of the database when the election is OK, here some time ago someone asked me why his vs tool did not create a new data mining project template, here to mention, in fact VS as Microsoft's flagship development software, so its update speed is far faster than the database update version, so to choose the development of data mining solutions in the Start menu to find the SQL Server directory under the VS connection.
Operation Steps
(1) Create a new solution, then the data source, and then the data source view, very simple steps, do not understand can look at the previous several articles, we directly look at the picture
We took a name for the solution, then from the data source to find the table we need to dig, the table we need to create a good name: Salesbyarea, you can see this table is recorded in previous years of the month of sales records and sales performance, below we have a rough analysis of the data in this table.
(2) Preview the data, analyze the content of the source data structure
Here we need to analyze the data to analyze, first look at what is inside, is not satisfied with the timing algorithm data requirements. Also we right-click "Browse Data", we select random sampling, sample data is 5000 rows. The concrete method does not repeat here, the concrete method may refer to the previous article, we look directly at the figure
Here are a few columns of data, in fact, the content is quite simple, we look, there are bicycle brands and regions, timelines, sales volume, sales quota, year, month, report date. From the reporting date, it is essentially a monthly 25th formation report, which is then generated every month and is required for data in the Microsoft Time Series algorithm:
1. The required analysis data sequence must contain a time series, and the sequence value is continuous ... This can be understood ... If there is no continuous value can not be inferred, because the data itself he has no rules to follow ....
2, the need to analyze the data sequence exists only marked value, in fact, also said the traditional meaning of the primary key, this in every algorithm to use
From the above data we can make the report date and the first list of bike brands and regions (modelregion) Form a combination of primary key to meet the 2nd requirement, because the same time a brand in a region can only produce a sales value.
Let's analyze the time above to see if we can meet the first condition. We choose the pivot table, which is the same as the PivotTable report in Excel, and basically doesn't matter, we drag the detail data into the middle of the area, select the reporting date for the column, and select the Bike brand area (modelregion). Let's take a look at the data:
As we can see, this previous year's sales record contains a record of sales from 2005 to 2008, of which 06 and 07 all have a record for every one months, and 2005 and 08 have only six months of data, in fact, there are only six months of data is normal in 08 years, Because the Microsoft case database adventureworksdw2008r2 produced by the date is here, that is, we will predict after this sales record, 05 years only six months to represent the data from here, this is no problem ... We continue to drag down
I'll go... Several of the following products are in the 05, there are no sales records in 06, there are two possibilities, the first one is that the two products from the beginning of 06 to introduce sales, so the previous data is not normal, of course, there is an extreme situation that is the two years of this product sales of 0. For this situation we have to confirm with the business side of the deal, for our analyst ... The sales record does not have a null value, which means there is no sales display value of 0, not empty.
We click on the year to enter the month, a detailed look at the value.
It seems that these data start dates are really starting from July 05, then ending in June 08, and that the data for each month is continuous, that is, from start to finish, every month has a value, we drag down
Indeed, the following items are sold from July 07 onwards, and the end date is the end of June 08.
The above analysis, in fact, the data in this table to meet the data requirements of our Microsoft Time Series algorithm, where there is a continuous timeline dimension, but there are several product sales start date is not all started from the start date, for this case time Series algorithm is allowed, Just make sure that each sequence in our timeline dimension has a uniform ending date, and that interval time is continuous.
Of course, we can analyze the source data in other ways, we do not do here.
(3) New mining structure
Right-click on the mining structure, now create a new data mining structure, and then next ... Continue and then next ... Here do not repeat, do not understand can refer to the previous several articles, we choose Microsfoft time Series algorithm, look at the picture
Click Next, there are a few key points we need to set up, we look at the map:
Here we combine branding and area, reporting dates into key columns, two columns of sales and sales as input and output, because these two columns, even if our historical analysis to use the input value, but also we will speculate on the output column, of course, can also be analyzed by the proposal, here we know what to do, We click Next,
We leave 30% of the facts behind to do the verification test after the accuracy, then take a name: forecasting, then select Next
(4) parameter configuration
There are several parameters for the Microsoft Timing algorithm that are important and need to be configured separately, and here we introduce
Periodicity_hint: This parameter provides algorithmic information about the recurrence frequency of the data pattern. The simple point is the iterative time interval of the time series, for example, the timeline used in this article is to change every month, and the cycle is a year, so we set this parameter to 12, meaning to repeat every 12 months.
Then we need to deploy and process the mining model. And then we'll do the results analysis next.
Result Analysis
After the program is deployed, we view the analysis through the Mining Model viewer, and we'll look at the graph directly:
The diagram above shows the results of the Microsoft Time Series algorithm, mining Model Viewer for this algorithm provides two panel view, one is a diagram, the other is a model, we will use this detailed analysis, the most commonly used is the chart Model Viewer, the icon area is divided into two, as above, The first half of the model history analysis data, the back Blur area is a speculative region, the right side of a sequence filter Drop-down box, from the horizontal axis we can see that the time range is July 2005 25-November 2007 25 percent lines with solid line, the following area is the predicted area, The forecast range is from July 25, 2008 to November 2008 25 and the polyline is indicated by a dashed line.
Hey... It does not look very refreshing.
We choose a product to see, we choose M200 Europe, M200 NorthAmerica Sales, the following look at the picture:
By clicking on the line in the middle of the chart, we can analyze the peak sales of the bike in this two-year period for May and December, the so-called peak season ... It's nothing special, right, the big spring of May ... Well. America May should also be spring ... Spring is suitable for outdoor ... It's good to buy a bike, here in fact, we are more concerned about the season of next year or the off-season is when, because according to this we can take corresponding measures, such as the peak season to increase inventory, reduce the off-season inventory and so on, we see M200 this product in the 08 peak season is that month ....
See, 08 of July will be the peak season for this product, the same off-season for September
And this is the sales in Europe, but in North America is not the same, it is in the 08 of September for the peak season, it is, the above can be seen, that the two areas of sales will have a big difference, only resorting experience is not out of the analysis. Also its off-season came ahead of time, look at the following figure:
Also from all of the above two products in the product map can be seen, these two products are booming sales, that is, the so-called Chaoyang good selling products, the profits will certainly be better in the future, we can click to speculate on their turnover in 08. Let's look at the picture:
From the above figure can be seen, according to the rules of the line chart can be seen sales and sales are related, Khan ... Nonsense. As sales increase, sales increase, but there is an interesting message from the above, that is, before June 25, 2006, sales and sales are a line, but then separated ... Right. It means something. ... After the sales volume began to slowly increase than the sales degree ... What does it mean. That is to say, the product is sold more, and his sales are less ... Sweat... For what reason. The reason is simple .... The price of the product ... The price of the product, so it's sales up, and here I say is the year-on-year sales of his decline.
Anyway, this product is slowly starting to sell over time. And the turnover is also increasing, especially at the end of 2007 with a large area of the jump, I think it is taking a better measure. From the figure we can also see that in August 2008 there will be a large sales degree ... The presumed sales were 2,267%. When you take this forecast report card to boss ... Boss will be able to sleep in the wake of their laughter ...
If all the products are sold like this, let's expand on the other parts to see:
Hey... We have found a relatively not-hit products ... T1000, you can see from the picture, this product is only listed in August 2007, and has been listed sales began to improve, but later began to shrink slowly ... I'll go... It was predicted that by 08 the sales of this product were flat and there was a tendency to plunge. If you see this performance as a leader. Idea. Or go straight back to the city.
If the curve is not intuitive enough, we can change the number of forecast steps, changing the smoothness of the polyline to show a more intuitive view of future predictions. Of course, adjust this parameter can also change the prediction interval
Right... This T1000 product is estimated to be buried by 2011 ... It looks like there could be a negative number ... That is, there will be a loss to make a yell of the situation, of course, the farther the forecast time range, the algorithm will be less accurate, after all, no one can predict too long future things, because many factors are changing.
Let's take a look at the other panel "model" provided by VS for us, the panel provides each sequence type to form a decision tree algorithm based on the data content, inferring that each sequence affects the value of the sequence with the progress of the timeline, and the details can refer to my previous article: The Microsoft Decision Tree algorithm.
The picture above can see R250 this product will be August 22, 2007 This day as the demarcation line, before this sales value is far greater than this after the sales, god horse reason. What's going on. These need to go to the business section, we can see from the data that the situation, these situations are usually a huge reason to contribute to, for example: September this year, 30, the domestic release of the new housing policy ... If the curve is a house price forecast line, this factor can be reflected in the day, and then the last week, Beijing continued haze ... If the curve is a mask's sales forecast line, this factor is the reason for this node ....
This panel shows the results we do not detailed analysis, its presentation is the decision tree analysis method, interested students can refer to my previous article.
In the process above, we only analyze the entire excavation process, according to the line chart analysis of some of the product trends and sales issues, in fact, the most important step is to tell me the next year month sales performance and sales volume, in our data-speaking era, just give us a trend map of limited use, After all, the market to find a random chart software I can get out, even make a better look than you this.
The following article will solve the problem by predicting that I can clearly predict the sales performance and sales quota for each month next year and even the latter.