Big Data era: a summary of knowledge points based on Microsoft Case Database Data Mining (Microsoft Time Series algorithm)

Last Update:2014-11-02 Source: Internet

Author: User

Tags dashed line polyline

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original: (original) Big Data era: a summary of knowledge points based on Microsoft Case Database Data Mining (Microsoft Time Series algorithm)

Objective

This article is also the continuation of the Microsoft Series Mining algorithm Summary, the first few mainly based on state discrete values or continuous values for speculation and prediction, the algorithm used mainly three kinds: Microsoft Decision tree Analysis algorithm, Microsoft Clustering algorithm, Microsoft Naive Bayes algorithm , of course, followed by a summary of the results of the prediction, the application of the scenario in the previous articles are also introduced, interested students can click to view, this article we will summarize the algorithm for the Microsoft Time Series algorithm, this algorithm is also a more important data mining algorithm, Because all projections and projections will be used in the future, and all of this will have a timeline throughout, and this will be the focus of the timing algorithm.

Application Scenario Introduction

Through the introduction of previous articles, we have been able to predict what factors affect a certain behavior, and based on these factors to extract our best customer base (will buy bicycles), which is described above several algorithms, but will not feel the information from the big data is too little point, With a lot of problems just through the above several algorithms are not extrapolated, but this information happens to be the top leaders concerned, for example, said:

1. As a data analyst, can you predict the sales performance of the next year according to the previous sales situation? How to solve this problem? There's a man who's going to solve this, huh? I take last year's sales value to do average, if less than a year? What if it's forecast for next January? ....

2, can according to the previous sales situation forecast sales peak season, like the real estate industry, "Gold nine silver Ten" said is this, these are senior sales staff experience summary, but you can ensure that the company has such people? Even if you can guarantee that he's right? Even if it's right to make sure he says it's right for another product? Even suitable for different regions? .... I'll go to the ... These later we let the data to tell you!

3. Are the sales laws in different regions consistent? In other words, whether it is the same sales strategy .... What kind of sales strategy is better suited for that type of product? Will the sales of the various products affect the sales? is not suitable for us to do bundle sales.

We can solve these problems through the Microsoft Time Series algorithm, which is the application scenario of the algorithm, gossip, we enter the topic of this article.

Technical preparation

(1) Also we use the case Data Warehouse provided by Microsoft (ADVENTUREWORKSDW2008R2), here we only need to use a table, which is exactly a view vtimeseries, in fact, this is the record of the previous years in different months sales summary value, We will analyze this part of the data in more detail later.

(2) VS2008, SQL Server, Analysis Services Nothing to introduce, the installation of the database when the full selection can be, here a while ago someone asked me why his vs tool did not create a new data mining project template, here to mention, In fact, vs as Microsoft's flagship development software, so it is updated much faster than the database update version, so to choose to develop a data mining solution, you need to find in the Start menu in the SQL Server directory vs connection.

Operation Steps

(1) New solution, then data source, then data source view, very simple steps, do not understand can look at our previous articles, we directly look at the figure

We give a name to the solution, and then we find the table we need to dig from the data source, create the table we need, and take a name: Salesbyarea, you can see that this table is recorded in previous years of sales records and sales performance, below we have a rough analysis of the data in this table.

(2) Preview data, analyze source data structure content

Here we need to analyze the data to analyze, first to see what is inside the content, is not satisfied with the timing algorithm data requirements. Similarly we right-click "Browse Data", we choose random sampling, sampled data is 5000 rows. Specific methods here do not repeat, specific methods can refer to the previous article, we directly look at the picture

There are a few columns of data, in fact, the content is very simple, we see, there are bicycle brands and regions, time line, sales volume, sales quota, year, month, report date. From the reporting date, the report is basically 25th per month, and then a monthly copy is generated, which is required for data in the Microsoft Time Series algorithm:

1, the request Analysis data series must contain the time series, and the sequence value is continuous ... This can be understood ... If there is no continuous value is not speculation, because the data itself he has no rule to follow ....

2, the need to analyze the data series there is a unique value, in fact, said the traditional meaning above the primary key, this in each algorithm to use

From the above data we can combine the reporting date and the first list of bike brands and regions (Modelregion) to form a combination of primary keys to meet the 2nd requirement above, because at the same time a brand in a region can produce only one sales value.

Let's analyze the time in detail to see if we can not meet the first condition, we choose the pivot table, this and excel in the pivot table is the same, use it basically no problem, we drag the detail data into the middle of the area, the column selection report date, row selection of the Bicycle brand region (modelregion), Let's take a look at the data:

We can see that this previous Sales record table contains the sales records from 2005 to 2008, of which 06 and 07 are all year round every one months will contain a record, and 2005, 08 only half a year of data, in fact, here 08 years only half a year data is normal, Because the Microsoft case database ADVENTUREWORKSDW2008R2 generated date is here, that is, we will predict this after the sales record, 05 years only six months to indicate the data from here, this is no problem ... We continue to drag down

I'll go... The following products in 05, 06 there are no sales records, there are two possibilities, the first is that the two products from 06 to introduce sales, so the previous data is not normal, of course, there is an extreme situation that is the two years this product sales of 0 ... In this case we have to confirm with the business party to do the processing, for our analysts ... The sales record does not have a null value, that is, this place has no sales display value should be 0, not empty!

We click on the year to enter the month, a detailed look at the value.

It seems that the data start date is really starting from July 05, and then the end of June 08, and the data for each month is continuous, that is, from the beginning to the end of a continuous monthly value, we drag the following

Indeed, the following items are sold from July 07 onwards and end date to June 08.

Through the above analysis, in fact, the data in this table is satisfied with the data requirements of our Microsoft Time Series algorithm, there is a continuous timeline dimension, but there are several product sales start date not all start date, for this case the timing algorithm is allowed, Just make sure that each sequence has a uniform end date in our timeline dimension, and that the interval time is continuous.

Of course, the source data can be analyzed in other ways, and we won't do it here.

(3) Create a new mining structure

Right-click on the mining structure, now create a new data mining structure, and then next ... Continue and Next ... Here do not repeat, do not understand can refer to the previous articles, we choose Microsfoft time Series algorithm, see figure

Click Next, there are a few key points we need to set up, let's look at the graph:

Here we combine the brand and the region, the report date to form a key column, the sales and sales performance of two columns as input and as output, because these two columns even our historical analysis to use the input value, but also we will be speculating on the output column, of course, can also be analyzed by the recommendations, here we know what to do, We click Next,

We leave 30% of the facts, do the validation test for the accuracy behind, then take a name: forecasting, then select Next

(4) parameter configuration

Several parameters for the Microsoft Time Series algorithm are important and need to be configured separately, here we introduce

Periodicity_hint: This parameter provides algorithmic information about the repetition frequency of data patterns. The simple point is that the time series repeats the iteration interval, for example, the timeline used in this article is changed once per month, and the period is year, so we set this parameter to 12, meaning to repeat every 12 months.

Then we need to deploy and process the mining model. Then the next step is to analyze the results.

Results analysis

After the program is deployed, we view the analysis through the Mining Model viewer, no nonsense, we look directly at the diagram:

The graph above is the result of the Microsoft Time Series algorithm, the mining Model viewer for the algorithm provides two panel view, one is a chart, the other is a model, the following we will be detailed analysis, the usual most commonly used is the chart Model Viewer, the icon area is divided into two pieces, such as, The first half of the model historical analysis data, the later fuzzy area is the speculative area, the right one sequence filter drop-down box, from the horizontal axis we can see that the time interval is July 2005 25-November 2007 25 percent lines in solid line, the back area is the forecast area, The forecast interval is July 25, 2008 to November 2008 25, and the polyline is indicated by a dashed line.

Hey... It seems to be not very refreshing.

We choose a product to see, we choose M200 Europe, M200 NorthAmerica sales situation, look at the following picture:

By clicking on the click Line in the middle of the chart, we can analyze the peak sales of this bike in this two region year for May and December, the so-called peak season ... It's nothing special, is it, May Big spring ... Well? America May should also be spring ... Spring is suitable for outdoor ... Bike buy good also take it for granted, here in fact we are more concerned about next year's high season or off season is when, because according to this we can take corresponding measures, such as high season more inventory, low season to reduce inventory and so on, we see M200 this product in 08 the peak season is that month ....

See, 08 July will be the peak season of this product, the same low season for September

And this is the sales in Europe, but in North America is not the same, it is in 08 September for the peak season, is it, can see that the two areas of sales will also have a pretty big difference, only resorting experience is not out of the right bar. Also its off-season has come early, see:

Also from all of the above two products in the product map can be seen, the sales of these two products is booming, that is, the so-called Chaoyang good selling products, the resulting profit will certainly be better in the future, we can click to speculate on their turnover in 08 is how much. Let's look at the picture:

From the above figure can be seen, according to the law of the line chart can be seen sales and sales are related, Khan ... Nonsense! As sales increased, sales increased, but there was an interesting message that sales and sales were a line before June 25, 2006, but then they were separated ... Right? What does that mean? ... After the sales volume began to slowly higher than the sales degree ... What does it mean? That is to say that this product sells more, his sales degree is less ... Sweat... What's the reason? The reason is simple .... Product Price Reduction!. Product price reduction so it's sales up, the year before I said that his sales were down instead.

Regardless of how the product grows over time, it begins to sell very slowly. And the turnover is also increasing, especially at the end of 2007 when there is a large area of the jump, I think it has taken a better measure. We can also see that there will be a lot of sales in August 2008 ... The calculated sales amount is 2,267%. When you take this forecast report card to boss ... Boss will be able to sleep in their own laughter wake up ...

is not all the products are so hawking, we have to expand the other several look:

Hey... We have found a product that is not very good. T1000, it can be seen that this product was listed in August 2007, and has been listed sales began to increase, but began to shrink gradually ... I'll go... It is predicted that by 08, the product sales are flat, and the trend of tumbling! What if you see it as a leader? The idea of a son? Or do you just go back to the city?

If this curve is not intuitive enough, we can change the number of forecast steps, changing the smoothness of the polyline to show a more intuitive view of future predictions. Of course, adjusting this parameter can also change the forecast interval

Right... This T1000 product is estimated to have been buried by 2011 ... It seems that there may be a negative number ... That is, there will be a loss of money to make a yell, of course, the farther the predicted time range, the accuracy of the algorithm will be lower, after all, who can not predict too long future things, because many factors are changing.

Let's take a look at the other panel "model" provided by VS, which provides a decision tree algorithm for each sequence type based on the data content, speculating on the factor values of the sequence affected by each sequence as the timeline progresses, and the details can refer to my previous article: Microsoft Decision tree Algorithm.

Can see R250 This product will be August 22, 2007 This day as the dividing line, before this sales value far greater than after this sales, God horse reason? What's going on? These need to consult the Business section, we can see this situation from the data, the occurrence of these situations will generally have a huge factor to facilitate, for example: September this year, 30, the domestic release of a new mortgage policy ... If the curve is a price forecast line, this factor can be reflected on that day, and then, for example, the last week in Beijing continued haze ... If the curve is a sales forecast line for a mask, this factor is the cause of this node .....

This panel shows the results we do not detailed analysis, its display is the decision tree analysis method, interested students can refer to my previous article.

In the above process, we only analyzed the whole process of mining, according to the line chart analysis of some product trends and sales problems, in fact, the lack of the most important one step, that is to tell me the year of the month sales performance and sales volume is how much, in our time to speak the data, just give us a trend map of limited use, After all, I can find a chart software on the market, even more beautiful than you!

The article that follows us will solve this problem by predicting how much I can predict in the next year or even the following month, and what the sales quota is. With this report you can confidently find the boss, the rest is he did ....

Conclusion

Conclusion... What should I write about? We summarize the meaning of data mining, in fact, the entire process is the use of data and mathematics to speculate and predict the unknown things, and the current we use of mathematics can be used to generate predictions, as well as the IT industry the Internet nearly a decade of vigorous development accumulated data can also meet the data requirements, And as the cost of data storage is reduced, the cost of transformation of structured and unstructured data is reduced, and we are in the ocean of data, and what is urgently needed to change is us, or a change of ideas, a process of thinking progress, which is the meaning of the era of big data!

At the end of the article we give an article link to the previous algorithms:

Microsoft Decision Tree Analysis Algorithm summary

Summary of Microsoft Clustering algorithms

Microsoft Naive Bayes Analysis algorithm

Microsoft Algorithm Results Prediction Chapter

If you read this blog, feel that you have something to gain, please do not skimp on your " recommendation ".

Big Data era: a summary of knowledge points based on Microsoft Case Database Data Mining (Microsoft Time Series algorithm)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More