Machine Learning System Design Study Notes (2)

Last Update:2014-09-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A real example: predict the future traffic based on the past traffic of a company's servers.

Procedure:

1. Read data:

The server volume is recorded as a csvfile web_traffic.csv in the following format:

1 2272
2nan
31386
41365
51488
61337
71883
82283
91335
101025
111139
121477

In the data, the first column is the number of hours, and the second column is the access volume for that hour. Use scipy to read a scipy-defined array for scientific computing (the Code is as follows ):

import scipy as spdata = sp.genfromtxt("web_traffic.tsv",delimiter=‘\t‘)
print data[:10]

The preceding Code contains the logic for printing the first 10 rows of data. The result is as follows.

[[1.20.0000e + 00 2.27200000e + 03]
[2.20.0000e + 00Nan]
[3.20.0000e + 00 1.000000000e + 03]
[4.20.0000e + 00 1.36500000e + 03]
[5.20.0000e + 00 1.48800000e + 03]
[6.20.0000e + 00 1.33700000e + 03]
[7.20.0000e + 00 1.88300000e + 03]
[8.20.0000e + 00 2.28300000e + 03]
[9.20.0000e + 00 1.33500000e + 03]
[1.20.0000e + 01 1.02500000e + 03]

For the operation of scientific array, you can refer to the following webpage: http://wiki.scipy.org/Tentative_NumPy_Tutorial

2. Data cleansing and preprocessing

We found some invalid values in these data items. Can you see the Nan marked as red in the data? This indicates invalid information. Count the number of invalid data in the sample dataset.

x = data[:,1]
y = data[:,1]
sp.sum(sp.isnan(y))

The result is 8, that is, 8 invalid data records, and the number of data sets is 743. We washed it out. The BTW and numpy tools are very convenient.

x = x[~sp.isnan(y)]y = y[~sp.isnan(y)]

To get a more intuitive impression, We can visualize it. Matplotlib, a graphical tool, is used for the first time. It is similar to matlib, but matlib is used to draw many images.

We can see that the obvious trend is an upward trend, but how can we make predictions?

The use of matplotlib drawing tutorial is as follows: http://matplotlib.org/users/pyplot_tutorial.html but unexpectedly by the wall, do not know their wall a scientific software site dry hair! Try to find a solution, fxxx GFW! Pyplot package usage see the following link: http://matplotlib.org/api/pyplot_api.html also want to approach.

3. Use the correct model and Learning Method

We don't know what a model is. We need to find it and use a fitting model to predict the future!

As shown in the figure above, my first impression is my undergraduate course: numerical approximation. The core of numerical approximation is to find the law based on the existing data, that is, the fitting function. After reading it, I found that the example in the book is a typical numerical approximation method, but I remember that the course didn't have the concept of iteration and learning. Continue.

Assuming that this function is F, how can we determine that this function is a good model? A common practice is to check the error between the sample data and the function, and to avoid negative numbers, the variance is generally used. A function is defined as follows:

def error(f,x,y):    return sp.sum((f(x)-y)**2)

What does F look like ?, The simplest thing is that f (x) = AX + B is to determine what A and B are. There is a ployfit function in scipy, which allows us to take shortcuts. It can locate A and B so that the error defined above returns the minimum value (that is, the best fit to the data)

   fp1, residuals, rank, sv, rcond = sp.polyfit(x,y,1,full =True)   print fp1

FP1 is a two-dimensional array with values of A and B.

The printed value is [2.59619213, 989.02487106].

We obtain the linear function f (x) = 2.59619213x + 989.02487106.

What is its error? Do you still remember the error function?

We construct a function using the following code:

f1 = sp.poly1d(fp1)print (error(f1,x,y))

We get a result: 317389767.34 is the result? Not to mention. Draw a picture to see. Add the following code:

FX = sp. linspace (0, x [-1], 1000) # generate X to plot PLT. plot (FX, F1 (FX), linewidth = 4) # plot the curve PLT. legend (["d = % I" % f1.order], Loc = "upper left") # badge

The figure is as follows:

Obviously, from the figure, from week 4, this line clearly does not represent those data points. Is the value 317389767.34 good? Because all the fitting is done, there must be errors. Let's take this assurance to see if we can find a better model. Obviously, a linear function is not a good choice to describe the model. Next we will try different iteration methods. How? Continue in the next section.

BTW, how nice it would be to have this book when I went to college! Today's children are really happy. There will be so many good resources to learn.

Machine Learning System Design Study Notes (2)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Machine Learning System Design Study Notes (2)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Machine Learning System Design Study Notes (2)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support