Machine Learning System Design Study Notes (2)

Source: Internet
Author: User

A real example: predict the future traffic based on the past traffic of a company's servers.

Procedure:

1. Read data:

The server volume is recorded as a csvfile web_traffic.csv in the following format:

1 2272
2nan
31386
41365
51488
61337
71883
82283
91335
101025
111139
121477

In the data, the first column is the number of hours, and the second column is the access volume for that hour. Use scipy to read a scipy-defined array for scientific computing (the Code is as follows ):

import scipy as spdata = sp.genfromtxt("web_traffic.tsv",delimiter=‘\t‘)
print data[:10]

The preceding Code contains the logic for printing the first 10 rows of data. The result is as follows.

[[1.20.0000e + 00 2.27200000e + 03]
[2.20.0000e + 00Nan]
[3.20.0000e + 00 1.000000000e + 03]
[4.20.0000e + 00 1.36500000e + 03]
[5.20.0000e + 00 1.48800000e + 03]
[6.20.0000e + 00 1.33700000e + 03]
[7.20.0000e + 00 1.88300000e + 03]
[8.20.0000e + 00 2.28300000e + 03]
[9.20.0000e + 00 1.33500000e + 03]
[1.20.0000e + 01 1.02500000e + 03]

For the operation of scientific array, you can refer to the following webpage: http://wiki.scipy.org/Tentative_NumPy_Tutorial

2. Data cleansing and preprocessing

We found some invalid values in these data items. Can you see the Nan marked as red in the data? This indicates invalid information. Count the number of invalid data in the sample dataset.

x = data[:,1]
y = data[:,1]
sp.sum(sp.isnan(y))

The result is 8, that is, 8 invalid data records, and the number of data sets is 743. We washed it out. The BTW and numpy tools are very convenient.

x = x[~sp.isnan(y)]y = y[~sp.isnan(y)]

To get a more intuitive impression, We can visualize it. Matplotlib, a graphical tool, is used for the first time. It is similar to matlib, but matlib is used to draw many images.

 

We can see that the obvious trend is an upward trend, but how can we make predictions?

The use of matplotlib drawing tutorial is as follows: http://matplotlib.org/users/pyplot_tutorial.html but unexpectedly by the wall, do not know their wall a scientific software site dry hair! Try to find a solution, fxxx GFW! Pyplot package usage see the following link: http://matplotlib.org/api/pyplot_api.html also want to approach.

 

3. Use the correct model and Learning Method

We don't know what a model is. We need to find it and use a fitting model to predict the future!

As shown in the figure above, my first impression is my undergraduate course: numerical approximation. The core of numerical approximation is to find the law based on the existing data, that is, the fitting function. After reading it, I found that the example in the book is a typical numerical approximation method, but I remember that the course didn't have the concept of iteration and learning. Continue.

 

Assuming that this function is F, how can we determine that this function is a good model? A common practice is to check the error between the sample data and the function, and to avoid negative numbers, the variance is generally used. A function is defined as follows:

def error(f,x,y):    return sp.sum((f(x)-y)**2)

 

What does F look like ?, The simplest thing is that f (x) = AX + B is to determine what A and B are. There is a ployfit function in scipy, which allows us to take shortcuts. It can locate A and B so that the error defined above returns the minimum value (that is, the best fit to the data)

   fp1, residuals, rank, sv, rcond = sp.polyfit(x,y,1,full =True)   print fp1

FP1 is a two-dimensional array with values of A and B.

The printed value is [2.59619213, 989.02487106].

We obtain the linear function f (x) = 2.59619213x + 989.02487106.

What is its error? Do you still remember the error function?

We construct a function using the following code:

f1 = sp.poly1d(fp1)print (error(f1,x,y))

We get a result: 317389767.34 is the result? Not to mention. Draw a picture to see. Add the following code:

FX = sp. linspace (0, x [-1], 1000) # generate X to plot PLT. plot (FX, F1 (FX), linewidth = 4) # plot the curve PLT. legend (["d = % I" % f1.order], Loc = "upper left") # badge

The figure is as follows:

Obviously, from the figure, from week 4, this line clearly does not represent those data points. Is the value 317389767.34 good? Because all the fitting is done, there must be errors. Let's take this assurance to see if we can find a better model. Obviously, a linear function is not a good choice to describe the model. Next we will try different iteration methods. How? Continue in the next section.

 

BTW, how nice it would be to have this book when I went to college! Today's children are really happy. There will be so many good resources to learn.

 

Machine Learning System Design Study Notes (2)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.