"Scikit-learn" Using Python for machine learning experiments

Source: Internet
Author: User

Profile

This article is the first of a small experiment in machine learning using the Python programming language. The main contents are as follows:

  1. Read data and clean data
  2. Explore the characteristics of the input data
  3. Analyze how data is presented for learning algorithms
  4. Choosing the right model and learning algorithm
  5. Assess the accuracy of the program's performance
Read-in Data Reading

When you read data, you face problems dealing with invalid or lost data, and good handling is more like art than precise science. Because this part of the processing appropriate can be applied to more machine learning algorithms and thus increase the probability of success.

Use NumPy to chew data effectively and use SCIPY to intelligently absorb data

Python is a highly optimized interpretive language that is much slower than C and other languages in dealing with heavy computational algorithms, so why are there still a lot of scientists and companies betting on python in computationally intensive areas? Because Python can easily assign a numeric calculation task to these underlying extensions in C or FORTRAN. Among them, NumPy and scipy are the representatives.
NumPy provides a number of effective data structures, such as arrays, and SCIPY provides many algorithms to handle these arrays. Whether it's matrix manipulation, linear algebra, optimization problems, clustering, or even fast Fourier transforms, the Toolbox can meet the requirements.


Read-In Data operations

Here we take the page click Data for example, the first dimension attribute is the hour, the second dimension data is the number of clicks.

import scipy as spdata = sp.genfromtxt(‘web_traffic.tsv‘, delimiter=‘\t‘)
preprocessing and cleaning data

Once you have your data structure in place to store and process the information, you may need more data to ensure predictive activity, or have a lot of data, and you need to think about how to better sample the data.
Before the raw data is trained, refining the data can be very useful, and sometimes a simple algorithm with refined data is better than the performance of advanced algorithms that use raw data. This workflow is called Feature Engineering (feature Engineering). Creative and intelligent that is, you'll immediately see the results.

Since there may be an invalid value (Nan) in the dataset, we can look at the number of invalid values in advance:

hours = data[:,0]hits = data[:,1]sp.sum(sp.isnan(hits))

Use the following method to filter it out:

#cleaning the datahours = hours[~sp.isnan(hits)]hits = hits[~sp.isnan(hits)]

In order to give an intuitive understanding of the data, the data is presented using Matplotlib's Pyplot package.

import matplotlib.pyplot as pltplt.scatter(hours,hits)plt.title("Web traffic over the last month")plt.xlabel("Time")plt.ylabel("Hits/hour")plt.xticks([w*7*24 for w in range(10)], [‘week %i‘%w for w in range(10)])plt.autoscale(tight=True)plt.grid()plt.show()

It shows the following effects:


Choosing the Right Learning algorithm

Choosing a good learning algorithm is not as simple as choosing from three or four algorithms in your toolbox, and there are actually more algorithms you may not have seen. So this is a well-thought-out process that balances different performance and functional requirements, such as the tradeoff between execution speed and accuracy, scalability and ease of use.

Now that we have an intuitive understanding of the data, the next thing we do is to find a real model and infer the future data movement.

Selecting a model with approximation error (approximation error)

To select a correct model in many models, we need to use approximation error to measure the model predictive performance and use it to select the model. Here, we define the measurement error with the square of the difference between the predicted value and the true value:

def error(f, x, y):    return sp.sum((f(x)-y)**2)

where f represents a predictive function.

Fitting data with a simple line

We now assume that the implicit model of the data is a straight line, so how do we fit the data to minimize the approximation error?
SciPy's Polyfit () function solves this problem by giving the X and y axes of data, as well as the parameter order (the order of the line is 1), which gives the parameters of the model that minimizes the approximation error.

fp1, residuals, rank, sv, rcond = sp.polyfit(hours, hits, 1, full=True)

FP1 is the Polyfit function returns the model parameter, and for straight lines it is the slope and intercept of the line.
If the Polyfit parameter full is true, it will get more useful information in the fitting process, where only residuals is of interest to us, it is precisely the approximation error of the fitting line. The
then draws the line in the diagram:

#fit straight line modelfp1, residuals, rank, sv, rcond = sp.polyfit(hours, hits, 1, full=True)fStraight = sp.poly1d(fp1)#draw fitting straight linefx = sp.linspace(0,hours[-1], 1000) # generate X-values for plottingplt.plot(fx, fStraight(fx), linewidth=4)plt.legend(["d=%i" % fStraight.order], loc="upper left")

Use higher-order curves to fit data

Is it good to use a straight line fit? The error of fitting with a straight line is 317,389,767.34, does this mean that our predictions are good or bad?
we might as well use higher-order curves to fit the data to see if we can get better results.

fcurve3p = Sp.polyfit (Hours, hits, 3) FCurve3 = sp.poly1d (fcurve3p) print" Error of Curve3 line: ", Error ( fcurve3,hours,hits) fcurve10p = Sp.polyfit (hours, hits, ten) FCurve10 = sp.poly1d (fcurve10p) print "Error of Curve10 Line:", Error (fcurve10,hours,hits) fcurve50p = Sp.polyfit (hours, hits, a) FCurve50 = sp.poly1d (fcurve50p) print "Error of Curve50 Line: "Error (fcurve50,hours,hits)  

The approximation error is:

Error of Straight line:317389767.34
Error of Curve2 line:179983507.878
Error of Curve3 line:139350144.032
Error of Curve10 line:121942326.364
Error of Curve50 line:109504587.153


Let's take a closer look at the experimental results and see if our prediction curves are good fit data. Especially to look at the order of the polynomial from 10 to 50 of the process, the model and data fit too tightly, so that the model is not only to fit the data behind the model, but also to fit the noise data, resulting in a sharp curve, this phenomenon called overfitting .

Summary

From the small experiment above, we can see that if the line fitting is too simple, but the polynomial order from 10 to 50 of the fitting and too much, then is not 2, 3-order polynomial is the best answer? But we also find that if we use them as predictions, they will grow indefinitely. So, at the end of our reflection, it seems that we still don't really understand the data.

Measure performance Metrics

As a beginner of ML, there are many problems or errors in measuring the performance of a learner. This can be a simple question if you are testing your training data, and when you encounter uneven training data, the data determines the success or failure of your predictions.

Look back at the data

We will carefully analyze the data, look at the week3 to Week4, as if there is an obvious inflection point, so we have to separate the data after the week3.5, training a new curve.

inflection = 3.5*7*24 #the time of week3.5 is an inflectiontime1 = hours[:inflection]value1 = hits[:inflection]time2 = hours[inflection:]value2 = hits[inflection:]fStraight1p = sp.polyfit(time1,value1,1)fStraight1 = sp.poly1d(fStraight1p)fStraight2p = sp.polyfit(time2,value2,1)fStraight2 = sp.poly1d(fStraight2p)

Obviously, these two lines better describe the characteristics of the data, although the approximation error is greater than those of higher-order polynomial curve, but this method of fitting can better obtain the development trend of data. In contrast to the over-fitting phenomenon of high-order polynomial curves, for low-order curves, there is no good description of the data, which leads to the case of less-fitting. So in order to better describe the characteristics of the data, using the 2-order curve to fit the data to avoid the occurrence of overfitting and under-fitting phenomenon.

Training and testing

We trained to get a model, here is the two curves we fit. In order to verify the accuracy of our training model, we can take part of the training data and use it as test data during the initial training, and not only judge the model by the approximation error.

Summarize

This section is introduced as a small experiment of machine learning, which mainly transmits two points:
1. To train a learner, the data must be understood and refined, and the attention transferred from the algorithm to the data
2. Learn how to perform machine learning experiments without confusing training and testing data.
In the future, I will speed up, learn and practice.

Reference documents

Building machine learning Systems with Python. Richert,w. Coelho,l P

Reprint please indicate the author Jason Ding and its provenance
GitHub home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)

"Scikit-learn" Using Python for machine learning experiments

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.