Stock combat--linear regression

Source: Internet
Author: User
Tags stock prices

    • Machine learning: Predicting Google stock using Scikit-learn's linear regression

This is the first article in the Machine Learning series.

This article will Python use scikit-learn the linear regression to predict Google's stock trend. Please do not expect this example to make you a stock master. Here's how to do it in step-by.

Preparing data

The data used in this article comes from the www.quandl.com site. Using the Python appropriate quandl library, you can get the data we want with a few simple lines of code. This article uses the free data. Use the following code to get the data:

Import quandldf = Quandl.get (' Wiki/googl ')

WIKI/GOOGLThis is the ID of the dataset, which can be queried on the site. However, I found that the new version Quandl requires users to register for identity information on their website and then use the identity information to read the data. The data set used here belongs to the WIKI/GOOGL data provided by the old version interface and does not need to provide identity information.

With the above code, we get the data and store it in a df variable. By default, Quandl gets the data to the Pandas DataFrame store. So you can DataFrame see the data content through the relevant functions. For example, use print(df.head()) the first few lines of content that can print tabular data.

Preprocessing data

From the picture above we see that the dataset provides a number of column fields, such as Open recording the stock opening price, Close recording the closing price, and Volumn recording the day's volume. Adj.the data with the prefix should be the data after the right.

We do not need to use all the fields, because our goal is to predict the movement of the stock, so the object that needs to be studied is the stock price at a certain moment, so there is a comparison. So we describe the stock price with the closing price of the ex-right Adj. Close , that is, we choose it as the variable that will be predicted.

Next you need to think about what variables are related to stock prices. The following code selects several fields that may affect Adj. Close the change as a feature of the regression prediction and processes the features. For detailed steps, read the comments.

The import Mathimport NumPy as np# defines the Predictor column variable, which holds the tag name of the study object forecast_col = ' Adj. Close ' # Defines the forecast days, which is set to 1%forecast_out = int for all data volume lengths (m Ath.ceil (0.01*len (DF))) # uses only the following fields in df = df[[' Adj. Open ', ' Adj. High ', ' Adj. ', ' Adj. Close ', ' Adj. Volume ']]# constructs two new Column # hl_pct is the percentage change between the highest and lowest price of the stock df[' hl_pct ' = (df[' Adj. High ')-df[' Adj. Close ')/df[' ADJ. Close '] * 100.0# hl_pct for stock closing price and open price  Percent Change df[' pct_change ' = (df[' Adj. Close ')-df[' Adj. Open ']/df[' ADJ. Open '] * 100.0# below is the real feature field df = df[[' Adj. Close ', ' hl_pct ', ' pct_change ', ' Adj. Volume ']]# because Scikit-learn doesn't handle empty data, you need to set the empty data to a more difficult value, take -9999,df.fillna (-99999, Inplace=true) # to represent the field with a label, is the prediction result # by letting with adj. The data for the close column moves forward 1% lines to represent df[' label '] = Df[forecast_col].shift (-forecast_out) # The data data that is used in the model to generate the data x and Y and the predictions that are actually used X_latelyx = Np.array (Df.drop ([' label '], 1)) # TODO Here's a question. x = Preprocessing.scale (x) # The last 1% rows of data that were left when the label column was generated, and the rows do not have a label data, So we can take them as input data for the prediction x_lately = X[-forecast_out:]x = x[:-forecast_out]# Discards those rows that are empty in the label column Df.dropna (inplace=true) y = Np.array (df[' label ')) 

The above code is difficult to understand label how the column is generated and what is used. In fact, the first element of this column is the first i Adj. Close element of the column i + forecast_out . I want to try to describe it in simple words: Each data in this column is the closing price of the next day in real statistics forecast_out . Using the data of this column as the supervisory standard of the linear regression model, we can make the model learn the law, then we will use it to predict the result.

In addition X = preprocessing.scale(X) , this line of code normalizes the data of x, allowing the data of x to be normally distributed. (PS. However, I found that this kind of processing has changed the data of X, so I can't understand why this is done and why it doesn't affect the results of model learning. Have to know the answer to the trouble message tell. )

Linear regression

We've got the data ready. You can start building a linear regression model and let the data train it.

# Scikit-learn obsolete cross_validation from version 0.2, instead of Model_selectionfrom sklearn import preprocessing, model_selection, Svmfrom Sklearn.linear_model Import linearregression# before starting, X and y divide the data into two parts, one for training, the other for testing X_train, X_test, Y_train, Y_ Test = Model_selection.train_test_split (X, Y, test_size=0.2) # generates a linear regression object for Scikit-learn CLF = Linearregression (n_jobs=-1) # Start Training Clf.fit (X_train, Y_train) # Evaluate accuracy with test data accuracy = Clf.score (X_test, y_test) # make predictions Forecast_set = Clf.predict (x_lately ) print (Forecast_set, accuracy)

The preceding lines of code are scikit-learn the training and forecasting process using linear regression. We can calculate the accuracy of the model by testing the data accuracy and provide the prediction results by providing the model X_lately forecast_set .

I run the resulting results as follows:

This accuracy that needs to be noted accuracy does not indicate that the model predicts 100 days of data with 97 days to be correct. It represents a statistical concept of the linear model's ability to describe the information of the statistical data. I may have some discussion of this variable in a subsequent article.

Draw a trend

Finally we use matplotlib to visualize the data. Detailed steps look at the code comment.

Import Matplotlib.pyplot as Pltfrom matplotlib import styleimport datetime# modify Matplotlib style style.use (' ggplot ') One_day = 86400# new Forecast column in DF to hold the predicted result data df[' Forecast ' = np.nan# time index of the last line of df last_date = Df.iloc[-1].namelast_unix = Last_ Date.timestamp () Next_unix = Last_unix + one_day# iterates through the predictions and appends lines to DF with it # These lines are set to forecast I in np.nanfor except for the Forecast_set field:    next_date = Datetime.datetime.fromtimestamp (Next_unix)    Next_unix + = One_day    # [Np.nan for _ in range (Len ( Df.columns)-1] generates a list that does not contain a forecast field    # and [i] is a list that contains only forecast values    # The two lists are stitched together to form a new row, appended to the bottom of DF by date    Df.loc[next _date] = [Np.nan for _ in range (Len (df.columns)-1)] + [i]# start drawing df[' Adj. Close '].plot () df[' Forecast '].plot () Plt.legend (lo c=4) Plt.xlabel (' Date ') plt.ylabel (' Price ') plt.show ()

Running code can be obtained.

The red part is the collected data, and the blue part is the forecast data.

Click here to view the full code.

This article comes from a sync blog

Stock combat--linear regression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.