Using machine learning to predict weather (Part II)

Source: Internet
Author: User
Tags jupyter notebook statsmodels
Overview

In this article, we continue to explore the use of machine-learning methods to predict the weather of Nebraska State Lincoln using the data obtained from the Weather Underground website in the previous article.
In the previous article we have explored how to collect, organize, and clean data. In this article we will use the data from the previous article to establish a linear regression model to predict the weather. To build a linear regression model, I'm going to use two of the most important machine learning related libraries in Python: Scikit-learn and Statsmodels.
In the third article we will use Google TensorFlow to build a neural network model and compare the results of the predictions with those of the linear regression model.
This article will have a lot of mathematical concepts and nouns, if you understand more laborious, suggest you first Google related data concepts, have a basic understanding. Get Data

In this GitHub warehouse, there is a jupyter notebook file called Underground api.ipynb, which records the acquisition and collation of the data we use for this article and the next article. You'll also find a file called End-part1_df.pkl, and if you don't have the data, you can use the file directly, and then use the following code to turn the data into a pandas dataframe type.

Import Pickle  
with open (' end-part1_df.pkl ', ' RB ') as fp:  
    df = pickle.load (FP)

If running the above code encounters an error: No module named ' Pandas.indexes ', then the version of the pandas library you are using is inconsistent with mine (v0.18.1) and I have a CSV file to avoid this error. You can get it from the GitHub warehouse above and then use the following code to read the data.

Import pandas as PD  
df = pd.read_csv (' end-part2_df.csv '). Set_index (' Date ')
linear regression algorithm

The goal of the linear regression model is to use a series of linear correlation data and digital techniques to predict the possible result Y (dependent variable) based on the Predictor x (independent variable), and finally to establish a model (mathematical formula) to predict the given arbitrary predictor x to calculate the corresponding result y.
The general formula for linear regression is:

ŷ=β0 +β1 * X1 +β2 * x2 + ... +β (p-n) x (p-n) +ε

For a detailed explanation of the formula, see Baidu Encyclopedia-Linear regression model to select feature data for the model

The key assumption of linear regression technology requirements is that there is a linear relationship between the dependent variable and each independent variable. For our data, the temperature and other variables are then computed for the Pearson correlation coefficients. The Pearson correlation coefficient (R) is a measure of the linear correlation between equal-length arrays with values ranging from 1 to 1. Correlation values ranging from 0 to 1 indicate increasing positive correlation. This means that when the value in one data series increases with the value in the other sequence, two data sequences are positively correlated and the Pearson correlation value will be close to 1 because they are rising more and more equal. The correlation values from 0 to 1 are considered to be negative or negatively correlated, because when the value of a series increases the corresponding value of the decrease in the reverse series, the correlation value will be close to 1 when the range of the series varies equally (in the opposite direction). Pearson correlation values close to zero imply a weak linear relationship and weaken as the value approaches zero.
As to the intensity of the correlation coefficients, the views of statisticians and statistical books are different. However, I found a generally accepted set of associated strength classifications as follows:

To evaluate the dependencies in this data, I will call the Corr () method of the Pandas Dataframe object. By using the Corr () function, I can select the data that I am interested in (MEANTEMPM) and then call the Sort_values () function on the returned result (Pandas Series object), which outputs the correlation value from the most negative to the most positive correlation.

Df.corr () [[' MEANTEMPM ']].sort_values (' meantempm ')  



When choosing features that are included in this linear regression model, I want to be slightly tolerant of variables that contain moderate or low correlation coefficients. So I'm going to delete the attribute with the absolute value of the correlation value less than 0.6. Also, as the "mintempm" and "MAXTEMPM" variables are "MEANTEMPM" with the Predictor variable, I will delete these (ie, if I already know the highest and lowest temperature, then I have my answer). With this information, I can now create a new dataframe that contains only the variables I am interested in.

predictors = [' meantempm_1 ',  ' meantempm_2 ', '  meantempm_3 ',  
              ' mintempm_1 ',   ' mintempm_2 ',   ' Mintempm_3 ',
              ' meandewptm_1 ', ' meandewptm_2 ', ' meandewptm_3 ',
              ' maxdewptm_1 ',  ' maxdewptm_2 ',  ' Maxdewptm_3 ',
              ' mindewptm_1 ', '  mindewptm_2 ',  ' mindewptm_3 ',
              ' maxtempm_1 ',   ' maxtempm_2 ',   ' Maxtempm_3 ']
DF2 = df[[' meantempm '] + predictors]  
visual presentation of data relationships

Because most people, myself included, are more accustomed to using vision to evaluate and validate patterns, I will draw each selected predictor to prove the linear relationship of the data. To do this, I'll take advantage of the Matplotlib Pyplot module.
For this diagram, I would like to use the variable "MEANTEMPM" as the consistent y-axis along all 18 predicted variable graphs. One way is to create a grid. Pandas does have a useful drawing function called scatter_plot (), but it is usually used only when there are only about 5 variables, because it turns the drawing into a nxn matrix (18x18 in our case) becomes difficult to see the details in the data. Instead, I'll create a grid structure of six rows and three columns to avoid sacrificing the clarity of the chart.

import matplotlib import matplotlib.pyplot as PLT import numpy as NP # manually set the Param Eters of the figure to and appropriate size plt.rcparams[' figure.figsize '] = [+] # call subplots specifying the grid Structure we desire and that # The Y axes should to be shared fig, axes = plt.subplots (nrows=6, ncols=3, sharey=true) # Sin Ce it would is nice to loop through the features in to builds this plot # Let us rearrange we data into a 2D array of 6 RO WS and 3 columns arr = Np.array (predictors). Reshape (6, 3) # Use Enumerate to loop over the arr 2D array of rows and Colum  NS # and create scatter plots of each meantempm vs. feature for row, Col_arr in Enumerate (arr): for Col, feature In Enumerate (Col_arr): Axes[row, Col].scatter (df2[feature], df2[' meantempm ']) if col = = 0:ax Es[row, Col].set (xlabel=feature, ylabel= ' meantempm ') Else:axes[row, Col].set (xlabel=feature) plt.show ()



As can be seen from the above diagram, all the remaining predictive variables show a good linear relationship with the response variable ("MEANTEMPM"). In addition, it is noteworthy that these relationships are uniformly distributed randomly. I mean, in the absence of any scalloped or conical shape, the diffusion of the values seems to have a relatively equal change. Another important assumption of linear regression using the ordinary least squares algorithm is the uniform random distribution along the point. using stepwise regression to build a robust model

A strong linear regression model must select a meaningful and important index of statistical indicators as a predictive index. To select statistically significant features, I will use the Python statsmodels library. However, before using the Statsmodels library, I would like to first explain some of the theoretical implications and purposes of this approach.
A key aspect of using statistical methods (such as linear regression) in the analysis project is to establish and test the hypothesis test to verify the importance of the data hypotheses studied. There are many hypothesis tests that have been developed to test the robustness of linear regression models to various assumptions. One such hypothesis test is to evaluate the remarkable nature of each contained predictive variable.
The formal definition of βj parameter meaning is as follows: h0:βj= 0, 0 assuming that the predictive variable has no effect on the value of the result variable ha:βj≠0, the optional assumption is that the predictive variable has a significant effect on the value of the result variable

By using the probability test to evaluate the likelihood that each βj exceeds the apparent probability of a simple random opportunity at the selected threshold Alpha, we can choose more stringent data to ensure the robustness of the model.
However, in many datasets, the interaction between data leads to some simple hypothesis tests that do not meet expectations. In order to test the effect of interaction on any variable in a linear regression model, it is often called a stepwise regression technique. The effect on the resulting model is evaluated by adding or removing variables to evaluate the change of each variable. In this article, I'll use a technique called "back-end" to start with a model that contains the data I'm interested in.

The back-end workflow is as follows: Select an important stage a to determine whether a data can pass the construction test. The forecast data is filled into the model to evaluate the P value of the βj coefficient and the P value max p value, if the P value >α to step 4th, if not, the final model is removed the predictive variable identified in step 3 is installed again, but this time the variable is not deleted and then the loop returns to step 3rd

Below we use Statsmodels to follow the steps above to build our model.

# import the relevant module
import STATSMODELS.API as SM

# Separate our my Predictor variables (X) to my outcom E variable y
X = df2[predictors]  
y = df2[' meantempm ']

# ADD A constant to the Predictor variable set to Repres ENT the Bo intercept
X = sm.add_constant (x)  
X.ix[:5,: 5]  

# (1) Select a significance value
alpha = 0.05

# (2) fit the model
model = SM. OLS (y, X). Fit ()

# (3) Evaluate the coefficients ' p-values
model.summary ()  

The data output from the call Summary () function is as follows:



Well, I realize that the call to summary () simply prints a lot of information on the screen. In this article, we focus on only 2-3 values:

p>| T | -This is the P value I mentioned above and I will use it to evaluate the hypothesis test. This is the value of the variable we are going to use to determine whether to eliminate this step-by-step reverse elimination technique.
R-Squared-a measure of how much ADJ our model can explain the overall change in the result. R-Squared is the same as r squared, but for multivariate linear regression, the value is penalized by the number of variables that are included to explain the excessive fitting level.

This is not to say that other values in this output are worthless, on the contrary, they involve the more esoteric qualities of linear regression, which we do not have time to consider at all. For their complete explanations, I will defer to the advanced regression textbook, such as the application of Kutner linear regression model, Fifth edition. and statsmodels files.

# (3) Cont.-Identify the predictor with the greatest p-value and assess if it > our selected alpha.
# based off the table it are clear that             Meandewptm_3 has the greatest p-value and so it is
#             greater than Alpha of 0.05

# (4)-Use Pandas the drop function to the remove this column
from x x = X.drop (' meandewptm_3 ', Axis=1) 
  # (5) fit the model 
model = SM. OLS (y, X). Fit ()

model.summary ()  



With regard to your reading time, in order to maintain the reasonable length of the article, I will omit the remaining elimination cycles needed to build each new model, evaluate p values, and delete the least important values. Instead, I will skip to the last cycle and provide you with the final model. After all, the main goal here is to describe the process and the reasoning behind it. Below you will find the output of the final model that I converged after applying the reverse elimination technique. You can see from the output that all the remaining predictor variables have a significantly lower p value than our 0.05. Also noteworthy is the R-squared value in the final output. Here are two points to note: (1) r squared and adj. The R-squared value is equal, which indicates that our model has the least risk of being overly fitted, and (2) The value of 0.894 is interpreted as making our final model interpret the changes observed in the result variable by about 90%, "meantempm".

Model = SM. OLS (y, X). Fit ()  


predicting weather by using Scikit-learn linear regression module

Now that we have completed the steps to select a statistically significant Predictor (feature), we can use Scikit-learn to create a predictive model and test its ability to predict the average temperature. Scikit-learn is a very sophisticated machine learning library, widely used in industry and academia. One thing about Scikit-learn very impressive is that it maintains a very consistent "fit", "predictive" and "test" APIs in many numerical techniques and algorithms, making it very easy to use. In addition to this consistent API design, Scikit-learn also provides some useful tools for dealing with data that is common in many machine learning projects.

We'll begin by using Scikit-learn to import the Train_test_split () function from the Sklearn.model_selection module to start dividing our datasets into tests and training sets. I will divide the training and test data sets into 80% training and 20% tests, and assign a 12 random_state to ensure that you will get the same random selection of data as I do. This random_state parameter is useful for the repeatability of the results.

From sklearn.model_selection import train_test_split  
# The const column because unlike Statsmodels, Sciki T-learn would add
that to us X = X.drop (' const ', Axis=1)

The next step is to use a training dataset to establish a regression model. To do this, I will import and use the Linearregression class from the Sklearn.linear_model module. As mentioned earlier, the Scikit-learn score is computed using the general fit () and predict () functions.

From Sklearn.linear_model import linearregression  
# Instantiate the Regressor class
regressor = Linearregression () # Fit the "build"

model
by fitting the regressor to the training data regressor.fit (X_train, Y_train)

# Make a prediction set using the test set
prediction = Regressor.predict (x_test)

# Evaluate the pred Iction accuracy
of the model from Sklearn.metrics import mean_absolute_error, median_absolute_error  
print ("the Explained variance:%.2f "% Regressor.score (x_test, Y_test))"  
print ("The Mean absolute Error:%.2f degrees Celsius"% m Ean_absolute_error (Y_test, prediction))  
print ("The Median Absolute error:%.2f degrees Celsius"% Median_absolute_ Error (Y_test, prediction)) The

explained variance:0.90 the Mean absolute error:2.69 degrees the  
Median Absolute error:2.17 degrees Celsius  

As you can see in the previous lines of code, it is simple to use Scikit-learn to build a linear regression prediction model.

To obtain an explanatory understanding of the validity of the model, I used the score () function of the regression model to determine that the model could explain the variance of about 90% observed in the result variable (mean temperature). In addition, I use the Sklearn.metrics module's Mean_absolute_error () and Median_absolute_error () to determine that the average forecast is about 3 degrees Celsius close, and half the time off about 2 degrees Celsius. Summary

In this article, I demonstrated how the data collected in the previous article uses a linear regression machine learning algorithm to predict the future average weather temperature. In this article, I demonstrated how to use the Linear regression machine learning algorithm to predict the future average weather temperature, based on the data collected in the previous article. I demonstrated how to use the Statsmodels library to select predictive indicators with statistical significance based on reasonable statistical methods. Then, I use this information to fit the predictive model of the training subset of the Linearregression class based on Scikit-learn. Then using this fitting model, I can predict the expected value based on the input of the test subset and evaluate the accuracy of the prediction. related articles using machine learning to predict weather (part One) Click to view the English version of your blog: Snake catchers say

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.