Simple linear regression implemented using PHP (2)

Source: Internet
Author: User
Simple linear regression implemented using PHP (2) data research tool for solving output and probability function defects <br/> In Part 1 of this series, we mentioned simple linear regression at the end (SimpleLine solves output and probability function defects ). data Research Tools

At the end of Part 1 of this series of articles, we mention three elements missing from the Simple Linear Regression class. In this article, the author Paul Meagher uses PHP-based probability functions to compensate for these defects and demonstrates how to integrate the output methods into the SimpleLinearRegression class and create graphical output. He solved these problems by building a data research tool designed to thoroughly study the information contained in small and medium-sized datasets. (In Part 1, the author demonstrates how to use PHP as the implementation language to develop and implement the core part of the simple linear regression algorithm package .)


In part 1 of this two-part series ("simple linear regression with PHP"), I explained why the math library is useful to PHP. I also demonstrated how to develop and implement simple linear regression algorithms using PHP as the implementation language.

The goal of this article is to show you how to use the SimpleLinearRegression class discussed in section 1st to build an important data research tool.

Brief review: Concept
The basic goal behind simple linear regression modeling is to find the most consistent line from a pair of two-dimensional plane consisting of X and Y values (that is, X and Y measurements. Once this line is found using the least variance method, various statistical tests can be performed to determine the degree of deviation between this line and the observed Y value.

Linear equations (y = mx + B) have two parameters that must be estimated based on the provided X and Y data. they are Slope (m) and y axis intercept (B ). Once the two parameters are estimated, the observed values can be input into the linear equation and the Y predicted values generated by the equation can be observed.

To use the least variance method to estimate the m and B parameters, we need to find the m and B estimates to minimize the observed values and predicted values of all X values. The difference between the observed value and the predicted value is called an error (yi-(mxi + B). If Square is obtained for each error value, the sum of the residual values is obtained, the result is a number called the predicted squared difference. The use of the least variance method to determine the most consistent straight line involves finding the estimated values of m and B with the smallest predictive variance.

Two basic methods can be used to find the estimated values m and B that satisfy the least variance method. In the first method, you can use the numerical search process to set different m and B values and evaluate them to determine the minimum variance estimation value. The second method is to use calculus to find the equation used to estimate m and B. I am not going to discuss in depth the calculus involved in deriving these equations, but I did use these analytical equations in the SimpleLinearRegression class, to find the least square estimate of m and B (see getSlope () and getYIntercept methods in the SimpleLinearRegression class ).

Even if you have an equation that can be used to find the minimum square estimate of m and B, it does not mean that as long as these parameters are substituted into the linear equation, the result is a straight line that fits well with the data. The next step in this simple linear regression process is to determine whether the remaining prediction variance is acceptable.

We can use the statistical decision-making process to reject the alternative assumption of "line matching data. This process is based on the calculation of the T statistical value, and the probability function is used to obtain the probability of a random large observed value. As mentioned in Part 1, the SimpleLinearRegression class generates a large number of aggregate values, one of which is the T statistical value, which can be used to measure the degree of fit between linear equations and data. If the matching is good, the T statistical value is usually a large value. if the T value is small, a default model should be used to replace your linear equation, this model assumes that the average value of Y is the best predicted value (because the average value of a group of values can usually be the useful predicted value of the next observed value ).

To test whether the T statistic value is large enough, you do not need the average value of Y as the best predicted value. you need to calculate the probability of random T statistic values. If the probability is low, you can avoid the assumption that the average value is the best predicted value, and you can be sure that the simple linear model is well fit with the data. (For more information about calculating the probability of T statistic values, see section 1st .)

Go back and discuss the statistical decision-making process. It tells you when to ignore invalid assumptions, but does not tell you whether to accept the alternative assumptions. In the research environment, we need to use theoretical parameters and statistical parameters to establish the alternative hypothesis for linear models.

You have built a data research tool to implement a statistical decision-making process for a linear model (T test), and provided summarized data that can be used to construct theoretical and statistical parameters, these parameters are required to establish a linear model. Data Research tools can be classified as decision-making support tools for knowledge workers to conduct centralized research on small and medium-sized data.

From the perspective of learning, simple linear regression modeling is worth studying, because it is the only way to understand more advanced forms of statistical modeling. For example, many core concepts in simple linear Regression are to understand Multiple Regression, Factor Analysis, and Time Series.

Simple linear regression is also a multi-purpose modeling technology. You can use the original data (usually logarithm or power conversion) to model the curve data. These transformations can make the data linear, so that you can use simple linear regression to model the data. The generated linear model is represented as a linear formula related to the converted value.

Probability Functions

In the previous article, I used R to obtain the probability value, avoiding the problem of using PHP to implement the probability function. I am not completely satisfied with this solution, so I began to study this problem: what is required for developing PHP-based probability functions.

I started searching for information and code online. One of the two sources is the probability function in the book [url = http://www.library.cornell.edu/nr/bookc#.html?numerical Recipes in C [/url. I re-implemented some probability function codes (gammln. c and betai. c functions) using PHP, but I am still not satisfied with the results. Compared with some other implementations, the code seems to be a little more. In addition, I also need anti-probability functions.

Fortunately, I happened to find Interactive Statistical Calculation of John Pezzullo. John's website on probability distribution functions has all the functions I need. for ease of learning, these functions are implemented in JavaScript.

I transplanted Student T and Fisher F functions to PHP. I made some changes to the API to conform to the Java Naming style and embed all functions into the class named Distribution. A great feature of this implementation is the doCommonMath method, which is reused by all functions in this library. The doCommonMath method is also used for other tests (normality and chi-square tests) that I do not have the effort to implement.

Another aspect of this transplantation is worth noting. By using JavaScript, you can assign dynamically determined values to instance variables, for example:

Var PiD2 = pi ()/2


You cannot do this in PHP. Only simple constant values can be assigned to instance variables. We hope to solve this problem in PHP5.

Note that the code in listing 1 does not define the instance variables-this is because they are dynamically assigned values in the JavaScript version.

Listing 1. implementing probability functions

// Distribution. php

// Copyright John Pezullo
// Released under same terms as PHP.
// PHP Port and OO 'fying by Paul Meagher

Class Distribution {

Function doCommonMath ($ q, $ I, $ j, $ B ){

$ Zz = 1;
$ Z = $ zz;
$ K = $ I;


While ($ k <= $ j ){
$ Zz = $ zz * $ q * $ k/($ k-$ B );
$ Z = $ z + $ zz;
$ K = $ k + 2;
}
Return $ z;
}

Function getStudentT ($ t, $ df ){

$ T = abs ($ t );
$ W = $ t/sqrt ($ df );
$ Th = atan ($ w );

If ($ df = 1 ){
Return 1-$ th/(pi ()/2 );
}

$ Th = sin ($ th );
$ Cth = cos ($ th );

If ($ df % 2) = 1 ){
Return
1-($ th + $ th * $ cth * $ this-> doCommonMath ($ cth * $ cth, 2, $ df-3,-1 ))
/(Pi ()/2 );
} Else {
Return 1-$ Something * $ this-> doCommonMath ($ cth * $ cth, 1, $ df-3,-1 );
}

}

Function getInverseStudentT ($ p, $ df ){

$ V = 0.5;
$ Dv = 0.5;
$ T = 0;

While ($ dv> 1e-6 ){
$ T = (1/$ v)-1;
$ Dv = $ dv/2;
If ($ this-> getStudentT ($ t, $ df)> $ p ){
$ V = $ v-$ dv;
} Else {
$ V = $ v + $ dv;
}
}
Return $ t;
}


Function getFisherF ($ f, $ n1, $ n2 ){
// Implemented but not shown
}

Function getInverseFisherF ($ p, $ n1, $ n2 ){
// Implemented but not shown
}

}
?>




Output method
Since you have already implemented probability functions using PHP, the only challenge left after developing a data research tool based on PHP is to design a method for displaying analysis results.

A simple solution is to display the values of all instance variables on the screen as needed. In the first article, I did this when I showed the linear equation, T value, and T probability of the consumption Study (Burnout Study. It is helpful to access a specific value for a specific purpose. SimpleLinearRegression supports this usage.

However, another method for output results is to systematically group the output parts. If we study the output of the main statistical software packages used for regression analysis, we will find that they often group the output in the same way. They often include a Summary Table, an Analysis Of Variance Table, a Parameter Estimate Table, and an R Value ). Similarly, I have created some output methods with the following names:

ShowSummaryTable ()
ShowAnalysisOfVariance ()
ShowParameterEstimates ()
ShowRValues ()

I also have a method (getFormula () for displaying linear prediction formulas ()). Many statistical software packages do not output formulas, but require users to construct formulas based on the output of the preceding methods. This is partly because the final form of the formula you used to model data may be different from the default formula for the following reasons:

1. there is no meaningful explanation for Y axis intercept
2. the input values may be converted, and you may need to cancel the conversion to obtain the final explanation.

All these methods assume that the output media is a web page. Considering that you may want to output these summary values in a non-webpage media, I decided to wrap these output methods in a class that inherits the SimpleLinearRegression class. The code in listing 2 is intended to demonstrate the general logic of the output class. In order to make the general logic more prominent, the code that implements various show methods is removed.

Listing 2. demonstrate the general logic of the output class

// HTML. php

// Copyright 2003, Paul Meagher
// Distributed under GPL

Include_once "slr/SimpleLinearRegression. php ";

Class SimpleLinearRegressionHTML extends SimpleLinearRegression {

Function SimpleLinearRegressionHTML ($ X, $ Y, $ conf_int ){
SimpleLinearRegression: SimpleLinearRegression ($ X, $ Y, $ conf_int );
}

Function showTableSummary ($ x_name, $ y_name ){}

Function showAnalysisOfVariance (){}

Function showParameterEstimates (){}

Function showFormula ($ x_name, $ y_name ){}

Function showRValues (){}
}

?>




The constructor of this class is only the wrapper of the SimpleLinearRegression class constructor. This means that if you want to display the HTML output of the SimpleLinearRegression analysis, you should instantiate the SimpleLinearRegressionHTML class instead of directly instantiating the SimpleLinearRegression class. The advantage is that there will not be many unused methods flooding the SimpleLinearRegression class, and classes for other output media can be defined more freely (the same API may be implemented for different media types ).

Graphic Output
The output methods you have implemented so far display the summary values in HTML format. It is also suitable for displaying the distribution chart (scatter plot) or line chart (line plot) of the data in GIF, JPEG, or PNG format ).

Instead of writing code for generating line charts and distribution charts, I think it is best to use a PHP-based graphics library named JpGraph. JpGraph is being actively developed by Johan Persson. its project website describes it as follows:


JpGraph can make it easy to draw images that only have the least code in a quick but inappropriate way, or complex and professional graphics that require very fine-grained control. JpGraph is also applicable to scientific and commercial graphics.

The JpGraph distribution contains a large number of sample scripts that can be customized according to specific requirements. Using JpGraph for data research tools is very simple. you only need to find a sample script with similar functions as my needs, and then rewrite the script to meet my specific needs.

The script in listing 3 is extracted from the sample data research tool (explore. php), which demonstrates how to call the library and how to enter data from the SimpleLinearRegression analysis into the Line and Scatter classes. The comments in this code are written by Johan Persson (the JPGraph code library documentation is well-developed ).

Listing 3. details of functions from the sample data research tool javase. php
// Snippet extracted from ipve. php script

Include ("jpgraph/jpgraph. php ");
Include ("jpgraph/jpgraph_scatter.php ");
Include ("jpgraph/jpgraph_line.php ");

// Create the graph
$ Graph = new Graph (300,200, 'auto ');
$ Graph-> SetScale ("linlin ");

// Setup title
$ Graph-> title-> Set ("$ title ");
$ Graph-> img-> SetMargin (50, 20, 20, 40 );
$ Graph-> xaxis-> SetTitle ("$ x_name", "center ");
$ Graph-> yaxis-> SetTitleMargin (30 );
$ Graph-> yaxis-> title-> Set ("$ y_name ");

$ Graph-> title-> SetFont (FF_FONT1, FS_BOLD );

// Make sure that the X-axis is always at
// Bottom at the plot and not just at Y = 0 which is
// The default position
$ Graph-> xaxis-> SetPos ('min ');

// Create the scatter plot with some nice colors
$ Sp1 = new ScatterPlot ($ slr-> Y, $ slr-> X );
$ Sp1-> mark-> SetType (MARK_FILLEDCIRCLE );
$ Sp1-> mark-> SetFillColor ("red ");
$ Sp1-> SetColor ("blue ");
$ Sp1-> SetWeight (3 );
$ Sp1-> mark-> SetWidth (4 );

// Create the regression line
$ Lplot = new LinePlot ($ slr-> PredictedY, $ slr-> X );
$ Lplot-> SetWeight (2 );
$ Lplot-> SetColor ('navy ');

// Add the pltos to the line
$ Graph-> Add ($ sp1 );
$ Graph-> Add ($ lplot );

//... And stroke
$ Graph_name = "temp/test.png ";
$ Graph-> Stroke ($ graph_name );
?>
'Vspace = '15'>

?>


Data Research script
The data research tool consists of a single script (javase. php) that calls the methods of the SimpleLinearRegressionHTML class and the JpGraph Library.

The script uses simple processing logic. The first part of the script performs basic verification on the submitted form data. If the form data passes verification, the second part of the script is executed.

The code in the second part of the script is used to analyze data and display the summary results in HTML and graphic format. Listing 4 shows the basic structure of the javase. php script:

Listing 4. explore. php structure

// Unzip E. php

If (! Empty ($ x_values )){
$ X = explode (",", $ x_values );
$ NumX = count ($ X );
}

If (! Empty ($ y_values )){
$ Y = explode (",", $ y_values );
$ NumY = count ($ Y );
}

// Display entry data entry form if variables not set

If (empty ($ title) OR (empty ($ x_name) OR (empty ($ x_values) OR
(Empty ($ y_name) OR (empty ($ conf_int) OR (empty ($ y_values) OR
($ NumX! = $ NumY )){

// Omitted code for displaying entry form

} Else {

Include_once "slr/SimpleLinearRegressionHTML. php ";
$ Slr = new SimpleLinearRegressionHTML ($ X, $ Y, $ conf_int );

Echo "$ title ";

$ Slr-> showTableSummary ($ x_name, $ y_name );
Echo"

";

$ Slr-> showAnalysisOfVariance ();
Echo"

";

$ Slr-> showParameterEstimates ($ x_name, $ y_name );
Echo"
";

$ Slr-> showFormula ($ x_name, $ y_name );
Echo"

";

$ Slr-> showRValues ($ x_name, $ y_name );
Echo"
";

Include ("jpgraph/jpgraph. php ");
Include ("jpgraph/jpgraph_scatter.php ");
Include ("jpgraph/jpgraph_line.php ");

// The code for displaying the graphics is inline in
// Unzip E. php script. The code for these two line plots
// Finishes off the script:

// Omitted code for displaying scatter plus line plot
// Omitted code for displaying residuals plot

}

?>


Fire Loss Research
To demonstrate how to use data research tools, I will use data from a hypothetical fire loss study. This study associates the amount of fire losses in major residential areas with their distance to the nearest fire station. For example, insurance companies are interested in the study of this relationship for the purpose of determining insurance premiums.

The input screen in data 1 of the study is shown.

Figure 1. display the input screen of research data



After the data is submitted, it is analyzed and the analysis results are displayed. The first result set displayed is Table Summary, as shown in 2.

Figure 2. Table Summary is the first result set displayed.



Table Summary displays input data and other columns in a Table, these columns indicate the difference between the predicted values Y and Y corresponding to the observed values X and the lower limit and upper limit of the predicted confidence interval Y.

Figure 3 shows the three high-level data Summary tables after Table Summary.

Figure 3. three high-level data Summary tables after Table Summary



The Analysis of Variance table shows how to classify the deviation value of Y value into two major sources of deviation value, which are the Variance interpreted by the Model (see the Model row) and the variance that the model cannot interpret (see the Error line ). The larger F value means that the linear model captures most of the deviation values in the Y value. This table is more useful in multiple regression environments, where each independent variable occupies a row in the table.

The Parameter Estimates table shows the estimated Y-axis Intercept and Slope ). Each row includes a T value and the probability of observing the limit T value (see Prob> T column ). The slope of Prob> T can be used to reject linear models.

If the probability of a T value is greater than 0.05 (or a similar small probability), you can reject this invalid hypothesis because it is unlikely to randomly observe the limit value. Otherwise, you must use the invalid hypothesis.

In the study of fire loss, the probability of a random T value of 12.57 is less than 0.00000. This means that the linear model is a useful estimator (better than the average value of the Y value) for the Y value corresponding to the X value range observed in this study ).

The final report shows the correlation coefficient or R value. They can be used to evaluate the degree of fit between a linear model and data. The high R value indicates that the matching is good.

Each summary report provides answers to various analysis questions about the relationship between linear models and data. See textbooks written by Hamilton, Neter, or Pedhauzeur for more advanced regression analysis processing (see references ).

The final report element to be displayed is the data distribution chart and line chart, as shown in figure 4.

Figure 4. Final Report elements-distribution chart and Line chart



Most people are familiar with the description of line charts (such as the first graph in this series), so I will not comment on this, just to say that the JPGraph library can generate high-quality scientific charts for the Web. It also works well when you input distribution or straight line data.

The second figure associates the residual (observed Y, predicted Y) with your predicted Y value. This is a graphic example used by advocates of research Data Analysis (EDA) to help analysts identify and understand patterns in Data to the highest degree. Experts can use this picture to answer the following questions:

Examples of possible non-normal values or excessive influence
Possible curve relationships (using conversion ?)
Non-normal residual distribution
Non-linear error variance or cross-square deviation

This data research tool can be easily expanded to generate more types of graphs-histograms, block charts, and quartile charts-all of which are standard EDA tools.

Mathematical Library Architecture
My hobbies in mathematics have kept me very interested in the library of mathematics in recent months. This type of research drives me to think about how to organize my code library and make it grow as expected in the future.

I am using the directory structure in listing 5 for the moment:

Listing 5. easy-to-grow directory structure
Phpmath/

Burnout_study.php
Explore. php
Fire_study.php
Navbar. php

Dist/
Distribution. php
Fisher. php
Student. php
Source. php

Jpgraph/
Etc...

Slr/
SimpleLinearRegression. php
SimpleLinearRegressionHTML. php

Temp/




For example, in the future work on multiple regression, this library will be extended to include the matrix directory, which is used to accommodate the need to perform matrix operations (this is a more advanced form of regression analysis) PHP code. I will also create an mr directory to accommodate PHP code that implements multiple regression analysis input methods, logic, and output methods.

Note that the directory structure contains a temp directory. You must set the permission for this directory so that the export E. php script can write the output graph to this directory. Keep this in mind when you try to install the phpmath_002.tar.gz source code. In addition, read instructions for installing JpGraph on the JpGraph project website (see references ).

Finally, if you use the following method, you can move all software classes to the document root directory outside the Web root directory:


Grant a global PHP_MATH variable the permission to access a non-Web root directory and
Make sure that the defined constant is prefixed before all required or included file paths.

In the future, the setting of the PHP_MATH variable will be completed through a configuration file for the entire PHP Mathematical Library.

What have you learned?
In this article, you learned how to use the SimpleLinearRegression class to develop data research tools for small and medium-sized datasets. In this process, I also developed a local probability function for the SimpleLinearRegression class, and used the HTML output method and the JpGraph library-based graphics to generate code to extend this class.

From the perspective of learning, simple linear regression modeling is worth further research, because facts prove that it is the only way to understand more advanced forms of statistical modeling. A thorough understanding of simple linear regression will benefit you from learning more advanced techniques (such as multiple regression or multivariate variance analysis.

Even if simple linear regression uses only one variable to describe or predict the deviation value of another variable, finding a simple linear relationship between all the research variables is often the first step in research data analysis. Just because the data is multivariate does not mean that it must be studied using multivariate tools. In fact, using a basic tool such as simple linear regression at the beginning is a good way to explore the data model.

This series studies two applications of simple linear regression analysis. In this article, I have studied the strong linear relationship between "distance to the fire station" and "Fire Loss. In the first article, I studied the linear relationship between "social concentration" and the measurement value called "consumption index". Although this relationship is relatively weak, it is still very obvious. (As an exercise, it may be interesting to use the data research tools discussed in this article to study the messy data in the first study case. You may notice that the y-axis intercept is negative, which means that "social concentration" is 0, and the predicted consumption index is-29.50. Does this make sense? When modeling a phenomenon, you should ask yourself: should the equation contain an optional y-axis intercept? If yes, what is the role of y axis intercept in linear equations .)

Further research on simple linear regression may include research on these topics:


* If you want to skip the intercept from your equation and other calculation formulas that can be used, then when can this problem be done?
* When and how to use power, logarithm, and other transformations to normalize the data so that simple linear regression can be used to model the data
* It can be used to evaluate the adequacy of your modeling assumptions and give you a clearer insight into other visualization methods of patterns in the data

These are part of more advanced topics for students who need to learn simple linear regression. The references contain links to advanced topic articles. For more information, see

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.