Simple linear regression implemented using PHP. In part 1 of this two-part series (simple linear regression with PHP), I 've explained why the math library is useful to PHP. I also demonstrated how to use PHP in section 1st of this two-part series ("simple linear regression with PHP, I explained why the math library is useful to PHP. I also demonstrated how to develop and implement simple linear regression algorithms using PHP as the implementation language.
The goal of this article is to show you how to use the SimpleLinearRegression class discussed in section 1st to build an important data research tool.
Brief review: Concept
The basic goal behind simple linear regression modeling is to find the most consistent line from a pair of two-dimensional plane consisting of X and Y values (that is, X and Y measurements. Once this line is found using the least variance method, various statistical tests can be performed to determine the degree of deviation between this line and the observed Y value.
Linear equations (y = mx + B) have two parameters that must be estimated based on the provided X and Y data. they are Slope (m) and y axis intercept (B ). Once the two parameters are estimated, the observed values can be input into the linear equation and the Y predicted values generated by the equation can be observed.
To use the least variance method to estimate the m and B parameters, we need to find the m and B estimates to minimize the observed values and predicted values of all X values. The difference between the observed value and the predicted value is known as the error (y I-(mx I + B). if we calculate the square of each error value, then obtain the sum of the residual values, the result is a number called the predicted squared difference. The use of the least variance method to determine the most consistent straight line involves finding the estimated values of m and B with the smallest predictive variance.
Two basic methods can be used to find the estimated values m and B that satisfy the least variance method. In the first method, you can use the numerical search process to set different m and B values and evaluate them to determine the minimum variance estimation value. The second method is to use calculus to find the equation used to estimate m and B. I am not going to discuss in depth the calculus involved in deriving these equations, but I did use these analytical equations in the SimpleLinearRegression class, to find the least square estimate of m and B (see getSlope () and getYIntercept methods in the SimpleLinearRegression class ).
Even if you have an equation that can be used to find the minimum square estimate of m and B, it does not mean that as long as these parameters are substituted into the linear equation, the result is a straight line that fits well with the data. The next step in this simple linear regression process is to determine whether the remaining prediction variance is acceptable.
We can use the statistical decision-making process to reject the alternative assumption of "line matching data. This process is based on the calculation of the T statistical value, and the probability function is used to obtain the probability of a random large observed value. As mentioned in Part 1, the SimpleLinearRegression class generates a large number of aggregate values, one of which is the T statistical value, which can be used to measure the degree of fit between linear equations and data. If the matching is good, the T statistical value is usually a large value. if the T value is small, a default model should be used to replace your linear equation, this model assumes that the average value of Y is the best predicted value (because the average value of a group of values can usually be the useful predicted value of the next observed value ).
To test whether the T statistic value is large enough, you do not need the average value of Y as the best predicted value. you need to calculate the probability of random T statistic values. If the probability is low, you can avoid the assumption that the average value is the best predicted value, and you can be sure that the simple linear model is well fit with the data. (For more information about calculating the probability of T statistic values, see section 1st .)
Go back and discuss the statistical decision-making process. It tells you when to ignore invalid assumptions, but does not tell you whether to accept the alternative assumptions. In the research environment, we need to use theoretical parameters and statistical parameters to establish the alternative hypothesis for linear models.
You have built a data research tool to implement a statistical decision-making process for a linear model (T test), and provided summarized data that can be used to construct theoretical and statistical parameters, these parameters are required to establish a linear model. Data Research tools can be classified as decision-making support tools for knowledge workers to conduct centralized research on small and medium-sized data.
From the perspective of learning, simple linear regression modeling is worth studying, because it is the only way to understand more advanced forms of statistical modeling. For example, many core concepts in simple linear Regression are to understand Multiple Regression, Factor Analysis, and Time Series.
Simple linear regression is also a multi-purpose modeling technology. You can use the original data (usually logarithm or power conversion) to model the curve data. These transformations can make the data linear, so that you can use simple linear regression to model the data. The generated linear model is represented as a linear formula related to the converted value.
Back to top
Probability Functions
In the previous article, I used R to obtain the probability value, avoiding the problem of using PHP to implement the probability function. I am not completely satisfied with this solution, so I began to study this problem: what is required for developing PHP-based probability functions.
I started searching for information and code online. One of the two sources is the probability function in the book Numerical Recipes in C. I re-implemented some probability function codes (gammln. c and betai. c functions) using PHP, but I am still not satisfied with the results. Compared with some other implementations, the code seems to be a little more. In addition, I also need anti-probability functions.
Fortunately, I happened to find Interactive Statistical Calculation of John Pezzullo. John's website on probability distribution functions has all the functions I need. for ease of learning, these functions are implemented in JavaScript.
I transplanted Student T and Fisher F functions to PHP. I made some changes to the API to conform to the Java Naming style and embed all functions into the class named Distribution. A great feature of this implementation is the doCommonMath method, which is reused by all functions in this library. The doCommonMath method is also used for other tests (normality and chi-square tests) that I do not have the effort to implement.
Another aspect of this transplantation is worth noting. By using JavaScript, you can assign dynamically determined values to instance variables, for example:
Var PiD2 = pi ()/2
You cannot do this in PHP. Only simple constant values can be assigned to instance variables. We hope to solve this problem in PHP5.
Note that the code in listing 1 does not define the instance variables-this is because they are dynamically assigned values in the JavaScript version.
Listing 1. implementing probability functions
DoCommonMath ($ cth * $ cth, 2, $ df-3,-1)/(pi ()/2 );} else {return 1-$ Something * $ this-> doCommonMath ($ cth * $ cth, 1, $ df-3,-1) ;}} function getInverseStudentT ($ p, $ df) {$ v = 0.5; $ dv = 0.5; $ t = 0; while ($ dv> 1e-6) {$ t = (1/$ v)-1; $ dv = $ dv/2; if ($ this-> getStudentT ($ t, $ df)> $ p) {$ v = $ v-$ dv ;} else {$ v = $ v + $ dv;} return $ t;} function getFisherF ($ f, $ n1, $ n2) {// impleme Nted but not shown} function getInverseFisherF ($ p, $ n1, $ n2) {// implemented but not shown }}?>
Back to top
Graphic Output
The output methods you have implemented so far display the summary values in HTML format. It is also suitable for displaying the distribution chart (scatter plot) or line chart (line plot) of the data in GIF, JPEG, or PNG format ).
Instead of writing code for generating line charts and distribution charts, I think it is best to use a PHP-based graphics library named JpGraph. JpGraph is being actively developed by Johan Persson. its project website describes it as follows:
JpGraph can make it easy to draw images that only have the least code in a quick but inappropriate way, or complex and professional graphics that require very fine-grained control. JpGraph is also applicable to scientific and commercial graphics.
The JpGraph distribution contains a large number of sample scripts that can be customized according to specific requirements. Using JpGraph for data research tools is very simple. you only need to find a sample script with similar functions as my needs, and then rewrite the script to meet my specific needs.
The script in listing 3 is extracted from the sample data research tool (explore. php), which demonstrates how to call the library and how to enter data from the SimpleLinearRegression analysis into the Line and Scatter classes. The comments in this code are written by Johan Persson (the JPGraph code library documentation is well-developed ).
Listing 3. details of functions from the sample data research tool javase. php
SetScale ("linlin"); // Setup title $ graph-> title-> Set ("$ title"); $ graph-> img-> SetMargin (50, 20, 20, 40 ); $ graph-> xaxis-> SetTitle ("$ x_name", "center"); $ graph-> yaxis-> SetTitleMargin (30 ); $ graph-> yaxis-> title-> Set ("$ y_name"); $ graph-> title-> SetFont (FF_FONT1, FS_BOLD ); // make sure that the X-axis is always at the // bottom at the plot and not just at Y = 0 which is // the default position $ graph-> xaxis-> set Pos ('min'); // Create the scatter plot with some nice colors $ sp1 = new ScatterPlot ($ slr-> Y, $ slr-> X ); $ sp1-> mark-> SetType (MARK_FILLEDCIRCLE); $ sp1-> mark-> SetFillColor ("red"); $ sp1-> SetColor ("blue "); $ sp1-> SetWeight (3); $ sp1-> mark-> SetWidth (4); // Create the regression line $ lplot = new LinePlot ($ slr-> PredictedY, $ slr-> X); $ lplot-> SetWeight (2); $ lplot-> SetColor ('navy'); // Add the pltos to the lin E $ graph-> Add ($ sp1); $ graph-> Add ($ lplot );//... and stroke $ graph_name = "temp/test.png"; $ graph-> Stroke ($ graph_name);?> 'Vspace = '15' >?>
Back to top
Data Research script
The data research tool consists of a single script (javase. php) that calls the methods of the SimpleLinearRegressionHTML class and the JpGraph Library.
The script uses simple processing logic. The first part of the script performs basic verification on the submitted form data. If the form data passes verification, the second part of the script is executed.
The code in the second part of the script is used to analyze data and display the summary results in HTML and graphic format. Listing 4 shows the basic structure of the javase. php script:
Listing 4. explore. php structure
ShowTableSummary ($ x_name, $ y_name); echo"
"; $ Slr-> showAnalysisOfVariance (); echo"
"; $ Slr-> showParameterEstimates ($ x_name, $ y_name); echo"
"; $ Slr-> showFormula ($ x_name, $ y_name); echo"
"; $ Slr-> showRValues ($ x_name, $ y_name); echo"
"; Include (" jpgraph/jpgraph. php "); include (" jpgraph/jpgraph_scatter.php "); include (" jpgraph/jpgraph_line.php "); // The code for displaying the graphics is inline in the // example E. php script. the code for these two line plots // finishes off the script: // Omitted code for displaying scatter plus line plot // Omitted code for displaying residuals plot}?>
In part 2 of distance (simple linear regression implemented using PHP), I explained why the math library is useful to PHP. I also demonstrated how to use PHP...