Simple linear regression implemented using PHP: (1)

Source: Internet
Author: User
Simple linear regression implemented using PHP: (1) importance of databases in PHP <br/> a powerful tool is missing in the PHP field: language-based mathematical library. The importance of two databases in PHP


PHP lacks a powerful tool: a language-based mathematical library. In this two-part series, Paul Meagher hopes to inspire PHP developers to develop and implement a PHP-based mathematical library by providing an example of how to develop and analyze the model Library. In section 1st, he demonstrated how to use PHP as the implementation language to develop and implement the core part of the Simple Linear Regression algorithm package. In Part 1, the author adds some features in this package: a useful data analysis tool for small and medium sized datasets.

Introduction
Compared with other open-source languages (such as Perl and Python), the PHP community lacks powerful work to develop a math library.

One reason for this situation may be that there are already a large number of mature mathematical tools, which may hinder the community's self-development of PHP Tools. For example, I have studied a powerful tool, S System, which has an impressive set of statistical libraries specially designed to analyze datasets, in 1998, the ACM Award was awarded for its language design. If S or its open source code similar R is just an exec_shell call, why bother using PHP to implement the same statistical computing function? For more information about S System, its ACM Award, or R, see references.

Isn't that a waste of developer energy? If the motivation for developing a PHP math library is to save developers' energy and use the best tools to complete the work, the current topic of PHP is very meaningful.

On the other hand, the development of PHP math libraries may be encouraged out of teaching motivation. For about 10% of people, mathematics is an interesting topic worth exploring. For those who are still familiar with PHP, the development of the PHP math library can enhance the math learning process. In other words, do not just read the section on T testing, you also need to implement a class that can calculate the corresponding median and display them in the standard format.

Through guidance and training, I hope to prove that developing a PHP math library is not a very difficult task. it may represent an interesting technology and learning challenge. In this article, I will provide a PHP math library example named SimpleLinearRegression, which demonstrates a common method for developing the PHP math library. Let's start with discussing some general principles that guide me to develop this SimpleLinearRegression class.

Guiding principles
I used six general principles to guide the development of the SimpleLinearRegression class.

1. create a class for each analysis model.
2. use reverse links for development.
3. a large number of getters are expected.
4. store intermediate results.
5. develop preferences for detailed APIs.
6. perfection is not a goal.
7. let us study these guidelines one by one in more detail.

Create a class for each analysis model
Each major analysis test or process should have a PHP class with the same name as the test or process, this class contains the input function, the function for calculating the median and aggregate value, and the output function (display the median and aggregate value in text or graphic format on the screen ).

Use reverse links for development
In mathematical programming, the goal of encoding is usually the standard output value that is expected to be generated by the analytical process (such as MultipleRegression, TimeSeries, or ChiSquared. From the perspective of solving the problem, this means you can use reverse links to develop mathematical methods.

For example, the summary output screen displays one or more summary statistical results. These summary statistical results depend on the calculation of intermediate statistical results. these intermediate statistical results may involve intermediate statistical results at a deeper level, and so on. This reverse link-based development method exports the next principle.

A large number of getters are expected.
Most of the development work in mathematics involves calculating the median and aggregate value. In fact, this means that you should not be surprised if your class contains many getter methods for calculating the median and aggregate value.

Store intermediate results
Store the intermediate calculation result in the result object, so that you can use the intermediate result as the input for subsequent calculation. This principle is implemented in S language design. In the current environment, this principle is implemented by selecting instance variables to represent the calculated median and summary results.

Develop preferences for detailed APIs
When creating a naming scheme for member functions and instance variables in the SimpleLinearRegression class, I found that if I use a long name (such as getSumSquaredError, rather than getYY2) to describe member functions and instance variables, it is easier to understand the operation content of the function and the meaning of the variables.

I did not give up the abbreviated name. However, when I use a short name, I have to provide a comment to fully describe the meaning of the name. In my opinion, naming schemes that are highly abbreviated are common in mathematical programming, but they make it more difficult to understand and prove whether a mathematical routine is step-by-step.

Perfection is not a goal
The goal of this coding exercise is not to develop a highly optimized and rigorous mathematical engine for PHP. In the early stages, we should emphasize the importance of analyzing and testing the implementation of learning and solve this problem.


Instance variables
When modeling a statistical test or process, you need to specify which instance variables are declared.

The selection of instance variables can be determined by the intermediate value and the total value generated by the analysis process. Each median value and aggregate value can have a corresponding instance variable that uses the variable value as the object property.

I used this analysis to determine which variables are declared for the SimpleLinearRegression class in listing 1. Similar analysis can be performed on MultipleRegression, ANOVA, or TimeSeries processes.

Listing 1. instance variables of the SimpleLinearRegression class

// Copyright 2003, Paul Meagher
// Distributed under GPL

Class SimpleLinearRegression {

Var $ n;
Var $ X = array ();
Var $ Y = array ();
Var $ ConfInt;
Var $ Alpha;
Var $ XMean;
Var $ YMean;
Var $ SumXX;
Var $ SumXY;
Var $ SumYY;
Var $ Slope;
Var $ YInt;
Var $ PredictedY = array ();
Var $ Error = array ();
Var $ SquaredError = array ();
Var $ TotalError;
Var $ SumError;
Var $ SumSquaredError;
Var $ ErrorVariance;
Var $ StdErr;
Var $ SlopeStdErr;
Var $ SlopeVal; // T value of Slope
Var $ YIntStdErr;
Var $ YIntTVal; // T value for Y Intercept
Var $ R;
Var $ RSquared;
Var $ DF; // Degrees of Freedom
Var $ SlopeProb; // Probability of Slope Estimate
Var $ YIntProb; // Probability of Y Intercept Estimate
Var $ AlphaTVal; // T Value for given alpha setting
Var $ ConfIntOfSlope;

Var $ RPath = "/usr/local/bin/R"; // Your path here

Var $ format = "% 01.2f"; // Used for formatting output

}
?>


Constructor
The constructor method of the SimpleLinearRegression class accepts an X and an Y vector, each of which has the same number of values. You can also set a confidence interval (confidence interval) with the default value of 95% for your expected Y value ).

The constructor method starts by verifying whether the data format is suitable for processing. Once the input vector passes the "equal size" and "value greater than 1" test, the core part of the algorithm is executed.

Executing this task involves using a series of getter methods to calculate the median and total value of the statistical process. Assign the return value of each method call to an instance variable of the class. This method is used to store the computing results so that the call routine in the calculation of the front and back links can use the median value and the total value. You can also call the output method of this class to display these results, as described in listing 2.

List 2. call the class output method

// Copyright 2003, Paul Meagher
// Distributed under GPL

Function SimpleLinearRegression ($ X, $ Y, $ ConfidenceInterval = "95 "){

$ NumX = count ($ X );
$ NumY = count ($ Y );

If ($ numX! = $ NumY ){
Die ("Error: Size of X and Y vectors must be the same .");

}
If ($ numX <= 1 ){
Die ("Error: Size of input array must be at least 2 .");
}

$ This-> n = $ numX;
$ This-> X = $ X;
$ This-> Y = $ Y;

$ This-> ConfInt = $ ConfidenceInterval;
$ This-& gt; Alpha = (1 + ($ this-& gt; ConfInt/100)/2;

$ This-> XMean = $ this-> getMean ($ this-> X );
$ This-> YMean = $ this-> getMean ($ this-> Y );
$ This-> SumXX = $ this-> getSumXX ();
$ This-> SumYY = $ this-> getSumYY ();
$ This-> SumXY = $ this-> getSumXY ();
$ This-> Slope = $ this-> getSlope ();
$ This-> YInt = $ this-> getYInt ();
$ This-> PredictedY = $ this-> getPredictedY ();
$ This-> Error = $ this-> getError ();
$ This-> SquaredError = $ this-> getSquaredError ();
$ This-> SumError = $ this-> getSumError ();
$ This-> TotalError = $ this-> getTotalError ();
$ This-> SumSquaredError = $ this-> getSumSquaredError ();
$ This-> ErrorVariance = $ this-> getErrorVariance ();
$ This-> StdErr = $ this-> getStdErr ();
$ This-> SlopeStdErr = $ this-> getSlopeStdErr ();
$ This-> YIntStdErr = $ this-> getYIntStdErr ();
$ This-> SlopeTVal = $ this-> getSlopeTVal ();
$ This-> YIntTVal = $ this-> getYIntTVal ();
$ This-> R = $ this-> getR ();
$ This-> RSquared = $ this-> getRSquared ();
$ This-> DF = $ this-> getDF ();
$ This-> SlopeProb = $ this-> getStudentProb ($ this-> SlopeTVal, $ this-> DF );
$ This-> YIntProb = $ this-> getStudentProb ($ this-> YIntTVal, $ this-> DF );
$ This-> AlphaTVal = $ this-> getInverseStudentProb ($ this-> Alpha, $ this-> DF );
$ This-> ConfIntOfSlope = $ this-> getConfIntOfSlope ();

Return true;
}

?>


The method name and its sequence are derived by combining reverse links and referring to the statistical textbooks used by undergraduate students. This textbook illustrates how to calculate the median value step by step. The name of the median that I want to calculate carries the "get" prefix to push and export the method name.

Make the model and data consistent
The SimpleLinearRegression process is used to generate a line that matches the data. a line has the following standard equation:

Y = B + mx

The PHP format of this equation looks similar to listing 3:

Listing 3. PHP equations that match the model and data
$ PredictedY [$ I] = $ YIntercept + $ Slope * $ X [$ I]


The SimpleLinearRegression class uses the least square method to derive the estimated values of Y-axis Intercept and Slope parameters. These estimated parameters are used to construct a linear equation (see listing 3), which models the relationship between X and Y values.

Using the derived linear equation, you can obtain the predicted Y value corresponding to each X value. If the linear equation is very consistent with the data, the observed values of Y are close to the predicted values.

How to determine if the matching conditions are exclusive
The SimpleLinearRegression class generates a considerable number of summary values. An important summary value is the T statistical value, which can be used to measure the degree of fit between a linear equation and data. If they are very consistent, the T value is usually very large. If the T statistic value is small, a model should be used to replace the linear equation. This model assumes that the mean value of Y is the best predicted value (that is, the mean value of a set of values is usually a useful predicted value of the next observed value, making it the default model ).

To test whether the T statistic value is large enough to not take the mean value of Y as the best predicted value, you need to calculate the random probability of getting the T statistic value. If the probability of getting the T statistical value is very low, you can deny the invalid assumption that the mean is the best predicted value. correspondingly, you are sure that the simple linear model is very consistent with the data.

So how can we calculate the probability of T statistic values?

Calculate the probability of the T statistic value
Since PHP lacks a mathematical routine for calculating the probability of T statistic values, I decided to hand this task over to the statistical computation package R (see the www.r-project.org in references) for the required values. I also want to remind everyone to pay attention to this package, because:

1. R provides many ideas that PHP developers may simulate in the PHP Mathematical Library.
2. with R, you can determine whether the values obtained from the PHP mathematical library are consistent with those obtained from the mature free and available open source statistical package.
The code in listing 4 demonstrates how easy it is to hand it over to R to get a value.

Listing 4. submit it to the R statistical calculation package for processing to obtain a value.

// Copyright 2003, Paul Meagher
// Distributed under GPL

Class SimpleLinearRegression {

Var $ RPath = "/usr/local/bin/R"; // Your path here

Function getStudentProb ($ T, $ df ){
$ Probability = 0.0;
$ Cmd = "echo 'DT ($ T, $ df) '| $ this-> RPath -- slave ";
$ Result = shell_exec ($ cmd );
List ($ LineNumber, $ Probability) = explode ("", trim ($ result ));
Return $ Probability;
}

Function getInverseStudentProb ($ alpha, $ df ){
$ InverseProbability = 0.0;
$ Cmd = "echo 'qt ($ alpha, $ df) '| $ this-> RPath -- slave ";
$ Result = shell_exec ($ cmd );
List ($ LineNumber, $ InverseProbability) = explode ("", trim ($ result ));
Return $ InverseProbability;
}

}

?>


Note that the path to the R executable file has been set and used in the two functions. The first function returns the probability value related to the T statistical value based on the Student's T distribution, and the second inverse function calculates the T statistical value corresponding to the given alpha setting. The getStudentProb method is used to evaluate the degree of fit of the linear model. the getInverseStudentProb method returns an intermediate value, which is used to calculate the confidence interval of each predicted Y value.

Due to limited space, it is impossible for me to detail all the functions in this class one by one. Therefore, if you want to understand the terms and steps involved in simple linear regression analysis, I encourage you to refer to the statistical textbooks used by undergraduate students.

Fuel consumption research
To demonstrate how to use this class, I can use data from burnout research in public utilities. Michael Leiter and Kimberly Ann Meechan studied the relationship between the units of consumption measurement called the Exhaustion Index and the independent variables called Concentration. Concentration refers to the proportion of people's social interactions from their work environment.

To study the relationship between personal consumption exponent values in their samples and concentration values, load these values into the appropriate named array and instantiate the class with these array values. After the class is instantiated, some summary values generated by the class are displayed to evaluate the degree of fit between the linear model and the data.

Listing 5 shows the script for loading data and displaying summary values:

Listing 5. scripts for loading data and displaying summary values

// BurnoutStudy. php

// Copyright 2003, Paul Meagher
// Distributed under GPL

Include "SimpleLinearRegression. php ";

// Load data from burnout study

$ Concentration = array (20, 60, 38, 88, 79,87,
68, 12, 35, 70, 80, 92,
,
, 96 );

$ ExhaustionIndex = array (100,525,300,980,310,900,
410,296,120,501,920,810,
506,493,892,527,600,855,
709,791,718,684,141,400,970 );

$ Slr = new SimpleLinearRegression ($ Concentration, $ ExhaustionIndex );

$ YInt = sprintf ($ slr-> format, $ slr-> YInt );
$ Slope = sprintf ($ slr-> format, $ slr-> Slope );
$ SlopeTVal = sprintf ($ slr-> format, $ slr-> SlopeTVal );
$ SlopeProb = sprintf ("% 01.6f", $ slr-> SlopeProb );

?>













Equation:
T:
Prob> T:



Run the script in a Web browser to generate the following output:

Equation: Exhaustion =-29.50 + (8.87 * Concentration)
T: 6.03
Prob> T: 0.000005


The last row of this table indicates that the random probability of getting such a large T value is very low. We can conclude that the prediction capability of the simple linear model is better than that of the average value consumed only.

Knowing the degree of concentration of a person's workplace contact can be used to predict the amount of fuel they may be consuming. This equation tells us that every time the concentration value increases by one unit, the consumption value of one person in the social service field will increase by eight units. This further proves that in order to reduce the potential fuel consumption, individuals in the social service field should consider making friends outside their workplace.

This is just a rough description of the meanings of these results. To fully understand the meaning of this dataset, you may want to study it in more detail to make sure it is the correct explanation. In the next article, I will discuss what other analyses should be performed.

What have you learned?
First, you do not have to be a rocket scientist to develop a PHP-based mathematical package. By adhering to the standard object-oriented technology and the explicit use of reverse link solutions, you can easily use PHP to implement some basic statistical processes.

From the perspective of teaching, I think this exercise is very useful if you only need to think about statistical tests or routines at a higher and lower abstraction levels. In other words, a good way to supplement your statistical testing or process learning is to use this process as an algorithm.

To achieve statistical testing, it is usually necessary to go beyond the scope of the given information and solve and discover problems creatively. It is also a good way to find deficiencies in a certain discipline.

On the negative side, you find that PHP lacks internal means for sampling distribution, which is necessary for most statistical tests. You need to hand it over to R to get these values, but I am worried that you will not have time or are not interested in installing R. Some common probability functions can be implemented in PHP on the local machine to solve this problem.

Another problem is that this class generates many median values and aggregate values, but the aggregate output does not actually use this. I provide some hard-to-handle outputs, but this is neither adequate nor well organized, so that you cannot fully explain the analysis results. In fact, I have no idea how to integrate the output method into this class. This needs to be solved.

Finally, you need to understand the data, not just view the summary value. You also need to understand how each data point is distributed. One of the best ways is to plot your data into a chart. Again, I don't know much about this, but if I want to use this class to analyze the actual data, I need to solve this problem.

In the next article in this series, I will implement some probability functions using the native PHP code, extend the SimpleLinearRegression class with several output methods, and generate a report: it is easier to draw conclusions from data by using the table and graph format to represent the median and total value. And wait for the next decomposition!


References

1. for more information, see James T. mcClave and Terry Sincich compile a popular university textbook Statistics, version 9th (Prentice-Hall, online). The algorithm steps used in this article and the example of "fuel consumption research" refer to this book.
2. check the PEAR Resource Library, which currently contains a small number of low-level PHP mathematics classes. In the end, we should be happy to see that PEAR contains packages that implement standard high-level numeric methods (such as SimpleLinearRegression, MultipleRegression, TimeSeries, ANOVA, FactorAnalysis, FourierAnalysis, and others.
3. View all source code of the author's SimpleLinearRegression class.
4. Take a look at the Numerical Python project. it extends Python with a very scientific array language and sophisticated subscript creation methods. With this extension, mathematical operations are very close to the functions that people expect from the compiled language.
5. the study provides many mathematical references for Perl, including the index of the CPAN mathematical module, the algorithm module of CPAN, and the Perl Data Language ), it aims to provide Perl with the ability to compress storage and quickly operate large N-dimensional data arrays.
6. For more information about John Chambers's S programming language, see his publications and links to his research projects at Bell Labs. You can also learn about the ACM Award for language design in 1998.
7. R is the language and environment used for statistical computing and graphics, similar to the award-winning S System, R provides statistics and graphics technologies such as linear and nonlinear modeling, statistical testing, time series analysis, classification, and clustering. Learn about R on the R Project homepage.
8. if you are new to PHP, Read Amol Hatwar's developerWorks series: "Develop robust code with PHP:" "Part 1: Introduction to the Advanced Architecture" (December 1st) "Part 1: effective use of variables" (September 2002) and "Part 2: preparation of reusable functions" (May 2nd ).

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.