Simple linear regression implemented in PHP: (i)

Source: Internet
Author: User
Tags explode functions variables pear php class sprintf variable trim
The importance of databases in PHP


A powerful tool in the field of PHP is missing: A language based math library. In this two-part series, Paul Meagher hopes to inspire PHP developers to develop and implement a PHP based math library by providing an example of how to develop an analysis model library. In part 1th, he demonstrates how to use PHP as the implementation language to develop and implement a core part of a simple linear regression (Linear regression) algorithm package. In the 2nd part, the author adds some features to the package: useful data analysis tools for small and medium sized datasets.

Brief introduction
Compared to other Open-source languages such as Perl and Python, the PHP community lacks a strong job to develop a math library.

One reason for this is that there are already a large number of sophisticated mathematical tools that may be preventing the community from developing its own PHP tools. For example, I have studied a powerful tool S System, which has an impressive set of statistical databases, designed to analyze datasets, and was awarded the ACM Award in 1998 for its language design. If S or its open source category R is just a exec_shell call, why bother using PHP to achieve the same statistical computing function? For more information on S System, its ACM awards, or R, see Resources.

Isn't this a waste of developer energy? If the motivation for developing a PHP math library is to save the developer's energy and use the best tools to do the job, then PHP's current topic is meaningful.

On the other hand, motivation for teaching may encourage the development of a PHP math library. For about 10% of the people, mathematics is an interesting subject to explore. For those who are also skilled at PHP, the development of the PHP Math library can enhance the math learning process, in other words, not just reading chapters on T-tests, but also implementing a class that calculates the intermediate values and displays them in a standard format.

Through coaching and training, I want to prove that developing a PHP math library is not a difficult task, it may represent an interesting technology and learning problem. In this article, I'll provide a sample PHP math library, called Simplelinearregression, which demonstrates a common way to develop a PHP math library. Let's start by discussing some common principles that guide me in developing this simplelinearregression class.

Guiding Principles
I used six general principles to guide the development of simplelinearregression classes.

1. Each analysis model establishes a class.
2. Use reverse link to develop class.
3. There are expected to be a large number of getter.
4. Store intermediate results.
5. Develop preferences for detailed APIs.
6. Perfection is not a goal.
7. Let us examine these guidelines in more detail.

Each analysis model builds a class
Each major analytical test or procedure should have a PHP class with the same name as the test or procedure name, which contains the input functions, functions and output functions that calculate the intermediate value and the total value (the median and total values are displayed on the screen in text or graphic format).

Use reverse link to develop class
In mathematical programming, the goal of encoding is usually the standard output value that an analysis process (such as multipleregression, TimeSeries, or chisquared) expects to generate. From a problem-solving perspective, this means that you can use reverse linking to develop a method of mathematical classes.

For example, the summary Output screen displays one or more summary statistics. These summary statistical results depend on the calculation of the intermediate statistic results, which may involve a deeper level of intermediate statistics, and so on. This development method based on the reverse link derives the next principle.

Expected to have a large number of getter
Most of the math class development work involves calculating intermediate values and summary values. In fact, this means that you should not be surprised if your class contains many getter methods for calculating intermediate and aggregate values.

Store intermediate Results
The intermediate results are stored in the result object so that you can use the intermediate result as input for subsequent computations. This principle is implemented in S language design. In the current environment, this principle is implemented by selecting the instance variable to represent the calculated intermediate value and the aggregated result.

Develop preferences for detailed APIs
When making a naming scheme for member functions and instance variables in the Simplelinearregression class, I found that if I used a longer name (a name like Getsumsquarederror rather than a getYY2) to describe a member function and an instance variable, It is much easier to understand what the operation of the function is and what the variables mean.

I did not give up the abbreviated name altogether; however, when I use the abbreviated name, I have to try to provide a comment to fully explain the meaning of the name. My view is that highly abbreviated naming schemes are common in mathematical programming, but they make it more difficult to understand and prove whether a mathematical routine is more or less in step, without having to create such a difficulty.

Perfection is not a goal
The goal of this coding exercise is not to be sure to develop a highly optimized and rigorous math engine for PHP. In the early stages, emphasis should be placed on learning to achieve significant analytical testing, as well as addressing the challenges in this area.


Instance variables
When modeling a statistical test or procedure, you need to indicate which instance variables to declare.

The selection of an instance variable can be determined by stating the intermediate value and the total value generated by the analysis process. Each intermediate value and summary value can have a corresponding instance variable that takes the value of the variable as an object property.

I use this analysis to determine which variables to declare for the Simplelinearregression class in Listing 1. A similar analysis can be performed on multipleregression, ANOVA, or timeseries processes.

Listing 1. Instance variables of the Simplelinearregression class
<?php

Copyright 2003, Paul Meagher
Distributed under GPL

Class Simplelinearregression {

var $n;
var $X = array ();
var $Y = array ();
var $ConfInt;
var $Alpha;
var $XMean;
var $YMean;
var $SumXX;
var $SumXY;
var $SumYY;
var $Slope;
var $YInt;
var $PredictedY = array ();
var $Error = array ();
var $SquaredError = array ();
var $TotalError;
var $SumError;
var $SumSquaredError;
var $ErrorVariance;
var $StdErr;
var $SlopeStdErr;
var $SlopeVal; T Value of Slope
var $YIntStdErr;
var $YIntTVal; T value for Y Intercept
var $R;
var $RSquared;
var $DF; Degrees of Freedom
var $SlopeProb; Probability of Slope Estimate
var $YIntProb; Probability of Y Intercept estimate
var $AlphaTVal; T Value for given Alpha setting
var $ConfIntOfSlope;

var $RPath = "/USR/LOCAL/BIN/R"; Your path here

var $format = "%01.2f"; Used for formatting output

}
?>


Constructors
The constructor method of the Simplelinearregression class accepts an X and a Y vector, each of which has the same number of values. You can also set a confidence interval that defaults to 95% for your estimated Y value (confidence interval).

The constructor method starts with validating that the data form is appropriate for processing. Once the input vector passes the "equal size" and "value greater than 1" test, the kernel part of the algorithm is executed.

Performing this task involves calculating the intermediate and summary values of the statistical process through a series of getter methods. Assigns the return value of each method call to an instance variable of the class. Storing the results in this way ensures that the call routines in the back-and-forth calculation can use intermediate and summary values. You can also display these results by calling the output method of the class, as described in Listing 2.

Listing 2. Calling class Output method
<?php

Copyright 2003, Paul Meagher
Distributed under GPL

function Simplelinearregression ($X, $Y, $ConfidenceInterval = "95") {

$numX = count ($X);
$numY = count ($Y);

if ($numX!= $numY) {
Die ("Error:size of X and Y vectors must be the same.");

}
if ($numX <= 1) {
Die ("Error:size of input array must is at least 2");
}

$this->n = $numX;
$this->x = $X;
$this->y = $Y;

$this->confint = $ConfidenceInterval;
$this->alpha = (1 + ($this->confint/100))/2;

$this->xmean = $this->getmean ($this->x);
$this->ymean = $this->getmean ($this->y);
$this->sumxx = $this->getsumxx ();
$this->sumyy = $this->getsumyy ();
$this->sumxy = $this->getsumxy ();
$this->slope = $this->getslope ();
$this->yint = $this->getyint ();
$this->predictedy = $this->getpredictedy ();
$this->error = $this->geterror ();
$this->squarederror = $this->getsquarederror ();
$this->sumerror = $this->getsumerror ();
$this->totalerror = $this->gettotalerror ();
$this->sumsquarederror = $this->getsumsquarederror ();
$this->errorvariance = $this->geterrorvariance ();
$this->stderr = $this->getstderr ();
$this->slopestderr = $this->getslopestderr ();
$this->yintstderr = $this->getyintstderr ();
$this->slopetval = $this->getslopetval ();
$this->yinttval = $this->getyinttval ();
$this->r = $this->getr ();
$this->rsquared = $this->getrsquared ();
$this->df = $this->getdf ();
$this->slopeprob = $this->getstudentprob ($this->slopetval, $this->df);
$this->yintprob = $this->getstudentprob ($this->yinttval, $this->df);
$this->alphatval = $this->getinversestudentprob ($this->alpha, $this->df);
$this->confintofslope = $this->getconfintofslope ();

return true;
}

?>


The method name and its sequence are deduced by combining the reverse link and the statistical textbook used by the university undergraduate students, which explains Step-by-step how to calculate the median. The name of the intermediate value I need to compute has a "get" prefix to derive the method name.

Match the model to the data
The simplelinearregression process is used to produce a line that matches the data, where the line has the following standard equations:

y = b + mx

The PHP format for this equation looks similar to listing 3:

Listing 3. The PHP equation that matches the model to the data
$PredictedY [$i] = $YIntercept + $Slope * $X [$i]


The Simplelinearregression class uses the least-squares criterion to derive an estimate of the Y-intercept (y Intercept) and slope (Slope) parameters. These estimated parameters are used to construct a linear equation (see Listing 3), which models the relationship between X and Y values.

Using the deduced linear equation, you can get the predicted Y value for each X value. If the linear equation is in good agreement with the data, then the observed value of Y is nearly the same as the predicted value.

How to determine if it fits perfectly
The Simplelinearregression class generates quite a few summary values. An important summary value is the T statistic, which can be used to measure the degree of coincidence of a linear equation with the data. If the match is very good, then the T statistic is often very large. If the T statistic is small, you should replace the linear equation with a model that assumes that the mean value of Y is the best predictor (that is, the mean of a set of values is usually the predicted value that is useful for the next observation, making it the default model).

To test whether the t statistic is large enough to not use the mean value of the Y-value as the best predictor, you need to compute the random probability of getting t-statistic values. If the probability of getting a T statistic is low, you can negate the assumption that the mean is the best predictor, which corresponds to the belief that the simple linear model fits the data very well.

So, how do you calculate the probability of a T statistic?

Calculate probability of T statistical value
Because PHP lacks mathematical routines to compute the probability of T statistics, I decided to give this task to the statistical calculation package R (see Resources for www.r-project.org) to get the necessary values. I would also like to draw your attention to the package because:

1. R offers a number of ideas that PHP developers might simulate in the PHP math library.
2. With R, you can determine whether the values obtained from the PHP Math library are consistent with those obtained from the mature, free and available open source statistics package.
The code in Listing 4 shows how easy it is to give R to handle to get a value.

Listing 4. To the R statistic calculation package to get a value
<?php

Copyright 2003, Paul Meagher
Distributed under GPL

Class Simplelinearregression {

var $RPath = "/USR/LOCAL/BIN/R"; Your path here

function Getstudentprob ($T, $DF) {
$Probability = 0.0;
$cmd = "echo" DT ($T, $df) ' | $this->rpath--slave ";
$result = Shell_exec ($cmd);
List ($LineNumber, $Probability) = Explode ("", Trim ($result));
return $Probability;
}

function Getinversestudentprob ($alpha, $DF) {
$InverseProbability = 0.0;
$cmd = "Echo ' qt ($alpha, $df) ' | $this->rpath--slave ";
$result = Shell_exec ($cmd);
List ($LineNumber, $InverseProbability) = Explode ("", Trim ($result));
return $InverseProbability;
}

}

?>


Note that the path to the R executable is set here, and the path is used in two functions. The first function returns the probability value associated with the T statistic based on the student's t distribution, and the second inverse function calculates the t statistic corresponding to the given alpha setting. The Getstudentprob method is used to evaluate the degree of coincidence of linear models; The Getinversestudentprob method returns an intermediate value that calculates the confidence interval for each predicted Y-value.

Because of space limitations, I cannot elaborate on all the functions in this class, so if you want to understand the terminology and procedures involved in simple linear regression analysis, I encourage you to refer to the statistical textbooks used by university students.

Fuel consumption Research
To demonstrate how to use this class, I can use data from the utility Burnup (burnout) study. Michael Leiter and Kimberly Ann Meechan have studied the relationship between the Burnup unit of measurement called the Consumption Index (exhaustion index) and the independent variable called the Concentration degree (concentration). Concentration refers to the proportion of people's social contacts that come from their working environment.

To study the relationship between the personal consumption index and the centralization value in their sample, load the values into an appropriately named array and instantiate the class with these array values. When a class is instantiated, it displays some of the total values generated by the class to assess the degree to which the linear model matches the data.

Listing 5 shows the script that loads the data and displays the total value of the meeting:

Listing 5. Script for loading data and displaying the total value of a meeting
<?php

burnoutstudy.php

Copyright 2003, Paul Meagher
Distributed under GPL

Include "simplelinearregression.php";

Load Data from Burnout study

$Concentration = Array (20,60,38,88,79,87,
68,12,35,70,80,92,
77,86,83,79,75,81,
75,77,77,77,17,85,96);

$ExhaustionIndex = Array (100,525,300,980,310,900,
410,296,120,501,920,810,
506,493,892,527,600,855,
709,791,718,684,141,400,970);

$SLR = new Simplelinearregression ($Concentration, $ExhaustionIndex);

$YInt = sprintf ($SLR->format, $SLR->yint);
$Slope = sprintf ($SLR->format, $SLR->slope);
$SlopeTVal = sprintf ($SLR->format, $SLR->slopetval);
$SlopeProb = sprintf ("%01.6f", $SLR->slopeprob);

?>

<table border= ' 1 ' cellpadding= ' 5 ' >
<tr>
<th align= ' right ' >Equation:</th>
<td></td>
</tr>
<tr>
<th align= ' right ' >T:</th>
<td></td>
</tr>
<tr>
<th align= ' right ' >prob > t:</th>
<td><td>
</tr>
</table>


Running the script through a Web browser produces the following output:

Equation:exhaustion = -29.50 + (8.87 * concentration)
t:6.03
Prob > t:0.000005


The last line of this table indicates that the random probability of acquiring such a large T-value is very low. It can be concluded that simple linear models have better predictive power than the average value of consumption alone.

Knowing the degree of concentration of a person's workplace connection can be used to predict the level of burnup they may be consuming. This equation tells us that the value of a person's consumption in the social service area will be increased by 8 units for each 1 units added to the concentration degree. This further demonstrates that in order to reduce the potential burnup, individuals in the social services field should consider making friends outside their workplaces.

This is just a rough description of what these results might mean. To fully examine the implications of this dataset, you may want to study this data in more detail to make sure that it is the correct explanation. In the next article I'll discuss what other analyses should be done.

What have you learned?
First, you don't have to be a rocket scientist to develop a significant PHP-based math package. Adherence to standard object-oriented technology, and the explicit use of reverse link problem-solving methods, can be relatively easy to use PHP to achieve some of the more basic statistical processes.

From the point of view of teaching, I think this exercise is very useful if you are only asked to think about statistical tests or routines at higher and lower levels of abstraction. In other words, a good way to supplement your statistical testing or process learning is to implement this process as an algorithm.

To implement a statistical test, you typically need to go beyond the given information range and creatively resolve and discover problems. It is also a good way to find out the lack of understanding of a subject.

On the downside, you find that PHP lacks intrinsic means for sampling distributions, which is necessary to achieve most statistical tests. You need to give r to handle these values, but I'm afraid you will have no time or interest in installing R. Native PHP implementations of some common probability functions can solve this problem.

Another problem is that the class generates many intermediate and summary values, but the rollup output does not actually take advantage of this. I have provided some difficult output, but this is neither sufficient nor well organized so that you cannot adequately explain the results of the analysis. In fact, I have absolutely no idea how the output method can be integrated into the class. This needs to be addressed.

Finally, to figure out the data, not just look at the summary value. You also need to understand how each data point is distributed. One of the best ways to do this is to draw your data into a chart. Again, I don't know much about this, but I need to solve this problem if I want to use this class to analyze the actual data.

In the next article in this series, I'll use native PHP code to implement some probability functions, extend the Simplelinearregression class with several output methods, and generate a report that represents intermediate and summary values in table and graphic format, making it easier to draw conclusions from the data. And stay Let's!


Resources

1. Please refer to the popular university textbook Statistics written by James T. McClave and Terry Sincich, 9th edition (Prentice-hall, online), which is referenced in the algorithmic steps and the "burnup study" examples used in this article.
2. Check the PEAR resource pool, which currently contains a small number of low-level PHP math classes. In the end, it should be nice to see that PEAR contains higher-level numeric methods for implementing the standard (such as Simplelinearregression, Multipleregression, TimeSeries, ANOVA, Factoranalysis, Fourieranalysis and other) packages.
3. View all source code for the author's Simplelinearregression class.
4. Learn about the numerical Python project, which extends python with a very scientific array language and a well-established subscript method. With this extension, mathematical operations are very close to what people expect from the compiled language.
5. Study the many mathematical references available to Perl, including the CPAN math module and the modules in the algorithm section of the CPAN, and the Perl data language (Perl Language), which is designed to provide Perl with the ability to compress storage and quickly manipulate large N-dimensional data arrays 。
6. For more information on the S programming language of John Chambers, please refer to his publications and links to his various research projects at Bell Labs. You can also learn about the ACM Award for language Design in 1998.
7.R is a language and environment for statistical calculations and graphics, similar to the award-winning S SYSTEM,R provides statistical and graphical techniques such as linear and non-linear modeling, statistical testing, time series analysis, classification, clustering, and so on. Please learn about R on the R Project home page.
8. If you have just contacted PHP, please read Amol Hatwar's DeveloperWorks series: "Developing robust code in PHP:" "Part 1th: farsighted Introduction" (August 2002), "part 2nd: Effective use of variables" (2002 Year September) and "part 3rd: Writing reusable Functions" (November 2002).

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.