The importance of PHP's mathematical library for simple linear regression

Last Update:2016-06-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Brief introduction

Compared to other open source languages such as Perl and Python, the PHP community lacks a strong job to develop a math library.

One reason for this may be that there are already a large number of sophisticated mathematical tools, which may hinder the community's own efforts to develop PHP tools. For example, I have studied a powerful tool, S System, which has an impressive set of statistical libraries designed to analyze datasets and was awarded the ACM Award for its language design in 1998. If S or its open source similar R is just a Exec_shell call, then why bother to use PHP to achieve the same statistical computing function? For more information about the S System, its ACM award, or R, see related references.

Isn't this a waste of developer energy? If the motivation to develop a PHP math library is to save developers ' energy and use the best tools to get the job done, PHP's current topic is meaningful.

On the other hand, the development of the PHP Math library may be encouraged for instructional reasons. For about 10% of people, Mathematics is an interesting topic to explore. For those who are also skilled at using PHP, the development of the PHP Math library can enhance the learning process of mathematics, in other words, do not just read the section on T tests, but also implement a class that can calculate the corresponding intermediate values and display them in a standard format.

Through coaching and training, I want to prove that developing a PHP math library is not a difficult task, it may represent an interesting technical and learning challenge. In this article, I'll provide an example of a PHP math library called Simplelinearregression, which demonstrates a common approach that you can use to develop a PHP math library. Let's start by discussing some common principles that guide me in developing this simplelinearregression class.

Guiding Principles

I used six general principles to guide the development of the Simplelinearregression class.

Create a class for each analysis model.
Use the reverse link to develop the class.
A large number of getters are expected.
Stores intermediate results.
Make a preference for the detailed API.
Perfection is not a goal.

Let's examine these guidelines in more detail.
　
Create a class for each analysis model

Each major analysis test or process should have a PHP class with the same name as the test or procedure name, which contains the input function, the function that calculates the median and the total value, and the output function (which displays the median and total values in text or graphic format all on the screen).

Using reverse chaining to develop classes

In mathematical programming, the goal of coding is usually the standard output value that the analysis process (such as multipleregression, TimeSeries, or chisquared) wants to generate. From a problem-solving perspective, this means that you can use the reverse link to develop a mathematical class approach.

For example, the summary output screen shows one or more summary statistics results. These aggregated statistical results are dependent on the calculation of the intermediate statistical results, which in turn may involve a deeper level of intermediate statistical results, and so on. This reverse link-based development method derives the next principle.

Expect a lot of getter

Most of the class development work for mathematical classes involves calculating intermediate and summary values. In fact, this means that you should not be surprised if your class contains many getter methods for calculating intermediate and aggregate values.

Storing intermediate results

The intermediate results are stored inside the result object so that you can use the intermediate result as input for subsequent calculations. This principle has been implemented in the design of S language. In the current environment, this principle is implemented by selecting instance variables to represent computed intermediate values and summary results.

Making preferences for detailed APIs

When naming schemes for member functions and instance variables in the Simplelinearregression class, I find that if I describe member functions and instance variables using longer names (similar to names like Getsumsquarederror, rather than getYY2), It is easier to understand what the function is doing and what the variables represent.

I did not give up the abbreviated name altogether, but when I used the shorthand name, I had to try to provide a comment to fully explain the meaning of the name. My view is that highly abbreviated naming schemes are common in mathematical programming, but they make it more difficult to understand and prove that a mathematical routine is step-by-step, rather than having to cause such difficulties.

Perfection is not a goal

The goal of this coding exercise is not to be sure to develop highly optimized and rigorous math engines for PHP. At an early stage, emphasis should be placed on learning to achieve meaningful analytical testing and solving this problem.

Instance variable

When modeling a statistical test or process, you need to indicate which instance variables are declared.

The selection of instance variables can be determined by explaining the intermediate and total values generated by the analysis process. Each intermediate and summary value can have a corresponding instance variable that takes the value of the variable as an object property.

I use this analysis to determine which variables are declared for the Simplelinearregression class in Listing 1. You can perform a similar analysis on the multipleregression, ANOVA, or TimeSeries processes.

Listing 1. Instance variables for the Simplelinearregression class

Copyright 2003, Paul Meagher
Distributed under GPL
Class Simplelinearregression {
var $n;
var $X = array ();
var $Y = array ();
var $ConfInt;
var $Alpha;
var $XMean;
var $YMean;
var $SumXX;
var $SumXY;
var $SumYY;
var $Slope;
var $YInt;
var $PRedictedY = array ();
var $Error = array ();
var $SquaredError = array ();
var $TotalError;
var $SumError;
var $SumSquaredError;
var $ErrorVariance;
var $StdErr;
var $SlopeStdErr;
var $SlopeVal; T Value of Slope
var $YIntStdErr;
var $YIntTVal; T value for Y Intercept
var $R;
var $RSquared;
var $DF; Degrees of Freedom
var $SlopeProb; Probability of Slope Estimate
var $YIntProb; Probability of Y Intercept Estimate
var $AlphaTVal; T Value for given Alpha setting
var $ConfIntOfSlope;
　
var $RPath = "/USR/LOCAL/BIN/R"; Your path here
　
var $format = "%01.2f"; Used for formatting output
　
}
?>

constructor function

The constructor method of the Simplelinearregression class accepts an x and a Y-vector, each with the same number of values. You can also set a confidence interval (confidence interval) that defaults to 95% for the Y value you expect.

The constructor method begins by verifying that the data form is appropriate for processing. Once the input vectors pass the "size equal" and "value greater than 1" tests, the core part of the algorithm is executed.

Performing this task involves calculating the median and summary values of the statistical process through a series of getter methods. Assigns the return value of each method call to an instance variable of the class. Storing the results in this way ensures that the call routines in the pre-and post-linked calculations can use both intermediate and summary values. You can also display these results by invoking the output method of the class, as described in Listing 2.

Listing 2. Calling the class output method

Copyright 2003, Paul Meagher
Distributed under GPL
function Simplelinearregression ($X, $Y, $ConfidenceInterval = "95") {
$numX = count ($X);
$numY = count ($Y);
if ($numX! = $numY) {
Die ("Error:size of X and Y vectors must is the same.");
}
if ($numX <= 1) {
Die ("Error:size of input array must is at least 2.");
}
　
$this->n = $numX;
$this->x = $X;
$this->y = $Y;
　
$this->confint = $ConfidenceInterval;
$this->alpha = (1 + ($this->confint/100))/2;
$this->xmean = $this->getmean ($this->x);
$this->ymean = $this->getmean ($this->y);
$this->sumxx = $this->getsumxx ();
$this->sumyy = $this->getsumyy ();
$this->sumxy = $this->getsumxy ();
$this->slope = $this->getslope ();
$this->yint = $this->getyint ();
$this->predictedy = $this->getpredictedy ();
$this->error = $this->geterror ();
$this->squarederror = $this->getsquarederror ();
$this->sumerror = $this->getsumerror ();
$this->totalerror = $this->gettotalerror ();
$this->sumsquarederror = $this->getsumsquarederror ();
$this->errorvariance = $this->geterrorvariance ();
$this->stderr = $this->getstderr ();
$this->slopestderr = $this->getslopestderr ();
$this->yintstderr = $this->getyintstderr ();
$this->slopetval = $this->getslopetval ();
$this->yinttval = $this->getyinttval ();
$this->r = $this->getr ();
$this->rsquared = $this->getrsquared ();
$this->df = $this->getdf ();
$this->slopeprob = $this->getstudentprob ($this->slopetval, $this->df);
$this->yintprob = $this->getstudentprob ($this->yinttval, $this->df);
$this->alphatval = $this->getinversestudentprob ($this->alpha, $this->df);
$this->confintofslope = $this->getconfintofslope ();
return true;
}
?>

The method name and its sequence are derived from a combination of reverse linking and reference to a statistical textbook used by undergraduate students, which shows step-by-step how to calculate intermediate values. I need to calculate the name of the intermediate value with a "get" prefix, thus deriving the method name.

Match the model to the data

The simplelinearregression process is used to produce a line that matches the data, where the line has the following standard equations:

y = b + mx

The PHP format of the equation looks similar to listing 3:

Listing 3. PHP equations that match the model to the data

$PredictedY [$i] = $YIntercept + $Slope * $X [$i]

The Simplelinearregression class uses the least squares criterion to derive estimates for Y-intercept (Y-Intercept) and slope (Slope) parameters. These estimated parameters are used to construct a linear equation (see Listing 3), which models the relationship between x and Y values.

Using the deduced linear equation, you can get the predicted Y value corresponding to each x value. If the linear equation is consistent with the data, then the observed value of Y is closer to the predicted value.

How to determine if it fits perfectly

The Simplelinearregression class generates quite a few summary values. An important summary value is the t statistic, which can be used to measure the degree to which a linear equation matches the data. If it fits well, the T statistic is often very large. If the T statistic is small, you should replace the linear equation with a model that assumes that the mean value of the Y value is the best predictor (that is, the mean of a set of values is usually the predicted value that is useful for the next observation, making it a default model).

To test that the t statistic is large enough not to use the mean value of the Y value as the best predictor, you need to calculate the random probability of getting the t statistic value. If the probability of getting the T statistic is low, then you can negate the null hypothesis that the mean is the best predictor, which corresponds to the fact that the simple linear model fits the data very well.

So, how do you calculate the probability of a T statistic?

Calculate the probability of the T statistic value

Because PHP lacks a mathematical routine to calculate the probability of a T statistic, I decided to give this task to the statistical calculation package R (see Resources for www.r-project.org) to get the necessary values. I would also like to draw attention to the package because:

R provides a number of ideas that PHP developers might emulate in the PHP math library.
With R, you can determine whether the values obtained from the PHP Math library are consistent with those obtained from the mature free open source statistics package.
The code in Listing 4 demonstrates how easy it is to give R to handle to get a value.

Listing 4. To the R statistical calculation package to process to get a value

Copyright 2003, Paul Meagher
Distributed under GPL
Class Simplelinearregression {
　
var $RPath = "/USR/LOCAL/BIN/R"; Your path here
function Getstudentprob ($T, $DF) {
$Probability = 0.0;
$cmd = "echo" DT ($T, $df) ' | $this->rpath--slave ";
$result = Shell_exec ($cmd);
List ($LineNumber, $Probability) = Explode ("", Trim ($result));
return $Probability;
}
function Getinversestudentprob ($alpha, $DF) {
$InverseProbability = 0.0;
$cmd = "Echo ' qt ($alpha, $df) ' | $this->rpath--slave ";
$result = Shell_exec ($cmd);
List ($LineNumber, $InverseProbability) = Explode ("", Trim ($result));
return $InverseProbability;
}
}
?>

Note that the path to the R executable is already set, and the path is used in two functions. The first function returns the probability value associated with the T statistic based on the student's t distribution, while the second inverse calculates the t statistic corresponding to the given alpha setting. The Getstudentprob method is used to evaluate the degree of coincidence of the linear model; The Getinversestudentprob method returns an intermediate value that calculates the confidence interval for each predicted y value.

Because of the limited space, I cannot explain all the functions in this class one-by-one, so if you want to understand the terms and procedures involved in simple linear regression analysis, I encourage you to refer to the statistical textbooks used by undergraduate students.

Fuel consumption Research

To demonstrate how to use this class, I can use the data from the fuel consumption (burnout) study in the public sector. Michael Leiter and Kimberly Ann Meechan studied the relationship between the unit of fuel consumption, called the consumption Index (exhaustion Index), and the independent variable called concentration (concentration). Concentration refers to the proportion of people's social contact that comes from their work environment.

To study the relationship between the consumption index values of individuals in their samples and the concentration values, load these values into an appropriately named array and instantiate the class with these array values. After the class is instantiated, it displays some of the total values generated by the class to assess how well the linear model fits the data.

Listing 5 shows the script that loads the data and displays the value of the rollup:

Listing 5. Script for loading data and displaying value of totals

burnoutstudy.php
Copyright 2003, Paul Meagher
Distributed under GPL
Include "simplelinearregression.php";
Load Data from Burnout study
$Concentration = Array (20,60,38,88,79,87,
68,12,35,70,80,92,
77,86,83,79,75,81,
75,77,77,77,17,85,96);
　　　　　　　　　　　　　
$ExhaustionIndex = Array (100,525,300,980,310,900,
410,296,120,501,920,810,
506,493,892,527,600,855,
709,791,718,684,141,400,970);
　　　　　　　　　　　　　
$SLR = new Simplelinearregression ($Concentration, $ExhaustionIndex);
$YInt = sprintf ($SLR->format, $SLR->yint);
$Slope = sprintf ($SLR->format, $SLR->slope);
$SlopeTVal = sprintf ($SLR->format, $SLR->slopetval);
$SlopeProb = sprintf ("%01.6f", $SLR->slopeprob);
?>

Equation:
T:
Prob > T:

Running the script from a Web browser produces the following output:

Equation:exhaustion = -29.50 + (8.87 * concentration)
t:6.03
Prob > t:0.000005

The last line of the table indicates that the random probability of getting such a large t value is very low. It can be concluded that the predictive power of a simple linear model is better than the average value of the consumption value alone.

Knowing the concentration of a person's workplace connection can be used to predict how much fuel they may be consuming. This equation tells us that the value of the concentration is increased by 1 units per unit, and the consumption value of one person in the social services field increases by 8 units. This further proves that in order to reduce the potential fuel consumption, individuals in the social services sector should consider making friends outside their workplaces.

This is a rough description of what these results might mean. To fully study the meaning of this data set, you may want to study this data in more detail to make sure that this is the correct explanation. In the next article I will discuss what other analyses should be performed.

What have you learned?

First, you don't have to be a rocket scientist to develop a significant PHP-based math package. Adhering to the standard object-oriented technology, and explicitly using the inverse link problem solving method, it is relatively convenient to use PHP to achieve some of the more basic statistical process.

From a teaching standpoint, I think: This exercise is very useful if you are simply asking you to think about statistical tests or routines at a higher and lower level of abstraction. In other words, a good way to supplement your statistical testing or process learning is to implement this process as an algorithm.

Achieving statistical testing often requires exceeding the given range of information and creatively solving and discovering problems. It is also a good way to find a lack of understanding of a subject.

On the downside, you find that PHP lacks intrinsic means for sampling distributions, which is necessary for most statistical testing. You need to hand it over to R to get these values, but I'm afraid you'll have no time or interest in installing R. Some common probability functions of native PHP implementations can solve this problem.

Another problem is that the class generates many intermediate and summary values, but the rollup output does not actually take advantage of this. I've provided some hard-to-handle output, but it's not enough or well organized so that you can't fully explain the results of the analysis. In fact, I have absolutely no idea how the output method can be integrated into the class. This needs to be addressed.

Finally, to figure out the data, it's not just looking at the aggregated values. You also need to understand how each data point is distributed. One of the best ways to do this is to draw your data into a chart. Again, I don't know much about this, but if you want to use this class to analyze the actual data, you need to solve the problem.



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The importance of PHP's mathematical library for simple linear regression

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support