PHP: a data research tool for simple linear regression

Source: Internet
Author: User
Data research tools for PHP to implement simple linear regression. The basic goal behind the concept of simple linear regression modeling is to find the most consistent line from the paired two-dimensional plane consisting of X and Y values (that is, X and Y measurements. Once the minimum variance is used Concept

The basic goal behind simple linear regression modeling is from pairing XValue and YValue (that is XAnd YMeasured value) to find the most consistent line in a two-dimensional plane. Once used Least variance methodFind this line, and you can perform various statistical tests to determine this line and the observed YThe deviation degree of the value.

Linear equation ( Y = mx + B) There are two parameters that must be based on the provided XAnd YThe data is estimated as the slope ( M) And y-axis intercept ( B). Once the two parameters are estimated, the observed values can be input into the linear equation, and YPredicted value.

Use the least variance method to estimate MAnd BParameters, we need to find the m and B estimates, so that they for all XWorth It YThe observed values and predicted values are the smallest. The difference between the observed value and the predicted value is called an error ( Y I-(mx I + B)), And if we calculate the square for each error value and then calculate the sum of these residual values, the result is Prediction squared difference. Using the least variance method to determine the most consistent line involves finding the rows that minimize the prediction variance MAnd B.

Two basic methods can be used to find the estimated value satisfying the least variance method. MAnd B. The first method can be used to set different values in the numerical search process. MAnd BAnd evaluate them to determine the minimum variance. The second method is to use calculus to locate MAnd B. I am not going to discuss in depth the calculus involved in deriving these equations, but I did use these analytical equations in the SimpleLinearRegression class to find MAnd B(See getSlope () and getYIntercept methods in the SimpleLinearRegression class ).

Even if you have MAnd BThe equation of least square estimate does not mean that as long as these parameters are substituted into the linear equation, the result is a straight line that fits well with the data. The next step in this simple linear regression process is to determine whether the remaining prediction variance is acceptable.

We can use the statistical decision-making process to reject the alternative assumption of "line matching data. This process is based on the calculation of the T statistical value, and the probability function is used to obtain the probability of a random large observed value. As mentioned in Part 1, the SimpleLinearRegression class generates a large number of aggregate values, one of which is the T statistical value, which can be used to measure the degree of fit between linear equations and data. If the matching is good, the T statistical value is usually a large value. if the T value is small, a default model should be used to replace your linear equation. This model assumes that YThe average value of a value is the best predicted value (because the average value of a group of values can usually be a useful predicted value of the next observed value ).

To test whether the T statistic value is large, you can skip this step. YThe average value of the value is used as the best predicted value, and the probability of random T statistic values is calculated. If the probability is low, you can avoid the assumption that the average value is the best predicted value, and you can be sure that the simple linear model is well fit with the data. (For more information about calculating the probability of T statistic values, see section 1st .)

Go back and discuss the statistical decision-making process. It tells you when to ignore invalid assumptions, but does not tell you whether to accept the alternative assumptions. In the research environment, we need to use theoretical parameters and statistical parameters to establish the alternative hypothesis for linear models.

You have built a data research tool to implement a statistical decision-making process for a linear model (T test), and provided summarized data that can be used to construct theoretical and statistical parameters, these parameters are required to establish a linear model. Data Research tools can be classified as decision-making support tools for knowledge workers to conduct centralized research on small and medium-sized data.

From the perspective of learning, simple linear regression modeling is worth studying, because it is the only way to understand more advanced forms of statistical modeling. For example, many core concepts in simple linear Regression are to understand Multiple Regression, Factor Analysis, and Time Series.

Simple linear regression is also a multi-purpose modeling technology. You can use the original data (usually logarithm or power conversion) to model the curve data. These transformations can make the data linear, so that you can use simple linear regression to model the data. The generated linear model is represented as a linear formula related to the converted value.

   Probability Functions

In the previous article, I used R to obtain the probability value, avoiding the problem of using PHP to implement the probability function. I am not completely satisfied with this solution, so I began to study this problem: what is required for developing PHP-based probability functions.

I started searching for information and code online. One of the two sources is books. Numerical Recipes in C. I re-implemented some probability function codes (gammln. c and betai. c functions) using PHP, but I am still not satisfied with the results. Compared with some other implementations, the code seems to be a little more. In addition, I also need anti-probability functions.

Fortunately, I happened to find Interactive Statistical Calculation of John Pezzullo. John's website on probability distribution functions has all the functions I need. for ease of learning, these functions are implemented in JavaScript.

I transplanted Student T and Fisher F functions to PHP. I made some changes to the API to conform to the Java Naming style and embed all functions into the class named Distribution. A great feature of this implementation is the doCommonMath method, which is reused by all functions in this library. The doCommonMath method is also used for other tests (normality and chi-square tests) that I do not have the effort to implement.

Another aspect of this transplantation is worth noting. By using JavaScript, you can assign dynamically determined values to instance variables, for example:

            var PiD2 = pi() / 2            

You cannot do this in PHP. Only simple constant values can be assigned to instance variables. We hope to solve this problem in PHP5.

Note that the code in listing 1 does not define the instance variables-this is because they are dynamically assigned values in the JavaScript version.

Listing 1. implementing probability functions

            <?php            // Distribution.php            // Copyright John Pezullo            // Released under same terms as PHP.            // PHP Port and OOfying by Paul Meagher            class Distribution {            function doCommonMath($q, $i, $j, $b) {            $zz = 1;            $z  = $zz;            $k  = $i;            while($k <= $j) {            $zz = $zz * $q * $k / ($k - $b);            $z  = $z + $zz;            $k  = $k + 2;            }            return $z;            }            function getStudentT($t, $df) {            $t  = abs($t);            $w  = $t  / sqrt($df);            $th = atan($w);            if ($df == 1) {            return 1 - $th / (pi() / 2);            }            $sth = sin($th);            $cth = cos($th);            if( ($df % 2) ==1 ) {            return            1 - ($th + $sth * $cth * $this->doCommonMath($cth * $cth, 2, $df - 3, -1))            / (pi()/2);            } else {            return 1 - $sth * $this->doCommonMath($cth * $cth, 1, $df - 3, -1);            }            }            function getInverseStudentT($p, $df) {            $v =  0.5;            $dv = 0.5;            $t  = 0;            while($dv > 1e-6) {            $t = (1 / $v) - 1;            $dv = $dv / 2;            if ( $this->getStudentT($t, $df) > $p) {            $v = $v - $dv;            } else {            $v = $v + $dv;            }            }            return $t;            }            function getFisherF($f, $n1, $n2) {            // implemented but not shown            }            function getInverseFisherF($p, $n1, $n2) {            // implemented but not shown            }            }            ?>            
  Output method

Since you have already implemented probability functions using PHP, the only challenge left after developing a data research tool based on PHP is to design a method for displaying analysis results.

A simple solution is to display the values of all instance variables on the screen as needed. In the first article, when we show the linear equation of the consumption research (Burnout Study, TValue and TThis is what I do when it comes to probability. It is helpful to access a specific value for a specific purpose. SimpleLinearRegression supports this usage.

However, another method for output results is to systematically group the output parts. If we study the output of the main statistical software packages used for regression analysis, we will find that they often group the output in the same way. They often have Summary Table), Analysis Of Variance)Table, Parameter Estimate)Table and R Value). Similarly, I have created some output methods with the following names:
  • ShowSummaryTable ()

    The basic goal behind linear regression modeling is to find the most consistent line from a pair of two-dimensional planes consisting of X and Y values (that is, X and Y measurements. Once the minimum variance is used...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.