Data research tool for solving defects of output and probability functions

The 1th part of this article series refers to the missing three elements in the simple linear regression (Linear regression) class. In this article, the author Paul Meagher uses PHP based probability functions to compensate for these flaws, demonstrating how to integrate the output method into the Simplelinearregression class and create a graphical output. He solves these problems by building data research tools designed to deeply study the information contained in small and medium-sized datasets. (In the 1th part, the author demonstrates how to use PHP as the implementation language to develop and implement the core of a simple linear regression algorithm package.) ）

In this 1th installment of the two-part series, "Simple linear regression with PHP", I explained why the math library is useful for PHP. I also demonstrated how to use PHP as the implementation language to develop and implement a simple linear regression algorithm in the core part.

The goal of this article is to show you how to build an important data research tool using the Simplelinearregression class discussed in part 1th.

Brief review: Concept
The basic goal behind simple linear regression modeling is to find the most consistent line in a two-dimensional plane consisting of pairs of x values and y values (i.e., x and y measurements). Once this line is found using the minimum variance method, a variety of statistical tests can be performed to determine the extent to which the line deviates from the observed Y-value.

A linear equation (y = mx + b) has two parameters that must be estimated according to the supplied X and y data, which are slope (m) and y-intercept (b). Once these two parameters are calculated, the observed values can be entered into a linear equation and the predicted Y values generated by the equations are observed.

To estimate the M and B parameters using the minimum variance method, the estimates for M and b are found so that they are the smallest observed and predicted values for all X-worth Y-values. The difference between observed and predicted values is called the error (Yi-(mxi + b)), and if each error value is squared and then the sum of the residuals is obtained, the result is a number called the predicted squared difference. Using the minimum variance method to determine the most consistent line involves finding an estimate of M and b that minimizes the predicted variance.

Two basic methods can be used to find estimates M and b that satisfy the minimum variance method. The first method can use a numeric search procedure to set different m and B values and evaluate them, and ultimately determine the estimate that produces the minimum variance. The second approach is to use calculus to find equations for estimating M and b. I'm not going to delve into the calculus involved in deriving these equations, but I did use these analytic equations in the Simplelinearregression class to find the least square estimate of M and B (see the Simplelinearregression class Getslope () and Getyintercept method).

Even having an equation that can be used to find the least square estimate of M and B does not mean that as long as the arguments are put into a linear equation, the result is a straight line that fits well with the data. The next step in this simple linear regression process is to determine whether the remaining predictive variances are acceptable.

You can use the statistical decision process to veto the alternative hypothesis of "line and data anastomosis." This process is based on the calculation of T statistic value and uses the probability function to obtain the probability of the random large observed value. As mentioned in part 1th, the Simplelinearregression class generates a large number of summary values, one of which is the T statistic, which can be used to measure the degree to which a linear equation fits the data. If the anastomosis is good, the T statistic is often a larger value; If the T value is small, you should replace your linear equation with a default model, which assumes that the average value of the Y value is the best predictor (since the average of a set of values can usually be a useful predictor of the next observation value).

To test whether the T statistic is large so that you can use the average value of Y-values as the best predictor, you need to calculate the probability of randomly acquiring T-statistics. If the probability is low, then the invalid assumption that the average value is the best predictor can be avoided, and the simple linear model can be assured that the data is well matched. (For more information on calculating the probability of T statistics, see part 1th.) ）

Go back and discuss the statistical decision-making process. It tells you when not to take an invalid hypothesis, but does not tell you whether to accept the optional hypothesis. In the research environment, it is necessary to establish the hypothesis of the linear model by the theoretical parameters and statistical parameters.

The Data research tool that you build implements the statistical decision-making process for linear models (T-Test) and provides summary data that can be used to construct theoretical and statistical parameters that are required to establish a linear model. Data research tools can be categorized as decision support tools for knowledge workers to study patterns in small and medium sized data sets.

From a learning point of view, simple linear regression modeling is worth studying, because it is the only way to understand more advanced forms of statistical modelling. For example, many of the core concepts in simple linear regression have established a good foundation for understanding multiple regressions (multiple regression), factor analysis (Factor analyses) and time series (temporal Series).

Simple linear regression is also a versatile modeling technique. You can use it to model curve data by converting the original data, which is usually logarithmic or power-converted. These transformations allow data to be linearized so that simple linear regression can be used to model the data. The generated linear model will be represented as a linear formula associated with the converted value.

probability function

In the previous article, I managed to avoid the problem of using PHP to implement a probability function by giving R the probability value. I wasn't completely satisfied with the solution, so I started to look at the problem: what it takes to develop a probabilistic function based on PHP.

I started searching the Internet for information and code. One source of both is the probability function in the book [Url=http://www.library.cornell.edu/nr/bookcpdf.html]numerical Recipes in C [/url]. I've implemented some of the probability function codes (GAMMLN.C and BETAI.C functions) in PHP, but I'm not satisfied with the results. Compared with some other implementations, the code seems to be a little more. In addition, I also need the inverse probability function.

Luckily, I stumbled upon John Pezzullo's Interactive statistical calculation. John has all the functions I need on the site of the probability distribution function, and for the sake of learning, these functions have been implemented in JavaScript.

I ported the Student T and Fisher F functions to PHP. I made some changes to the API to conform to the Java naming style and embed all functions in a class named distribution. A great feature of this implementation is the Docommonmath method, which is reused by all functions in this library. Other tests (normal and card-side tests) that I don't have the strength to implement also use the Docommonmath method.

Another aspect of this transplant is also noteworthy. By using JavaScript, a user can assign a dynamically determined value to an instance variable, such as:

var PiD2 = Pi ()/2

You cannot do this in PHP. You can only assign simple constant values to instance variables. It is hoped that this flaw will be solved in the PHP5.

Note that the code in Listing 1 does not define instance variables-This is because they are dynamically assigned values in the JavaScript version.

Listing 1. Realize probability function
<?php

distribution.php

Copyright John Pezullo
Released under same terms as PHP.
PHP Port and OO ' fying by Paul Meagher

function Getfisherf ($f, $n 1, $n 2) {
Implemented but not shown
}

function Getinversefisherf ($p, $n 1, $n 2) {
Implemented but not shown
}

}
?>

Output method
Now that you've implemented the probability function in PHP, the only problem with developing a data research tool based on PHP is to design a method for displaying the results of the analysis.

The simple solution is to display the values of all instance variables to the screen as needed. In the first article, I did this when I showed the linear equation, T-value, and T-probability of the BURNUP study (Burnout Study). It is helpful to be able to access specific values for specific purposes, and simplelinearregression supports such usage.

Another way to output the results, however, is to systematically group the parts of the output. If you study the output of the main statistical packages used for regression analysis, you will find that they tend to group the output in the same way. They tend to have summary tables (Summary table), deviation analysis (analyze of variance) tables, parameter estimates (Parameter estimate) tables, and R values (r value). Similarly, I created some output methods with the following name:

I also have a method for displaying linear predictive formulas (Getformula ()). Many statistical software packages do not output formulas, but rather expect the user to construct formulas based on the output of the above method. This is partly because the final form of the formula you used to model your data may be different from the default formula for the following reasons:

No meaningful explanation for 1.Y axis intercept
2. Or the input value may be converted, and you may need to cancel the conversion of them to get the final explanation.

All of these methods assume that the output medium is a Web page. I decided to wrap these output methods in a class that inherits the Simplelinearregression class, considering that you might want to output these rollup values in other media other than the Web. The code in Listing 2 is intended to demonstrate the general logic of the output class. To make the general logic more prominent, the code that implements the various show methods is removed.

Listing 2. Demo generic logic for output classes
<?php

html.php

Copyright 2003, Paul Meagher
Distributed under GPL

Include_once "slr/simplelinearregression.php";

Class Simplelinearregressionhtml extends Simplelinearregression {

The constructor of this class is just the wrapper for the Simplelinearregression class constructor. This means that if you want to display the HTML output for simplelinearregression parsing, you should instantiate the Simplelinearregressionhtml class instead of directly instantiating the Simplelinearregression class. The advantage is that there will not be many unused methods flooding the Simplelinearregression class, and you can more freely define classes for other output vectors (perhaps implementing the same API for different media types).

Graphics output
To date, the output methods that you have implemented display summarized values in HTML format. It is also appropriate for distributing graphs (scatter plot) or line diagrams (lines plot) that display these data in GIF, JPEG, or PNG formats.

Rather than writing code to generate line and distribution diagrams in person, I think it's best to use a PHP based graphics library called Jpgraph. Jpgraph is being actively developed by Johan Persson, whose project website describes it as:

Whether it's a "quick but inappropriate" graphic for minimal code, or a complex professional graphic that requires very fine-grained control, jpgraph can make it easier to draw. Jpgraph also applies to scientific and commercial types of graphics.

The jpgraph distribution contains a large number of sample scripts that can be customized to suit specific requirements. The use of jpgraph for data research tools is simple, just find a sample script that functions like my requirements, and then rewrite the script to meet my specific needs.

The script in Listing 3 is extracted from the Sample Data research tool (explore.php), which demonstrates how to invoke the library and populate the line and scatter classes with data from simplelinearregression analysis. The comments in this code are written by Johan Persson (jpgraph code base documentation works well).

Listing 3. Detailed content of the function from the Sample Data research tool explore.php <?php

Snippet Extracted from explore.php script

Include ("jpgraph/jpgraph.php");
Include ("jpgraph/jpgraph_scatter.php");
Include ("jpgraph/jpgraph_line.php");

Create the graph
$graph = new Graph (300,200, ' Auto ');
$graph->setscale ("Linlin");

Data research Scripts
The Data research tool consists of a single script (explore.php) that invokes methods of the Simplelinearregressionhtml class and the Jpgraph library.

The script uses simple processing logic. The first part of the script performs basic validation on the submitted form data. If the form data is validated, the second part of the script is executed.

The second part of the script contains code that analyzes the data and displays the summarized results in HTML and graphics format. The basic structure of the explore.php script is shown in Listing 4:

Display entry data entry form if variables not set

if (empty ($title)) or (Empty ($x _name)) or (Empty ($x _values)) or
(Empty ($y _name)) or (Empty ($conf _int)) or empty ($y _values)) or
($numX!= $numY)) {

Omitted code for displaying entry form

} else {

Include_once "slr/simplelinearregressionhtml.php";
$SLR = new Simplelinearregressionhtml ($X, $Y, $conf _int);

Include ("jpgraph/jpgraph.php");
Include ("jpgraph/jpgraph_scatter.php");
Include ("jpgraph/jpgraph_line.php");

The code for displaying the graphics are inline in the
explore.php script. The code for this two line plots
Finishes off the script:

Omitted code for displaying scatter plus line plot
Omitted code for displaying residuals plot

}

?>

Study on Fire loss
To demonstrate how to use the Data research tool, I will use data from the hypothetical fire loss study. The study linked the amount of fire loss in major residential areas to their distance to the nearest fire station. For example, insurance companies are interested in the study of this relationship for the purpose of determining premiums.

The data for this study are shown in the input screen in Figure 1.

Figure 1. Input screen to display research data

Once the data has been submitted, it is parsed and the results of these analyses are displayed. The first result set displayed is the Table Summary, as shown in Figure 2.

Figure 2. Table Summary is the first result set displayed

Table Summary displays the input data and other columns in tabular form that indicate the difference between the predicted value Y, Y value and observations for the observed value X and the lower and upper bounds of the predicted Y-value confidence interval.

Figure 3 shows the three high-level data summary tables after table Summary.

Figure 3. Shows three high-level data summary tables after table Summary

The analysis of Variance table shows how to classify the deviation of the Y-value to two major deviations from the model, the variance (see Model Row) and the variance that the model cannot interpret (see the Error line). A larger F value means that the linear model captures most of the deviation in the Y measurement. This table is more useful in multiple regression environments where each independent variable occupies one row in the table.

The Parameter estimates table shows the estimated Y-intercept (Intercept) and slope (Slope). Each row includes a T value and the probability of observing the limit T value (see Prob > t column). The Prob > T of the slope can be used to reject the linear model.

If the probability of a T value is greater than 0.05 (or a similar small probability), then you can veto the null hypothesis because the probability of a random observation of the limit is small. Otherwise, you must use the null hypothesis.

In the study of fire loss, the probability of randomly acquiring T-value of size 12.57 is less than 0.00000. This means that the linear model is a useful predictor (better than the average of Y-values) for the Y-value corresponding to the X-value interval observed in the study.

The final report shows the correlation coefficient or R value. They can be used to evaluate the degree to which the linear model fits the data. The high R value indicates a good coincidence.

Each summary report provides answers to various analytical questions about the relationship between linear models and data. Consult the textbooks written by Hamilton, Neter, or pedhauzeur for more advanced regression analysis processing (see Resources).

The final report element to display is the distribution and line graph of the data, as shown in Figure 4.

Figure 4. Final report elements-distribution map and line chart

Most people are familiar with the description of the line chart (the first picture in this series), so I'm not going to comment on this, just to say that the Jpgraph library can produce high-quality scientific charts for the Web. When you enter a distribution or straight-line data, it does well.

The second picture associates the residuals (observed y, predicted y) with your predicted y values. This is the graphical example used by the advocates of the Research data analysis (exploratory-Analysis,eda) to help bring the analyst's ability to detect and understand the patterns in the data to the highest degree possible. An expert can use this picture to answer questions about the following:

Examples of possible non-normal or excessive influence
Possible curve relationships (using transformations?) ）
Distribution of non-normal residual difference
Variance or variance of extraordinary magnitude error

You can easily extend this data research tool to generate more types of graphics-histograms, block diagrams, and four-point graphs-all of which are standard EDA tools.

Mathematical Library Architecture
The hobby of maths has kept me interested in the math library in recent months. This kind of research drives me to think about how to organize my code base and make it expected to grow in the future.

I take the directory structure in listing 5 for the time being:

Listing 5. Easy to grow directory structure
phpmath/

For example, future work on multiple regressions will involve extending the library to include the Matrix directory, which is used to accommodate the PHP code that performs matrix operations (which is required for more advanced forms of regression analysis). I will also create a Mr Directory to accommodate PHP code that implements multiple regression analysis input methods, logic, and output methods.

Please note that this directory structure contains a temp directory. You must set permissions on this directory so that the explore.php script can write the output graph to that directory. Keep this in mind when you try to install the phpmath_002.tar.gz source code. Also, please read the instructions for installing Jpgraph on the Jpgraph project website (see Resources).

Finally, if you take the following approach, you can move all the software classes to the document root directory outside the Web root:

Make a global php_math variable have access to a non-Web root location, and
Ensure that the defined constant is prefixed to all required or included file paths.

In the future, the settings for the Php_math variable will be done through a configuration file for the entire PHP math library.

What have you learned?
In this article, you learned how to use the Simplelinearregression class to develop data research tools for small and medium sized datasets. In this process, I also developed a native probability function for the Simplelinearregression class, which is extended using HTML output methods and graphics generation code based on the Jpgraph library.

From the learning point of view, simple linear regression modeling is worth further study, because it has proved that it is the only way to understand more advanced forms of statistical modelling. You will benefit from a thorough understanding of simple linear regression before delving into more advanced techniques, such as multiple regressions or multivariate variance analysis.

Even if simple linear regression uses only one variable to describe or predict the deviation of another variable, finding a simple linear relationship between all the research variables is often the first step in the research data analysis. Just because the data is diverse does not mean that you have to use multivariate tools to study it. In fact, using basic tools such as simple linear regression at the outset is a good way to start exploring data patterns.

This series studies two applications of simple linear regression analysis. In this paper, I have studied the strong linear relationship between "distance to a fire station" and "fire loss". In the first article, I studied the linear relationship between "social concentration" and measured values called "consumption indices", although the relationship was relatively weak, but still obvious. (as an exercise, it might be interesting to re-examine the messy data in the first case with the data research tools discussed in this article.) You may notice that the y-axis intercept is negative, which means that the "social concentration" is 0 and the forecast consumption index is-29.50. Does that make any sense? When modeling a phenomenon, you should ask yourself whether the equation should contain an optional Y-intercept, and if so, what role the Y-intercept can play in a linear equation. ）

Further studies of simple linear regression may include research on these topics:

* If you want to omit the intercept from your equation and other calculation formulas that you can use, you can do so when
* When and how the data is linearized using power, logarithm, and other transformations to model the data with simple linear regression
* Can be used to evaluate the adequacy of your modeling assumptions and to gain a clearer insight into other visualization methods of patterns in the data

These are some of the more advanced topics for students studying simple linear regression. Resources contains a few links to advanced topic articles that you can refer to for more information about regression analysis.

The standard PHP installation provides many of the resources necessary to develop critical applications based on mathematics. I hope this series of articles will inspire other developers to implement math routines in PHP for fun, technical or learning challenges.

Related accessories: Download the source code used in this article

Resources
1. Please refer to the popular university textbook Statistics written by James T. McClave and Terry Sincich, 9th edition (Prentice-hall, online), which is referenced in the algorithmic steps and the "burnup study" examples used in this article.
2. Check the PEAR resource pool, which currently contains a small number of low-level PHP math classes. In the end, it should be nice to see that PEAR contains higher-level numeric methods for implementing the standard (such as Simplelinearregression, Multipleregression, TimeSeries, ANOVA, Factoranalysis, Fourieranalysis and other) packages.
3. View all source code for the author's Simplelinearregression class.
4. Learn about the numerical Python project, which extends python with a very scientific array language and a well-established subscript method. With this extension, mathematical operations are very close to what people expect from the compiled language.
5. Study the many mathematical references available to Perl, including the CPAN math module and the modules in the algorithm section of the CPAN, and the Perl data language (Perl Language), which is designed to provide Perl with the ability to compress storage and quickly manipulate large N-dimensional data arrays 。
6. For more information on the S programming language of John Chambers, please refer to his publications and links to his various research projects at Bell Labs. You can also learn about the ACM Award for language Design in 1998.
7.R is a language and environment for statistical calculations and graphics, similar to the award-winning S SYSTEM,R provides statistical and graphical techniques such as linear and non-linear modeling, statistical testing, time series analysis, classification, clustering, and so on. Please learn about R on the R Project home page.
8. If you have just contacted PHP, please read Amol Hatwar's DeveloperWorks series: "Developing robust code in PHP:" "Part 1th: farsighted Introduction" (August 2002), "part 2nd: Effective use of variables" (2002 Year September) and "part 3rd: Writing reusable Functions" (November 2002).
9. Visit John Pezzullo's excellent site, which provides a Web page that performs statistical calculations. The probability function based on PHP is based on the code found on the John's probability function page.
10. To Digital Library of mathematical functions to learn about the books written by M. Abramowitz and I.A. Stegun Handbook of Mathematical More information for NS (also known as AMS55).
11. View the Jpgraph site for a large amount of information about PHP's main OO graphics library.
12. Read the Engineering Handbook of Statistics, published by the National Institute of Standards and Technology (Institute of Standards,nist), which contains a few chapters on Exploratory Data analysis, very good.
13. If you are interested in learning more about regression topics, please try reading the following useful references:

L. C. Hamilton (1992). Regression with Graphics. California Pacific Grove:brooks/cole Publishing Company.
J Neter, M.H. Kutner and W Wasserman W (1990). Applied Linear regression Models (3rd edition). Chicago Irwin.
E. J. Pedhazur (1982). Multiple regression in behavioral. New York State, New York: Holt,rinehart and Winston.

14. Read Cameron Laird's article "Open Source in the biosciences". PHP needs better math tools to participate in this growing market (developerworks,2002 year November).
15. View Rweb, which is a web-based R interface.

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.