Data research tools for solving output and probability function defects

Source: Internet
Author: User
Article title: a data research tool for solving output and probability function defects. Linux is a technology channel of the IT lab in China. Includes basic categories such as desktop applications, Linux system management, kernel research, embedded systems, and open source.
   Brief review: Concept
The basic goal behind simple linear regression modeling is to find the most consistent line from a pair of two-dimensional plane consisting of X and Y values (that is, X and Y measurements. Once this line is found using the least variance method, various statistical tests can be performed to determine the degree of deviation between this line and the observed Y value.
  
Linear equations (y = mx + B) have two parameters that must be estimated based on the provided X and Y data. they are Slope (m) and y axis intercept (B ). Once the two parameters are estimated, the observed values can be input into the linear equation and the Y predicted values generated by the equation can be observed.
  
To use the least variance method to estimate the m and B parameters, we need to find the m and B estimates to minimize the observed values and predicted values of all X values. The difference between the observed value and the predicted value is called an error (yi-(mxi + B). If Square is obtained for each error value, the sum of the residual values is obtained, the result is a number called the predicted squared difference. The use of the least variance method to determine the most consistent straight line involves finding the estimated values of m and B with the smallest predictive variance.
  
Two basic methods can be used to find the estimated values m and B that satisfy the least variance method. In the first method, you can use the numerical search process to set different m and B values and evaluate them to determine the minimum variance estimation value. The second method is to use calculus to find the equation used to estimate m and B. I am not going to discuss in depth the calculus involved in deriving these equations, but I did use these analytical equations in the SimpleLinearRegression class, to find the least square estimate of m and B (see getSlope () and getYIntercept methods in the SimpleLinearRegression class ).
  
Even if you have an equation that can be used to find the minimum square estimate of m and B, it does not mean that as long as these parameters are substituted into the linear equation, the result is a straight line that fits well with the data. The next step in this simple linear regression process is to determine whether the remaining prediction variance is acceptable.
  
We can use the statistical decision-making process to reject the alternative assumption of "line matching data. This process is based on the calculation of the T statistical value, and the probability function is used to obtain the probability of a random large observed value. As mentioned in Part 1, the SimpleLinearRegression class generates a large number of aggregate values, one of which is the T statistical value, which can be used to measure the degree of fit between linear equations and data. If the matching is good, the T statistical value is usually a large value. if the T value is small, a default model should be used to replace your linear equation, this model assumes that the average value of Y is the best predicted value (because the average value of a group of values can usually be the useful predicted value of the next observed value ).
  
To test whether the T statistic value is large enough, you do not need the average value of Y as the best predicted value. you need to calculate the probability of random T statistic values. If the probability is low, you can avoid the assumption that the average value is the best predicted value, and you can be sure that the simple linear model is well fit with the data. (For more information about calculating the probability of T statistic values, see section 1st .)
  
Go back and discuss the statistical decision-making process. It tells you when to ignore invalid assumptions, but does not tell you whether to accept the alternative assumptions. In the research environment, we need to use theoretical parameters and statistical parameters to establish the alternative hypothesis for linear models.
  
You have built a data research tool to implement a statistical decision-making process for a linear model (T test), and provided summarized data that can be used to construct theoretical and statistical parameters, these parameters are required to establish a linear model. Data Research tools can be classified as decision-making support tools for knowledge workers to conduct centralized research on small and medium-sized data.
  
From the perspective of learning, simple linear regression modeling is worth studying, because it is the only way to understand more advanced forms of statistical modeling. For example, many core concepts in simple linear Regression are to understand Multiple Regression, Factor Analysis, and Time Series.
  
Simple linear regression is also a multi-purpose modeling technology. You can use the original data (usually logarithm or power conversion) to model the curve data. These transformations can make the data linear, so that you can use simple linear regression to model the data. The generated linear model is represented as a linear formula related to the converted value.
  
   Probability Functions
In the previous article, I used R to obtain the probability value, avoiding the problem of using PHP to implement the probability function. I am not completely satisfied with this solution, so I began to study this problem: what is required for developing PHP-based probability functions.
  
I started searching for information and code online. One of the two sources is the probability function in the book Numerical Recipes in C. I re-implemented some probability function codes (gammln. c and betai. c functions) using PHP, but I am still not satisfied with the results. Compared with some other implementations, the code seems to be a little more. In addition, I also need anti-probability functions.
  
Fortunately, I happened to find Interactive Statistical Calculation of John Pezzullo. John's website on probability distribution functions has all the functions I need. for ease of learning, these functions are implemented in JavaScript.
  
I transplanted Student T and Fisher F functions to PHP. I made some changes to the API to conform to the Java Naming style and embed all functions into the class named Distribution. A great feature of this implementation is the doCommonMath method, which is reused by all functions in this library. The doCommonMath method is also used for other tests (normality and chi-square tests) that I do not have the effort to implement.
  
Another aspect of this transplantation is worth noting. By using JavaScript, you can assign dynamically determined values to instance variables, for example:
  
Var PiD2 = pi ()/2
  
You cannot do this in PHP. Only simple constant values can be assigned to instance variables. We hope to solve this problem in PHP5.
  
Note that the code in listing 1 does not define the instance variables-this is because they are dynamically assigned values in the JavaScript version.
  
   Listing 1. implementing probability functions
     
// Distribution. php
  
// Copyright John Pezullo
// Released under same terms as PHP.
// PHP Port and OO 'fying by Paul Meagher
  
Class Distribution {
  
Function doCommonMath ($ q, $ I, $ j, $ B ){
  
$ Zz = 1;
$ Z = $ zz;
$ K = $ I;
  
  
While ($ k <= $ j ){
$ Zz = $ zz * $ q * $ k/($ k-$ B );
$ Z = $ z + $ zz;
$ K = $ k + 2;
}
Return $ z;
}
  
Function getStudentT ($ t, $ df ){
  
$ T = abs ($ t );
$ W = $ t/sqrt ($ df );
$ Th = atan ($ w );
  
If ($ df = 1 ){
Return 1-$ th/(pi ()/2 );
}
  
$ Th = sin ($ th );
$ Cth = cos ($ th );
  
If ($ df % 2) = 1 ){
Return
1-($ th + $ th * $ cth * $ this-> doCommonMath ($ cth * $ cth, 2, $ df-3,-1 ))
/(Pi ()/2 );
} Else {
Return 1-$ Something * $ this-> doCommonMath ($ cth * $ cth, 1, $ df-3,-1 );
}
  
}
  
Function getInverseStudentT ($ p, $ df ){
  
$ V = 0.5;
$ Dv = 0.5;
$ T = 0;
  
While ($ dv> 1e-6 ){
$ T = (1/$ v)-1;
$ Dv = $ dv/2;
If ($ this-> getStudentT ($ t, $ df)> $ p ){
$ V = $ v-$ dv;
} Else {
$ V = $ v + $ dv;
}
}
Return $ t;
}
  
  
Function getFisherF ($ f, $ n1, $ n2 ){
// Implemented but not shown
}
  
Function getInverseFisherF ($ p, $ n1, $ n2 ){
// Implemented but not shown
}
  
}
?>
  
   Output method
Since you have already implemented probability functions using PHP, the only challenge left after developing a data research tool based on PHP is to design a method for displaying analysis results.
  
A simple solution is to display the values of all instance variables on the screen as needed. In the first article, I did this when I showed the linear equation, T value, and T probability of the consumption Study (Burnout Study. It is helpful to access a specific value for a specific purpose. SimpleLinearRegression supports this usage.
  
However, another method for output results is to systematically group the output parts. If we study the output of the main statistical software packages used for regression analysis, we will find that they often group the output in the same way. They often include a Summary Table, an Analysis Of Variance Table, a Parameter Estimate Table, and an R Value ). Similarly, I have created some output methods with the following names:
  
ShowSummaryTable ()
ShowAnalysisOfVariance ()
ShowParameterEstimates ()
ShowRValues ()
I also have a method (getFormula () for displaying linear prediction formulas ()). Many statistical software packages do not output formulas, but require users to construct formulas based on the output of the preceding methods. This is partly because the final form of the formula you used to model data may be different from the default formula for the following reasons:
  
Y axis intercept has no meaningful explanation, or
The input values may be converted, and you may need to cancel the conversion to obtain the final explanation.
All these methods assume that the output media is a web page. Considering that you may want to output these summary values in a non-webpage media, I decided to wrap these output methods in a class that inherits the SimpleLinearRegression class. The code in listing 2 is intended to demonstrate the general logic of the output class. In order to make the general logic more prominent, the code that implements various show methods is removed.
  
   Listing 2. demonstrate the general logic of the output class
     
// HTML. php
  
// Copyright 2003, Paul Meagher
// Distributed under GPL
  
Include_once "slr/SimpleLinearRegression. php ";
  
Class SimpleLinearRegressionHTML extends SimpleLinearRegression {
  
Function SimpleLinearRegressionHTML ($ X, $ Y, $ conf_int ){
SimpleLinearRegression: SimpleLinearRegression ($ X, $ Y, $ conf_int );
}
  
Function showTableSummary ($ x_name, $ y_name ){

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.