Read Catalogue
- Directory
- 1 Basic description of the algorithm
- 2 The application scenario of the algorithm.
- 3 Advantages and disadvantages of the algorithm
- 4 input data, intermediate results, and output results of the algorithm
- 5 Code reference for the algorithm
- 6 shares
Correlation and regression: How's My Line?
Author Bai Ningsu
October 25, 2015 22:16:07
Absrtact: The statistical series in the eyes of programmers is the collation of the author and the team to learn the notes together. The first mention of statistics, many people think that is the patent of economics or mathematics, and the computer does not intersect. It is true that in traditional disciplines, it plays a significant role in the above disciplines. However, with the development of science and technology and the popularization of machine intelligence, the role of statistics in machine intelligence is becoming more and more important. The study of this series of statistics is based on the book of "in-depth statistics" (biased towards code, which requires the reader to have a certain basis, see PPT learning later). As Mr. Wu in the beauty of mathematics, a statistical and mathematical model plays an important role in machine intelligence. World-class challenges such as speech recognition, part-of-speech analysis, machine translation, and so on, are the keys to finding the door to success from the statistics. Especially in the natural language processing is more important, therefore, statistical and mathematical modeling of learning is particularly important. Finally, thank the team for their participation . ( This article original, reprint annotated source : Correlation and return: How is my line? )
Back to top table of contents
"Programmer's Eye Statistics (1)" Information visualization: First Impressions
"Programmer's Eye Statistics (2)" Concentrated trend measurement: dispersion, variability, strong distance
"Programmer's Eye Statistics (3)" Probability calculation: Seize the opportunity
"Statistics in the Programmer's Eye (4)" Application of discrete probability distribution: using expectations wisely
"Programmer's Eye Statistics (5)" Permutation: Sort, rank, row
"Programmer's Eye Statistics (6)" Geometric distribution, two-item distribution and Poisson distribution: persisting discrete
"Statistics in the Programmer's Eye (7)" Application of normal distribution: the beauty of normality
"Statistics in the eyes of programmers (8)" Application of statistical sampling: sample Extraction
"Programmer's Eye Statistics (9)" Overall and sample estimates: Making predictions
"Programmer's Eye Statistics (10)" Hypothesis Test application: Research evidence
"Programmer's Eye Statistics (11)" Chi-square distribution application
"Statistics in the Programmer's Eye (12)" Correlation and regression: How's My Line?
Back to top 1 basic description of Algorithm 1.1 algorithm description
In order to understand the correlation between the two variables (independent variables and dependent variables), the linear regression method was used to analyze the two variable data, the best fitting line and correlation coefficient were obtained, and the value of the other variable was estimated by the value of one variable.
1.2 Definitions
If there is a two-variable data distribution as follows:
X |
X1 |
X2 |
X3 |
X4 |
...... |
Xn-2 |
Xn-1 |
Xn |
Y |
Y1 |
Y2 |
Y3 |
Y4 |
...... |
Yn-2 |
Yn-1 |
Yn |
The variable x corresponds to the value one by one of Y and presents a linearly related relationship.
1.3 Explanation of symbols
X: Represents an Independent variable
Y: Indicates the dependent variable
XI: Indicates the value of the argument
Yi: Indicates the value of the dependent variable
1.4 Calculation method
1, suppose the equation of the best fitting line is: y=ax+b
2, calculate the mean value of the argument x and the dependent variable y:,
3, using the least squares regression method to find the best fit line slope:
4, calculate the best fit line of the cut distance:
5. The equation of the best fitting line is obtained by the slope and the tangent distance obtained:
6, calculate the standard deviation of the argument x and the dependent variable y:,
7, calculate the correlation coefficient:
8, through the correlation coefficient to determine the best fit line and data fitting, the rules are as follows:
(1) If the absolute value of the correlation coefficient is closer to 1, the higher the fit of the best fitting line is, it can be used for data prediction.
(2) If the absolute value of the correlation coefficient is closer to 0, the lower the fit of the best fit line is, it is not recommended for prediction (the predicted result may be inaccurate).
Back to the top of the 2 algorithm application scenario. 2.1 Algorithm Description under this scenario
Case Description: A sample of the relationship data for a different number of expected sunny hours and concert listeners, using this data, to estimate the ticketing situation based on the expected fine days (hours) of the concert day.
2.2 Algorithm definition under this scenario
Case definition: There is a two-variable data that gives both the expected sunny hours and the number of concert listeners as follows:
Sunny Hours (hour) |
1.9 |
2.5 |
3.2 |
3.8 |
4.7 |
5.5 |
5.9 |
7.2 |
Number of concert listeners (hundred people) |
22 |
33 |
30 |
42 |
38 |
39 |
42 |
55 |
If the day of the concert is expected to be 4.3 hours, how many people will be in the concert audience?
2.3 Symbolic interpretation of the algorithm in this scenario
Sunny Hours: Indicates an independent variable
Number of listeners: indicates the dependent variable
2.4 Algorithm calculation method under this scenario
1, suppose the equation of the best fitting line is: y=ax+b
2. Calculate the mean value of the sunny hours and the number of listeners:
3, using the least squares regression method to find the best fit line slope:
4, calculate the best fit line of the cut distance:
5. The equation of the best fitting line is obtained by the slope and the tangent distance obtained:
6. Calculate the standard deviation between the number of sunny hours and the number of listeners:
7, calculate the correlation coefficient:
8, through the correlation coefficient to determine the best fit line and data fitting and to obtain the prediction results:
Since R is close to 1, there is a strong positive correlation between the number of listeners in the concert and the expected sunny hours. In other words, based on the data available, a reasonable good estimate of the expected total number of concerts is given using the best fitting line based on the expected sunny hours.
When the concert day is expected to be 4.3 hours of fine weather, using the best fit line equation, then you can estimate the number of concert listeners on the day will be about 3868 people.
Back to top 3 advantages and disadvantages of the algorithm 3.1 advantages of this algorithm
Advantage: Explore the linear correlation pattern between variable data, and can give the opinion and result for the prediction.
3.2 Disadvantages of this algorithm
Disadvantage: It is only applicable to estimate the data information that is already available, not necessarily to a scope other than the data limit. In the case of concerts, it is not always good to estimate how many people might be in the audience when the day of the concert is expected to be 8 hours.
3.3 The algorithm adapts to the scene
A causal relationship between the variables, estimated by the cause factor value of the predicted result factor value. such as light duration and rice yield.
3.4 This algorithm does not adapt to the scene
There is no causal relationship between variables, for example, a person's weight and height.
3.5 data types that this algorithm applies To
This algorithm is suitable for double data type, it retains three decimal places by default, and can set the number of reserved bits by itself.
Back to the top 4 algorithm input data, intermediate results and output
4.1 The algorithm input data?
@param twoVarData double[][[], representing sample data
* @param testiVar double, dependent variable test value
* @param fraDigits int, the result retains a few decimals
4.2 Intermediate results of this algorithm?
* @param iVarAvg double, which represents the mean of the independent variable
* @param dVarAvg double, indicating the mean of the dependent variable
* @param slope double, indicating the slope of the best fit line
* @param tangentDistance double, indicating the cut distance
* @param Sx double, indicating the standard deviation of the independent variable
* @param Sy double, indicating the standard deviation of the dependent variable
4.3 Output of this algorithm?
@return result String[4], including {slope, tangent, best fit line equation, correlation coefficient} result character value
* @return estValue double, dependent variable estimate
Back to top 5 algorithm code reference 5.1 class and method basic description
Class Source: See source program: Statistics.src.CorrelationAndRegression
Using correlation and regression principle to calculate the best fitting line of two variables [independent variable and dependent variable] data, and to excavate the linear relation between the two variables [independent variable and dependent variable] data through the best fitting line [Y=A+BX], thus predicting the value of the dependent variable by the value of the independent variable.
5.2 Class and method invocation interfaces
See source program: Statistics.src.CorrelationAndRegression
The following methods are included in the Correlationandregression.java:
Calculateavgvalue (double[] vardata,int fradigits)//Calculate the mean value of a variable
Calculateslope (double[][] twovardata,int fradigits)//Calculate the slope of the best fit line
Calculatetangentdistance (double slope,double ivaravg,double dvaravg,int fradigits)//Calculate the tangent distance of the best fit line
Calculatestandarddeviation (double[] vardata,int fradigits)//standard deviation of calculated variables
Calculatecorrelationcoefficient (double slope,double sx,double sy,int fradigits)//Calculate correlation coefficient
Calculateestimatedvalue (double slope,double tangentdistance,double testivar,int fradigits)//calculate the estimated value of the dependent variable
Analyze (double[][] twovardata,int fradigits)//regression analysis and draw results
Call encapsulation Method: Statistics.src.utils.ScoreUtil
Methods in Scoreutil.java:
Getfractiondigits (double, int)//Keep several decimals on numeric values
5.3 Source Code
Import utils.ScoreUtil;
/**
* Relevance and regression
* @description Use the correlation and regression principles to calculate the best fit line for the two variables [independent and dependent variables] data,
* And use the best-fit line [y=a+bx] to mine the relationship between the two-variable [independent and dependent variable] data, and then estimate the value of the dependent variable by predicting the value of the independent variable.
* Application scenario: for example, predicting the number of people who will be present at the concert through weather conditions; or estimating the growth of the plant by the length of illumination.
* Limitations: It can only be estimated based on existing data information, and may not be applicable to the scope of data limitation accidents.
* For example, the current data from 2000 to 2010 cannot predict the data after 2010.
* @author candymoon
* @2015-8-13下午 4:19:57
*/
Public class CorrelationAndRegression {
/**
* Find the average of the variable values (formula: average = sum / total)
* @param varData variable data value
* @param fraDigits result retention digits
* @return average
*/
Public static double calculateAvgValue(double[] varData,int fraDigits){
Double avgValue = 0;
Int len = varData.length; / / array length
For (int i = 0; i < varData.length; i++) {
avgValue += varData[i];
}
/ / Calculate the mean
avgValue = avgValue/len;
//Retain the result in 3 decimal places (rounded)
String avgValue_String = ScoreUtil.getFractionDigits(avgValue, fraDigits);
avgValue = Double.valueOf(avgValue_String);
Return avgValue;
}
/**
* Calculate the slope of the best fit line (y = a + bx) b
* @description The slope of the best-fit line (y=a+bx) is calculated as: b=∑((x-xAvg)(y-yAvg))/∑(x-xAvg)2
* , where x is the independent variable, xAvg is the mean of the independent variables, y is the dependent variable, and yAvg is the mean of the dependent variable.
* @param twoVarData two variable data
* @param fraDigits result retention digits
* @return slope
*/
Public static double calculateSlope(double[][] twoVarData,int fraDigits){
Double slope = 0;//slope
Double iVarAvg = 0; / / mean value
Double dVarAvg = 0; / / dependent variable mean
/ / Calculate the mean of the independent variable and the mean of the dependent variable
iVarAvg = calculateAvgValue(twoVarData[0], fraDigits);
dVarAvg = calculateAvgValue(twoVarData[1], fraDigits);
Int iVarLen = twoVarData[0].length;
Double numerator = 0;//molecule
Double denominator = 0; / / denominator
/ / Calculate the numerator and denominator of the formula
For (int i = 0; i < iVarLen; i++) {
Double x = twoVarData[0][i];
Double y = twoVarData[1][i];
Numerator += (x-iVarAvg)*(y-dVarAvg);
Denominator += (x-iVarAvg)*(x-iVarAvg);
}
/ / Calculate the slope
Slope = numerator/denominator;
/ / Keep the result a few decimals (rounded)
String slope_String = ScoreUtil.getFractionDigits(slope, fraDigits);
Slope = Double.valueOf(slope_String);
Return slope;
}
/**
* Calculate the cut distance a of the best fit line (y = a + bx)
* @description Because the best fit line passes through the point (average of the independent variable, the mean of the dependent variable), its calculation formula is:
* a=dVarAvg-b*iVarAvg
* @param slope slope
* @param iVarAvg argument
* @param dVarAvg dependent variable
* @param fraDigits result retention digits
* @return cut distance
*/
Public static double calculateTangentDistance(double slope,double iVarAvg,double dVarAvg,int fraDigits){
Double tanDis = 0;//cut distance
/ / Calculate the cut distance
tanDis = dVarAvg-(slope*iVarAvg);
//Retain the result in 3 decimal places (rounded)
String tanDis_String = ScoreUtil.getFractionDigits(tanDis, fraDigits);
tanDis = Double.valueOf(tanDis_String);
Return tanDis;
}
/**
* Calculate the standard deviation of the variable data
* @description The standard deviation is calculated as: standard deviation = ((var-varAvg)2/(n-1))^2
* @param varData variable data
* @param fraDigits result retention digits
* @return standard deviation
*/
Public static double calculateStandardDeviation(double[] varData,int fraDigits){
Double standardDev = 0;//standard deviation
Double varAvg = 0; / / variable mean
Double denominator = 0; / / denominator
/ / Calculate the mean value of the variable
varAvg = calculateAvgValue(varData,fraDigits);
Int varDataLen = varData.length;
For (int i = 0; i < varDataLen; i++) {
Double var = varData[i];
Denominator += (var-varAvg)*(var-varAvg);
}
/ / Calculate the standard deviation
standardDev = Math.pow(denominator/(varDataLen-1),0.5);
//Retain the result in 3 decimal places (rounded)
String standardDev_String = ScoreUtil.getFractionDigits(standardDev, fraDigits);
standardDev = Double.valueOf(standardDev_String);
Return standardDev;
}
/**
* Calculate the correlation coefficient
* @description The correlation coefficient indicates the degree of fit of the best fit line to the two-variable data. The correlation coefficient is calculated as: r = (b*Sx)/Sy;
* r has a value range of [-1, 1]. When |r| is closer to 1, the degree of fit is higher. When |r| is closer to 0, the lower the degree of fit.
* (1) When r = -1, it means completely negative correlation; (2) When r = 1, it means completely positive correlation; (3) When r = 0, it means irrelevant.
* @param slope slope
* standard deviation of @param Sx argument
* @param Sy standard deviation of the dependent variable
* @param fraDigits result retention digits
* @return correlation coefficient
*/
Public static double calculateCorrelationCoefficient(double slope,double Sx,double Sy,int fraDigits){
Double r = 0; / / correlation coefficient
r = (slope*Sx)/Sy;
//Retain the result in 3 decimal places (rounded)
String r_String = ScoreUtil.getFractionDigits(r, fraDigits);
r = Double.valueOf(r_String);
Return r;
}
/**
* Calculate the estimated value of the dependent variable
* @description Solve with the best fit line equation y=a+bx
* @param slope The slope of the best fit line
* @param tangentDistance The cut-off distance of the best fit line
* @param testiVar test argument value
* @param fraDigits results retain a few decimals
* @return estimate of the dependent variable
*/
Public static double calculateEstimatedValue(double slope,double tangentDistance,double testiVar,int fraDigits){
Double estValue = 0;//estimate
/ / Use the best fit line equation y = a + bx solution
estValue = tangentDistance+(slope*testiVar);
//Retain the result in 3 decimal places (rounded)
String estValue_String = ScoreUtil.getFractionDigits(estValue, fraDigits);
estValue = Double.valueOf(estValue_Strin
g);
Return estValue;
}
/**
* Find slope, cut distance, best fit line, correlation coefficient
* @param twoVarData two variable data
* @param testiVar test value
* @param fraDigits results retain a few decimals
* @return string result array {slope, tangent, best fit line, correlation coefficient}
*/
Public static String[] analyze(double[][] twoVarData,int fraDigits){
String[] result = new String[4];//string result array
String bestFittingLine = "";
Double iVarAvg = 0; / / mean value
Double dVarAvg = 0; / / dependent variable mean
Double b = 0; / / slope
Double a = 0;//cut distance
Double r = 0; / / correlation coefficient
Double Sx = 0; / / standard deviation of the independent variable
Double Sy = 0; / / standard deviation of the dependent variable
/ / Calculate the mean of the independent and dependent variables
iVarAvg = calculateAvgValue(twoVarData[0], fraDigits);
dVarAvg = calculateAvgValue(twoVarData[1], fraDigits);
/ / Calculate the slope
b = calculateSlope(twoVarData,fraDigits);
/ / Calculate the cut distance
a = calculateTangentDistance(b,iVarAvg,dVarAvg,fraDigits);
/ / Calculate the standard deviation of independent and dependent variables
Sx = calculateStandardDeviation(twoVarData[0], fraDigits);
Sy = calculateStandardDeviation(twoVarData[1], fraDigits);
/ / Calculate the correlation coefficient
r = calculateCorrelationCoefficient(b,Sx,Sy,fraDigits);
/ / Combine the best fit line equation
bestFittingLine = "y="+a+"+"+b+"x";
/ / Add the result to the result array
Result[0]= b+"";
Result[1]= a+"";
Result[2]= bestFittingLine+"";
Result[3]= r+"";
Return result;
}
/**
* @param args
*/
Public static void main(String[] args) {
/*
* iVar is an abbreviation for independent variable
* dVar is an abbreviation of dependent variable
*/
Double[] iVar = {1.9,2.5,3.2,3.8,4.7,5.5,5.9,7.2};//independent value
Double[] dVar = {22,33,30,42,38,49,42,55};//dependent variable value
Double[][] twoVarData = new double[][]{iVar,dVar};//two variable data
Int fraDigits = 3; / / results retain a few decimal places
Double testiVar = 4.3; / / test the independent variable value
String[] result = analyze(twoVarData,fraDigits);
Double estValue = 0;//estimate
System.out.println("-------------output result---------------");
System.out.println(" Slope: "+result[0]+" Cut Distance: "+result[1]);
System.out.println(" Best fit line: "+result[2]);
System.out.println(" correlation coefficient: "+result[3]);
estValue = calculateEstimatedValue(Double.valueOf(result[0]), Double.valueOf(result[1]), testiVar, fraDigits);
System.out.println(" Test argument value: "+testiVar);
System.out.println(" estimated value of the dependent variable: "+estValue);
System.out.println("----------------------------------");
}
}
Back to top 6 shared ppt:http://yunpan.cn/cfwawexctkmed access password 291e Open source code: Http://yunpan.cn/cFWAFPNrvn6PV Access Password 8208
"Statistics in the Programmer's Eye (12)" Correlation and regression: How's My Line? Go