"Statistics in the Programmer's Eye (12)" Correlation and regression: How's My Line? Go

Source: Internet
Author: User



Read Catalogue


    • Directory
    • 1 Basic description of the algorithm
    • 2 The application scenario of the algorithm.
    • 3 Advantages and disadvantages of the algorithm
    • 4 input data, intermediate results, and output results of the algorithm
    • 5 Code reference for the algorithm
    • 6 shares
Correlation and regression: How's My Line?


Author Bai Ningsu
October 25, 2015 22:16:07


Absrtact: The statistical series in the eyes of programmers is the collation of the author and the team to learn the notes together. The first mention of statistics, many people think that is the patent of economics or mathematics, and the computer does not intersect. It is true that in traditional disciplines, it plays a significant role in the above disciplines. However, with the development of science and technology and the popularization of machine intelligence, the role of statistics in machine intelligence is becoming more and more important. The study of this series of statistics is based on the book of "in-depth statistics" (biased towards code, which requires the reader to have a certain basis, see PPT learning later). As Mr. Wu in the beauty of mathematics, a statistical and mathematical model plays an important role in machine intelligence. World-class challenges such as speech recognition, part-of-speech analysis, machine translation, and so on, are the keys to finding the door to success from the statistics. Especially in the natural language processing is more important, therefore, statistical and mathematical modeling of learning is particularly important. Finally, thank the team for their participation . ( This article original, reprint annotated source : Correlation and return: How is my line? )

Back to top table of contents


"Programmer's Eye Statistics (1)" Information visualization: First Impressions



"Programmer's Eye Statistics (2)" Concentrated trend measurement: dispersion, variability, strong distance



"Programmer's Eye Statistics (3)" Probability calculation: Seize the opportunity



"Statistics in the Programmer's Eye (4)" Application of discrete probability distribution: using expectations wisely



"Programmer's Eye Statistics (5)" Permutation: Sort, rank, row



"Programmer's Eye Statistics (6)" Geometric distribution, two-item distribution and Poisson distribution: persisting discrete



"Statistics in the Programmer's Eye (7)" Application of normal distribution: the beauty of normality



"Statistics in the eyes of programmers (8)" Application of statistical sampling: sample Extraction



"Programmer's Eye Statistics (9)" Overall and sample estimates: Making predictions



"Programmer's Eye Statistics (10)" Hypothesis Test application: Research evidence



"Programmer's Eye Statistics (11)" Chi-square distribution application



"Statistics in the Programmer's Eye (12)" Correlation and regression: How's My Line?


Back to top 1 basic description of Algorithm 1.1 algorithm description


In order to understand the correlation between the two variables (independent variables and dependent variables), the linear regression method was used to analyze the two variable data, the best fitting line and correlation coefficient were obtained, and the value of the other variable was estimated by the value of one variable.


1.2 Definitions


If there is a two-variable data distribution as follows:


X

X1

X2

X3

X4

......

Xn-2

Xn-1

Xn

Y

Y1

Y2

Y3

Y4

......

Yn-2

Yn-1

Yn


The variable x corresponds to the value one by one of Y and presents a linearly related relationship.


1.3 Explanation of symbols


X: Represents an Independent variable
Y: Indicates the dependent variable
XI: Indicates the value of the argument



Yi: Indicates the value of the dependent variable


1.4 Calculation method


1, suppose the equation of the best fitting line is: y=ax+b



2, calculate the mean value of the argument x and the dependent variable y:,



3, using the least squares regression method to find the best fit line slope:



4, calculate the best fit line of the cut distance:



5. The equation of the best fitting line is obtained by the slope and the tangent distance obtained:



6, calculate the standard deviation of the argument x and the dependent variable y:,



7, calculate the correlation coefficient:



8, through the correlation coefficient to determine the best fit line and data fitting, the rules are as follows:



(1) If the absolute value of the correlation coefficient is closer to 1, the higher the fit of the best fitting line is, it can be used for data prediction.



(2) If the absolute value of the correlation coefficient is closer to 0, the lower the fit of the best fit line is, it is not recommended for prediction (the predicted result may be inaccurate).


Back to the top of the 2 algorithm application scenario. 2.1 Algorithm Description under this scenario


Case Description: A sample of the relationship data for a different number of expected sunny hours and concert listeners, using this data, to estimate the ticketing situation based on the expected fine days (hours) of the concert day.


2.2 Algorithm definition under this scenario


Case definition: There is a two-variable data that gives both the expected sunny hours and the number of concert listeners as follows:


Sunny Hours (hour)

1.9

2.5

3.2

3.8

4.7

5.5

5.9

7.2

Number of concert listeners (hundred people)

22

33

30

42

38

39

42

55


If the day of the concert is expected to be 4.3 hours, how many people will be in the concert audience?


2.3 Symbolic interpretation of the algorithm in this scenario


Sunny Hours: Indicates an independent variable



Number of listeners: indicates the dependent variable


2.4 Algorithm calculation method under this scenario


1, suppose the equation of the best fitting line is: y=ax+b



2. Calculate the mean value of the sunny hours and the number of listeners:






3, using the least squares regression method to find the best fit line slope:






4, calculate the best fit line of the cut distance:



5. The equation of the best fitting line is obtained by the slope and the tangent distance obtained:



6. Calculate the standard deviation between the number of sunny hours and the number of listeners:






7, calculate the correlation coefficient:






8, through the correlation coefficient to determine the best fit line and data fitting and to obtain the prediction results:



Since R is close to 1, there is a strong positive correlation between the number of listeners in the concert and the expected sunny hours. In other words, based on the data available, a reasonable good estimate of the expected total number of concerts is given using the best fitting line based on the expected sunny hours.



When the concert day is expected to be 4.3 hours of fine weather, using the best fit line equation, then you can estimate the number of concert listeners on the day will be about 3868 people.


Back to top 3 advantages and disadvantages of the algorithm 3.1 advantages of this algorithm


Advantage: Explore the linear correlation pattern between variable data, and can give the opinion and result for the prediction.


3.2 Disadvantages of this algorithm


Disadvantage: It is only applicable to estimate the data information that is already available, not necessarily to a scope other than the data limit. In the case of concerts, it is not always good to estimate how many people might be in the audience when the day of the concert is expected to be 8 hours.


3.3 The algorithm adapts to the scene


A causal relationship between the variables, estimated by the cause factor value of the predicted result factor value. such as light duration and rice yield.


3.4 This algorithm does not adapt to the scene


There is no causal relationship between variables, for example, a person's weight and height.


3.5 data types that this algorithm applies To


This algorithm is suitable for double data type, it retains three decimal places by default, and can set the number of reserved bits by itself.


Back to the top 4 algorithm input data, intermediate results and output

 4.1 The algorithm input data?

@param twoVarData double[][[], representing sample data
 
* @param testiVar double, dependent variable test value
 
* @param fraDigits int, the result retains a few decimals

4.2 Intermediate results of this algorithm?


* @param iVarAvg double, which represents the mean of the independent variable
 
* @param dVarAvg double, indicating the mean of the dependent variable
 
* @param slope double, indicating the slope of the best fit line
 
* @param tangentDistance double, indicating the cut distance
 
* @param Sx double, indicating the standard deviation of the independent variable
 
* @param Sy double, indicating the standard deviation of the dependent variable



4.3 Output of this algorithm?


@return result String[4], including {slope, tangent, best fit line equation, correlation coefficient} result character value
 
* @return estValue double, dependent variable estimate


Back to top 5 algorithm code reference 5.1 class and method basic description


Class Source: See source program: Statistics.src.CorrelationAndRegression



Using correlation and regression principle to calculate the best fitting line of two variables [independent variable and dependent variable] data, and to excavate the linear relation between the two variables [independent variable and dependent variable] data through the best fitting line [Y=A+BX], thus predicting the value of the dependent variable by the value of the independent variable.


5.2 Class and method invocation interfaces


See source program: Statistics.src.CorrelationAndRegression



The following methods are included in the Correlationandregression.java:



Calculateavgvalue (double[] vardata,int fradigits)//Calculate the mean value of a variable



Calculateslope (double[][] twovardata,int fradigits)//Calculate the slope of the best fit line



Calculatetangentdistance (double slope,double ivaravg,double dvaravg,int fradigits)//Calculate the tangent distance of the best fit line



Calculatestandarddeviation (double[] vardata,int fradigits)//standard deviation of calculated variables



Calculatecorrelationcoefficient (double slope,double sx,double sy,int fradigits)//Calculate correlation coefficient



Calculateestimatedvalue (double slope,double tangentdistance,double testivar,int fradigits)//calculate the estimated value of the dependent variable



Analyze (double[][] twovardata,int fradigits)//regression analysis and draw results



Call encapsulation Method: Statistics.src.utils.ScoreUtil



Methods in Scoreutil.java:



Getfractiondigits (double, int)//Keep several decimals on numeric values


5.3 Source Code
Import utils.ScoreUtil;

/**
 * Relevance and regression
 * @description Use the correlation and regression principles to calculate the best fit line for the two variables [independent and dependent variables] data,
 * And use the best-fit line [y=a+bx] to mine the relationship between the two-variable [independent and dependent variable] data, and then estimate the value of the dependent variable by predicting the value of the independent variable.
 * Application scenario: for example, predicting the number of people who will be present at the concert through weather conditions; or estimating the growth of the plant by the length of illumination.
 * Limitations: It can only be estimated based on existing data information, and may not be applicable to the scope of data limitation accidents.
 * For example, the current data from 2000 to 2010 cannot predict the data after 2010.
 * @author candymoon
 * @2015-8-13下午 4:19:57
 */
Public class CorrelationAndRegression {
    /**
     * Find the average of the variable values (formula: average = sum / total)
     * @param varData variable data value
     * @param fraDigits result retention digits
     * @return average
     */
    Public static double calculateAvgValue(double[] varData,int fraDigits){
        Double avgValue = 0;
        Int len = varData.length; / / array length
        For (int i = 0; i < varData.length; i++) {
            avgValue += varData[i];
        }
        / / Calculate the mean
        avgValue = avgValue/len;
        //Retain the result in 3 decimal places (rounded)
        String avgValue_String = ScoreUtil.getFractionDigits(avgValue, fraDigits);
        avgValue = Double.valueOf(avgValue_String);
        Return avgValue;
    }
    /**
     * Calculate the slope of the best fit line (y = a + bx) b
     * @description The slope of the best-fit line (y=a+bx) is calculated as: b=∑((x-xAvg)(y-yAvg))/∑(x-xAvg)2
     * , where x is the independent variable, xAvg is the mean of the independent variables, y is the dependent variable, and yAvg is the mean of the dependent variable.
     * @param twoVarData two variable data
     * @param fraDigits result retention digits
     * @return slope
     */
    Public static double calculateSlope(double[][] twoVarData,int fraDigits){
        Double slope = 0;//slope
        Double iVarAvg = 0; / / mean value
        Double dVarAvg = 0; / / dependent variable mean
        / / Calculate the mean of the independent variable and the mean of the dependent variable
        iVarAvg = calculateAvgValue(twoVarData[0], fraDigits);
        dVarAvg = calculateAvgValue(twoVarData[1], fraDigits);
        Int iVarLen = twoVarData[0].length;
        Double numerator = 0;//molecule
        Double denominator = 0; / / denominator
        / / Calculate the numerator and denominator of the formula
        For (int i = 0; i < iVarLen; i++) {
            Double x = twoVarData[0][i];
            Double y = twoVarData[1][i];
            Numerator += (x-iVarAvg)*(y-dVarAvg);
            Denominator += (x-iVarAvg)*(x-iVarAvg);
        }
        / / Calculate the slope
        Slope = numerator/denominator;
        / / Keep the result a few decimals (rounded)
        String slope_String = ScoreUtil.getFractionDigits(slope, fraDigits);
        Slope = Double.valueOf(slope_String);
        Return slope;
    }
    /**
     * Calculate the cut distance a of the best fit line (y = a + bx)
     * @description Because the best fit line passes through the point (average of the independent variable, the mean of the dependent variable), its calculation formula is:
     * a=dVarAvg-b*iVarAvg
     * @param slope slope
     * @param iVarAvg argument
     * @param dVarAvg dependent variable
     * @param fraDigits result retention digits
     * @return cut distance
     */
    Public static double calculateTangentDistance(double slope,double iVarAvg,double dVarAvg,int fraDigits){
        Double tanDis = 0;//cut distance
        / / Calculate the cut distance
        tanDis = dVarAvg-(slope*iVarAvg);
        //Retain the result in 3 decimal places (rounded)
        String tanDis_String = ScoreUtil.getFractionDigits(tanDis, fraDigits);
        tanDis = Double.valueOf(tanDis_String);
        Return tanDis;
    }
    /**
     * Calculate the standard deviation of the variable data
     * @description The standard deviation is calculated as: standard deviation = ((var-varAvg)2/(n-1))^2
     * @param varData variable data
     * @param fraDigits result retention digits
     * @return standard deviation
     */
    Public static double calculateStandardDeviation(double[] varData,int fraDigits){
        Double standardDev = 0;//standard deviation
        Double varAvg = 0; / / variable mean
        Double denominator = 0; / / denominator
        / / Calculate the mean value of the variable
        varAvg = calculateAvgValue(varData,fraDigits);
        Int varDataLen = varData.length;
        For (int i = 0; i < varDataLen; i++) {
            Double var = varData[i];
            Denominator += (var-varAvg)*(var-varAvg);
        }
        / / Calculate the standard deviation
        standardDev = Math.pow(denominator/(varDataLen-1),0.5);
        //Retain the result in 3 decimal places (rounded)
        String standardDev_String = ScoreUtil.getFractionDigits(standardDev, fraDigits);
        standardDev = Double.valueOf(standardDev_String);
        Return standardDev;
    }
    /**
     * Calculate the correlation coefficient
     * @description The correlation coefficient indicates the degree of fit of the best fit line to the two-variable data. The correlation coefficient is calculated as: r = (b*Sx)/Sy;
     * r has a value range of [-1, 1]. When |r| is closer to 1, the degree of fit is higher. When |r| is closer to 0, the lower the degree of fit.
     * (1) When r = -1, it means completely negative correlation; (2) When r = 1, it means completely positive correlation; (3) When r = 0, it means irrelevant.
     * @param slope slope
     * standard deviation of @param Sx argument
     * @param Sy standard deviation of the dependent variable
     * @param fraDigits result retention digits
     * @return correlation coefficient
     */
    Public static double calculateCorrelationCoefficient(double slope,double Sx,double Sy,int fraDigits){
        Double r = 0; / / correlation coefficient
        r = (slope*Sx)/Sy;
        //Retain the result in 3 decimal places (rounded)
        String r_String = ScoreUtil.getFractionDigits(r, fraDigits);
        r = Double.valueOf(r_String);
        Return r;
    }
    /**
     * Calculate the estimated value of the dependent variable
     * @description Solve with the best fit line equation y=a+bx
     * @param slope The slope of the best fit line
     * @param tangentDistance The cut-off distance of the best fit line
     * @param testiVar test argument value
     * @param fraDigits results retain a few decimals
     * @return estimate of the dependent variable
     */
    Public static double calculateEstimatedValue(double slope,double tangentDistance,double testiVar,int fraDigits){
        Double estValue = 0;//estimate
        / / Use the best fit line equation y = a + bx solution
        estValue = tangentDistance+(slope*testiVar);
        //Retain the result in 3 decimal places (rounded)
        String estValue_String = ScoreUtil.getFractionDigits(estValue, fraDigits);
        estValue = Double.valueOf(estValue_Strin
g);
        Return estValue;
    }
    /**
     * Find slope, cut distance, best fit line, correlation coefficient
     * @param twoVarData two variable data
     * @param testiVar test value
     * @param fraDigits results retain a few decimals
     * @return string result array {slope, tangent, best fit line, correlation coefficient}
     */
    Public static String[] analyze(double[][] twoVarData,int fraDigits){
        String[] result = new String[4];//string result array
        String bestFittingLine = "";
        Double iVarAvg = 0; / / mean value
        Double dVarAvg = 0; / / dependent variable mean
        Double b = 0; / / slope
        Double a = 0;//cut distance
        Double r = 0; / / correlation coefficient
        Double Sx = 0; / / standard deviation of the independent variable
        Double Sy = 0; / / standard deviation of the dependent variable
        / / Calculate the mean of the independent and dependent variables
        iVarAvg = calculateAvgValue(twoVarData[0], fraDigits);
        dVarAvg = calculateAvgValue(twoVarData[1], fraDigits);
        / / Calculate the slope
        b = calculateSlope(twoVarData,fraDigits);
        / / Calculate the cut distance
        a = calculateTangentDistance(b,iVarAvg,dVarAvg,fraDigits);
        / / Calculate the standard deviation of independent and dependent variables
        Sx = calculateStandardDeviation(twoVarData[0], fraDigits);
        Sy = calculateStandardDeviation(twoVarData[1], fraDigits);
        / / Calculate the correlation coefficient
        r = calculateCorrelationCoefficient(b,Sx,Sy,fraDigits);
        / / Combine the best fit line equation
        bestFittingLine = "y="+a+"+"+b+"x";
        / / Add the result to the result array
        Result[0]= b+"";
        Result[1]= a+"";
        Result[2]= bestFittingLine+"";
        Result[3]= r+"";
        Return result;
    }

    /**
     * @param args
     */
    Public static void main(String[] args) {
        /*
         * iVar is an abbreviation for independent variable
         * dVar is an abbreviation of dependent variable
         */
         Double[] iVar = {1.9,2.5,3.2,3.8,4.7,5.5,5.9,7.2};//independent value
         Double[] dVar = {22,33,30,42,38,49,42,55};//dependent variable value
         Double[][] twoVarData = new double[][]{iVar,dVar};//two variable data
         Int fraDigits = 3; / / results retain a few decimal places
         Double testiVar = 4.3; / / test the independent variable value
         String[] result = analyze(twoVarData,fraDigits);
         Double estValue = 0;//estimate
         System.out.println("-------------output result---------------");
         System.out.println(" Slope: "+result[0]+" Cut Distance: "+result[1]);
         System.out.println(" Best fit line: "+result[2]);
         System.out.println(" correlation coefficient: "+result[3]);
         estValue = calculateEstimatedValue(Double.valueOf(result[0]), Double.valueOf(result[1]), testiVar, fraDigits);
         System.out.println(" Test argument value: "+testiVar);
         System.out.println(" estimated value of the dependent variable: "+estValue);
         System.out.println("----------------------------------");
    }

} 




Back to top 6 shared ppt:http://yunpan.cn/cfwawexctkmed access password 291e Open source code: Http://yunpan.cn/cFWAFPNrvn6PV Access Password 8208


"Statistics in the Programmer's Eye (12)" Correlation and regression: How's My Line? Go


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.