Summing up, covariance is actually the average of the multiplication of data deviations for any two dimensions.
The meaning and calculation formula of covariance
Children who have studied probability statistics know that the most basic concept in statistics is the mean, variance, or standard deviation of a sample. First we give you a set of n samples, and then give the formula description of these concepts, these high school students who have learned maths should know, around the area.
Mean value:
Standard deviation:
Variance:
Obviously, the mean value describes the middle point of the sample collection, which tells us that the information is very limited,
The standard deviation gives us a description of the average distance from each sample point of the sample set to the mean. Take these two sets as an example, [0,8,12,20] and [8,9,11,12], the average value of two sets is 10, but obviously two sets the difference is very large, the standard deviation of the two, the former is 8.3, the latter is 1.8, obviously the latter is more concentrated, so its standard deviation is smaller, the standard deviation is described in this "Scatter degree". The reason for dividing by n-1 instead of dividing by N is that it allows us to better approximate the overall standard deviation with a smaller set of samples, which is the statistically so-called "unbiased estimate ".
And the variance is just the square of the standard deviation.
Why is covariance required?
The above statistics seem to have been described almost, but we should note that the standard deviation and variance is generally used to describe one-dimensional data, but in real life we often encounter data sets that contain multidimensional data, the simplest of us to go to school when you have to count a number of subjects test scores. In the face of such datasets, of course we can calculate their variance independently of each dimension, but usually we want to know more, for example, if a boy's wretched degree is related to his popularity with a girl, hehe. Covariance is a statistic used to measure the relationship of two random variables, We can imitate the definition of variance:
To measure the degree to which each dimension deviates from its mean, the standard deviation can be defined as follows:
What is the significance of the results of the covariance? If the result is positive, then the two are positive correlation (from covariance can lead to the definition of "correlation coefficient"), that is, the more wretched a person is more popular with girls, hey, that must ~ negative results for negative correlation, the more wretched girls more annoying, maybe? If it is 0, it is statistically said to be "mutually independent".
From the definition of covariance we can also see some obvious properties, such as:
Covariance is more than the covariance matrix
The wretched and popular problem mentioned in the previous section is a typical two-dimensional problem, and covariance can only deal with two-dimensional problems, that is, the number of dimensions is more natural than the need to calculate multiple covariance, such as n-dimensional datasets need to calculate
Covariance, it is natural for us to think of using matrices to organize this data. Give the definition of the covariance matrix:
This definition is still very easy to understand, we can give a simple three-dimensional example, assuming that the dataset has three dimensions, then the covariance matrix is
As can be seen, the covariance matrix is a symmetric matrix, and the diagonal is the variance on each dimension.
Matlab Covariance actual combat covariance matrix calculates the covariance between different dimensions, rather than between different samples. This I will combine the following example illustrates that the following demonstration will use MATLAB, in order to illustrate the principle of computation, do not directly invoke the MATLAB cov function (blue part of MATLAB code). first, randomly generate a 10*3-dimensional integer matrix as the sample set, 10 is the number of samples, and 3 is the dimension of the sample.
1 |
mysample = Fix (rand (10,3) *50) |
  According to the formula, it is necessary to calculate the mean for the covariance, that is, the mean or column by row, I always bothered with this problem from the start. In particular, we have emphasized that the covariance matrix is the covariance between different dimensions, and always keep this in mind. Each row of the sample matrix is a sample, each column is a dimension, so we want to calculate the mean by column . For the sake of convenience, we first assign values for three dimensions: |
23 |
DIM1 = Mysample ( :, 1);d im2 = Mysample (:, 2);d im3 = Mysample (:, 3); |
Calculate the covariance of dim1 and dim2,dim1 with DIM3,DIM2 and dim3:
123 |
Sum ((Dim1-mean (DIM1)). * (Dim2-mean (DIM2)))/(Size (mysample,1)-1)% get 74.5333sum ((Dim1-mean (DIM1)). * (Dim3-mean (d) IM3))/(Size (mysample,1)-1)% get -10.0889sum ((Dim2-mean (DIM2))/(Dim3-mean (DIM3))/(Size (mysample,1)-1)% get -106.4000 |
It's much easier to figure this out, and the diagonal of the covariance matrix is the variance on each dimension, which we'll calculate in turn:
123 |
STD (DIM1) ^2% get 108.3222std (dim2) ^2% get 260.6222std (dim3) ^2% get 94.1778 |
In this way, we get all the data needed to compute the covariance matrix and invoke the COV function from Matlab to verify:
Does the data we calculate be the same?
Update: Suddenly discovered today, the original covariance matrix can also be computed, first let the sample matrix center, that is, each dimension minus the mean of the dimension, so that the average value of each dimension is 0, and then directly with the new sample matrix by its transpose, and then divided by (N-1). In fact, this method is also from the previous formula channel, but it is not very intuitive to understand, but in the abstract formula derivation is still very common! Also gives the MATLAB code implementation:
12 |
X = Mysample-repmat (mean (mysample), 10, 1); The%-centric sample matrix makes each dimension mean 0C = (X ' *x)./(Size (x,1)-1); |
Summarize
The key to understanding the covariance matrix is to keep in mind that it calculates the covariance between the different dimensions, not between different samples, to get a sample matrix, we first want to be clear is a row is a sample or a dimension, the heart clear this entire calculation process will go down the river, so that will not be confused ~
--ppt examples of covariance and correlation coefficients
http://download.csdn.net/detail/goodshot/5087550
Covariance matrix (reprint)