Definition and calculation of covariance and covariance matrices

Source: Internet
Author: User

the meaning and calculation formula of covariance

Children who have studied probability statistics know that the most basic concept in statistics is the mean, variance, or standard deviation of a sample. First we give you a set of n samples, and then give the formula description of these concepts, these high school students who have learned maths should know, around the area.

mean value:


Standard deviation:


Variance:


Obviously, the mean value describes the middle point of the sample collection, which tells us that the information is very limited,

The standard deviation gives us a description of the average distance from each sample point of the sample set to the mean. Take these two sets as an example, [0,8,12,20] and [8,9,11,12], the average value of two sets is 10, but obviously two sets the difference is very large, the standard deviation of the two, the former is 8.3, the latter is 1.8, obviously the latter is more concentrated, so its standard deviation is smaller, the standard deviation is described in this "Scatter degree". The reason for dividing by n-1 instead of dividing by N is that it allows us to better approximate the overall standard deviation with a smaller set of samples, which is the statistically so-called "unbiased estimate".

And the variance is just the square of the standard deviation.



Why the covariance is required.

The above statistics seem to have been described almost, but we should note that the standard deviation and variance is generally used to describe one-dimensional data, but in real life we often encounter data sets that contain multidimensional data, the simplest of us to go to school when you have to count a number of subjects test scores. In the face of such datasets, of course we can calculate their variance independently of each dimension, but usually we want to know more, for example, if a boy's wretched degree is related to his popularity with a girl, hehe. Covariance is a statistic used to measure the relationship of two random variables, We can imitate the definition of variance:


To measure the degree to which each dimension deviates from its mean, the standard deviation can be defined as follows:


What is the significance of the results of the covariance? If the result is positive, then the two are positive correlation (from covariance can lead to the definition of "correlation coefficient"), that is, the more wretched a person is more popular with girls, hey, that must ~ negative results for negative correlation, the more wretched girls more annoying, maybe. If it is 0, it is statistically said to be "mutually independent".

From the definition of covariance we can also see some obvious properties, such as:



Covariance is more than the covariance matrix

The wretched and popular problem mentioned in the previous section is a typical two-dimensional problem, and covariance can only deal with two-dimensional problems, that is, the number of dimensions is more natural than the need to calculate multiple covariance, such as n-dimensional datasets need to calculate


Covariance, it is natural for us to think of using matrices to organize this data. Give the definition of the covariance matrix:


This definition is still very easy to understand, we can give a simple three-dimensional example, assuming that the dataset has three dimensions, then the covariance matrix is


As can be seen, the covariance matrix is a symmetric matrix, and the diagonal is the variance on each dimension.

matlab covariance Combat

The above-mentioned content is relatively easy, covariance matrix seems to be very simple, but the actual combat is very easy to confuse people. It is important to be clear that the covariance Matrix calculates the covariance between different dimensions, rather than between different samples. This I will combine the following example shows that the following demonstration will use MATLAB, in order to illustrate the principle of computation, do not directly invoke the MATLAB cov function (blue part of MATLAB code).

Firstly, a 10*3-dimensional integer matrix is randomly generated as a sample set, 10 is the number of samples, and 3 is the dimension of the sample.

1

mysample = Fix (rand (10,3) *50)


According to the formula, it is necessary to calculate the mean for the covariance, which is the average or column by row, and I always bothered with this problem from the start. In particular, we have emphasized that the covariance matrix is the covariance between different dimensions, and always keep this in mind. Each row of the sample matrix is a sample, each column is a dimension, so we are going to calculate the mean by columns . To describe the convenience, we first assign the data of three dimensions:

23

DIM1 = Mysample (:, 1);d im2 = Mysample (:, 2);d im3 = Mysample (:, 3);

Calculate the covariance of dim1 and dim2,dim1 with DIM3,DIM2 and dim3:

123

Sum ((Dim1-mean (DIM1)). * (Dim2-mean (DIM2)))/(Size (mysample,1)-1)% get 74.5333sum ((Dim1-mean (DIM1)). * (Dim3-mean (d)   IM3))/(Size (mysample,1)-1)% get -10.0889sum ((Dim2-mean (DIM2))/(Dim3-mean (DIM3))/(Size (mysample,1)-1)% get -106.4000

It's much easier to figure this out, and the diagonal of the covariance matrix is the variance on each dimension, which we'll calculate in turn:

123

STD (DIM1) ^2% get 108.3222std (dim2) ^2% get 260.6222std (dim3) ^2% get 94.1778

In this way, we get all the data needed to compute the covariance matrix and invoke the COV function from Matlab to verify:

1

CoV (mysample)


The data we calculate are not the same.

Update: Suddenly discovered today, the original covariance matrix can also be computed, first let the sample matrix center, that is, each dimension minus the mean of the dimension, so that the average value of each dimension is 0, and then directly with the new sample matrix by its transpose, and then divided by (N-1). In fact, this method is also from the previous formula channel, but the understanding is not very intuitive, but in the abstract formula derivation is still very common. Also gives the MATLAB code implementation:

12

X = Mysample-repmat (mean (mysample), 10, 1); The%-centric sample matrix makes each dimension mean 0C = (X ' *x)./(Size (x,1)-1);

Summary

The key to understanding the covariance matrix is to keep in mind that it calculates the covariance between the different dimensions, not between different samples, to get a sample matrix, we first want to be clear is a row is a sample or a dimension, the heart clear this entire calculation process will go down the river, so that will not be confused ~

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.