Concept and calculation of covariance matrix

Source: Internet
Author: User

The key to understanding the covariance matrix is to keep in mind that it calculates the covariance between different dimensions, rather than between different samples, to get a sample matrix, the first thing we want to be clear is whether a row is a sample or a dimension, the heart is clear that the whole calculation process will go down the river, so you will not be confused

Talking about covariance matrix

Look at the paper today and see the covariance matrix this broken thing, the previous look at the pattern classification of the special trouble, did not think it is still unclear, simply start to check the covariance matrix data, after the decision to immediately record down, hehe ~ I will use self-thought to talk about the covariance matrix in a gradual manner.

Basic concepts of statistics

Children who have studied probability statistics know that the most basic concept in statistics is the mean, variance, or standard deviation of a sample. First we give you a set of n samples, and then give the formula description of these concepts, these high school students who have learned maths should know, around the area.

Mean value:
Standard deviation:
Variance:

Obviously, the mean value describes the middle point of the sample collection, which tells us that the information is very limited, while the standard deviation gives us a description of the average distance from each sample point of the sample set to the mean. Take these two sets as an example, [0,8,12,20] and [8,9,11,12], the average value of two sets is 10, but obviously two sets the difference is very large, the standard deviation of the two, the former is 8.3, the latter is 1.8, obviously the latter is more concentrated, so its standard deviation is smaller, the standard deviation is described in this "Scatter degree". The reason for dividing by n-1 instead of dividing by N is that it allows us to better approximate the overall standard deviation with a smaller set of samples, which is the statistically so-called "unbiased estimate". And the variance is just the square of the standard deviation.

Why is covariance required?

The above statistics seem to have been described almost, but we should note that the standard deviation and variance is generally used to describe one-dimensional data, but in real life we often encounter data sets that contain multidimensional data, the simplest of us to go to school when you have to count a number of subjects test scores. In the face of such datasets, of course we can calculate their variance independently of each dimension, but usually we want to know more, for example, if a boy's wretched degree is related to his popularity with a girl, hehe. Covariance is a statistic used to measure the relationship of two random variables, We can imitate the definition of variance:

To measure the degree to which each dimension deviates from its mean, the standard deviation can be defined as follows:

What is the significance of the results of the covariance? If the result is positive, then the two are positive correlation (from covariance can lead to the definition of "correlation coefficient"), that is, the more wretched a person is more popular with girls, hey, that must ~ negative results for negative correlation, the more wretched girls more annoying, maybe? If it is 0, it is statistically said to be "mutually independent".

From the definition of covariance we can also see some obvious properties, such as:


Covariance is more than the covariance matrix

The wretched and popular problem mentioned in the previous section is a typical two-dimensional problem, and the covariance can only deal with two-dimensional problems, that is, the number of dimensions of the natural need to calculate a number of covariance, such as n-dimensional datasets need to calculate the covariance, it is natural that we think of using matrices to organize the data. Give the definition of the covariance matrix:

This definition is still very easy to understand, we can give a simple three-dimensional example, assuming that the dataset has three dimensions, then the covariance matrix is

As can be seen, the covariance matrix is a symmetric matrix, and the diagonal is the variance on each dimension.

MATLAB covariance combat

The above-mentioned content is relatively easy, covariance matrix seems to be very simple, but the actual combat is very easy to confuse people. It is important to be clear that the covariance Matrix calculates the covariance between different dimensions, rather than between different samples. This I will combine the following example shows that the following demonstration will use MATLAB, in order to illustrate the principle of computation, do not directly invoke the MATLAB cov function (blue part of MATLAB code).

Firstly, a 10*3-dimensional integer matrix is randomly generated as a sample set, 10 is the number of samples, and 3 is the dimension of the sample.

1MySample = Fix (rand (10,3) *50)

According to the formula, it is necessary to calculate the mean for the covariance, which is the average or column by row, and I always bothered with this problem from the start. In particular, we have emphasized that the covariance matrix is the covariance between different dimensions, and always keep this in mind. Each row of the sample matrix is a sample, each column is a dimension, so we are going to calculate the mean by columns . To describe the convenience, we first assign the data of three dimensions:

1DIM1 = Mysample (:, 1); 2dim2 = Mysample (:, 2); 3dim3 = Mysample (:, 3);

Calculate the covariance of dim1 and dim2,dim1 with DIM3,DIM2 and dim3:

1sum ((Dim1-mean (DIM1)). * (Dim2-mean (DIM2))/(Size (mysample,1)-1)% get 74.53332sum ((Dim1-mean (DIM1)). * (Dim3-mean (DIM3)) )/(Size (mysample,1)-1)% get -10.08893sum ((Dim2-mean (DIM2)). * (Dim3-mean (DIM3)))/(Size (mysample,1)-1)% get-10 6.4000

It's much easier to figure this out, and the diagonal of the covariance matrix is the variance on each dimension, which we'll calculate in turn:

1STD (DIM1) ^2% get 108.32222std (dim2) ^2% get 260.62223std (dim3) ^2% get 94.1778

In this way, we get all the data needed to compute the covariance matrix and invoke the COV function from Matlab to verify:

1cov (Mysample)

Does the data we calculate be the same?

Update: Suddenly discovered today, the original covariance matrix can also be computed, first let the sample matrix center, that is, each dimension minus the mean of the dimension, so that the average value of each dimension is 0, and then directly with the new sample matrix by its transpose, and then divided by (N-1). In fact, this method is also from the previous formula channel, but it is not very intuitive to understand, but in the abstract formula derivation is still very common! Also gives the MATLAB code implementation:

1X = Mysample-repmat (mean (mysample), 10, 1); The%-centric sample matrix makes each dimension mean 02C = (X ' *x)./(Size (x,1)-1);

Concept and calculation of covariance matrix

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.