First, the basic concept of statistics
The most basic concept in statistics is the mean, variance and standard deviation of the sample. First, we give a collection of n samples, and here is a description of the formulas for these concepts:
Mean value:
Standard deviation:
Variance:
The mean value describes the middle point of the sample collection, which tells us that the information is finite, and the standard deviation gives us the average distance from each sample point to the mean of the sample set.
Take these two sets as an example, [0, 8, 12, 20] and [8, 9, 11, 12], two sets of the mean value are 10, but obviously two sets of the difference is very large, the standard deviation of the two, the former is 8.3 the latter is 1.8, obviously the latter is more concentrated, so its standard deviation is smaller, the standard deviation is described in this "Scatter degree". The reason for dividing by n-1 instead of N is that it allows us to better approximate the overall standard deviation with a smaller set of samples, the so-called "unbiased estimate". And the variance is just the square of the standard deviation.
Second, why the covariance is required
Standard deviation and variance are generally used to describe a one-dimensional data, but in real life we often encounter data sets containing multidimensional data, the simplest is that we have to go to school to count the test scores of multiple disciplines. In the face of such datasets, of course we can calculate their variances independently of each dimension, but usually we would like to know more, for example, whether a boy's wretched degree has some connection with his popularity with a girl. Covariance is a statistic used to measure the relationship of two random variables, and we can imitate the definition of variance:
To measure the degree to which each dimension deviates from its mean, covariance can be defined like this:
What is the significance of the results of the covariance? If the result is positive, then the two are positively correlated (the definition of "correlation coefficient" can be derived from covariance), which means that a person is more wretched and more popular with girls. If the result is negative, it means that the two are negatively correlated and the more wretched the girl the more annoying. If it is 0, then there is no relationship between the two, wretched and girls do not like the link between the likes and dislikes, is statistically said "mutual independence."
From the definition of covariance we can also see some obvious properties, such as:
Three, the covariance matrix
The wretched and popular problem mentioned above is a typical two-dimensional problem, and covariance can only deal with two-dimensional problems, that is, the number of dimensions of the natural need to calculate multiple covariance, such as n-dimensional data sets need to calculate the covariance, it is natural that we think of using matrices to organize the data. Give the definition of the covariance matrix:
This definition is still very easy to understand, we can give a three-dimensional example, assuming that the dataset has three dimensions, the covariance matrix is:
As can be seen, the covariance matrix is a symmetric matrix, and the diagonal is the variance of each dimension.
Four, Matlab covariance actual combat
It is important to be clear that the covariance matrix calculates the covariance between different dimensions, rather than between different samples. The following demo will use MATLAB, in order to illustrate the principle of computation, do not call the MATLAB cov function directly:
First, randomly generate a 10*3-dimensional integer matrix as a sample set, 10 is the number of samples, and 3 is the dimension of the sample.
Figure 1 Creating a sample set using Matlab
According to the formula, it is necessary to calculate the mean for covariance, and it is emphasized that the covariance matrix is to calculate the covariance between different dimensions, always keep this in mind. Each row of the sample matrix is a sample, and each column is a dimension, so we want to calculate the mean by column. To describe the convenience, we first assign the data of three dimensions:
Figure 2 Assigning the data of three dimensions to each other
Calculate the covariance of dim1 and dim2,dim1 with DIM3,DIM2 and dim3:
Figure 3 Calculation of the three covariance
The element on the diagonal of the covariance matrix is the variance of each dimension, and we calculate these variances in turn:
Figure 4 Calculating the variance on the diagonal
Thus, we get all the data needed to calculate the covariance matrix, and we can call the cov function of Matlab to get the covariance matrix directly:
Figure 5 Calculating the covariance matrix of a sample directly using the COV function of MATLAB
The result of the calculation is exactly the same as the result of the previous data being filled into the matrix.
V. Summary
The key to understanding the covariance matrix is to keep in mind that its calculations are the covariance between different dimensions, not between different samples. To get a sample matrix, the first thing to be clear is a row is a sample or a dimension, the heart clear the entire calculation process will go down the river, so you will not be confused.
Original address:
http://pinkyjie.com/2010/08/31/covariance/
Category: Mathematics
[Turn] on the covariance matrix