first, the basic concept of statistics
The most basic concept in statistics is the mean, variance and standard deviation of the sample. First, we give a set of n samples, and the formula for these concepts is described below:
Mean value:
Standard deviation:
Variance:
The mean value describes the middle point of the sample set, which tells us that the information is finite, and the standard deviation describes the average of the distance from each sample point of the sample set to the mean value.
Take these two sets for example, [0, 8, 12, 20] and [8, 9, 11, 12], two sets of the mean is 10, but obviously two sets of the difference is very large, calculate the standard deviation, the former is 8.3, the latter is 1.8, it is obvious that the latter is more concentrated, so its standard deviation is described in this "Dispersion degree". This is divided by n-1 rather than n because it allows us to better approximate the overall standard deviation with a smaller set of samples, known as the "unbiased estimate" statistically. And the variance is just the square of the standard deviation.
Second, why the need for covariance
Standard deviation and variance are generally used to describe one-dimensional data, but in real life we often encounter data sets containing multidimensional data, the simplest is that everyone in school will inevitably have to count a number of subjects in the exam results. In the face of such a dataset, we can of course compute its variance independently of each dimension, but usually we want to know more about the extent to which a boy's degree of indecency is related to the popularity of a girl. Covariance is such a statistic used to measure the relationship of two random variables that we can emulate the definition of variance:
To measure the degree to which each dimension deviates from its mean, the covariance can be defined as:
What is the point of covariance results? If the result is positive, the two are positively correlated (the definition of "correlation coefficient" can be drawn from covariance), which means that the more wretched a person is, the more popular The girl is. If the result is negative, it means that the two are negatively correlated, the more wretched the girl the more annoying. If 0, then there is no relationship between the two, the wretched not wretched and girls like there is no association between, is the statistical said "mutual independence."
We can also see some obvious properties from the definition of covariance, such as:
third, covariance matrix
The wretched and popular questions mentioned above are typical two-dimensional problems, and covariance can only deal with two-dimensional problems, the number of dimensions of the natural need to calculate multiple covariance, such as n-dimensional data sets need to compute covariance, that naturally we think of using matrices to organize the data. The definition of covariance matrix is given:
This definition is very easy to understand, we can give a three-dimensional example, assuming that the dataset has three dimensions, the covariance matrix is:
Thus, the covariance matrix is a symmetric matrix, and the diagonal is the variance of each dimension.
four, Matlab covariance actual combat
It is important to be clear that the covariance matrix calculates the covariance between different dimensions, not between the different samples. The following demo will use MATLAB, in order to illustrate the calculation principle, not directly invoke the MATLAB cov function:
First, a 10*3-dimensional integer matrix is randomly generated as a sample set, 10 is the number of samples, and 3 is the dimension of the sample.
Figure 1 using MATLAB to generate a sample set
According to the formula, the calculation of covariance needs to calculate the mean value, in particular, the covariance matrix is to calculate the covariance between different dimensions, always keep this in mind. Each row of the sample matrix is a sample, each column is a dimension, so we have to calculate the mean value by column. To describe the convenience, we first assign the data of three dimensions to each:
Figure 2 assigns the data of three dimensions to each value
Calculates the covariance of dim1 and dim2,dim1 with DIM3,DIM2 and dim3:
Figure 3 calculates three covariance
The elements on the diagonal of the covariance matrix are the variance of each dimension, and we calculate these variances in turn:
Figure 4 calculates the variance on the diagonal
In this way, we get all the data needed to compute the covariance matrix, and can call the cov function of Matlab to get the covariance matrix directly:
Fig. 5 The covariance matrix of the sample is directly computed using the COV function of MATLAB
The result of the calculation is exactly the same as when the previous data has been filled in with the matrix.
v. Summary
The key to understanding the covariance matrix is to keep in mind that its calculations are covariance between different dimensions rather than between different samples. To get a sample matrix, the first thing to be clear is whether a row is a sample or a dimension, the mind clear the entire calculation process will be downstream, so it will not be confused.
Original address:
http://pinkyjie.com/2010/08/31/covariance/