Understanding of covariance and Markov distance

Source: Internet
Author: User

I sorted out several good blogs.

Basic concepts of statistics

Variance: Obviously, the mean value describes the intermediate point of the sample set. It tells us that the information is very limited, the standard deviation describes the average distance from each sample point to the mean of the sample set. Take the two sets as an example. For example, [,] and [,], the mean values of the two sets are 10, but obviously the difference between the two sets is very big. Calculate the standard deviation of the two sets, the former is 8.3, and the latter is 1.8. Obviously, the latter is more concentrated, so its standard deviation is smaller. The standard deviation describes this "Degree of dispersion ". The reason for dividing n-1 rather than N is that we can better approximate the population standard deviation with a smaller sample set, that is, the so-called "unbiased estimation" in statistics ". The variance is only the square of the standard deviation.

Why is covariance required?

The preceding statistics seem to have already been described, but we should note that standard deviation and variance are generally used to describe one-dimensional data. However, in real life, we often encounter data sets with multidimensional data, the simplest way is to calculate the scores of multiple subjects at school. In the face of such a dataset, we can calculate its variance independently based on each dimension, but we usually want to know more, for example, is there any connection between a boy's cooriness level and his popularity ~ Covariance is a statistic used to measure the relationship between two random variables.

What is the significance of the covariance result? If the result is positive, it indicates that the two are positively correlated (from the covariance can lead to the definition of "correlation coefficient"). That is to say, the more wretched a person is, the more popular a girl is ~ If the result is negative, it indicates a negative correlation. The more wretched the girl is, the more annoying it is. Is it possible? If the value is 0, it is also counted as "independent ".

More covariance is the covariance matrix.

The cumbersome and popular problem mentioned in the previous section is a typical two-dimensional problem, while covariance can only deal with two-dimensional problems, when there are more dimensions, we need to calculate multiple covariance. Naturally, we will think of using matrices to organize the data.

The following is an introduction to the covariance matrix on Wiki:

In Statistics and probability theory, the covariance matrix (or co-variant matrix) is a matrix, and each element is the variance between each vector element. This is a natural extension from a scalar random variable to a high-dimensional random vector.

Assume that X is a column vector composed of N scalar random variables,

And μ I is the expected value of the I-th element, that is, μ I =
E (XI ). The covariance matrix is defined in I. Item J is the following covariance:

That is:

The (I, j) element in the matrix is the covariance between Xi and XJ. This concept is generalized for the variance of scalar random variables.

Although the covariance matrix is simple, it is a very powerful tool in many fields. It can export a transformation matrix, which can completely decorrelation data ). From different perspectives, that is to say, we can find a group of optimal bases to express data in a compact manner. (For the complete proof, see Ruili ).
In statistics, this method is referred to as Principal Component Analysis (principal components analysis) and karhunen-loè ve transform (KL-Transform) in image processing ).

 

The Markov distance is proposed by P. C. Mahalanobis, an Indian statistician, to indicate the covariance distance of data. It is an effective way to calculate the similarity between two unknown sample sets. What is different from the Euclidean distance is that it considers the relationship between various characteristics (for example, a piece of information about height will bring about a piece of information about weight, because the two are associated) scale-invariant is independent of the measurement scale.
For a multi-variable vector whose mean is Σ In the covariance matrix, the Markov distance is

The Markov distance can also be defined as the degree of difference between two random variables that are subject to the same distribution and their covariance matrix is Σ:

If the covariance matrix is a matrix of units, the Markov distance is simplified to a Euclidean distance. If the covariance matrix is a diagonal matrix, it can also be called a normalized Euclidean distance '.

Where σ IYesXIStandard deviation.

According to the definition of Markov distance, we can obtain the following features: P           The Markov distance between two points is independent of the measurement unit of the original data. P           The Markov distance between the two points calculated from the standardized data and the centralized data (that is, the difference between the original data and the mean) is the same P           This can eliminate the interference of correlations between variables.
P           Meet the four basic principles of distance: non-negative, self-inverse, symmetry, and triangular inequality.
Note: It is not the physical meaning of the white Markov distance. What is the role of the inverse of the covariance matrix? I intuitively understand that the inverse of covariance plays a normalized role. For example, when the covariance is a diagonal array, the larger the lamuda, the smaller the inverse, to some extent, the effects of dimensions between different variables are normalized. Please advise

Advantages and disadvantages of Euclidean Markov distance: 1) the calculation of Markov distance is based on the population, which can be obtained from the interpretation of the covariance matrix, that is, if we take the same two samples and place them into two different populations, the Markov distance between the two finally calculated samples is usually different, unless the covariance matrices of the two populations happen to be the same; 2) When calculating the Markov distance, the number of samples must be greater than the dimension of the sample. Otherwise, the inverse matrix of the population covariance matrix does not exist. In this case, the Euclidean distance is used for calculation. 3) There is another situation where the condition that the total number of samples is greater than the dimension of the sample, but the inverse matrix of the covariance matrix still does not exist, such as three sample points) and (), this is because the three samples are collocated in the two-dimensional space plane. In this case, Euclidean distance is also used. 4) In practice, the condition "the number of samples is greater than the dimension of the sample" is easily met, and all the sample points appear. 3) the situations described in this section are rare, therefore, in most cases, the Markov distance can be smoothly calculated, but the calculation of the Markov distance is unstable. The unstable source is the covariance matrix, this is also the biggest difference between Markov distance and Euclidean distance. Advantage: it is not affected by dimensions. The Markov distance between two points is irrelevant to the measurement unit of the original data. It is determined by the standardized data and the centralized data (that is, the difference between the original data and the mean) the distance between two calculated points is the same. Markov distance can also eliminate the interference between variables. Disadvantage: the disadvantage is that it exaggerated the role of slightly changed variables. If dij is used to represent the distance between the I-th sample and the J-th sample, then for all I, J and K, dij should meet the following four conditions: ① When and only when I = J, dij = 0 ② dij> 0 ③ dij = DJI (Symmetry) ④ dij ≤ Dik + dkj (Triangle Inequality) Obviously, the Euclidean distance satisfies the preceding four conditions. There are multiple functions that meet the preceding conditions. The Markov distance used in this section is also one of them. The Markov distance between the I sample and the J sample is calculated using the following formula: dij = (x I x j) TS-1 (x I XJ )) 1/2 (T,-1, and 1/2 are both supermarked) where T indicates transpose, x I and x j are vectors composed of M indicators of the I and j samples respectively, and s is the sample covariance matrix.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.