Data analysis Fifth: Similarity and dissimilarity based on distance assessment data

Last Update:2018-08-24 Source: Internet

Author: User

Tags data structures rand

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Cluster analysis divides objects into clusters according to their differences, clusters are collections of data objects, and cluster analysis makes objects in the same cluster similar to objects in other clusters. Similarity and dissimilarity (dissimilarity) are evaluated based on the attribute values of the data object, usually involving distance measurements. Similarity (similarity) and dissimilarity (dissimilarity) are negatively correlated, collectively referred to as proximity (proximity).

In the cluster analysis, the first step of the clustering algorithm is to measure the distance between the data set objects, the actual operation step is: the data matrix (for storing data Objects) for dimensionless processing, the application of distance algorithm, to obtain a dissimilarity matrix (for storing data objects of the dissimilarity value).

Note: before calculating the distance, the data is first dimensionless.

One, the data matrix and the dissimilarity matrix

Suppose we have n objects (such as people) that are portrayed by P attributes (also known as dimensions or features such as age, height, weight, or gender), which are recorded as x1= (x11,x12,...,x1p), x2= (x21,x22,...,x2p), and so on, where Xij is the value of the J attribute of Object XI, Object Xi is also known as the object's eigenvector. The set of Xi is called the data matrix, the distance between each object is composed of matrix, called the dissimilarity matrix, usually, the common clustering algorithm needs to run on these two data structures.

1, Data matrix

Data Matrix or Object-attribute structure: This data structure holds n data objects in the form of a relational table or a matrix of NXP (N object XP properties):

2, dissimilarity Matrix

The dissimilarity matrix (dissimilarity matrix) or Object-Object structure: Stores the proximity (proximity) between N object 22, usually represented by a nxn matrix:

where D (i,j) is a measure of the dissimilarity or "difference" between the object I and the object J, in general, D (I,J) is a non-negative value, the object I and J are highly similar or "close" to each other, the value is close to 0, and the greater the value is. Note that d (i,i) = 0, that is, an object differs from its own by 0. In addition, D (i,j) =d (j,i). (For legibility, we do not show D (j,i), the matrix is symmetric.) ）

A data matrix consists of two different types of entities or "things," which are (representing objects) and columns (representing attributes). Thus, the data matrix is often referred to as a two-mode (Two-mode) matrix. The dissimilarity matrix contains only one type of entity and is therefore called a single-mode (One-mode) matrix. Many clustering and nearest neighbor algorithms run on a dissimilarity matrix, and for a distance-based dissimilarity matrix, you can use the dist () function in the stats package to convert the data matrix to a dissimilarity matrix.

Two, the distance measure of the numerical attribute in the cluster analysis process, the data object area needs to be divided into different classes, then how the object is divided into different categories? It is the similarity between the objects to judge whether the different objects belong to one kind of basis. There are generally two kinds of metrics for object similarity: distance and similarity coefficients, distances are generally used to measure similarities between observations, and similarity coefficients are generally used to measure the similarity between variables. Distance refers to an observation as a point in an M-dimensional space and a distance defined in space. The distance-based clustering algorithm is that the points that are closer to each other are grouped into the same class, and the distance points fall into different classes. . The distance in cluster analysis is generally used to test the similarity between samples, which is the key step of clustering analysis. in the cluster analysis, Euclidean method is often used to calculate the divergence of numerical properties, Euclidean is to depict the difference of the data object distance measurement, in addition to Euclid, but also cut Manhattan, LAN distance and other methods. Rankine distance is a dimensionless quantity, overcomes the shortcomings of Minkowski distance and the dimensions of each index, and the rand distance is not sensitive to large outliers, which makes it particularly suitable for highly offset and data, but the rand distance does not consider the correlation between variables. Minkov distance and the distance between the rand are assumed to be independent of each other, that is, in the orthogonal space to discuss the distance, but in the actual problem, there is often a certain correlation between variables, in order to overcome the influence of the correlation between the variables, can use the Markov distance. The following is a list of 6 commonly used distance measurements, with the exception of the Markov distance, and the other 5 distance measurements need to be done in a non-dimensional way.

1, Euclidean distance

The most commonly used distance measure in a clustering algorithm, which represents the straight-line distance between two points in a space:

Because of the dimensional inconsistency of the components of the eigenvectors, it is often necessary to standardize each component so that it is independent of the unit, for example, the use of European distances for indicators of height (cm) and weight (kg) of two units may invalidate the results.

Disadvantage: The correlation between components is not considered, and the results can be disturbed by multiple components that embody a single feature.

2, Max distance (Chebyshev)

Have you ever played chess? The king can move to any of the adjacent 8 squares in one step, so how many steps does it take for the king to walk from the lattice (x1,y1) to the lattice (x2,y2)? You will find that the minimum number of steps is always max (| x2-x1 |, | y2-y1 |) step.

3, Manhattan distance

So named because it's a block away from the city's two points, imagine you're driving from one intersection in Manhattan to another, and driving distance is a straight-line distance between two points? Obviously not, unless you can cross the building, the actual driving distance is the "Manhattan Distance", and this is the name of the Manhattan distance from the source, Manhattan is also known as the city Block distance (urban block distance), for example, 2 blocks south, across 3 blocks, a total of 5 blocks.

4, Rankine distance (lance distance)

The Rankine distance overcomes the influence of dimension, but does not consider the correlation between the indexes.

5, Minkov distance

Minkov the parameter p is needed for the distance,

where p is a variable parameter, depending on the variable parameter, the He distance can represent a class of distances:

When P=1, it is the Manhattan distance
When p=2, it's Euclidean distance.
When p→∞, it is the maximum distance

The disadvantage of he distance is two: (1) The dimensions of each component (scale), that is, the "unit" as the same view, (2) did not consider the distribution of the various components (expected, square, etc.) may be different.

6, Markov distance

The advantages of Markov distance: dimension-independent, exclude the interference between the correlations between variables, the disadvantage is: different characteristics can not be treated differently, may exaggerate the weak characteristics.

Applicable occasions:

1. Measure the degree of difference between the X and y of the random variable with the same distribution and the covariance matrix C of the two
2. Measure the degree of difference between X and a class of mean vectors, and discriminate the attribution of samples. At this point, Y is the class mean vector.

Using the R language to achieve the Markov distance:

Dist_mashi <-function (A, b) {return ((-a)%*% T (t (a))/ CoV ( A, B)}

Three, the dissimilarity of category attribute

Category attributes (also called nominal attributes) divide the data into finite groupings, and how do you calculate the differences between the objects depicted by the category attributes? The differences between the two objects I and J can be calculated based on the mismatch rate:

where M is the number of matches (i.e. the number of attributes with the same object I and J), and P is the total number of attributes that characterize the object, we can increase the impact of m by giving m a larger weight.

For example, a table with a nominal attribute, with only one nominal attribute, a P-value equal to 1, and an M-value of 1:

When the object I and J match, the dissimilarity metric D (i, j) = 0, when the object I and J do not match, the dissimilarity metric D (i, j) = 1, the difference matrix is obtained:

Four, the dissimilarity of mixed type attribute

How do I calculate the dissimilarity between objects of a mixed attribute type? A preferable approach is to process all attribute types together, combine different attributes into a single dissimilarity matrix, and convert all meaningful attributes to a common interval [0.0, 1.0] .

Suppose the dataset contains properties of p mixed types, and the difference between the object I and J is defined as D (i, J):

Normalize numeric attributes, map variable values to intervals [0.0, 1.0] so that all attribute values are mapped to the interval [0.0, 1.0], and then the distance of the numeric attribute is computed.

Five, the R function calculates the dissimilarity matrix

When using R to calculate distances, the commonly used functions are Daisy () in the Dist () and cluster packages in the stats package, dist () is used to calculate the dissimilarity matrix of the numeric attributes, and the Daisy () function is used to calculate (symmetric or asymmetric) two (binary) attributes, The difference matrix of the nominal (nominal) attribute, ordered (ordinal) attribute, numeric attribute, and mixed attribute.

1, using the dist () function to calculate the dissimilarity matrix

The dist () function in the stats package calculates the distance between two numeric observations:

Dist (x, method = "Euclidean", Diag = False, upper = false, P = 2)

The function calculates and returns the distance between the rows of the data matrix, calculated by using the distance matrix specified by the distance measure.

Parameter comment:

method: Measure distance, the default value is "Euclidean", the available methods are: "euclidean" , "maximum"、 "manhattan"、 "canberra"、 "binary" and"minkowski"，中文名称分别是：欧几里得、最大距离、曼哈顿、兰氏距离、二元距离和闵科夫斯基距离。
diag: Logical value, whether to draw the diagonal of the distance matrix (diagonal)
Upper: logical value, whether to draw the upper triangle of the distance matrix (upper triangle)
P: For Minkov distance, specify power value

The dist () method returns a lower triangular matrix, using the As.matrix () function to get the distance using the standard brackets.

D <- Dist (x) m <-As.matrix (d)

2, using the Daisy () function to calculate the dissimilarity matrix

The Daisy () function in the cluster package is used to calculate the distance between the two observations in a data set. When the original variable is a mixed type, or if the metric= "Gower" is set, Daisy () uses the Gower formula to calculate the dissimilar matrix of the dataset.

Daisy (x, metric = C ("Euclidean""Manhattan""  Gower"),      = FALSE, type = List (), weights = Rep.int(1, p), ... 
   )

Parameter comment:

x: A numeric matrix or data frame, a variable of numeric type is recognized as an interval scale variable, a variable of factor type is recognized as a nominal attribute, an ordered factor is recognized as an ordered variable, and other variable types need to be specified in the type parameter.
Metric: Character type, valid values are "Euclidean" (default), "Manhattan", and "Gower".
Stand: Logical value, whether the data is normalized by column before the difference is calculated
Type:list type, specifying the type of variable in x, valid list items are "Ordratio" (for ordinal variables), "Logratio" (for logarithmic conversions), "Asymm" (for asymmetric two-element attributes), and "symm" (for symmetric two-and nominal properties) .
Weights: A numeric vector (length is the number of columns of x P=ncol (x)), used to mix variables of type (or metric= "Gower"), to specify weights for each variable, the default weight is 1.

Function Description:

Daisy () implements the processing of nominal, ordinal, and two-tuple attribute data by using the Gower dissimilarity Factor (1971). If the variable of x is a nominal, ordinal, and two-tuple type of data, the function ignores the metric and stand parameters and calculates the distance of the data matrix using the Gower coefficients. For pure numeric data, you can also calculate the dissimilarity matrix by setting metric= "Gower", the process of calculation is to standardize the data object first, the normalized algorithm is: (x-min)/(Max-min), the data is scaled to the range [0.0, 1.0].

Reference Documentation:

Fundamentals of data Mining: similarity and divergence of metric data

Similarity measurement (distance and similarity coefficient)

R language: Calculate various distances

Data mining: Concept and technology--the similarity and dissimilarity of measurement data of note

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More