talking about the cluster
Introduction
The goal of cluster analysis is to collect data on a similar basis to classify it. That is, clustering is a kind of data processing method that we often use when we are confronted with a large amount of information. By using the clustering method, it can help to divide the original data into different parts, improve the macroscopic understanding of the data, and lay the foundation for the deep understanding of the data.
Clustering algorithms in the industry have a huge application, such as in the < mathematics of the United States > Book, that is, Google to apply it to the news classification example. In recent years, especially with the hot machine learning, clustering algorithm in academia and industry has received great attention. As early as 2015, Science published a Kmeans clustering improvement algorithm.
From a statistical point of view, clustering is a method of simplifying data by modeling data.
From a machine learning standpoint, a cluster is equivalent to a hidden pattern. Clustering is the unsupervised learning process of search clusters.
clustering and Classification (discriminant)
From the name alone, clustering and classification are not very different. But it's a huge difference in statistics or machine learning. The most intuitive and concise interpretation of cluster analysis and classification is to use a common phrase, "flock together, flock together".
Theoretically, in machine learning, the biggest difference between clustering and classification is whether there is supervised learning, that is, whether there is a training set, that is, the classification (discriminant) method has a training set, through training set training model and then get the classification model, and then use this model to map all the input to the corresponding output, By simply judging the output and realizing the purpose of classification, it has the ability to classify the unknown data. Clustering, however, often does not have a training set, that is, there is no training sample in advance, and it is necessary to model the data directly.
In statistics, cluster analysis. According to the research object characteristics of the research object classification of a multivariate analysis technology, the nature of similar individuals into a class, so that the same class of individuals have a high degree of homogeneity, the individual between different classes have a high degree of heterogeneity. Classification (discriminant) is a technical means of statistical discrimination and grouping. According to a group variable of a certain quantity case and the known information of other multivariate variables, the quantitative relation between the grouping and other multivariate variables is determined, the discriminant function is established, and then the number relation can be used to discriminate the cases of other unknown grouping types.
That is, the cluster is similar to the classification name, but actually it is a completely different concept, and the data being processed is different. main cluster thought and method
Talking about clustering, first of all, here we assume that a simple scenario is to assume that we need to cluster a sports class student, so how will we complete the division of the selected students? Here, for the simple process, we assume that the student has only two attributes-age and weight. The following article will be combined with the main clustering ideas and algorithms based on the R language to give you a detailed introduction: 1 Hierarchical clustering
Hierarchical clustering is a common method in the process of clustering, and its main ideas are:
In the cluster sample of the relationship between the distance between the proximity of the use of distances to replace, the distance is similar to a class of samples, repeat this process to complete clustering. Generally take the cluster process for the bottom-up, the above image is a hierarchical clustering process diagram, the so-called bottom-up, also is a beginning sample such as in the y-axis 0, are the individual individuals, with the cluster, the sample gradually converge to know the final form of a class. (another is from the top down, we can Baidu a bit).
In order to find the nearest/far and mean distance in a specific calculation, all distances need to be computed over and over, a double loop is required, and each iteration can only merge two subclasses, which is very slow, but this method is generally used for daily use, and the hierarchical clustering function Hclust in R is the method.
Here we refer to the distance, so what is the distance here? Here we introduce various distances for calculating sample similarity in statistics: 1.1 Similarity calculation 1.11 Distance Concept
Euclidean distance (euclideandistance)
Euclidean distance is the most common distance measure, which measures the absolute distance between points in a multidimensional space. The formula is as follows:
Because the calculations are based on absolute values for each dimension feature, Euclidean measures need to ensure that each dimension metric is at the same scale level, such as the use of European distances for indicators with a height (cm) and weight (kg) of two units may invalidate the results.
Minkowski distance (Minkowski Distance)
The distance of the Ming is the generalization of the Euclidean distance, which is the generalization of the multiple distance measurement formula. The formula is as follows:
The P-value here is a variable, and the Euclidean distance is obtained when p=2.
Manhattan Distance (Manhattan Distance)
The distance from the city block in Manhattan is the result of summing distances from multiple dimensions, i.e. the distance measurement formula obtained when p=1 in the above-mentioned distance, as follows:
Chebyshev Ski Distance (Chebyshev Distance)
Chebyshev the distance from the chess King's Way, we know that the chess King can only go to the surrounding 8 in a step, then if you want to go from the board of a (x1, y1) to B (x2, y2) at least a few steps to go. Extended to multidimensional space, in fact, Chebyshev distance is when p tends to infinity when the distance of the Ming:
In fact, the above Manhattan distance, Euclidean distance and Chebyshev distance are Minkowski distances under special conditions of application.
1.12 Similarity Measurement
Similarity measure (similarity), that is, to calculate the similarity between individuals, in contrast to distance measurement, the smaller the value of similarity measure, the smaller the similarity between individuals, the greater the difference.
Cosine similarity of vector space (cosine similarity)
The cosine similarity is used to measure the difference between the two individuals by the cosine of the two vectors in the vector space. The cosine similarity focuses more on the direction of the two vectors than on distances or lengths, compared to distance measurements. The formula is as follows:
Pearson correlation coefficient (Pearson Correlation coefficient)
That is, correlation coefficient r in correlation analysis, the cosine angle of space vector is computed for x and y based on their overall normalization. The formula is as follows:
is also the
1.2 Clustering Process
Assuming there are n samples to be clustered, the basic steps for hierarchical clustering are:
1. (initialize) classify each sample into a class, and calculate the distance between each of the two classes, that is, the similarity between the sample and the sample;
2, according to a certain rule to select the category matching the distance requirements, the completion of inter-class merger;
3. Recalculate the similarity between the newly generated class and the old classes;
4, repeat 2 and 3 until all sample points are classified as a class, end.
Here if careful will find in step two process appeared in accordance with a certain rule to select this detail, here we briefly introduce the clustering process commonly used in the selection rules.
Single method:
Select two and middle, the shortest node x and Y, and use this distance as the distance between the two classes. Then merge the categories with the shortest distance.
Complete method:
Choose Two and middle, the longest node x and Y, and take this distance as the distance between the two classes. Then merge the categories with the shortest distance.
Average Method:
Select the average distance for all nodes in class two, and then use this distance as the distance between the two classes. Then merge the categories with the shortest distance.
Centorid Method:
Select the center point (i.e. centroid) in class two with which you calculate the squared difference between the two centers as a distance.
Ward Minimum Variance method:
This method and class always make the sum of the squares and increments of the classes caused by the class to be minimal. That is, first, n samples of a class, each class merges the total deviation of the sum of squares will be increased, choose to make the difference squared and increase the minimum two classes to merge.
In R, this method is divided into two methods: Ward and Ward.d2, the difference between the two is that the Ward.d2 method squares the dispersion before the class is updated.
Generally in our application, the main way to choose is ward and average method
After choosing a good, suitable method, we can then start to complete the cluster.
Based on the examples earlier in this article, we are clustering
First, we find the distance matrix of the data
The function we use in R is Dist (), whose default distance is Euclidean distance
Dist (X,method = "Euclidean", Diag = False, upper = false, P = 2)
This x represents the data to calculate the distance, the method represents the methods to be used to calculate the distance, whether the diag represents a diagonal element, whether the upper represents the upper triangular matrix, and p represents the P value size in the Minkowski distance.
After we get the distance matrix, we use the simplest single method to complete the clustering process, that is, to complete the cluster by the minimum distance between classes.
The function we use here is Hclust ()
The default parameters are:
Hclust (d, method = "complete", Members = NULL)
where d represents the distance matrix to be provided, method represents the clustering method to be used, and a more detailed introduction is available in R. Hclust view Help.
Rect.hclust (plot,k=3) at the same time, you can use this function to get the required class in the diagram. Where plot represents the cluster result, k=n represents the desired category.
By the distance matrix, we know that the minimum distance is first 3 to 6 of 3.162287, which is the first class we get for the height in the figure above. 2 Then the minimum distance is 4 to 8 of 4.123106. According to this principle, the gradual merging of points until they are merged into a class completes the clustering process.
Analysis.
reference:
Deng Haiyan. The difference between cluster analysis and discriminant analysis [J]. Wuhan Journal, 2006, (1).
Rodriguez A, Laio a. Clustering by fast search and find of density peaks[j]. Science, 2014, 344 (6191): 1492-1496.
The beauty of mathematics Wu
Data mining: Concepts and Techniques (English version, 3rd edition) (English) Jiawei Han (author), Campbell (KAMBER.M.) (author), Pei Jian (author) Turing programming books: R-language combat Kabakov (Roberti.kabacoff) (author), Ko Tao (translator)
Blog: http://blog.csdn.net/jwh_bupt/article/details/7685809
Blog: http://www.cnblogs.com/emanlee/archive/2012/02/28/2371273.html
Blog: http://blog.csdn.net/yillc/article/details/6746509