Summary
Clustering is unsupervised learning ( unsupervised learning does not rely on pre-defined classes or training instances with class tags), it classifies similar objects into the same cluster, it is observational learning, rather than example-based learning, which is somewhat like a fully automated classification. To put it bluntly, clustering (clustering) can be understood literally--the process of clustering identical, similar, close, and related object instances into one class. The common clustering algorithms in machine learning include K-means algorithm, expectation maximization algorithm (expectation Maximization,em, reference "EM algorithm Principle "), spectral Clustering algorithm (reference  Machine Learning Algorithm review-spectral clustering and artificial neural network algorithm, this paper describes the K-means clustering algorithm, this paper introduces K-means (K-means) and binary K-means clustering algorithm.
(a) what is clustering
Or the phrase "birds of a Feather, flock together", if you know the label of the crowd beforehand (such as literature, Ordinary, 2B), then according to the supervised learning classification algorithm can be a person clearly divided into a certain category; If you do not know the label of the crowd, then only according to the characteristics of people (such as hobbies, education, occupation, etc. This is the clustering algorithm.
Clustering is unsupervised learning (unsupervised learning does not rely on pre-defined classes or training instances with class tags), it classifies similar objects into the same cluster, it is observational learning, rather than example-based learning, which is somewhat like a fully automated classification. The so-called cluster is the object in the set has a great similarity, and the objects between different sets have great dissimilarity. Cluster identification (cluster identification gives the meaning of the clustering results and tells us what these clusters are all about. Typically, the cluster centroid can represent the entire cluster of data to make decisions. The clustering method can be applied to almost all objects, the more similar the objects within the cluster, the better the clustering effect.
From the machine learning point of view, the cluster is the same as the hidden pattern, the biggest difference between clustering and classification is that the classification learning instances or data objects have category tags, but clustering is different, need to be automatically determined by clustering learning algorithm. Clustering is also known as unsupervised classification (unsupervised classification) because it produces the same results as the classification, but only if the classes are not predefined.
Cluster analysis is an exploratory analysis, in the process of classification, people do not have to give a classification criteria, cluster analysis can be based on the sample data, automatic classification. The different methods used in cluster analysis often get different conclusions. The cluster analysis of the same group of data by different researchers may not be consistent. From the point of view of practical application, clustering analysis is one of the main tasks of data mining. Moreover, clustering can be used as a stand-alone tool to obtain the distribution of data, observe the characteristics of each cluster of data, and concentrate on the further analysis of specific cluster sets. Cluster analysis can also be used as a preprocessing step for other algorithms, such as classification and qualitative induction algorithms.
Cluster analysis attempts to classify similar objects into the same cluster, and to classify non-similar objects into different clusters, then whether "similarity" is a choice of similarity calculation method. Now, there are many different methods of similarity calculation, which depends on the application, and choosing the appropriate similarity calculation method will improve the performance of clustering algorithm. The similarity measurement methods commonly used in machine learning refer to the post "similarity measurement in machine learning".
The clustering algorithm usually merges the input data by the central point or the hierarchical way, so the clustering algorithm tries to find the internal structure of the data in order to classify the data according to the greatest common denominator, the goal is to make the similarity of the same class object as large as possible, and the similarity between different objects as small as possible. At present, there are many methods of clustering, according to the basic ideas, the clustering algorithm can be divided into five categories: Hierarchical Clustering algorithm, segmentation clustering algorithm, constraint-based clustering algorithm, machine learning clustering algorithm and for high-dimensional clustering algorithm, reference "various clustering algorithm comparison." The basic process of clustering algorithm includes feature selection, similarity measure, clustering criterion, clustering algorithm and result verification, and the specific reference is "Clustering algorithm learning note (i)--foundation".
To put it bluntly, clustering (clustering) can be understood literally--the process of clustering identical, similar, close, and related object instances into one class. Simply understood, if a data set contains n instances, it is possible to divide the N instances into m categories according to a certain criterion, the instances in each category are related, and the different categories are not related, and this gives a clustering model. When judging the class of the new sample point, by calculating the similarity between the point and the M category, the most similar class is chosen as the collation of the point.
Since clustering can be regarded as an unsupervised classification, its application scenarios are extensive, including user group division, text categorization, image recognition and so on. When there is little prior information about the data (such as a statistical model) available, and the user requires as little as possible a hypothesis about the likelihood of the data, the clustering method is suitable for viewing the intrinsic relationships in the data points to evaluate and make decisions about their structure.
The common clustering algorithms in machine learning include K-means algorithm, expectation maximization algorithm (expectation Maximization,em, reference "EM algorithm principle"), Spectral Clustering algorithm (Reference machine learning algorithm review-spectral clustering) and artificial neural network algorithm, The K-means clustering algorithm is described in this paper, and the K-means (K-means) clustering algorithm is introduced in this paper.
(b) K-means (K-means) Clustering algorithm1. Recognize K-means clustering algorithm
The K-means algorithm is the simplest clustering algorithm, which belongs to the segmented clustering algorithm, so that the data points in each cluster (k) and the center of mass of the cluster (Sum of squared error) are minimized, which is the evaluation criterion for evaluating the final clustering effect of the K-means algorithm.
The basis of the K-means algorithm is the minimum squared error sum criterion. The cost function is:
, Μc (i) represents the centroid of the first cluster, we want to get the lowest cost function of the cluster model, and intuitively, the more similar the sample in each cluster, the smaller the squared error of the cluster centroid. By calculating the sum of squared errors of all clusters, it is possible to verify whether clustering is optimal for K clusters. Smaller SSE values indicate that the data points are closer to their centroid, and the clustering effect is better. Because the error is squared, more attention is paid to those points away from the center. One way to certainly reduce SSE values is to increase the number of clusters, but this violates the goal of clustering, and the goal of clustering is to improve the quality of the cluster in the case of keeping the number of families unchanged.
K-Means (K-means) clustering algorithm is called K-means because it can discover k different clusters, and the center of each cluster is calculated from the mean value of the sample characteristics of the sub-datasets contained in the cluster. The K-means is the algorithm that discovers a set of a given dataset, the number of clusters K is given by the user, and each cluster is described by its centroid (centroid)-The center of a cluster. The K-means clustering algorithm needs the numerical data to carry on the similarity measure, can also map the nominal data to the two value type data again for the measure similarity, its advantage is easy to implement, the disadvantage is possibly converges to the local minimum value, converges slowly on the massive data set.
Assuming that the training sample DataSet X is a (m, n)-dimensional matrix, m represents the number of samples, and N represents the number of features per sample point, the result of the K-means clustering algorithm is to get a kxn-dimensional matrix, K for the number of user-specified clusters, Each row is a row vector of length n-the first element is the mean of all the samples in the cluster J (j=0,1,..., n-1) characteristics.
2. Algorithmic Processes
The working flow of the K-means algorithm is this. First, the K initial points are randomly determined as centroid, and each point in the dataset is then assigned to a cluster--that is, to find the nearest centroid for each point and assign it to the cluster corresponding to that centroid, and update the centroid of each cluster (the average of all data sample features of the cluster) after the completion of the step. The above process iterates multiple times until the cluster ownership of all data points no longer changes or reaches the maximum number of iterations maxiteration. The performance of K-means algorithm is affected by the method of similarity measurement, and the common measure of similarity is to calculate Euclidean distance. The pseudo-code for the above procedure is represented as follows:
***************************************************************
Create a K-point as the starting centroid
When the cluster allocation result at any point changes (in-loop setting, maintaining the flag bit changed, you can also set the maximum number of iterations max)
For each data point in the data set
For each centroid
Calculate the distance between the centroid and the data points
Assigns a data point to the cluster closest to it (if a bit of a cluster changes, the flag bit is changed = True)
Update Cluster Center: For each cluster, calculate a bit of the mean in the cluster and use the mean as the centroid
***************************************************************
The termination condition of the above loop is that the cluster allocation result of each point has not changed or reached the maximum number of cycles.
The selection of K centroid at initialization can be random, but the usual way to improve performance is
(1) Try to choose a distance far from the point (method: In turn, calculated with the identified points (the first point can be randomly selected) distance, and select the maximum distance point). When k is large, the computational complexity of this method is more suitable for the K-value initialization of the binary K-means clustering algorithm.
(2) A hierarchical clustering method is adopted to find K clusters. TBD
3. Characteristic value processing
The K-means clustering algorithm requires numerical data for similarity measurement, and the nominal data can be mapped to two-value data for measurement similarity.
In addition, the sample will have multiple characteristics, each of which has its own domain and range of values, and they have a different effect on the distance calculation, such as a larger influence will be over the value of the smaller parameters. In order to be fair, the sample feature value must do some scale processing, the simplest way is that all the characteristics of the values are normalized disposal, each dimension of the data are converted to 0, 1 interval, thereby reducing the number of iterations, improve the convergence rate of the algorithm.
4. Selection of K values
As mentioned earlier, the number of clusters in K-means clustering K is a user-defined parameter, then how can users know if K is the correct choice? How do you know if the generated clusters are better? Like the K-value determination method of K-nearest neighbor classification algorithm, K-means algorithm can also use cross-validation method to determine the lowest error rate, referring to the "Machine Learning Classic algorithm and Python implementation of the--k nearest neighbor (KNN) algorithm," the 2.3-K value of the determination.
When the number of K is lower than the number of real clusters, SSE (or other dispersion indicators such as average diameter) will rise rapidly. So you can use multiple clusters, and then compare the way to determine the best K value. Multi-cluster, usually using K=1, 2, 4, 8 ... This two-fractional-column approach, through cross-validation, finds a K-value that obtains a good clustering effect at V/2, V, and then continues to use the dichotomy to find the best K value between [V/2, V].
5. Python implementations of the K-means algorithm
: TBD
The reason that the K-means algorithm converges but the clustering effect is poor is that the K-means algorithm converges to the local minimum, not the global minimum (local minimum refers to the result can be but not the best result, the global minimum is the best possible result). In order to overcome the problem that K-means algorithm converges to local minimum value, another algorithm called binary K-means (bisecting K-means) was proposed.
(iii) binary K-means (bisecting k-means) Clustering algorithm
As the name implies, the binary K-means clustering algorithm is to take k=2 K-means clustering of datasets (sub-datasets) each time, and the selection of sub-datasets has some criteria. The binary K-means clustering algorithm first takes all the points as a cluster, the first step is then divides the cluster into two, then the iteration is: in all clusters according to SSE Select a cluster to continue the binary K-means division, until the user-specified number of clusters. There are two ways to continue dividing clusters according to the SSE selection:
(1) Choosing which cluster to divide depends on whether the value of SSE can be minimized by its partitioning. This requires dividing each cluster into two divisions, then calculating the sum of the cluster SSE after the cluster and calculating the difference between its and the binary SSE (of course SSE must fall), and finally selecting the cluster with the largest difference for two points.
The pseudo-code form of the binary K-means algorithm under this scheme is as follows:
***************************************************************
Treat all data points as a cluster
When the number of clusters is less than K
For each cluster
Calculate Total Error
K-means clustering on a given cluster (k=2)
Calculates the total error after dividing the cluster into a split
Select the cluster with the smallest error to divide the operation
***************************************************************
(2) Another approach is to select the largest cluster of SSE in all clusters to divide, until the number of clusters reached the user specified number, the algorithm process and (1) similar, the difference is only each time the cluster in the selection of the largest SSE cluster.
Python implementation of binary K-means clustering algorithm
: TBD
Reference
Similarity measurement in machine learning
A survey of common methods for similarity calculation
A classical algorithm for machine learning and Python implementation--clustering and K-means and two-K-means clustering algorithm