9. Clustering
Content
9. Clustering
9.1 Supervised learning and unsupervised learning
9.2 K-means algorithm
9.3 Optimization Objective
9.4 Random Initialization
9.5 Choosing the number of Clusters
9.1 Supervised learning and unsupervised learning
We have learned many machine learning algorithms, including linear regression, logistic regression, neural networks, and support vector machines. These algorithms all have one thing in common: the given training sample itself is labeled . For example, when you use linear regression to predict a price, each training sample we use is one or more variables (such as area, floor, etc.) and the tag-to-house price. When using logistic regression, neural networks and support vector machines to deal with classification problems, it is also the use of training samples themselves tagged as a kind, such as the use of spam classification of the existing spam (labeled 1) and non-spam (labeled 0), digital recognition, the variable is the value of each pixel point, The tag is the value of the number itself. The algorithm we use to learn with tagged training samples is called supervised learning (supervised learning). Supervised learning training samples can be unified into the following form, where x is a variable and y is a marker.
Obviously, not all data in real life is tagged (or the tag is unknown). Therefore, we need to study the unmarked training samples to reveal the intrinsic nature and law of the data. We call this learning unsupervised learning (unsupervised learning). Therefore, the training sample of unsupervised learning is the following form, which contains only the characteristic quantity.
Figure 9-1 shows the difference between supervised learning and unsupervised learning. The figure (1) indicates the classification of the labeled sample, with different classes on either side (one for the circle and the other as the fork); the figure (2) is a cluster (clustering)of unlabeled samples (which appear to be loops on the surface) based on variables X1 and x2.
Figure 9-1 An example of the difference between supervised learning and unsupervised learning
Unsupervised learning also has a lot of applications, a clustering example is: For the collected papers, according to the characteristics of each paper, such as word frequency, sentence length, the number of pages and other groups. There are many other applications for clustering, as shown in 9-2. A non-clustered example is a cocktail -party algorithm that finds valid data (information) from noisy data, such as in noisy cocktail parties, where you can still notice someone calling you. So the cocktail party algorithm can be used for speech recognition (see Wikipedia).
There is more discussion on the difference between supervised learning and unsupervised learning on Quora.
Figure 9-2 Application of some clustering
9.2 K-means Algorithm
The basic idea of clustering is to divide a sample in a dataset into several normally disjoint subsets, each of which is called a " cluster " (cluster). After partitioning, each cluster may have corresponding concepts (properties), for example, according to the number of pages, sentence length and other characteristics of the paper to do clustering of 2 clusters, may get a majority of the master's thesis is a cluster, and the other is a majority of graduation thesis of the cluster.
The K-means (K-means) algorithm is a widely used algorithm for cluster partitioning. The steps for the K-means algorithm are described below:
- Random initialization of k samples (points), called Cluster Center (cluster centroids);
- Cluster allocation: For all samples, assign it to the cluster center closest to it;
- Mobile Cluster Center: For each cluster, calculate the average of all the samples belonging to the cluster, moving the cluster center to the mean place;
- Repeat steps 2 and 3 until you find the cluster we want (that is, the optimization target, the next section 9.3)
Figure 9-3 illustrates the case where both the number of features and the number of clusters K are 2.
Fig. 9-3 k mean value algorithm demonstration
By the above description, we formalize the K-means algorithm.
Input:
- K (number of clusters)
- Training set, where (drop Convention)
Algorithm:
randomly initialize K cluster centroids
repeat {
for i = 1 to M
: = index (from 1 to K) of cluster centroid closest to
for k = 1 to K
< Span style= "FONT-SIZE:16PX;" > : = average (mean) of points assigned to cluster
&NBSP;&NBSP;&NBSP;&NBSP;}
In the above algorithm, the first loop corresponds to the steps of cluster allocation: we construct the vector c so that the value of C (i) equals the index of the cluster of x (i), that is, the index of the nearest cluster center from X (i). In a mathematical way, it is shown as follows:
The second loop corresponds to the step of moving the cluster center, which moves the cluster center to the average of the cluster. The more mathematical way is expressed as follows:
These are the samples that are assigned to the cluster.
If a cluster center is not assigned to a sample, we can either reinitialize the cluster center or remove it directly.
After several iterations, the algorithm will converge, that is, the continuation of the iteration will no longer affect the cluster.
In some applications, the sample may be contiguous and there is no apparent clustering, but we can use the K-means algorithm to classify the sample into K-subsets for reference. For example, the size code of a T-shirt is divided by the height and weight of the person, as shown in 9-4.
Figure 9-4 K-means for non-separated clusters
9.3 Optimization Objective
Re-describe the variables used in the K-mean algorithm:
= Index of cluster (,..., K) to which example is currently assigned
= Cluster centroid K ()
= Cluster centroid of cluster to which example have been assigned
Using these variables, define our cost function as follows:
So our optimization goal is to
Combined with the algorithm described in section 9.2, you can find:
- In the cluster allocation step, our goal is to change the minimization of the J function (fixed)
- In the move Cluster Center step, our goal is to change the minimization of the J function (fixed)
Note that in the K-mean algorithm, the cost function cannot be increased, it should always fall (different from the gradient descent method).
9.4 Random Initialization
Here is a recommended method for initializing the cluster center.
- Make sure that K < m, which means that the number of clusters should be less than the number of samples;
- Random selection of k training samples;
- Make k Cluster Center equal to K training samples.
The K-mean algorithm may fall into the local optimal. To reduce this situation, we can run the K-mean algorithm multiple times based on random initialization. Therefore, the algorithm becomes the following form (to run 100 times as an example: tradeoff of efficiency and accuracy)
For i = 1 to 100 {
Randomly initialize K-means.
Run K-means. Get
Compute cost function (distortion)
}
Pick clustering that gave lowest cost
9.5 Choosing the number of Clusters
The value of the selection k is usually subjective and ambiguous. That is, there is no way to ensure that a certain value of k is better than other values. However, there are some methods for reference.
The elbow method : Draw the cost J on the function graph of the cluster K, the J value should decrease with the increase of k and then tend to be flat, choosing the value of K when J begins to balance. Shown in 9-5 (1).
However, this curve is usually gradual, with no obvious "elbow". Shown in 9-5 (2).
Fig. 9-5 the graph of the cost J on the cluster K
Note: As K increases, j should always be reduced, otherwise, an error condition may be that the K-means fall into a bad local optimal.
Some other ways to see Wikipedia.
Of course, we should sometimes determine the value of k based on subsequent purposes (Later/downstream purpose). Or based on the height and weight of the people to divide the size of T-shirts, for example, if we want to divide the T-shirt size into 3 types of s/m/l, then the value of k should be 3; If you want to divide into 5 types of XS/S/M/L/XL, then the value of K should be 5. As shown in 9-6.
Figure 9-6 Two different cases of dividing T-shirt size
"Recommended reading" discusses the disadvantages of the K-mean algorithm
Stanford Machine Learning Note-9. Clustering (clustering)