Machine learning Notes (ix) clustering algorithms and Practices (k-means,dbscan,dpeak,spectral

Machine learning Notes (ix) clustering algorithms and Practices (k-means,dbscan,dpeak,spectral_clustering)

Last Update:2017-03-19 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This week school things more so dragged a few days, this time we talk about clustering algorithm ha.

First of all, we know that the main machine learning methods are divided into supervised learning and unsupervised learning. Supervised learning mainly refers to we have given the data and classification, based on these we train our classifier in order to achieve a better classification effect, such as our previous talk of logistic regression ah, decision tree Ah, SVM AH are supervised learning model. Unsupervised learning means that we only have data, no classification results, and then modeling based on the data can give a process of which samples belong to a class, which we often call clustering.

Today I mainly introduce the following several of the most common clustering algorithms, including K-means algorithm, density-based clustering (DBSCAN) algorithm, density maximum algorithm (dpeak), spectral clustering algorithm, basically also from easy to difficult, from the principle of speaking my own understanding, hope to everyone useful.

====================================================================

K-means algorithm.

In principle, the K-means algorithm actually assumes that the distribution of our data is the same Gaussian distribution as K Sigma, with N1,N2 in each distribution ... NK samples, the mean values are MU1,MU2 ... Muk, so that each sample belongs to its own likelihood probability of that cluster is

This routine we are very familiar with, the following is to take the logarithm likelihood probability, demand likelihood probability of the maximum value, give it a minus sign can be used as a loss function, considering that all clusters of sigma is equal, so we can go to the K-means loss function

Then we take the derivative of the loss function to 0, we can get the best cluster center after the update.

So we get the so-called K-means algorithm.

1 Initial selection of K category centers.

2 mark each sample as the closest category to the category center.

3 Update each category center to the center of all points that belong to that category.

4 Repeat 2, 32 steps several times up to termination conditions (number of iterations, cluster center change rate, MSE, etc.)

Now let's look back at the problem with the K-means algorithm.

First of all, as I started introducing it, it is assumed that the data obey Sigma the same mixed Gaussian distribution, so the result of the final classification is certainly a number of circular regions, which greatly limits its application, if our data is the kind of more wonderful shape, such as what the fan, ring Ah, You will find that the effect of K-means is not very satisfactory.

Secondly, you have to give the number of this classification K Ah, there is a certain priori conditions OK, if it is two eyes a smear, how to determine it? Guess, or try, with a certain evaluation criteria to choose the best one. There is the choice of the initial cluster center, the result of K-means is sensitive to the initial value, for example, the sample is divided into three clusters, you start to set two centers in a certain cluster, there is a center in the middle of the other two clusters, so the final result is probably the two clusters are divided into a class, There is also a cluster that is forcibly divided into two clusters. Therefore, in order to solve the problem of the initial value sensitivity, and put forward the k-means++ algorithm, it is that you first randomly specify the first cluster center, and then calculate all points to the center of the cluster, the distance as a weight to select the next cluster center, to a certain extent can solve the problem of the cluster center initial value selection is unreasonable.

Finally, there is a problem, similar to the previous one we spoke of SVM, if we use the linear divisible SVM method, an anomaly can be our division of the super-plane with the deviation caused by the generalization ability is weakened. K-means We use the mean to update the cluster center, the same, an anomaly will cause the new cluster center of a relatively large deviation, and then update the time we still have to consider the anomaly, so we will not get a better effect.

Said so many K-means shortcomings all appear that it is useless, we still want to say K-means as one of the most classical clustering algorithm, it is simple, fast, in dealing with big data when the relative advantages will be relatively large, and sometimes can also be used as a step in other clustering algorithms.

====================================================================

Dbscan algorithm.

In front of the K-means algorithm mainly for that kind of circular area data clustering, relatively narrow application range. Density clustering can compensate for this disadvantage and can be used in any form of clustering. This algorithm requires us to adjust two parameters, Radius sigma, the minimum number of M, first introduce some concepts of the algorithm

Core object: For an object whose Sigma field has at least m objects, then we call it a core object

Direct density up to: If an object is in the Sigma domain of a core object, it is said that the two objects are directly density up to

Density up to (connected): If an object A and B are directly dense, and objects B and C are also direct densities, then we call a and c a density, which is also known as density-linked.

Dbscan's algorithm is that we first find a core object, starting from it, determine a number of direct density of the object, and then from the several objects, looking for their direct density can reach the point, until finally there is no object to add, then a cluster update is complete. We can also say that a cluster is actually a collection of all the points that are dense.

Where is the advantage of it?

First of all, it does not require the shape of this cluster, as long as the density of these points can be reached we will classify it as a cluster, so no matter how wonderful your shape, and finally we can divide it into the same cluster.

Second, we can think of those anomalies, it deviates from the normal object a lot, so it is not the core object, and then to the other points is not the density can be reached, so the end is left out is not classified. So the Dbscan algorithm also has a certain function of eliminating outliers, of course, here should also note that if our Sigma value is too large or m too small, or will cause some outliers troubled waters into some clusters or into a class and so on, in a word, or to the results of the classification of the parameters, Find the best way to classify.

====================================================================

Dpeak algorithm

The density maximum algorithm can be regarded as a kind of expansion based on the above two algorithms, its main advantage is to determine the cluster center and exclude outliers.

How do we do this, we first given a radius of R, and then to all of our samples, calculate the number of samples in its R neighborhood as its local density as Rho, the second step, calculate the minimum value of each sample to the point where the density is higher than it is recorded as Sigma, With these two parameters we can do the next step of the screening work, specifically divided into the following four scenarios:

1 Rho is very small and sigma is very large. Sample size around this sample is very small, but to a point more dense than its distance is quite far, this shows what, it is a far away from the normal sample of outliers Ah, in the small corner of the remote to do their own tricks ah, decisive kick it AH.

2 Rho is very large and sigma is very large. The sample size around this specimen is very large and it is far better to find a point that is larger than its density, which means that the point is surrounded by stars, and it is the king of this cluster, which we tend to identify as a cluster center.

3 Rho is small and sigma is small. The sample size around the sample is small, but to find a sample density larger than its point is not far away, indicating that the point is a point on the edge, is often a cluster of boundaries.

4 Rho is very large and sigma is very small. The sample size around the specimen is large, but the density is even larger than it is actually not far, this situation will only occur when you are in the center of the cluster, it is a pity, perhaps you are the core members of this cluster, but you can not do this cluster of kings.

Well, based on the Rho and sigma of each sample, we'll probably be able to determine their respective roles, we'll take the big villain outliers out of the sample, and we'll find a large number of Rho and Sigma points as cluster centers, Then the K-means or Dbscan algorithm can be used for clustering to obtain relatively good results.

====================================================================

Spectral clustering

This is to say the last kind of clustering algorithm, I found today that the cluster formula is really not much, but typing is tired Ah!!!

Spectral clustering is a kind of clustering method based on graph theory, and I hope to explain what it is doing in a more intuitive way.

Firstly, some concepts are introduced, and the spectral clustering must say what is the spectrum, and the similarity matrix, the degree matrix and the Laplace matrix.

Spectrum: The eigenvalues of a square matrix (not a square word is the characteristic value of the square which is left by its transpose) called the spectrum, where the maximum is called the spectral radius.

The similarity matrix w:n a n*n matrix, and the value of column J of the Matrix I is a certain similarity between the first sample and the J Sample.

Degree Matrix D: A diagonal array of n*n, and the value of column I of row i is the sum of all the values of line I in the similarity matrix.

Laplace matrix L: the degree matrix minus the similarity matrix is the Laplace matrix.

Well, based on these concepts, let's consider a situation like this. For such m samples, they belong to the same class, then they are not able to create a m*m similarity matrix W, the matrix is a symmetric array, and then can be obtained its degree matrix and Laplace matrix, skillfully in this matrix is a semi-definite, proved as follows

So the eigenvalues of the Laplace matrix are greater than or equal to 0. From the definition of the degree matrix we know that its diagonal elements are equal to the sum of each row of the similarity matrix, so the Laplace matrix is multiplied by a full 1 vector to get a column vector of all 0, so the Laplace matrix has a 0 eigenvalue, the corresponding eigenvectors are all 1 vectors, which means

Similarly if there is another class of n samples, the same, for it, there is

Assuming that the samples of these two classes have no similarity, then the similarity matrix of their two samples mixed together can be expressed as

We already know that LM and Ln correspond to the eigenvectors of the eigenvalues 0, so I can get the two eigenvectors of L for M

So we can obviously see that the two eigenvectors corresponding to the L minimum eigenvalue 0 can be clearly separated by two species, so we know that by clustering the eigenvalues of the smallest k eigenvalues of the Laplace matrix, we can determine the class of the corresponding sample. Of course, the reality of the situation is not as beautiful as we derive, all 1 of the 0 is rare, because the sample between the more or less entangled relationship, but on the basis of the clustering of eigenvectors have been able to determine the type of the sample, this is what we call spectral clustering.

Finally, the algorithm process of spectral clustering is repeated.

1 The similarity degree matrix, the degree matrix and the Laplace matrix are calculated.

2 calculates the eigenvector of the eigenvalues corresponding to the K-small values before the Laplace matrix.

3 The K eigenvectors are formed into a new matrix, and their row vectors are clustered.

The clustering results of 4 rows of vectors represent the clustering results of the original samples.

====================================================================

Well, this is probably what I want to talk about all the content of the algorithm, the following is the love of the swap link, the front was used to rot the iris data is not available today, so the first step we first build a little data to see

Import NumPy as Npimport Matplotlib.pyplot as Pltimport sklearn.datasets as Dsimport matplotlib.colors# build data n=500centers= 4data,y=ds.make_blobs (n,centers=centers,random_state=0) #原始数据分布matplotlib. rcparams[' font.sans-serif '] = [u ' SimHei ']matplotlib.rcparams[' axes.unicode_minus ' = FALSECM = Matplotlib.colors.ListedColormap (List (' RGBM ')) Plt.scatter ( DATA[:,0],DATA[:,1],C=Y,CMAP=CM) plt.title (U ' raw data distribution ') Plt.grid () plt.show ()

The results are as follows

OK, let's try it first with K-means.

#K-meansfrom sklearn.cluster Import kmeansmodel=kmeans (n_clusters=4,init= ' k-means++ ') y_pre=model.fit_predict (data ) Plt.scatter (DATA[:,0],DATA[:,1],C=Y_PRE,CMAP=CM) plt.title (U ' k-means cluster ') Plt.grid () plt.show ()

The result is

Classification results quite good ah, in addition to local by the relatively close to a few points small there is a mistake, the whole is still satisfactory. But I don't know if you remember what I said before, K-means has a priori condition that the data satisfies the Gaussian distribution of the same variance, so we deliberately make the variance of the data to see if the clustering effect will be greatly affected.

#方差不等数据data2, Y2=ds.make_blobs (n,centers=centers,cluster_std= (2,2,5,8), random_state=0) Plt.scatter (data2[:,0], DATA2[:,1],C=Y2,CMAP=CM) plt.title (U ' raw data distribution ') Plt.grid () plt.show () Model2=kmeans (n_clusters=4,init= ' k-means++ ') y_ Pre2=model2.fit_predict (data2) plt.scatter (data2[:,0],data2[:,1],c=y_pre2,cmap=cm) plt.title (U ' k-means cluster ') Plt.grid () plt.show ()

The results are as follows

Sure enough, the data after the cluster is a lump of the kind, and can not be the original data in the middle of the two categories, so this also verifies that we said before the K-means still has its limitations.

Well, I have played a lot of words today ... I wish you a pleasant weekend, a nice day~

Machine learning Notes (ix) clustering algorithms and Practices (k-means,dbscan,dpeak,spectral_clustering)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More