Using Scikit-learn to study spectral clustering

Source: Internet
Author: User



In the summary of the principle of spectral clustering (spectral clustering), we summarize the principle of spectral clustering. Here we make a summary of the use of spectral clustering in Scikit-learn.


1. Scikit-learn Spectral Clustering Overview


In the class library of Scikit-learn, Sklearn.cluster.SpectralClustering realizes the spectral clustering based on ncut, and does not implement transduction clustering based on Ratiocut. At the same time, for the establishment of similar matrices, the method based on K-neighbor method and full-join method is realized, and there is no similarity matrix based on $\epsilon$-neighbor method. The last step clustering method provides two kinds of K-means algorithm and discretize algorithm.



For the parameters of spectralclustering, we mainly need to adjust the parameters of the similarity matrix and the number of cluster categories, which has a great influence on the result of clustering. Of course, some of the other parameters need to be understood, as necessary to modify the default parameters.


2. Important parameters of spectralclustering and notes of the parameter adjustment


Below, we will make an introduction to the important parameters of spectralclustering, and we will introduce them to the notes of the assistant.



1)N_clusters: Represents us in the spectral clustering transduction dimension to dimensionality (7th section of the principle of $k_1$), but also the last step clustering algorithm clustering to the dimension ($k_2$ of the Principle section 7th). That is to say, the spectral clusters in the Scikit-learn unify the two parameters. Simplifies the number of parameters for the parameter adjustment. Although this value is optional, it is generally recommended to select the optimal parameters for the parameter adjustment.



2) Affinity: That is how we build the similarity matrix. There are three types of options available, the first of which is ' nearest_neighbors ', which is the K-nearest approach. The second category is ' precomputed ', which is a custom similarity matrix. When choosing a custom similarity matrix, you need to call Set_params yourself to set the similarity matrix yourself. The third type is the full join method, which can be used to define similar matrices using various kernel functions, and also to customize kernel functions. The most common is the built-in Gaussian kernel function ' RBF '. Other popular kernel functions are ' linear ', which is the linear kernel function, ' poly ' is the polynomial kernel function, ' sigmoid ' is the sigmoid kernel function. If these kernel functions are selected, the corresponding kernel function parameters need to be adjusted in the following separate parameters. Custom kernel functions I haven't used them, and I don't have much to say here. The affinity default is the Gaussian core ' RBF '. In general, it is recommended to use the default Gaussian kernel function for similar matrices.



3) kernel function parameter gamma: If we use the polynomial kernel function ' poly ', Gaussian kernel function ' RBF ', or ' sigmoid ' kernel function in the affinity parameter, then we need to make an argument to this parameter.



This parameter in the polynomial kernel function corresponds to $\gamma$ in $k (x, Z) = (\gamma x \bullet z + R) ^d$. Cross-validation is generally required to select a suitable set of $\gamma, R, d$



This parameter in the Gaussian kernel function corresponds to $k (x, Z) = exp (\gamma| | x-z| | ^2) $ in the $\gamma$. It is generally necessary to select the appropriate $\gamma$ by cross-validation



This parameter in the Sigmoid kernel function corresponds to \bullet in $k (x, z) = Tanh (\gamma x $\gamma$ z + R) $. It is generally necessary to select a suitable set of $\gamma by cross-validation, R $



$\gamma$ default value is 1.0, if we affinity use ' nearest_neighbors ' or ' precomputed ', then this parameter is meaningless.



4) kernel function parameter degree: If we use the polynomial kernel function ' poly ' in the affinity parameter, then we need to make an argument to this parameter. This parameter corresponds to the $d$ in $k (x, Z) = (\gamma x \bullet z + R) ^d$. The default is 3. Cross-validation is generally required to select a suitable set of $\gamma, R, d$



5) kernel function parameter coef0: If we use the polynomial kernel function ' poly ' in the affinity parameter, or the sigmoid kernel function, then we need to make an argument to this parameter.



This parameter in the polynomial kernel function corresponds to $r$ in $k (x, Z) = (\gamma x \bullet z + R) ^d$. Cross-validation is generally required to select a suitable set of $\gamma, R, d$



This parameter in the Sigmoid kernel function corresponds to \bullet in $k (x, z) = Tanh (\gamma x $r$ z + R) $. It is generally necessary to select a suitable set of $\gamma by cross-validation, R $



COEF0 defaults to 1.



6)Kernel_params: If the affinity parameter uses a custom kernel function, the parameters of the kernel function need to be passed through this parameter.



7)N_neighbors: If we specify the affinity parameter as ' nearest_neighbors ' that is K-neighbor, then we can specify the number of K of the KNN algorithm by this parameter. The default is 10. We need to adjust this parameter according to the distribution of the sample. If we affinity not to use ' nearest_neighbors ', we don't need to ignore this parameter.



8)Eigen_solver: 1 The tool used when calculating eigenvalue eigenvectors for dimensionality reduction. There are None, ' arpack ', ' lobpcg ', and ' AMG ' 4 kinds of options. If our sample number is not particularly large, regardless of this parameter, using the "None" violence matrix feature decomposition can be, if the sample size is too large, you need to use some of the following matrix tools to speed up the matrix feature decomposition. It has no effect on the clustering effect of the algorithm.



9)Eigen_tol: If Eigen_solver uses Arpack ', the matrix decomposition stop condition needs to be specified by Eigen_tol.



Assign_labels: That is, the final method of clustering selection, there are K-means algorithm and discretize algorithm can be selected two algorithms. In general, the default K-means algorithm clustering works better. However, since the results of the K-means algorithm are affected by the selection of the initial values, it may be different each time, if we need the algorithm result can be reproduced, you can use Discretize.



N_init: That is, the use of K-means when the combination of different initial values to run K-means the number of clusters, this and K-means class inside the meaning of N_init is exactly the same, the default is 10, the general use of the default value can be. If your N_clusters value is large, you can increase this value appropriately.



As can be seen from the above introduction, in addition to the final category number of N_clusters, the main is the choice of similar matrix affinity , and corresponding similar matrix parameters. When I select a similar matrix construction method, the process of parameter tuning is the process of cross-selection of corresponding parameters. For the K-neighbor method, the n_neighbors need to be adjusted, and for the most commonly used Gaussian kernel function RBF in the full join method, the gamma is required for the parameter adjustment.


3.SpectralClustering instances


Here we use an example to describe the clustering of the next spectralclustering. We select the most commonly used Gaussian nuclei to build a similar matrix, using K-means to do the final clustering.



First we generate 500 6-D datasets, divided into 5 clusters. Because it is 6-dimensional, it is not visualized here, the code is as follows:




 
import numpy as np
from sklearn import datasets
X, y = datasets.make_blobs(n_samples=500, n_features=6, centers=5, cluster_std=[0.4, 0.3, 0.4, 0.3, 0.4], random_state=11)


Then we look at the effect of the default spectral clustering:




 
from sklearn.cluster import SpectralClustering
y_pred = SpectralClustering().fit_predict(X)
from sklearn import metrics
print "Calinski-Harabasz Score", metrics.calinski_harabaz_score(X, y_pred)


The Calinski-harabasz score for the output is:




Calinski-harabasz score 14908.9325026


Since we are using the Gaussian kernel, we generally need to n_clusters and gamma to the parameter. Select the appropriate parameter value. The code is as follows:




 
for index, gamma in enumerate((0.01,0.1,1,10)):
    for index, k in enumerate((3,4,5,6)):
        y_pred = SpectralClustering(n_clusters=k, gamma=gamma).fit_predict(X)
        print "Calinski-Harabasz Score with gamma=", gamma, "n_clusters=", k,"score:", metrics.calinski_harabaz_score(X, y_pred) 


The output is as follows:


Calinski-harabasz score with gamma= 0.01 n_clusters= 3 score:1979.77096092
Calinski-harabasz score with gamma= 0.01 n_clusters= 4 score:3154.01841219
Calinski-harabasz score with gamma= 0.01 n_clusters= 5 score:23410.63895
Calinski-harabasz score with gamma= 0.01 n_clusters= 6 score:19303.7340877
Calinski-harabasz score with gamma= 0.1 n_clusters= 3 score:1979.77096092
Calinski-harabasz score with gamma= 0.1 n_clusters= 4 score:3154.01841219
Calinski-harabasz score with gamma= 0.1 n_clusters= 5 score:23410.63895
Calinski-harabasz score with gamma= 0.1 n_clusters= 6 score:19427.9618944
Calinski-harabasz score with gamma= 1 n_clusters= 3 score:687.787319232
Calinski-harabasz score with gamma= 1 n_clusters= 4 score:196.926294549
Calinski-harabasz score with gamma= 1 n_clusters= 5 score:23410.63895
Calinski-harabasz score with gamma= 1 n_clusters= 6 score:19384.9657724
Calinski-harabasz score with gamma= n_clusters= 3 score:43.8197355672
Calinski-harabasz score with gamma= n_clusters= 4 score:35.2149370067
Calinski-harabasz score with gamma= n_clusters= 5 score:29.1784898767
Calinski-harabasz score with gamma= n_clusters= 6 score:47.3799111856


The best n_clusters is 5, and the best Gaussian kernel parameter is 1 or 0.1.



We can see that when we don't enter optional n_clusters, just use the best gamma for the 0.1-time clustering effect, the code is as follows:






y_pred = SpectralClustering(gamma=0.1).fit_predict(X)
print "Calinski-Harabasz Score", metrics.calinski_harabaz_score(X, y_pred) 


The output is:




Calinski-harabasz score 14950.4939717


Can be seen n_clusters general or the selection of parameters is better.






(Welcome reprint, reproduced please indicate the source.) Welcome to communicate: [email protected])



Using Scikit-learn to study spectral clustering


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.