K-means algorithm Detailed introduction (SSE, contour analysis) __ algorithm

Source: Internet
Author: User

In the front we have introduced a lot of supervised learning algorithms, classification and regression. This article mainly introduces unsupervised algorithm, through clustering analysis to deal with the No class mark data. We do not know the correct result of the data (class standard), through clustering algorithm to discover and mining the data itself structure information, the data Clustering (classification). The goal of clustering algorithm is that the similarity degree is high and the similarity between clusters is low. A bit like LDA reduced-dimension algorithm, the minimum variance of the class, the largest among the variance between classes. This article mainly includes: 1, K-means algorithm
2, k-means++
3, hard clustering and soft clustering
4, the performance evaluation index of clustering algorithm
, K-means algorithm

K-means algorithm is one of the most popular and widely used clustering algorithms in clustering algorithm, because it is easy to implement and has high computational efficiency. The application of clustering algorithm is also very extensive, including different types of document classification, music, movies, based on user purchase behavior classification, based on user interests to build a recommendation system.

The implementation steps of the K-means algorithm are mainly divided into four steps:

1, from the sample set randomly sampled K sample points as the center of the initial cluster.

2. Divide each sample point into a cluster represented by its nearest central point.

3, using the center point of all the sample points in each cluster to represent the center point of the cluster.

4. Repeat 2 and 3 until the center point of the cluster is unchanged or the set number of iterations is reached or the set fault tolerance range is reached.

The common distance metric is the square of Euclidean distance:


where x and y represent different two samples, and n represents the dimensions of the sample (the number of features). Based on Euclidean distance, the problem that the K-means algorithm needs to be optimized is to minimize the square error of the cluster (within-cluster sum of squared Errors,sse), which is also called cluster inertia (cluster Intertia).


The following uses Sklearn to implement a K-means algorithm application, using the Sklearn dataset, the dataset contains 150 randomly generated points, the sample points are divided into three different clusters

From sklearn.datasets import make_blobs
import Matplotlib.pyplot as Plt


if __name__ = = "__main__":
    '
    N_samples: Represents the number of sample points
    N_features: Indicates that each sample consists of two features center
    : The number of points in the center of the sample (cluster)
    CLUSTER_STD: Representing the size of the variance of each sample cluster
    '
    x,y = Make_blobs (n_samples=150,n_features=2,centers=3,
                     cluster_std=0.5,shuffle=true,random_state=0)
    #绘点
    plt.scatter (x[:,0],x[:,1],marker= "O", color= "Blue")
    #以表格的形式显示
    plt.grid ()
    plt.show ()

The 150 sample points are distributed as shown above. The following cluster analysis of the above sample points is implemented using the Sklearn built-in Kmeans algorithm

    From sklearn.cluster import Kmeans
    '
    n_clusters: Sets the number of clusters
    init:random indicates the use of the Kmeans algorithm, the default is k-means++
    n_ Init: Set the number of initial sample centers
    Max_iter: Set the maximum number
    of iterations tol: Set the fault-tolerant range of the algorithm SSE (cluster error-square error)
    '
    Kmeans = Kmeans (n_clusters=3, init= "Random", n_init=10,max_iter=300,
                    tol=1e-04,random_state=0)
    y_km = kmeans.fit_predict (x)
Second, k-means++

The K-means algorithm needs to randomly select the initialization center point, if the center point selection is not suitable, may cause the cluster effect is not good or produces the convergence speed slow and so on. A more appropriate way to solve this problem is to run the K-means algorithm multiple times on the dataset and select the best performance model according to the square of the error in the cluster and (SSE). In addition, the k-means++ algorithm allows the initial center points to distance themselves as far as possible, compared to the K-means algorithm, it can produce a better model.

k-means++ consists of the following steps:

1. Initializes an empty set M, which is used to store the selected K Center point

2. Randomly select the first center point μ from the input sample and add it to the set m

3. For any sample point x other than the set M, find the sample D (x,m) with the smallest distance by calculation

4, using the weighted probability distribution to randomly select the next central point μ

5, repeat steps 2 and 3 until the K center point is selected

6. Execute K-means algorithm based on selected center point

Using Sklearn to implement k-means++, you only have to set the init parameter to "k-means++" and the default setting is "k-means++". The following k-means++ algorithm is used to achieve clustering of the above three clusters

    km = Kmeans (n_clusters=3,init= "k-means++", n_init=10,max_iter=300,
                tol=1e-04,random_state=0)
    #y_ The result of clustering is preserved in km
    Y_KM = km.fit_predict (x)
    #绘制不同簇的点
    plt.scatter (x[y_km==0,0],x[y_km==0,1],s=50,c= "Orange", Marker= "O", label= "cluster 1")
    Plt.scatter (x[y_km==1,0],x[y_km==1,1],s=50,c= "green", marker= "s", label= " Cluster 2 ")
    Plt.scatter (x[y_km==2,0],x[y_km==2,1],s=50,c=" Blue ", marker=" ^ ", label=" Cluster 3 ")
    #绘制簇的中心点
    Plt.scatter (km.cluster_centers_[:,0],km.cluster_centers_[:,1],s=250,marker= "*", c= "Red"
                , label= " Cluster Center ")
    Plt.legend ()
    Plt.grid ()
    plt.show ()


The above figure can be found that the k-means++ effect of the cluster is not bad, the center of the cluster, basically located in the sphere. In practice, using the k-means++ algorithm may be encountered, because the dimension of the sample is too high to be visualized, thus unable to set the number of clusters of samples. The clusters of the K-means algorithm cannot overlap or be layered, and assume that at least one sample will appear for each cluster.

Note: Since the K-means algorithm is based on Euclidean distance, so the K-means algorithm is sensitive to the range of data, so before using K-means algorithm, we need to standardize the data to ensure that the K-means algorithm is not affected by the feature dimension. third, hard clustering and soft clustering

Hard clustering (hard clustering) means that a sample in a dataset can only be divided into one cluster, such as the K-means algorithm. Soft clustering (soft clustering) or fuzzy clustering can divide a sample into a number of different clusters, such as the C-means (FCM) algorithm.

The computational procedure of FCM is similar to that of K-means, except that FCM uses the probability that the samples belong to different clusters instead of the K-means in the class. The probability of the sample belonging to different clusters is 1.

The steps of FCM are calculated as follows:

1, specify K center point, and randomly divide each sample point into a cluster

2. Calculate the center μ of each cluster

3, update each sample point of the cluster probability (membership degree)

4. Repeat steps 2 and 3 until the probability of the cluster of the sample point is constant or the fault tolerance range or maximum number of iterations is reached

The calculation formula of membership degree is as follows:


In which, ω represents the probability of the cluster of samples, and the number of clusters represented by the above expression is 3. The probability that the sample belongs to the cluster J. M is greater than 1 and generally takes 2, which is called the fuzzy coefficient.

The single iterative computation cost of FCM algorithm is higher than that of K-means algorithm, but the convergence rate of FCM is faster. Four, the performance index of clustering algorithm 1. Error variance in cluster (SSE)

In the division of the cluster, we use SSE as the objective function to divide the cluster. When the Kmeans algorithm is trained, we can obtain the error variance in the cluster by using the inertia attribute, and do not need to compute it again.

    #用来存放设置不同簇数时的SSE值
    distortions = []
    for I in Range (1,11):
        km = Kmeans (n_clusters=i,init= "k-means++", N_init =10,max_iter=300,tol=1e-4,random_state=0)
        km.fit (x)
        #获取K-means algorithm for SSE
        Distortions.append (Km.inertia _)
    #绘制曲线
    plt.plot (Range (1,11), distortions,marker= "O")
    Plt.xlabel ("Cluster Quantity") Plt.ylabel ("Number of Clusters") ("The
    error variance in the cluster" ( SSE) ")
    Plt.show ()

You can use the graphical tool elbow method to visualize the error variance within a cluster based on the number of clusters. The effect of K on the error variance in the cluster can be observed intuitively by the graph.


The above figure shows that when the number of clusters is 3, the elbow type appears, which indicates that K 3 is a good choice. 2, Contour map Quantitative analysis of cluster quality

Profiling (Silhouette analysis), using graphical tools to measure the aggregation of samples in a cluster, is applicable to other clustering algorithms besides K-means. The contour coefficients of a sample (silhouette coefficient) are calculated with three steps: 1, the average distance between the sample x and the other points within the cluster as the cohesion within the cluster a
2. Consider the average distance between the sample X and the nearest cluster as the degree of separation from the nearest cluster B
3, the separation of the cluster and the difference between the cluster is divided by the larger number of the figure to get the contour coefficient, the formula is as follows


The contour coefficients are taken from 1 to 1. The contour coefficient is 0 when the clustering is equal to the degree of separation. When b>>a, the contour coefficient is approximate to 1, at this time the model performance is best.

    km = Kmeans (n_clusters=3,init= "k-means++", n_init=10,max_iter=300,tol=1e-4,random_state=0) y_km = km.fit_predict (x) Import NumPy as NP from matplotlib import cm from sklearn.metrics import silhouette_samples #获取簇的标号 C Luster_labels = Np.unique (y_km) #获取簇的个数 n_clusters = cluster_labels.shape[0] #基于欧式距离计算轮廓系数 silhoutte_vals  = Silhouette_samples (x,y_km,metric= "Euclidean") #设置y坐标的起始位置 y_ax_lower,y_ax_upper=0,0 yticks=[] for i,c in Enumerate (cluster_labels): #获取不同簇的轮廓系数 c_silhouette_vals = silhoutte_vals[y_km = = c] #对簇中样本的轮廓系数由 Small to large to sort c_silhouette_vals.sort () #获取到簇中轮廓系数的个数 y_ax_upper = Len (c_silhouette_vals) #获取不同颜 
                 Colour color = Cm.jet (i/n_clusters) #绘制水平直方图 Plt.barh (range (y_ax_lower,y_ax_upper), C_silhouette_vals, Height=1.0,edgecolor= "None", Color=color) #获取显示y轴刻度的位置 yticks.append (Y_ax_lower+y_ax_upp ER)/2 #下一个y轴的Start position Y_ax_lower + = Len (c_silhouette_vals) #获取轮廓系数的平均值 silhouette_avg = Np.mean (silhoutte_vals) #绘制一条平行 The y-axis contour coefficient averages the dashed plt.axvline (silhouette_avg,color= "Red", linestyle= "--") #设置y轴显示的刻度 Plt.yticks (Yticks,cluster_labe ls+1) Plt.ylabel ("Cluster") Plt.xlabel ("Profile Factor") Plt.show ()


By using the contour chart, we can see the number of clusters in the sample and whether the sample contains abnormal values. In order to evaluate the performance of the clustering model, it can be evaluated by evaluating the contour coefficient, which is the red dotted line in the graph.






Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.