======================================================================

This series of blogs mainly refer to the Scikit-learn official website for each algorithm, and to do some translation, if there are errors, please correct me

Reprint please indicate the source

======================================================================

K-means algorithm Analysis and Python Code implementation refer to the previous two blog posts:

Kmeans algorithm of machine learning combat (K-mean clustering algorithm)

"Machine learning Combat" binary-kmeans algorithm (binary K mean clustering)

Next I'll show you how to use Scikit-learn to complete the K-means algorithm call

Note: This example analysis is fixed input k value and input K Initial center point, this has great limitations, easy to fall into the local optimal, you can use other algorithms (such as the canopy algorithm) coarse clustering estimation, produce n clusters, as the K value of K-means, here do not elaborate

One: K-means Clustering algorithm

Introduction to the 1:k-means algorithm

Clustering algorithm, one of the ten algorithms of data mining, the algorithm needs to accept the parameters K and K initial clustering centers, the number of datasets to be clustered and the initial cluster of K clusters "center", the result is the same cluster of objects similarity is very high, the data similarity in the same cluster is very low

2:k-means algorithm idea and description

Thought: clustering with K central points in space, classifying the objects closest to them, and updating each cluster center by iterative method

Describe:

(1) Appropriate choice of C class of the initial Center (2) in this iteration of K, for any sample, the distance to the center of C, the sample is classified to the shortest center of the class (3) the use of means such as a mean to update the center value of the Class (4) for all C cluster centers, if utilized (2 ) (3) After the iteration method is updated, the value remains unchanged, the iteration ends, or the iteration continues

3: A brief introduction to collection instances For example, there are now four drug a,b,c,d, and they have two properties respectively. Such as

They are represented on the axis as:

**First Iteration**, randomly selected two points as the initial center point, eg C1,C2 as the initial center point, the distance between four points to two sample points in D0, X, y is the horizontal ordinate Point A is nearest to the first point, so Group-1 line a points 1, the remaining three points are closest to the second center point, so group-2 Line 1

The sample point should be updated at this time, C1 unchanged, C2 updated to ((2+4+5)/3, (1+3+4)/3) = (11/3,8/3), updated data center point as shown in red dot

**Second Iteration**, the distance from four points to the center point of the sample is calculated as (11/3,8/3) as the center point of the sample, as

A, B point is closest to the first center point, C,d Point is closest to the second center point, so Group-1 line, a, b 1,group-2 row c,d 1

At this time, the sample center point is re-updated, c1= ((1+2)/2, (+/2) = (3/2,1), c2= ((4+5)/2, (3+4)/2) = (9/2,7/2), the updated data center point is shown in the red dot

**Third Iteration**, calculate ibid, skip, at this time the Data Sample center point no longer changes, so you can stop the iteration, the final cluster is as follows:

4: See Below a use Sklearn. Example of Kmeans (instance source)Example : Using Sklearn.datasets.make_blobs to produce 1500 two-dimensional datasets for clustering examples under different circumstances, the code is as follows

<span style= "FONT-SIZE:18PX;" > #coding: Utf-8 "Created on 2016/4/25@author:gamer Think" "Import NumPy as NP #科学计算包import Matplotlib.pyplot as P Lt #python画图包from sklearn.cluster import kmeans #导入K-means algorithm package from sklearn.datasets import make_blobsplt.figure ( Figsize= () The Make_blobs function is a dataset that produces a dataset and the corresponding label N_samples: the number of data sample points, the default value 100n_features: The dimension that represents the data, the default value is 2centers : The center point of the resulting data, the default value 3CLUSTER_STD: The standard deviation of the dataset, the floating-point number, or the sequence of floating-point numbers, the default value 1.0center_box: The data boundary after the center is determined, the default value ( -10.0, 10.0) Shuffle: Wash the mess, The default value is truerandom_state: The official website explains the seeds of the random generator more parameters even if please refer to: http://scikit-learn.org/dev/modules/generated/ Sklearn.datasets.make_blobs.html#sklearn.datasets.make_blobs "n_samples = 1500random_state = 170X, y = make_blobs (n_ Samples=n_samples, random_state=random_state) # incorrect number of clustersy_pred = Kmeans (n_clusters=2, random_state= random_state). Fit_predict (X) plt.subplot (221) #在2图里添加子图1plt. Scatter (x[:, 0], x[:, 1], c=y_pred) #scatter绘制散点plt. Title ("Incorrect number of Blobs") #加标题 # anisotropicly Distributed DatatRansformation = [[0.60834549, -0.63667341], [ -0.40887718, 0.85253229]]x_aniso = Np.dot (X, transformation) #返回的是乘积的形式y_ pred = Kmeans (n_clusters=3, random_state=random_state). Fit_predict (X_aniso) plt.subplot (222) #在2图里添加子图2plt. Scatter ( X_aniso[:, 0], x_aniso[:, 1], c=y_pred) plt.title ("anisotropicly distributed Blobs") # Different variancex_varied, Y_ varied = Make_blobs (n_samples=n_samples, cluster_std=[1.0, 2.5, 0.5], random_state=random_state) y_pred = Kmeans (n_clusters=3, random_state=random_state). Fit_predict (X_varied) PLT.SUBP Lot (223) #在2图里添加子图3plt. Scatter (x_varied[:, 0], x_varied[:, 1], c=y_pred) plt.title ("Unequal Variance") # unevenly sized blobsx_filtered = Np.vstack ((x[y = = 0][:500], X[y = = 1][:100], X[y = = 2][:10])) y_pred = Kmeans (n_clusters=3, random_state= random_state). Fit_predict (x_filtered) plt.subplot (224) #在2图里添加子图4plt. Scatter (x_filtered[:, 0], x_filtered[:, 1], c=y _pred) Plt.title ("unevenly sized Blobs") plt.show () #显示Figure </span>

Result diagram:

Two: Mini Batch K-means algorithm

Scikit-learn official online for the mini Batch K-means algorithm description is as follows:

Mini Batch K-means algorithm is a variant of the K-means algorithm, the use of small batches of data subsets to reduce the computational time, while still trying to optimize the objective function, where the so-called small batch is the random sampling of each training algorithm data subset, using these randomly generated subsets of the training algorithm, The calculation time is greatly reduced, compared with other algorithms, the convergence time of K-means is reduced, and the result of the low-volume K-means is generally only slightly worse than the standard algorithm.

The iterative step of the algorithm has two steps:

1: Randomly extracting some data from the data set to form small batches, assigning them to the nearest centroid

2: Update centroid

Compared with the K-means algorithm, the data is updated on every small sample set. For each small batch, the centroid is updated by calculating the average value, and the data in the small batch is assigned to the centroid, and as the number of iterations increases, these centroid changes are gradually reduced until the centroid stabilizes or reaches the specified number of iterations, stopping the calculation

The Mini Batch K-means has a faster convergence rate than K-means, but it also reduces the clustering effect, but it does not show up in the actual project.

This is a K-means and mini batch K-means the actual effect of the comparison chart

Below is a comparison of the K-means and mini Batch K-means algorithms with the code showing the above diagram:

<span style= "FONT-SIZE:18PX;" > #coding: UTF8 "Created on 2016/4/26@author:gamer Think" "Import timeimport NumPy as Npimport matplotlib.pyplot as Pl Tfrom sklearn.cluster import Minibatchkmeans, kmeansfrom sklearn.metrics.pairwise import Pairwise_distances_ Argminfrom sklearn.datasets.samples_generator Import make_blobs################################################## ############################# Generate Sample datanp.random.seed (0) batch_size = 45centers = [[1, 1], [-1,-1], [1,-1]] #初 Starting with three centers n_clusters = Len (centers) #聚类的数目为3 # produces 3000 sets of two-d data, centered on the top three points, with ( -10,10) as the boundary, the standard deviation of the dataset is 0.7X, Labels_true = Make_blobs ( n_samples=3000, Centers=centers, cluster_std=0.7) ############################################################### ################ Compute Clustering with Meansk_means = Kmeans (init= ' k-means++ ', n_clusters=3, n_init=10) t0 = Time.time ( ) #当前时间k_means. Fit (X) #使用K-means to 3000 data set training algorithm time consumption t_batch = Time.time ()-t0######################################### ###################################### Compute Clustering with MINIBATCHKMEANSMBK = Minibatchkmeans (init= ' k-means++ ', n_clusters=3, Batch_size=batch_ Size, n_init=10, max_no_improvement=10, verbose=0) t0 = Time.time () mbk.fit (X) #使用MiniBatchKMeans to 3000 Time consumption of the data set training algorithm T_mini_batch = Time.time ()-t0####################################################################### ######## plot result# creates a drawing object and sets the width and height of the object, and if you do not create a direct call to plot, Matplotlib creates a drawing object directly fig = Plt.figure (figsize= (8, 3)) Fig.subplots_adjust (left=0.02, right=0.98, bottom=0.05, top=0.9) colors = [' #4EACC5 ', ' #FF9C34 ', ' #4E9A06 ']# We want to ha ve the same colors for the same cluster from the# Minibatchkmeans and the Kmeans algorithm. Let ' s pair the cluster centers per# closest one.k_means_cluster_centers = Np.sort (K_means.cluster_centers_, axis=0) mbk_ Means_cluster_centers = Np.sort (Mbk.cluster_centers_, axis=0) k_means_labels = Pairwise_distances_argmin (X, K_means_ cluster_centers) Mbk_means_labels = Pairwise_distances_argmin (X, Mbk_means_cluster_centers) Order = Pairwise_distances_argmin (K_means_cluster_centers, mbk_means_cluster_centers) # k Meansax = Fig.add_subplot (1, 3, 1) #add_subplot image is given as a row of three columns, the first for K, the col in Zip (range (n_clusters), colors): my_members = K_means_labels = = k Cluster_center = k_means_cluster_centers[k] Ax.plot (x[my_members, 0], x[my_members, 1], ' W ', Markerfacecolor=col, marker= '. ') Ax.plot (Cluster_center[0], cluster_center[1], ' o ', Markerfacecolor=col, markeredgecolor= ' K ', markersize=6) ax.se T_title (' Kmeans ') Ax.set_xticks (()) Ax.set_yticks (()) Plt.text ( -3.5, 1.8, ' Train time:%.2fs\ninertia:%f '% (T_batch, K _means.inertia_) # minibatchkmeansax = Fig.add_subplot (1, 3, 2) #add_subplot image is divided into three columns, the second for K, the col in Zip (range n_clust ers), colors): my_members = Mbk_means_labels = = Order[k] Cluster_center = mbk_means_cluster_centers[order[k]] ax. Plot (x[my_members, 0], x[my_members, 1], ' W ', Markerfacecolor=col, marker= '. ') Ax.ploT (Cluster_center[0], cluster_center[1], ' o ', Markerfacecolor=col, markeredgecolor= ' K ', markersize=6) ax.set_titl E (' Minibatchkmeans ') Ax.set_xticks (()) Ax.set_yticks (()) Plt.text ( -3.5, 1.8, ' Train time:%.2fs\ninertia:%f '% (t_mi Ni_batch, Mbk.inertia_) # initialise The different array to all Falsedifferent = (Mbk_means_labels = = 4) Ax = FIG.ADD_SUBPL OT (1, 3, 3) #add_subplot image is divided into three columns, the third for K in range (N_clusters): Different + = ((K_means_labels = = k)! = (Mbk_means_la BELs = = Order[k]) identic = Np.logical_not (different) ax.plot (x[identic, 0], x[identic, 1], ' W ', markerfacecolor= ' #b BBBBB ', marker= '. ') Ax.plot (x[different, 0], x[different, 1], ' W ', markerfacecolor= ' m ', marker= '. ') Ax.set_title (' difference ') ax.set_xticks (()) Ax.set_yticks (()) plt.show () </span>

For more information, please refer to the official website: http://scikit-learn.org/dev/modules/clustering.html#clustering

K-means Clustering algorithm of Scikit-learn learning and Mini Batch K-means algorithm