I small white, first published blog, Big God Detour, do not like to spray.
Recently the company asked for some machine learning content, so in the reading of some machine learning related information, recently read the book name called Machine learning combat. This is a good book, well worth reading.
Ok, don't talk nonsense, get into the subject of our today.
K-Means algorithm (K-means algorithm)
1.k-means algorithm is a kind of clustering algorithm.
What is clustering? Clustering is that you do not know a few categories before classification, and do not know what are the categories, and let the computer according to the characteristics of the data into several different categories, these categories are not defined beforehand. This classification method is also called unsupervised classification.
2. Algorithmic thinking
Randomly select K Initial points as centroid in the dataset, assign each point in the dataset to a cluster, find the nearest centroid for each point, assign it to the cluster corresponding to that centroid, and after this step, the centroid of each cluster is updated to the average of all points of the cluster. Here the initial centroid has changed, repeating the steps n times until the heart no longer changes.
3.Python implementations
# Coding=utf-8
From numpy Import *
Import Matplotlib
Import Matplotlib.pyplot as Plt
Import operator
From OS import listdir
Import time
Initializing datasets
Def createdataset ():
Group = Array ([[1.0, 1.1], [1.0, 1.0],
[0, 0], [0, 0.1],
[2, 1.0], [2.1, 0.9],
[0.3, 0.0], [1.1, 0.9],
[2.2, 1.0], [2.1, 0.8],
[3.3, 3.5], [2.1, 0.9],
[2, 1.0], [2.1, 0.9],
[3.5, 3.4], [3.6, 3.5]
])
Return Group
#自己写的画图函数, the point (red) of the initial dataset is drawn each time, and the point to be observed.
Def show (Data,color=none):
If not color:
Color= ' Green '
Group=createdataset ()
Fig = plt.figure (1)
Axes = Fig.add_subplot (111)
Axes.scatter (group[:, 0], group[:, 1], s=40, c= ' red ')
Axes.scatter (data[:, 0], data[:, 1], s=50, C=color)
Plt.show ()
The code is taken from the machine learning Combat Chapter II, where I add my understanding of the code
def disteclud (Veca, VECB):
return sqrt (SUM (Power (VECA-VECB, 2)) #计算两点之间的欧氏距离
Due to the difference between the magnitude of the data will cause interference to the degree of impact, so the initial data to do data normalization.
def randcent (DataSet, k): #数据归一化处理
n = shape (DataSet) [1]
Centroids = Mat (Zeros ((k, n)) # Create centroid Mat
For j in Range (n): # Create random cluster centers, within bounds of each dimension
Minj = min (dataset[:, J])
Rangej = float (max (dataset[:, j])-Minj)
centroids[:, j] = Mat (Minj + Rangej * Random.rand (k, 1))
Return centroids
A shape (DataSet) is a method of the NumPy library that returns a dimension of an n-dimensional matrix in turn. such as dataset=[[1,2,3],[2,4,5]], then shape (DataSet) = (2,3)
The Mat method is to initialize the matrix
Next is the Kmeans function
def Kmeans (DataSet, K, Distmeas=disteclud, createcent=randcent): #参数dataSet是待聚类数据矩阵, K is the number of clustered classes
m = dataset.shape[0] #训练集中含有点的个数
Clusterassment = Zeros ((M, 2)) #初始化结果集, number equals the number of points in the training set
Centroids = Createcent (DataSet, K)
# Print Centroids
# Show (centroids) #matplotlib画图展示训练数据分布
clusterchanged = True #标记参数 to detect if the centroid position changes after each cycle
While clusterchanged:
clusterchanged = False
For I in range (m):
Point = Dataset[i,:] # Iterate through each of the points in the data set
Mindist = INF #inf is a positive infinity in Python
Minindex = 1 #随机定义一个序号, this sequence number is used to indicate the distance between each point and the centroid of the nearest, is the centroid of the sequence number, the initial arbitrarily take a negative.
For-N in range (k):
Heart = Centroids[n,:] # Traverse each centroid
Distance = Distmeas (point, Heart) # Find points and centroid distances
If distance < mindist: #如果此质心的距离比当前最小距离小
mindist = distance # update minimum distance mindist
Minindex = n # Update the centroid ordinal of the minimum distance
If clusterassment[i, 0]! = minindex:clusterchanged = True If the centroid's ordinal changes, change the tag data
Clusterassment[i,:] = Minindex, Mindist * * 2 # Save the first point as (nearest centroid ordinal, this point to the square of the centroid distance)
# Print Clusterassment
For cent in range (k): #遍历数据集, update the location of the centroid
Ptsinclust = dataset[(clusterassment[:, 0] = = cent)] # Remove the point of the centroid binding
# Print Ptsinclust
If Len (ptsinclust):
Centroids[cent,:] = mean (Ptsinclust, axis=0) # The mean value of the fetch point
Else
Centroids[cent,:] = Array ([[[0, 0]])
Show (centroids,color= ' green ') #作图画出此次循环后质心的变化情况
# Print Centroids
# print "----------------"
# Show (Centroids)
Return centroids, Clusterassment #返回质心, and information for each point
Centroids, Clusterassment=kmeans (CreateDataSet (), 4) #聚类为4类
Show (centroids,color= ' yellow ') #聚类结果展示为黄色点
Image
Initial data set and initial random selection centroid
One of the more perfect results.
Try a few times to show the following unsatisfactory results
The above is the specific implementation of Kmeans, but this method is not satisfactory, many times the implementation will find that the point of clustering is not stable, and not every time we want the best results, this is because the algorithm can only achieve local optimal, can not guarantee the global optimal, And that's because we can't guarantee that the point we're picking is the best point. Of course there is an improvement, I will continue to update the next section of the two-part K-Means algorithm
K-means Algorithm Analysis