K-means Algorithm Analysis

Last Update:2016-11-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I small white, first published blog, Big God Detour, do not like to spray.

Recently the company asked for some machine learning content, so in the reading of some machine learning related information, recently read the book name called Machine learning combat. This is a good book, well worth reading.

Ok, don't talk nonsense, get into the subject of our today.

K-Means algorithm (K-means algorithm)

1.k-means algorithm is a kind of clustering algorithm.

What is clustering? Clustering is that you do not know a few categories before classification, and do not know what are the categories, and let the computer according to the characteristics of the data into several different categories, these categories are not defined beforehand. This classification method is also called unsupervised classification.

2. Algorithmic thinking

Randomly select K Initial points as centroid in the dataset, assign each point in the dataset to a cluster, find the nearest centroid for each point, assign it to the cluster corresponding to that centroid, and after this step, the centroid of each cluster is updated to the average of all points of the cluster. Here the initial centroid has changed, repeating the steps n times until the heart no longer changes.

3.Python implementations

# Coding=utf-8

From numpy Import *
Import Matplotlib
Import Matplotlib.pyplot as Plt
Import operator
From OS import listdir
Import time

Initializing datasets

Def createdataset ():
    Group = Array ([[1.0, 1.1], [1.0, 1.0],
                   [0, 0], [0, 0.1],
                   [2, 1.0], [2.1, 0.9],
                   [0.3, 0.0], [1.1, 0.9],
                   [2.2, 1.0], [2.1, 0.8],
                   [3.3, 3.5], [2.1, 0.9],
                   [2, 1.0], [2.1, 0.9],
                   [3.5, 3.4], [3.6, 3.5]
                   ])
    Return Group

#自己写的画图函数, the point (red) of the initial dataset is drawn each time, and the point to be observed.

Def show (Data,color=none):
    If not color:
        Color= ' Green '
    Group=createdataset ()
    Fig = plt.figure (1)
    Axes = Fig.add_subplot (111)
    Axes.scatter (group[:, 0], group[:, 1], s=40, c= ' red ')
    Axes.scatter (data[:, 0], data[:, 1], s=50, C=color)
    Plt.show ()

The code is taken from the machine learning Combat Chapter II, where I add my understanding of the code

def disteclud (Veca, VECB):
    return sqrt (SUM (Power (VECA-VECB, 2))  #计算两点之间的欧氏距离

Due to the difference between the magnitude of the data will cause interference to the degree of impact, so the initial data to do data normalization.

def randcent (DataSet, k):       #数据归一化处理
    n = shape (DataSet) [1]
    Centroids = Mat (Zeros ((k, n))  # Create centroid Mat
    For j in Range (n):  # Create random cluster centers, within bounds of each dimension
        Minj = min (dataset[:, J])
        Rangej = float (max (dataset[:, j])-Minj)
        centroids[:, j] = Mat (Minj + Rangej * Random.rand (k, 1))
    Return centroids

A shape (DataSet) is a method of the NumPy library that returns a dimension of an n-dimensional matrix in turn. such as dataset=[[1,2,3],[2,4,5]], then shape (DataSet) = (2,3)

The Mat method is to initialize the matrix

Next is the Kmeans function

def Kmeans (DataSet, K, Distmeas=disteclud, createcent=randcent): #参数dataSet是待聚类数据矩阵, K is the number of clustered classes
m = dataset.shape[0] #训练集中含有点的个数
Clusterassment = Zeros ((M, 2)) #初始化结果集, number equals the number of points in the training set
Centroids = Createcent (DataSet, K)
# Print Centroids
# Show (centroids) #matplotlib画图展示训练数据分布
clusterchanged = True #标记参数 to detect if the centroid position changes after each cycle
While clusterchanged:
clusterchanged = False

For I in range (m):
Point = Dataset[i,:] # Iterate through each of the points in the data set
Mindist = INF #inf is a positive infinity in Python
Minindex = 1 #随机定义一个序号, this sequence number is used to indicate the distance between each point and the centroid of the nearest, is the centroid of the sequence number, the initial arbitrarily take a negative.

For-N in range (k):
Heart = Centroids[n,:] # Traverse each centroid
Distance = Distmeas (point, Heart) # Find points and centroid distances
If distance < mindist: #如果此质心的距离比当前最小距离小
mindist = distance # update minimum distance mindist
Minindex = n # Update the centroid ordinal of the minimum distance

If clusterassment[i, 0]! = minindex:clusterchanged = True If the centroid's ordinal changes, change the tag data
Clusterassment[i,:] = Minindex, Mindist * * 2 # Save the first point as (nearest centroid ordinal, this point to the square of the centroid distance)
# Print Clusterassment
For cent in range (k): #遍历数据集, update the location of the centroid
Ptsinclust = dataset[(clusterassment[:, 0] = = cent)] # Remove the point of the centroid binding
# Print Ptsinclust
If Len (ptsinclust):
Centroids[cent,:] = mean (Ptsinclust, axis=0) # The mean value of the fetch point
Else
Centroids[cent,:] = Array ([[[0, 0]])
Show (centroids,color= ' green ') #作图画出此次循环后质心的变化情况
# Print Centroids
# print "----------------"

# Show (Centroids)
Return centroids, Clusterassment #返回质心, and information for each point

Centroids, Clusterassment=kmeans (CreateDataSet (), 4) #聚类为4类
Show (centroids,color= ' yellow ') #聚类结果展示为黄色点

Image

Initial data set and initial random selection centroid

One of the more perfect results.

Try a few times to show the following unsatisfactory results

The above is the specific implementation of Kmeans, but this method is not satisfactory, many times the implementation will find that the point of clustering is not stable, and not every time we want the best results, this is because the algorithm can only achieve local optimal, can not guarantee the global optimal, And that's because we can't guarantee that the point we're picking is the best point. Of course there is an improvement, I will continue to update the next section of the two-part K-Means algorithm

K-means Algorithm Analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

K-means Algorithm Analysis

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support