Distance-based clustering method--k-means

Last Update:2018-07-26 Source: Internet

Author: User

Tags numeric

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Make sure the K division reaches the minimum squared error. It is suitable for the discovery of convex shape clusters, the difference between clusters and clusters is obvious, and the cluster size is similar.

Advantages

The algorithm is fast, simple, efficient and scalable for large data sets, and the time complexity is O (n*k*t), where T is the number of iterations, is close to linearity, and is suitable for mining large data sets.

Disadvantages

The selection of K value is difficult to estimate, and the selection of the initial clustering center point has a great influence on the clustering result. Often at the end of local optimization, noise and isolation points are sensitive.

"Algorithmic Process"

Input: K,data

1) Select K points as centroid;

2) calculates the distance from the remaining point to the centroid and points the point to the class where the nearest centroid is located;

3) Recalculate all kinds of centroid;

4) Repeat double-step until the distance between the new centroid and the original centroid is less than the specified threshold or the iteration limit is reached

"Optimization Goals"

The basic hypothesis of clustering: For each cluster, a central point can be selected so that all points in the cluster are less than the distance to the center of the other cluster. Although the data obtained in the actual situation is not guaranteed to always satisfy such constraints, it is usually the best result we can achieve, and those errors are usually inherent or the problem itself is non-functional.

Based on the above hypothesis, when n number of points need to be divided into K clusters, k-means to optimize the objective function:

Where the data point n is classified to cluster K at the time of 1, otherwise 0.

For the center of the K-cluster.

It is not easy to find and minimize J, but you can take an iterative approach: Fix It first, choose the optimal one, and it's easy to see that the J - minimum is guaranteed by classifying the data points to the nearest center . The next step is to fix it and then to find the optimal. The J-pair derivation and the derivative equal to zero, it is easy to get J minimum time should be satisfied:

the value should be the average of the data points in all cluster K. Since each iteration is the minimum value of J, J will only continue to decrease (or not change) without increasing, which guarantees that the K-means will eventually reach a minimum value. Although K-means is not guaranteed to always be able to get the global optimal solution, but for such a problem, like K-means this complexity of the algorithm, such a result is very good.

"Kmeans Method Application Example-Iris data"

#显示数据的基本特征 #检查数据的维度 Dim (Iris) #显示列名 names (Iris) #查看数据内部结构 str (IRIS) #显示数据集属性 attributes (Iris) #查看前十行记录 Head (iris,10) Iris [1:10,] #查看属性Petal. Width of the first five rows data iris[1:5, "Petal.width"] iris$petal.width[1:5] #显示每个变量的分布情况 Summary (IRIS) # Displays the frequency table (iris$species) #根据列Species画出饼图 pie (Table (iris$species)) #画出Sepal for each value in the Iris DataSet column species. The distribution histogram of the length of hist ( iris$sepal.length) #画出Sepal. length density function plot plot (density (iris$sepal.length)) #画出Sepal. The scatter plots of the length column and the Petal.length column plot ( Iris$sepal.length,iris$petal.length) #计算Sepal. Length column The variance of all values var (iris$sepal.length) #
Calculates the covariance cov (Iris$sepal.length, iris$petal.length) of the Sepal.length column and the Petal.length column #计算Sepal. Correlation coefficients for Length and petal.length columns Cor (iris$sepal.length, iris$petal.length) #使用Kmeans聚类 #删去类别列, the fifth column, and then the data is clustered #1 Newiris = iris[,-5] #2 Newiris = Iris newir Is$species = NULL #将数据分为三类 class = Kmeans (Newiris, 3) #统计每种花中属于每一类的数量 table (iris$species,class$cluster) # Take sepal.length as the horizontal axis, sepal.width as ordinate, draw scatter plot, color with the default color plot (Newiris[c ("Sepal.length", "Sepal.width"), col=class$ Cluster
#在散点图中标出每个类的中心 points (Class$centers[,c ("Sepal.length", "Sepal.width")], Col=1:3, pch=8, cex=2)

"R Language Implementation K-means"

The original code comes from:

Http://weibo.com/p/23041875063b660102vi10

My_kmeans = function (data, K, max.iter=10) {  #获取行, number of columns   rows = nrow (data)   cols = Ncol (data)   #迭代次数   ITER = 0   #定义indexMatrix矩阵, the first column is the class in which each data resides, the second column is the distance from each data to its class center, initialized to infinity   Indexmatrix = Matrix (0, Nrow=rows, n col=2)   indexmatrix[,2] = Inf   #centers矩阵存储类中心   centers = Matrix (0, Nrow=k, ncol=cols)   #从样本中选k个随机 Positive integers as initial cluster centers   Randseveralinteger = As.vector (1:rows,size=k)   for (i-1:k) {    indexmatrix[ randseveralinteger[i],1] = i     indexmatrix[randseveralinteger[i],2] = 0     centers[i,] = As.numeric (Data[randseveralinteger[i],])  }   #changed标记数据所在类是否发生变化   changed=true    while (changed) {&
nbsp     if (ITER >= max.iter)       break       Changed=false     #对每一个数据, Calculate the distance from the center of each class and divide it into the nearest class       for (i-1:rows) {      #previousCluster表示该数据之前所在类   &N Bsp   Previouscluster = indexmatrix[i,1]       #遍历所有的类, dividing the data to the nearest class       for (j in 1:k) {        #计算该数据到类j的距离         currentdistance = (sum ((Data[i,]-centers[j,]) ^2) ^0.5     &NBSP ;   #如果该数据到类j的距离更近         if (Currentdistance < indexmatrix[i,2]) {        
  #认为该数据属于类j           indexmatrix[i,1] = J           #更新该数据到类中心的距离           indexmatrix[i,2] = currentdistance          }            }       #如果该数据所属的类发生了变化, set changed to True, the algorithm continues       if (PREVIOUSCL Uster! = indexmatrix[i,1])           changed=true    }     #重新计算类中心   &N Bsp For (m in 1:k) {      Clustermatrix = As.matrix (data[indexmatrix[,1]==m,]) #得到属于第m个类的所有数据     &NBS P #如果属于第m类的数据的数目Greater than 0         if (Nrow (Clustermatrix) > 0) {        #更新第m类的类中心     &NBSP ;   Centers[m,] = Colmeans (Clustermatrix)         #centers [m,] = apply (Clustermatrix, 2, mean) &nbs
p; #两句效果相同      }        #迭代次数     iter= (iter+1) #计算函数返回值 +1  }  ------   #cluster返回每个样本属于那一类   cluster = indexmatrix[,1]   #center为聚类中心   centers = Data.frame (centers) &nbsp ; Names (centers) = names (data)   #withinss存储类内距离平方和   WITHINSS = Matrix (0,nrow=k,ncol=1)   Ss<-function ( x) sum (Scale (X,scale=false) ^2)   #按照类标号划分数据集, and variance for each part   withinss<-sapply (Split (As.data.frame (data), indexmatrix[,1]), SS)   #tot. Withinss for intra-class distance and   Tot.withinss<-sum (WITHINSS)   #类间距离   between = SS ( centers[indexmatrix[,1],])   size = as.numeric (table (cluster))   #生成返回值列表cluster, Tot.withinss,betweenss, Iteration   Result<-list (CLUSTER=CLUster,centers = Centers, Tot.withinss=withinss,betweenss=between,iter=iter)   return (result)  }

Improved

1) method of selecting K

A) combined with hierarchical clustering. Firstly, the hierarchical condensation algorithm is used to determine the number of result clusters.

b) Set some categories of splitting and merging conditions, automatically increase or decrease the number of categories in the process of clustering, to obtain a more reasonable number of types K (ISODATA)

c) According to the Theory of variance analysis (ANOVA), the optimal classification number is determined by the mixed F statistic, and the correctness of the best classification is verified by fuzzy partition entropy.

D) The Competition Learning Rule (RPCL), which is punished by the winner, not only is the weight of the winning unit corrected to fit the input value for each input, but also the penalty method for the secondary WINS unit to keep it away from the input value.

2) Select the initial cluster center

A) using a genetic algorithm to initialize

b) Select K Classes in the results of hierarchical clustering and use the centroid of the K classes as the initial centroid

3) k-means++ algorithm

The basic idea of selecting the initial centroid of the k-means++ algorithm is that the distance between the initial cluster centers should be as far as possible.

A) randomly select a point from the set of input data points as the first cluster center

b) For each point x in the dataset, calculate the distance from the nearest cluster center (referred to the selected cluster center) d (X)

c) Select a new data point as the new cluster Center, the principle of selection is: D (x) larger points, the probability of being selected as a cluster center is greater

D) Repeat B and C until K-cluster centers are selected

e) Use this K initial cluster center to run the standard K-means algorithm

The key to the algorithm is the 3rd step, how to reflect D (x) to the probability of being selected, an algorithm is as follows:

A) randomly pick a random point from our database as a "seed point"

b) For each point, we calculate the distance d (x) of the nearest "seed point" and save it in an array, and then add the distances together to get the sum (D (x)).

c) Take a random value and use the weighting method to calculate the next "seed point". The implementation of this algorithm is to first take a random value that can fall in sum (d (x)), and then use random-= d (x) until its <=0, where the point is the next "seed point".

D) Repeat B and C until K-cluster centers are selected

e) Use this K initial cluster center to run the standard K-means algorithm

4) Empty cluster processing

If all the points are not assigned to a cluster in the assigned step, they will get an empty cluster. If this happens, a strategy is needed to select a substitute centroid, otherwise the squared error will be large.

A) One way is to select a point that is farthest from any of the current centroid . This will eliminate the points that currently have the greatest impact on the total squared error.

b) Another method is to select a substitute centroid from a cluster with the largest SSE . This splits the cluster and lowers the total SSE of the cluster. If there are multiple empty clusters, the procedure repeats multiple times.

Reference

1. http://blog.sina.com.cn/s/blog_70f632090101f212.html

2. http://blog.csdn.net/loadstar_kun/article/details/39450615

3. http://blog.csdn.net/heavendai/article/details/7029465

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More