Unsupervised learning--k-mean clustering algorithm for non-labeled data grouping

Source: Internet
Author: User

Unsupervised learning

Unlike supervised learning, data is not labeled (categorized) in unsupervised learning. Unsupervised learning requires algorithms to find the inherent laws of these data and classify them. (as in the data, and there is no label, it can be seen that the data set can be divided into three categories, it is an unsupervised learning process.) )

Unsupervised learning does not have a training process.

Clustering algorithm

The algorithm will be similar to the object track in the same cluster, a bit like automatic classification. The more similar an object within a cluster, the better its classification effect.

The concept of non-contact may feel very tall, a little look at the algorithm and the idea of the same as the KNN is very simple.

The original data set is as follows (the data has two features, respectively, with a horizontal ordinate), and the original dataset does not have any label or categorical information:

The data in the figure can be roughly judged, the data set can be divided into three types of data (defined as 0,1,2), then each point exactly belongs to which classification, here through the K-means clustering algorithm to obtain three centroid points, and according to the distance from each point to three centroid classification (to 0,1,2 three centroid distance nearest, The data is divided into this class), the calculated three centroid points such as (the Red Fork points in the figure):

K-Means Clustering algorithm

The flow of the algorithm is as follows:

1. Load data set 2. Data initialization     2.1 creating a random centroid point     2.2 The individual matrices/arrays that hold the result of 3. Multiple iterations (judging whether the classification of all points has changed)     3.1 calculate a bit of classification     3.2 Recalculate centroid points (averaging as new centroid points with data belonging to the current class) according to the 3.1 classification results 4. Return data

The disadvantage of the algorithm:

The algorithm tends to converge to the local minimum rather than the global minimum value. (Local minimum refers to the result can also, but not the best result, the best possible result of global minimum value)

A two-part K-Means clustering algorithm

SSE: Metric for clustering effect (Sum of squared Erro, squared error)

The smaller the SSE, the better the clustering effect is when all data points are closer to their centroid.

The flow of the algorithm is as follows:

1. Treat all points as a cluster 2. When the number of clusters is less than      2.1, the total error of each cluster         2.1.1 is calculated         2.1.2 The K-means clustering (k=2) on the given cluster         2.1.2 calculates the total error after splitting in the cluster by      2.2 Select the cluster that is the smallest error to divide

Python implementation

Data loading

def Loaddataset (fileName):      #general function to parse tab-delimited floats    datamat = []                #assume last column is target value    Open (fileName)    For line in Fr.readlines():        curline = line.  Strip(). Split (' \ t ')        #map all elements to float ()        Datamat.append (fltline)    return Datamat

The data is in the form of the following, and the most important difference in the form of supervised learning data is that the data is not labeled. Each data is a two-dimensional vector.

3.275154    2.957587-3.344465    2.6035130.355083    -3.3765851.852435    3.547351-2.078973    2.552013-0.993756    -0.8844332.682252    4.007573-3.087776    2.878713-1.565978    -1.2569852.441611    0.444826-0.659487    3.111284-0.459601    -2.6180052.177680    2.387793-2.920969    2.917485-0.028814    -4.1680783.625746    2.119041-3.912363    1.325108-0.551694    -2.8142232.855808    3.483301 ............

Vector Euclidean distance calculation function

def Disteclud (Veca, VECB):    return#la. Norm (VECA-VECB)

Random generation of N centroid

def randcent (DataSet, K):    n = shape (DataSet) [1]    centroids = Mat (Zeros ((k,n)))#create centroid Mat     for J in Range (n):#create The random cluster centers, within bounds of each dimension        minj = min (datase T[:,J])        Rangej = float (max (dataset[:,j))-Minj)        centroids[:,j] = Mat (Minj + Rangej * Random.rand (k,1))    return Centroids

K-Means Clustering algorithm

Disadvantage: The algorithm must enter the number of categories in advance of the business K.

The function returns the centroid coordinate centroids, and the nearest centroid of each point (that is, the classification result of that point) and its distance clusterassment.

It is important to note the termination condition of the iteration: clusterchanged, which is used to mark whether the iteration has a different classification of data and the previous one, and if the current iteration's classification of all data is identical to the previous classification result, the iteration is no longer resumed.

defKmeans (DataSet, K, Distmeas=disteclud, createcent=randcent):#计算数据个数m = shape (DataSet) [0]# The nearest centroid distance to which to store each data, and its distance valueClusterassment = Mat (Zeros ((m,2)))#create Mat To assign data points                                      #to A centroid, also holds SE of each pointCentroids = Createcent (DataSet, K)#产生随机的质心点 (by iteration, gradually becoming precise)clusterchanged = True#分类是否改变, the condition of the end of the iterationWhile clusterchanged:clusterchanged = False for I in Range (m):#for Each data point assign it to the closest centroidMindist = inf; Minindex = 1 for j in Range (k): Distji = Distmeas (Centroids[j,:],dataset[i,:])ifDistji < mindist:mindist = Distji; Minindex = Jifclusterassment[i,0]! = minindex:clusterchanged = True clusterassment[i,:] = minindex,mindist**2PrintCentroids for cent in range (k):#recalculate centroids                #ptsInClust表示到该质心距离最近的点集合Ptsinclust = Dataset[nonzero (clusterassment[:,0]. A==cent) [0]]#get All the "this cluster            the #将质心坐标 is replaced with the mean of the nearest point coordinate, so it is called mean-value clustering algorithm .Centroids[cent,:] = mean (Ptsinclust, axis=0)#assign centroid to mean    returnCentroids, Clusterassment

Two-part K-Means algorithm

The input and output of the algorithm are the same as the K-means, but its internal implementation is more complex.

defBikmeans (DataSet, K, distmeas=disteclud): M = shape (DataSet) [0] clusterassment = Mat (Zeros ((m,2))) centroid0 = Me An (DataSet, axis=0). ToList () [0] centlist =[CENTROID0]#create a list with one centroidFor j in Range (m):#calc initial Errorclusterassment[j,1] = Distmeas (Mat (CENTROID0), dataset[j,:]) **2 while (Len (centlist) < K): Lowestsse = inf For I in range (len (centlist)): Ptsincurrcluster = Dataset[nonzero (clusterassment[:,0]. A==i) [0],:]#get The data points currently in cluster ICentroidmat, Splitclustass = Kmeans (Ptsincurrcluster, 2, distmeas) Ssesplit = SUM (splitclustass[:,1])#compare the SSE to the currrent minimumSsenotsplit = SUM (Clusterassment[nonzero (clusterassment[:,0). a!=i) [0],1])Print"Ssesplit, and Notsplit:", Ssesplit,ssenotsplitif(Ssesplit + ssenotsplit) < Lowestsse:bestcenttosplit = I bestnewcents = Centroidmat Bestclustass = Splitclustass.Copy() Lowestsse = Ssesplit + ssenotsplit Bestclustass[nonzero (bestclustass[:,0]. A = = 1) [0],0] = Len (centlist)#change 1 to 3,4, or whateverBestclustass[nonzero (bestclustass[:,0]. A = = 0) [0],0] = BestcenttosplitPrint' The Bestcenttosplit is: ', bestcenttosplitPrint' The Len of Bestclustass is: ', Len (bestclustass) Centlist[bestcenttosplit] = bestnewcents[0,:].tolist () [0]#replace a centroid with the best centroidsCentlist.append (bestnewcents[1,:].tolist () [0]) Clusterassment[nonzero (clusterassment[:,0]. A = = bestcenttosplit) [0],:]= Bestclustass#reassign new clusters, and SSE    returnMat (centlist), clusterassment

Other machine learning algorithms:

Supervised learning--stochastic gradient descent algorithm (SGD) and batch gradient descent algorithm (BGD)

Supervised learning--decision tree theory and Practice (i): Classification decision tree

Supervised learning--decision tree theory and Practice (bottom): Regression decision tree (CART)

Supervised learning--k proximity algorithm and digital recognition practice

Supervised learning--the theory and practice of naive Bayesian classification

Supervised learning--logistic Two classification (python)

Supervised learning--adaboost meta-algorithm to improve classification performance

Reference:

"Machine learning Combat"

Unsupervised learning--k-mean clustering algorithm for non-labeled data grouping

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.