Clustering algorithm (K-means Clustering algorithm)

Last Update:2018-05-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the process of data analysis and mining, the clustering algorithm used is 1. K-means Cluster, 2.k-center point, 3. System clustering.

1.k-mean clustering divides the data into predetermined number of classes K (using distance as the evaluation index of similarity) on the basis of the minimum error. Data is traversed every time, so big data is slow

2.k-the center point, instead of using the mean in K-means as the cluster center point, select the most clustered center point of the closest point of the distance mean.

3. System clustering is called multi-level clustering, classification is from high to low (you can imagine the structure of the two fork tree), the more the more, the less the data points, but the more common features, the disadvantage is not suitable for large data volume, slow speed.

K-mans Clustering actual combat code:

#-*-coding:utf-8-*-" "cluster discretization, the final result is in the form of: 1 2 3 4 a 0 0.178698 0.257724 0.351843An 240 356 .000000 281.000000 53.000000 i.e. (0, 0.178698] 240, (0.178698, 0.257724] 356, and so on. " " from __future__ Importprint_functionImportPandas as PD fromSklearn.clusterImportKmeans#Import K-mean clustering algorithmdatafile='.. /data/data.xls' #data files for clusteringProcessedfile ='.. /tmp/data_processed.xls' #file after data processingTypelabel ={u'syndrome type coefficient of liver-qi stagnation':'A', u'coefficient of accumulation syndrome of heat toxicity':'B', u'coefficient of offset syndrome of flush-type':'C', u'The coefficient of Qi and blood deficiency syndrome':'D', u'syndrome type coefficient of spleen and stomach weakness':'E', u'syndrome type coefficient of liver and kidney yin deficiency':'F'}k= 4#number of cluster classes to be performed#read data and cluster analysisdata = Pd.read_excel (datafile)#data is a dataframe.Keys =list (Typelabel.keys ()) Result= PD. DataFrame ()#declares an empty dataframe structureif __name__=='__main__':#determines whether the main window runs, if it is run after saving the code as a. Py, this sentence is required if it is copied directly to the command window.    forIinchRange (len (keys)):#call the K-means algorithm for clustering discretization    Print(U'Clustering for "%s" in progress ...'%keys[i]) Kmodel= Kmeans (n_clusters = k, n_jobs = 4)#N_jobs is a parallel number, generally equal to a good number of CPUs    #print (Data[[keys[i]]].as_matrix ()); exit ();Kmodel.fit (Data[[keys[i]]].as_matrix ())#Training model, As_matrix () converted to NumPy array, return specified series    #print (data[[keys[i]]); exit ();R1 = PD. DataFrame (kmodel.cluster_centers_, columns = [typelabel[keys[i]])#Cluster Center, KMODEL.CLUSTER_CENTERS_ returns four cluster center points    #In the case of a K initial class cluster center (usually a random selection of K data from the dataset), traverse all points in the dataset and calculate the distance to the K-Cluster center point , the nearest    #assigned to the class cluster in which the center of the cluster is assigned, after the allocation is complete, the center point of the K cluster (the average of the K clusters) is re-traversed, and the distance from the data set to the center of the K cluster is re-iterated.    #until the center point of the cluster changes very little, or the specified number of iterations (calculations) is reached.    #disadvantage: May converge to the local minimum value (affected by the initial cluster center),    #slow convergence on large datasets (each iteration requires a sample of every data in the data set, and the number of iterations defaults to a value of    #print ([typelabel[keys[i]]);R2= PD. Series (Kmodel.labels_). Value_counts ()#classification statistics, how many data points each of K clusters    #Dataframe and series are two data structures of pandas, series is understood as indexed arrays, Dataframe is a series of two-dimensional data, with navigation Index and column index, shape is understood as matrixR2= PD. DataFrame (r2, columns = [typelabel[keys[i]]+'N'])#convert to Dataframe to record the number of categories    #print (R2); exit ();r = Pd.concat ([R1, r2], Axis = 1). Sort_values (Typelabel[keys[i]])#match the cluster center and the number of categories, sort_values sort by a column    #print (R); exit ();R.index = [1, 2, 3, 4] [r[typelabel[keys[i] ]= PD. Series.rolling (R[typelabel[keys[i]], 2). Mean ()#Rolling_mean () to calculate the mean of adjacent 2 columns    #(discarded Rolling_mean, replaced with series.rolling (). mean (), as a boundary point. R[TYPELABEL[KEYS[I]]][1] = 0.0#These two lines change the original cluster center to a boundary point. result =result.append (r.t) result= Result.sort_index ()#Sort by index, that is, in a,b,c,d,e,f orderResult.to_excel (Processedfile)

Clustering algorithm (K-means Clustering algorithm)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Clustering algorithm (K-means Clustering algorithm)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Clustering algorithm (K-means Clustering algorithm)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support