Clustering algorithm (K-means Clustering algorithm)

Source: Internet
Author: User

In the process of data analysis and mining, the clustering algorithm used is 1. K-means Cluster, 2.k-center point, 3. System clustering.

1.k-mean clustering divides the data into predetermined number of classes K (using distance as the evaluation index of similarity) on the basis of the minimum error. Data is traversed every time, so big data is slow

2.k-the center point, instead of using the mean in K-means as the cluster center point, select the most clustered center point of the closest point of the distance mean.

3. System clustering is called multi-level clustering, classification is from high to low (you can imagine the structure of the two fork tree), the more the more, the less the data points, but the more common features, the disadvantage is not suitable for large data volume, slow speed.

K-mans Clustering actual combat code:

#-*-coding:utf-8-*-" "cluster discretization, the final result is in the form of: 1 2 3 4 a 0 0.178698 0.257724 0.351843An 240 356 .000000 281.000000 53.000000 i.e. (0, 0.178698] 240, (0.178698, 0.257724] 356, and so on. " " from __future__ Importprint_functionImportPandas as PD fromSklearn.clusterImportKmeans#Import K-mean clustering algorithmdatafile='.. /data/data.xls' #data files for clusteringProcessedfile ='.. /tmp/data_processed.xls' #file after data processingTypelabel ={u'syndrome type coefficient of liver-qi stagnation':'A', u'coefficient of accumulation syndrome of heat toxicity':'B', u'coefficient of offset syndrome of flush-type':'C', u'The coefficient of Qi and blood deficiency syndrome':'D', u'syndrome type coefficient of spleen and stomach weakness':'E', u'syndrome type coefficient of liver and kidney yin deficiency':'F'}k= 4#number of cluster classes to be performed#read data and cluster analysisdata = Pd.read_excel (datafile)#data is a dataframe.Keys =list (Typelabel.keys ()) Result= PD. DataFrame ()#declares an empty dataframe structureif __name__=='__main__':#determines whether the main window runs, if it is run after saving the code as a. Py, this sentence is required if it is copied directly to the command window.    forIinchRange (len (keys)):#call the K-means algorithm for clustering discretization    Print(U'Clustering for "%s" in progress ...'%keys[i]) Kmodel= Kmeans (n_clusters = k, n_jobs = 4)#N_jobs is a parallel number, generally equal to a good number of CPUs    #print (Data[[keys[i]]].as_matrix ()); exit ();Kmodel.fit (Data[[keys[i]]].as_matrix ())#Training model, As_matrix () converted to NumPy array, return specified series    #print (data[[keys[i]]); exit ();R1 = PD. DataFrame (kmodel.cluster_centers_, columns = [typelabel[keys[i]])#Cluster Center, KMODEL.CLUSTER_CENTERS_ returns four cluster center points    #In the case of a K initial class cluster center (usually a random selection of K data from the dataset), traverse all points in the dataset and calculate the distance to the K-Cluster center point , the nearest    #assigned to the class cluster in which the center of the cluster is assigned, after the allocation is complete, the center point of the K cluster (the average of the K clusters) is re-traversed, and the distance from the data set to the center of the K cluster is re-iterated.    #until the center point of the cluster changes very little, or the specified number of iterations (calculations) is reached.    #disadvantage: May converge to the local minimum value (affected by the initial cluster center),    #slow convergence on large datasets (each iteration requires a sample of every data in the data set, and the number of iterations defaults to a value of    #print ([typelabel[keys[i]]);R2= PD. Series (Kmodel.labels_). Value_counts ()#classification statistics, how many data points each of K clusters    #Dataframe and series are two data structures of pandas, series is understood as indexed arrays, Dataframe is a series of two-dimensional data, with navigation Index and column index, shape is understood as matrixR2= PD. DataFrame (r2, columns = [typelabel[keys[i]]+'N'])#convert to Dataframe to record the number of categories    #print (R2); exit ();r = Pd.concat ([R1, r2], Axis = 1). Sort_values (Typelabel[keys[i]])#match the cluster center and the number of categories, sort_values sort by a column    #print (R); exit ();R.index = [1, 2, 3, 4] [r[typelabel[keys[i] ]= PD. Series.rolling (R[typelabel[keys[i]], 2). Mean ()#Rolling_mean () to calculate the mean of adjacent 2 columns    #(discarded Rolling_mean, replaced with series.rolling (). mean (), as a boundary point. R[TYPELABEL[KEYS[I]]][1] = 0.0#These two lines change the original cluster center to a boundary point. result =result.append (r.t) result= Result.sort_index ()#Sort by index, that is, in a,b,c,d,e,f orderResult.to_excel (Processedfile)

Clustering algorithm (K-means Clustering algorithm)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.