In the process of data analysis and mining, the clustering algorithm used is 1. K-means Cluster, 2.k-center point, 3. System clustering.
1.k-mean clustering divides the data into predetermined number of classes K (using distance as the evaluation index of similarity) on the basis of the minimum error. Data is traversed every time, so big data is slow
2.k-the center point, instead of using the mean in K-means as the cluster center point, select the most clustered center point of the closest point of the distance mean.
3. System clustering is called multi-level clustering, classification is from high to low (you can imagine the structure of the two fork tree), the more the more, the less the data points, but the more common features, the disadvantage is not suitable for large data volume, slow speed.
K-mans Clustering actual combat code:
#-*-coding:utf-8-*-" "cluster discretization, the final result is in the form of: 1 2 3 4 a 0 0.178698 0.257724 0.351843An 240 356 .000000 281.000000 53.000000 i.e. (0, 0.178698] 240, (0.178698, 0.257724] 356, and so on. " " from __future__ Importprint_functionImportPandas as PD fromSklearn.clusterImportKmeans#Import K-mean clustering algorithmdatafile='.. /data/data.xls' #data files for clusteringProcessedfile ='.. /tmp/data_processed.xls' #file after data processingTypelabel ={u'syndrome type coefficient of liver-qi stagnation':'A', u'coefficient of accumulation syndrome of heat toxicity':'B', u'coefficient of offset syndrome of flush-type':'C', u'The coefficient of Qi and blood deficiency syndrome':'D', u'syndrome type coefficient of spleen and stomach weakness':'E', u'syndrome type coefficient of liver and kidney yin deficiency':'F'}k= 4#number of cluster classes to be performed#read data and cluster analysisdata = Pd.read_excel (datafile)#data is a dataframe.Keys =list (Typelabel.keys ()) Result= PD. DataFrame ()#declares an empty dataframe structureif __name__=='__main__':#determines whether the main window runs, if it is run after saving the code as a. Py, this sentence is required if it is copied directly to the command window. forIinchRange (len (keys)):#call the K-means algorithm for clustering discretization Print(U'Clustering for "%s" in progress ...'%keys[i]) Kmodel= Kmeans (n_clusters = k, n_jobs = 4)#N_jobs is a parallel number, generally equal to a good number of CPUs #print (Data[[keys[i]]].as_matrix ()); exit ();Kmodel.fit (Data[[keys[i]]].as_matrix ())#Training model, As_matrix () converted to NumPy array, return specified series #print (data[[keys[i]]); exit ();R1 = PD. DataFrame (kmodel.cluster_centers_, columns = [typelabel[keys[i]])#Cluster Center, KMODEL.CLUSTER_CENTERS_ returns four cluster center points #In the case of a K initial class cluster center (usually a random selection of K data from the dataset), traverse all points in the dataset and calculate the distance to the K-Cluster center point , the nearest #assigned to the class cluster in which the center of the cluster is assigned, after the allocation is complete, the center point of the K cluster (the average of the K clusters) is re-traversed, and the distance from the data set to the center of the K cluster is re-iterated. #until the center point of the cluster changes very little, or the specified number of iterations (calculations) is reached. #disadvantage: May converge to the local minimum value (affected by the initial cluster center), #slow convergence on large datasets (each iteration requires a sample of every data in the data set, and the number of iterations defaults to a value of #print ([typelabel[keys[i]]);R2= PD. Series (Kmodel.labels_). Value_counts ()#classification statistics, how many data points each of K clusters #Dataframe and series are two data structures of pandas, series is understood as indexed arrays, Dataframe is a series of two-dimensional data, with navigation Index and column index, shape is understood as matrixR2= PD. DataFrame (r2, columns = [typelabel[keys[i]]+'N'])#convert to Dataframe to record the number of categories #print (R2); exit ();r = Pd.concat ([R1, r2], Axis = 1). Sort_values (Typelabel[keys[i]])#match the cluster center and the number of categories, sort_values sort by a column #print (R); exit ();R.index = [1, 2, 3, 4] [r[typelabel[keys[i] ]= PD. Series.rolling (R[typelabel[keys[i]], 2). Mean ()#Rolling_mean () to calculate the mean of adjacent 2 columns #(discarded Rolling_mean, replaced with series.rolling (). mean (), as a boundary point. R[TYPELABEL[KEYS[I]]][1] = 0.0#These two lines change the original cluster center to a boundary point. result =result.append (r.t) result= Result.sort_index ()#Sort by index, that is, in a,b,c,d,e,f orderResult.to_excel (Processedfile)
Clustering algorithm (K-means Clustering algorithm)