Today we are going to cluster a 1000-record, each record with n attributes, using a binary K-means.
Algorithm idea:
I refer to the introduction to pang-ning Tan Data mining P317
The advantage relative to Kmeans is not affected by its initial centroid.
#coding Utf-8
#python 3.4
#2015-4-3
#Fitz Yin
#yinruyi. hm@gmail.com
fromSklearn.clusterImportKmeansImportNumPy as NPdefmakedict (f):#establish a dictionary relationship between line numbers and each row of dataA = [Line.split () forLineinchF] data_dict= {} forIinchRange (len (a)): Data_dict[i]=A[i]returndata_dictdefKmeans (data):#Kmeans Algorithmdata =Np.array (data) computer=kmeans (n_clusters=2) computer.fit (data) labels=Computer.labels_ One_class=[] Zero_class= [] forIinchRange (len (labels)):ifLabels[i] = = 1: One_class.append (i)#line number of class 0 Else: Zero_class.append (i)#line number of Class 1Centers = Computer.cluster_centers_#Find the centercohesion_0,cohesion_1 = -1,-1#Initialize yourself and your own cos is 1 forIinchzero_class:cohesion_0+ = Judge_cos (Data[i],centers[0])#Class 0 Cos evaluation forIinchone_class:cohesion_1+ = Judge_cos (Data[i],centers[1])#Class 1 cos evaluation returnzero_class,one_class,cohesion_0,cohesion_1defJudge_cos (x, y):#Cos evaluation functionAf,bf,ab =0,0,0 forIinchRange (len (x)): AF= Float (X[i]) *float (x[i]) BF= Float (Y[i]) *float (y[i]) AB= Float (X[i]) *float (y[i])ifAF = = 0orBF = =0:Print('Error') return0#This example does not appear to be full of 0 cases Else: Cos_value= ab/(NP.SQRT (AF) *np.sqrt (BF))returnCos_valuedefgettransdict (split_set,split_number):#establish a dictionary relationship between the Kmeans computed matrix and the original matrix two line numbersA =Split_set[split_number][0] Transdict= {} forIinchRange (len (a)): Transdict[i]=A[i]returntransdictdefGetsplitset (split_set,split_number):#to remove the clusters from the clusterNew_split_set = [] forIinchRange (len (split_set)):ifi = =Split_number:Pass Else: New_split_set.append (Split_set[i])returnNew_split_setdefGetsplitnumber (split_set):#find the number of the cluster you want to splitSplit_number =0 Temp= [] forIinchRange (len (split_set)): Temp.append (split_set[i][1]) forIinchRange (len (temp)):ifTemp[split_number] >Temp[i]: Split_number=IreturnSplit_numberdefMain (): F= Open ('Train.txt','R', encoding='Utf-8'). ReadLines () data_dict=makedict (f) k= 3#Number of categories #SSE = 0.001Split_set = [[[I forIinchRange (1000)],0]]#here 1000 is the line numberSplit_number = 0#cluster labels that need to be categorized whileLen (split_set)! =k:transdict= Gettransdict (Split_set,split_number)#Convert DictionaryArray2kmeans = [Data_dict[i] forIinchSplit_set[split_number][0]]#get the binary Kmeans computing matrixZero_class,one_class,cohesion_0,cohesion_1 =Kmeans (Array2kmeans) Real_zero_class= [Transdict[i] forIinchZero_class]#cluster 0 after splittingReal_one_class = [Transdict[i] forIinchOne_class]#cluster 1 after splittingSplit_set = Getsplitset (split_set,split_number)#to remove a large cluster of points from a total cluster .split_set.append ([real_zero_class,cohesion_0]) split_set.append ([real_one_class,cohesion_1])#The total cluster is divided into small clusters .Split_number = Getsplitnumber (Split_set)#get the cluster number for the next loop to be divided Print(Split_set)#[[[ Line number 1 class],sse1],[[line number 2 class],sse2],[[line number three class],sse3]]if __name__=='__main__': Main ()
Two-point Kmeans python implementation