二分kmeans python實現

來源:互聯網
上載者:User

標籤:

今天要對一個1000個個記錄,每個記錄有n個屬性的文本進行聚類,採用的是二分k均值方法。

演算法思想:

我參考了Pang-Ning Tan資料採礦導論裡P317

相對於kmeans的優點是不受其初始質心的影響。

#coding utf-8
#python 3.4
#2015-4-3
#Fitz Yin
#yinruyi.hm@gmail.com
from sklearn.cluster import KMeansimport numpy as npdef makedict(f): #建立行號和每行資料間的字典關係 a = [line.split() for line in f] data_dict = {} for i in range(len(a)): data_dict[i] = a[i] return data_dictdef kmeans(data): #kmeans演算法 data = np.array(data) computer=KMeans(n_clusters=2) computer.fit(data) labels = computer.labels_ one_class = [] zero_class = [] for i in range(len(labels)): if labels[i] == 1: one_class.append(i)#0類的行號 else: zero_class.append(i)#1類的行號 centers = computer.cluster_centers_#找到中心 cohesion_0,cohesion_1 = -1,-1#初始化,自己和自己的cos是1 for i in zero_class: cohesion_0 += judge_cos(data[i],centers[0])#0類cos評價 for i in one_class: cohesion_1 += judge_cos(data[i],centers[1])#1類cos評價 return zero_class,one_class,cohesion_0,cohesion_1def judge_cos(x,y): #cos評價函數 af,bf,ab = 0,0,0 for i in range(len(x)): af = float(x[i])*float(x[i]) bf = float(y[i])*float(y[i]) ab = float(x[i])*float(y[i]) if af == 0 or bf == 0: print(‘error‘) return 0 #本例中不出現全是0情況 else: cos_value = ab/(np.sqrt(af)*np.sqrt(bf)) return cos_valuedef gettransdict(split_set,split_number): #建立kmeans計算的矩陣和原來矩陣 兩個行號之間的字典關係 a = split_set[split_number][0] transdict = {} for i in range(len(a)): transdict[i] = a[i] return transdictdef getsplitset(split_set,split_number): #簇中去掉要分的簇 new_split_set = [] for i in range(len(split_set)): if i == split_number: pass else: new_split_set.append(split_set[i]) return new_split_setdef getsplitnumber(split_set): #找尋待分簇的編號 split_number = 0 temp = [] for i in range(len(split_set)): temp.append(split_set[i][1]) for i in range(len(temp)): if temp[split_number] > temp[i]: split_number = i return split_numberdef main(): f = open(‘train.txt‘,‘r‘,encoding=‘utf-8‘).readlines() data_dict = makedict(f) k = 3#分類個數 #sse = 0.001 split_set = [[[i for i in range(1000)],0]]#此處1000是行號 split_number = 0#需要分類的簇標號 while len(split_set) != k: transdict = gettransdict(split_set,split_number)#轉換字典 array2kmeans = [data_dict[i] for i in split_set[split_number][0]]#擷取二分kmeans計算矩陣 zero_class,one_class,cohesion_0,cohesion_1 = kmeans(array2kmeans) real_zero_class = [transdict[i] for i in zero_class]#分裂後的簇0 real_one_class = [transdict[i] for i in one_class]#分裂後的簇1 split_set = getsplitset(split_set,split_number)#將總的簇中去掉分的大的簇 split_set.append([real_zero_class,cohesion_0]) split_set.append([real_one_class,cohesion_1])#總的簇中加入分完的小簇 split_number = getsplitnumber(split_set)#擷取下一個迴圈待分的簇編號 print(split_set) #[[[行號1類],sse1],[[行號2類],sse2],[[行號三類],sse3]]if __name__ == ‘__main__‘: main()

 

二分kmeans python實現

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.