Based on the monthly Internet time data of the students using Dbscan algorithm calculation:
#coding =utf-8import NumPy as Npimport sklearn.cluster as Skcfrom sklearn import Metricsimport Matplotlib.pyplot as Pltmac2 id = dict () onlinetimes = []f = open (' F:\data\TestData.txt ', encoding= ' utf-8 ') for line in F:mac = Line.split (', ') [2] #取得 MAC address, such as the first line a417314eea7b onlinetime = Int (Line.split (', ') [6]) #上网时长 starttime = Int (Line.split (', ') [4].split (') [1].s Plit (': ') [0]) #开始时间只取第一个 ":" Split: Hours #每一个onlinetimes有一个唯一的mac2id对应 if Mac not in Mac2id:mac2id[mac] = Len (online Times) Onlinetimes.append ((StartTime, Onlinetime)) Else:onlinetimes[mac2id[mac]] = [(Starttime,onlinetim e)] #print (onlinetimes) real_x = Np.array (onlinetimes). Reshape (( -1, 2)) #自行构造一个2列的矩阵, 1 constructs unknown number of rows #print (real_x) X = real_x[ :, 0:1] #只取上网开始时 #print (X) #调用DBSCAN方法进行训练, labels is the cluster label db = SKC for each data. DBSCAN (eps=0.01, min_samples=20). Fit (X) #返回的数据的簇标签, the Noise data label is-1 "' #上网时长聚类X = Np.log (1+real_x[:, 1:]) db = SKC. DBSCAN (eps=0.04, min_samples=10). Fit (X) "labels = db.labels_print (' labels:\n ', labels) #计算簇标签为-1 noise data ratio Raito = Len (labels[labels[:] = =-1])/len (labels) print (' Noise raito: ', Format (Raito, '. 2% ')) #计算簇个数n_clusters_ = Len (Set ( Labels))-(1 if-1 in labels else 0) print (' Estimated numbe of clusters:%d '%n_clusters_) #簇个数print (' Silhouette coefficient: %0.3f '%metrics.silhouette_score (X, labels)) #聚类效果评价指标 # Print each cluster designator and cluster data for I in range (N_CLUSTERS_): Print (' Cluster ', I, ': ') ) Print (list (X[labels = = I].flatten ())) #绘制直方图plt. hist (X) plt.show ()
Operation Result:
Labels:
[0-1 0 1-1 1 0 1 2-1 1 0 1 1 3-1-1 3-1 1 1-1 1 3 4
-1 1 1 2 0 2 2-1 0 1 0 0 0 1 3-1 0 1 1 0 0 2-1 1 3
1-1 3-1 3 0 1 1 2 3 3-1-1-1 0 1 2 1-1 3 1 1 2 3 0
1-1 2 0 0 3 2 0 1-1 1 3-1 4 2-1-1 0-1 3-1 0 2 1-1
-1 2 1 1 2 0 2 1 1 3 3 0 1 2 0 1 0-1 1 1 3-1 2 1 3
1 1 1 2-1 5-1 1 3-1 0 1 0 0 1-1-1-1 2 2 0 1 1 3 0
0 0 1 4 4-1-1-1-1 4-1 4 4-1 4-1 1 2 2 3 0 1 0-1 1
0 0 1-1-1 0 2 1 0 2-1 1 1-1-1 0 1 1-1 3 1 1-1 1 1
0 0-1 0-1 0 0 2-1 1-1 1 0-1 2 1 3 1 1-1 1 0 0-1 0
0 3 2 0 0 5-1 3 2-1 5 4 4 4-1 5 5-1 4 0 4 4 4 5 4
4 5 5 0 5 4-1 4 5 5 5 1 5 5 0 5 4 4-1 4 4 5 4 0 5
4-1 0 5 5 5-1 4 5 5 5 5 4 4]
Noise raito:22.15%
Estimated Numbe of Clusters:6
Silhouette coefficient:0.710
Cluster 0:
[22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 2 2, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22]
Cluster 1:
[23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 2 3, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23]
Cluster 2:
[20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20]
Cluster 3:
[21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21]
Cluster 4:
[8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]
Cluster 5:
[7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7]
According to the Internet start time classification histogram is as follows:
The cluster histogram according to the internet time is as follows:
According to the Internet start time is obviously better than online long clustering.
The clustering 2--dbscan of unsupervised learning