First, the basic principle
Classification refers to the classifier based on the annotated category of training set, through training can be used to classify the unknown categories of samples. Classification is called supervised learning. If a sample of the training set does not have a label category, then clustering is required. Clustering is a class of similar samples, which are usually measured by distance. Clustering is called unsupervised learning.
clustering refers to the principle of "birds of a Feather", which does not cluster itself into different groups, such a collection of data objects is called clusters, and the process of describing each such cluster. Its purpose is to make samples belonging to the same cluster should be similar to each other, while samples of different clusters should be sufficiently dissimilar. Unlike classification rules, clustering is not known to be divided into groups and groups, and it is not known which space-based rules define groups. K-means is commonly used in clustering algorithms, where k means K-cluster (clusters). By the definition of clustering, a kind should be closest to the centroid of the cluster it belongs to (compared to other k-1 cluster). In order to represent cluster, the simplest and most effective is to take all the sample points mean, that is, the centroid (cluster centroid), which is the origin of the name means. second, the algorithm flow 1) random selection of K initial points as centroid 2) calculate the distance from each point to the K Center of mass and divide it into clusters corresponding to the nearest centroid3) Update the centroid of each cluster until the cluster allocation does not changeThe pseudo-code is represented as follows:Create a K-point as the starting centroid (you can choose randomly)When the result of a cluster assignment at any point has changedfor each data point in the data setfor each centroidcalculate the distance between the centroid and the data pointsassign a data point to the cluster closest to itfor each cluster, calculate the mean value in the cluster and use the mean as the centroid three, the characteristics of the algorithmadvantages: Easy to implementdisadvantage: It is possible to converge to the local minimum and converge slowly on large datasets. applicable data range: numeric type. iv. Python code implementation1. Convert a text file to a matrix
#############################
#功能: Importing a text file into a matrix
#输入变量: Text File
#输出变量: The matrix after the text file is converted
#############################
def load_data_set (file_name):
Data_mat = []
FR = Open (file_name)
For line in Fr.readlines ():
#先去除字符串两边的空格, and then split the string with the tab delimiter
Cur_line = Line.strip (). Split (' \ t ')
#用map函数将cur_line进行float运算, i.e. float type
Float_line = map (float, cur_line)
Data_mat.append (Float_line)
Fr.close ()
Return Data_mat
2. Calculation of European distance
############################
#功能: Calculating Euclidean distance
#输入变量: Two variables
#输出变量: Euclidean distance of two variables
############################
def dist_eclud (Vec_a, Vec_b):
return sqrt (SUM (Power (Vec_a-vec_b, 2)))
3. Construct a set of K random centroid
##################################
#功能: Constructing a set of K random centroid
#输入变量: Original data set, initialized to K centroid
#输出变量: Coordinate values for centroid
##################################
def rand_cent (Data_set, K):
n = shape (Data_set) [1] # Gets the number of columns
Centroids = Mat (Zeros ((k, N)) # Creates a centroid matrix of K row n columns with an initial value of 0
For j in Xrange (n): # Create random cluster centers within each dimension range
Min_j = min (data_set[:, J]) # Remove the value of the corresponding column and take the minimum value
Range_j = float (max (data_set[:, j])-Min_j)
# Generate random numbers from 0 to 1.0 to ensure that random points are within the bounds of the data
centroids[:, j] = Min_j + range_j*random.random ()
Return centroids
4. Realize K-means clustering algorithm
##################################
#功能: K-Mean clustering
#输入变量: The original data set, initialized to K centroid,
# distance function, construction of random centroid functions
#输出变量: Centroids The final centroid,
# Cluster_assment index value and error of data points contained by each centroid
##################################
def K_means (Data_set, K):
m = shape (Data_set) [0] # Gets the number of rows
Cluster_assment = Mat (Zeros ((M, 2)) # A column of record cluster index values, a column of storage errors
Centroids = Rand_cent (Data_set, k) # Generate random centroid
cluster_changed = True
While cluster_changed:
cluster_changed = False
For I in Xrange (m): # Calculates the distance from each data point to the centroid
Min_dist = inf
Min_index =-1
# Calculate the minimum distance from each data point to the centroid
for J in Xrange (k):
Dist_ji = Dist_eclud ( Centroids[j,:], data_set[i,:])
if Dist_ji < min_dist:
min_dist = Dist_ji
Min_index = J
# if the index value of the data point changes, i.e. the data point does not belong to the original centroid
If cluster_assment[i, 0]! = Min_index:
cluster_changed = True
# index value of minimum value, mean square value
Cluster_assment[i,:] = Min_index, min_dist**2
print ' centroids= ', centroids
For cent in Xrange (k):
# go to all columns where the first column equals cent
# nonzero () The return value is a tuple, two values are two dimensions, the first is a row, and the second is a column.
# . A represents a 2-dimensional array view that returns the matrix data
Pts_in_clust = Data_set[nonzero (cluster_assment[:, 0]. A = = cent) [0]]
Centroids[cent,:] = mean (Pts_in_clust, axis=0)
Return centroids, Cluster_assment
def main ():
Data_mat = Mat (Load_data_set (' TestSet.txt '))
# print "data_mat=", Data_mat
Centroids = Rand_cent (Data_mat, 2)
Print "centroids=", centroids
My_centroids, cluster_assing = K_means (Data_mat, 4)
print ' my_centroids= ', my_centroids
print ' cluster_assing= ', cluster_assing
if __name__ = = ' __main__ ':
Main ()
Machine learning--k mean Clustering (K-means) algorithm