Machine learning--k mean Clustering (K-means) algorithm

Last Update:2015-05-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the basic principle

Classification refers to the classifier based on the annotated category of training set, through training can be used to classify the unknown categories of samples. Classification is called supervised learning. If a sample of the training set does not have a label category, then clustering is required. Clustering is a class of similar samples, which are usually measured by distance. Clustering is called unsupervised learning.

clustering refers to the principle of "birds of a Feather", which does not cluster itself into different groups, such a collection of data objects is called clusters, and the process of describing each such cluster. Its purpose is to make samples belonging to the same cluster should be similar to each other, while samples of different clusters should be sufficiently dissimilar. Unlike classification rules, clustering is not known to be divided into groups and groups, and it is not known which space-based rules define groups. K-means is commonly used in clustering algorithms, where k means K-cluster (clusters). By the definition of clustering, a kind should be closest to the centroid of the cluster it belongs to (compared to other k-1 cluster). In order to represent cluster, the simplest and most effective is to take all the sample points mean, that is, the centroid (cluster centroid), which is the origin of the name means. second, the algorithm flow 1) random selection of K initial points as centroid 2) calculate the distance from each point to the K Center of mass and divide it into clusters corresponding to the nearest centroid3) Update the centroid of each cluster until the cluster allocation does not changeThe pseudo-code is represented as follows:Create a K-point as the starting centroid (you can choose randomly)When the result of a cluster assignment at any point has changedfor each data point in the data setfor each centroidcalculate the distance between the centroid and the data pointsassign a data point to the cluster closest to itfor each cluster, calculate the mean value in the cluster and use the mean as the centroid three, the characteristics of the algorithmadvantages: Easy to implementdisadvantage: It is possible to converge to the local minimum and converge slowly on large datasets. applicable data range: numeric type. iv. Python code implementation1. Convert a text file to a matrix

#############################
#功能: Importing a text file into a matrix
#输入变量: Text File
#输出变量: The matrix after the text file is converted
#############################
def load_data_set (file_name):
Data_mat = []
FR = Open (file_name)

For line in Fr.readlines ():
#先去除字符串两边的空格, and then split the string with the tab delimiter
Cur_line = Line.strip (). Split (' \ t ')

#用map函数将cur_line进行float运算, i.e. float type
Float_line = map (float, cur_line)

Data_mat.append (Float_line)

Fr.close ()
Return Data_mat

2. Calculation of European distance

############################
#功能: Calculating Euclidean distance
#输入变量: Two variables
#输出变量: Euclidean distance of two variables
############################
def dist_eclud (Vec_a, Vec_b):
return sqrt (SUM (Power (Vec_a-vec_b, 2)))

3. Construct a set of K random centroid

##################################
#功能: Constructing a set of K random centroid
#输入变量: Original data set, initialized to K centroid
#输出变量: Coordinate values for centroid
##################################
def rand_cent (Data_set, K):
n = shape (Data_set) [1] # Gets the number of columns
Centroids = Mat (Zeros ((k, N)) # Creates a centroid matrix of K row n columns with an initial value of 0

For j in Xrange (n): # Create random cluster centers within each dimension range
Min_j = min (data_set[:, J]) # Remove the value of the corresponding column and take the minimum value
Range_j = float (max (data_set[:, j])-Min_j)

# Generate random numbers from 0 to 1.0 to ensure that random points are within the bounds of the data
centroids[:, j] = Min_j + range_j*random.random ()

Return centroids

4. Realize K-means clustering algorithm

##################################
#功能: K-Mean clustering
#输入变量: The original data set, initialized to K centroid,
# distance function, construction of random centroid functions

#输出变量: Centroids The final centroid,
# Cluster_assment index value and error of data points contained by each centroid
##################################
def K_means (Data_set, K):
m = shape (Data_set) [0] # Gets the number of rows

Cluster_assment = Mat (Zeros ((M, 2)) # A column of record cluster index values, a column of storage errors
Centroids = Rand_cent (Data_set, k) # Generate random centroid
cluster_changed = True

While cluster_changed:
cluster_changed = False

For I in Xrange (m): # Calculates the distance from each data point to the centroid
Min_dist = inf
Min_index =-1

            # Calculate the minimum distance from each data point to the centroid
            for J in Xrange (k):
                Dist_ji = Dist_eclud ( Centroids[j,:], data_set[i,:])
                 if Dist_ji < min_dist:
                     min_dist = Dist_ji
                     Min_index = J

# if the index value of the data point changes, i.e. the data point does not belong to the original centroid
If cluster_assment[i, 0]! = Min_index:
cluster_changed = True

# index value of minimum value, mean square value
Cluster_assment[i,:] = Min_index, min_dist**2

print ' centroids= ', centroids
For cent in Xrange (k):

# go to all columns where the first column equals cent
# nonzero () The return value is a tuple, two values are two dimensions, the first is a row, and the second is a column.
# . A represents a 2-dimensional array view that returns the matrix data
Pts_in_clust = Data_set[nonzero (cluster_assment[:, 0]. A = = cent) [0]]
Centroids[cent,:] = mean (Pts_in_clust, axis=0)

Return centroids, Cluster_assment

def main ():
Data_mat = Mat (Load_data_set (' TestSet.txt '))
# print "data_mat=", Data_mat
Centroids = Rand_cent (Data_mat, 2)
Print "centroids=", centroids

My_centroids, cluster_assing = K_means (Data_mat, 4)
print ' my_centroids= ', my_centroids
print ' cluster_assing= ', cluster_assing

if __name__ = = ' __main__ ':
Main ()

Machine learning--k mean Clustering (K-means) algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Machine learning--k mean Clustering (K-means) algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Machine learning--k mean Clustering (K-means) algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support