Python machine learning-K-Means clustering implementation, pythonk-means

Last Update:2018-02-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article shares the implementation code of K-Means clustering in Python machine learning for your reference. The specific content is as follows:

1. K-Means clustering Principle

The K-means algorithm is a typical distance-based clustering algorithm. distance is used as the similarity evaluation index, that is, the closer the distance between two objects, the larger the similarity. The basic idea is to cluster k points in a space and classify objects closest to them. Update the values of each cluster center by iteration until the best clustering result is obtained. Each cluster itself is as compact as possible, and each cluster is separated as much as possible.
The general process of the algorithm is as follows: (1) randomly select k points as the seed points (these k points do not necessarily belong to the dataset); (2) calculate the distance between each data point and k seed points respectively. Which of the following is the closest to the seed point? (3) re-calculate the coordinates of k seed points (a simple and common method is to calculate the average value of the coordinate value as the new coordinate value; (4) Repeat steps 2 and 3, until the coordinates of the seed points remain unchanged or the number of cycles is completed.

2. Data and its initial cluster center

The data is in the Matlab loading format (mat) and contains X variables. The data source is (you can download it here) and X is a 300*2-dimensional variable, therefore, clustering is basically performed in some points on the plane coordinate axis.

First, we construct the initial centroids function, then randomly set the initial center, and preliminarily determine the center of gravity of each variable of X through the Euclidean distance. Code:

Import numpy as npimport pandas as pdimport matplotlib. pyplot as pltimport seaborn as sbfrom scipy. io import loadmatdef find_closest_centroids (X, centroids): m = X. shape [0] k = centroids. shape [0] # number of categories to be clustered idx = np. zeros (m) for I in range (m): min_dist = 1000000 # iteration termination condition for j in range (k): dist = np. sum (X [I,:]-centroids [j,:]) ** 2) if dist <min_dist: # record the current shortest distance and its central index value min_dist = dist idx [I] = j return idxdata = loadmat ('d: \ python \ Python ml \ ex7data2. mat ') X = data ['X'] initial_centroids = np. array ([[3, 3], [6, 2], [8, 5]) idx = find_closest_centroids (X, initial_centroids) idx [0: 3]

Here, we can convert m (300 here) to 0 vectors, that is, idx, that is, assume that every variable of X belongs to Class 0, then, the dist = np is calculated based on the distance from the initial center. sum (X [I,:]-centroids [j,:]) ** 2) to determine which class each variable belongs to, and finally replace 0 in idx.

3. constantly search for the center position and implement the kmeans Algorithm

The 300 dimension vector obtained by idx is used to determine the classification of each variable in X. On this basis, the initial center cluster position is constantly adjusted to find the optimal center.

Def compute_centroids (X, idx, k): m, n = X. shape centroids = np. zeros (k, n) for I in range (k): indices = np. where (idx = I) centroids [I,:] = (np. sum (X [indices,:], axis = 1)/len (indices [0]). ravel () # calculate the average value of all values in the center of the class as the return centroidscompute_centroids (X, idx, 3) of the new class center)

Based on the above functions, we construct the kmeans function to implement the K-means clustering algorithm. Then, the data is visualized based on the classification and texture coordinates of each variable.

Def run_k_means (X, initial_centroids, max_iters): m, n = X. shape k = initial_centroids.shape [0] idx = np. zeros (m) centroids = initial_centroids for I in range (max_iters): idx = Centers (X, centroids) centroids = compute_centroids (X, idx, k) return idx, centroidsidx, centroids = run_k_means (X, initial_centroids, 10) cluster1 = X [np. where (idx = 0) [0],:] # obtain the data set of the first category in X, that is, cluster2 = X [np. where (idx = 1) [0],:] cluster3 = X [np. where (idx = 2) [0],:] fig, ax = plt. subplots (figsize = (12, 8) ax. scatter (cluster1 [:, 0], cluster1 [:, 1], s = 30, color = 'R', label = 'Cluster 1') ax. scatter (cluster2 [:, 0], cluster2 [:, 1], s = 30, color = 'G', label = 'Cluster 2') ax. scatter (cluster3 [:, 0], cluster3 [:, 1], s = 30, color = 'B', label = 'Cluster 3') ax. legend () plt. show ()

The figure is as follows:

Image.png

4. initialize centroid settings

The initial center of the front side: [3, 3], [6, 2], [8, 5], which is set in advance, this generates idx (vector of each variable's classification), which is the basis for subsequent kmeans clustering. In fact, for two-dimensional or more data, it cannot be displayed on the plane coordinate axis, it is difficult to set a better initial center at the beginning. In addition, the initial center setting may also affect the convergence of the algorithm. Therefore, we need to construct another initialization centroid setting function to better set the initial centroid.

Def init_centroids (X, k): m, n = X. shape centroids = np. zeros (k, n) # initialize the zero matrix idx = np. random. randint (0, m, k) # returns an integer between 0 and m for I in range (k): centroids [I,:] = X [idx [I],:] return centroidsinit_centroids (X, 3)

The initial center position generated here is to randomly find three variables from X data as the initial value. On this basis, set initial_centroids = init_centroids (X, 3), and enter the code in the front and run it again.

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python machine learning-K-Means clustering implementation, pythonk-means

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python machine learning-K-Means clustering implementation, pythonk-means

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support