Python machine learning-K-Means clustering implementation, pythonk-means

Source: Internet
Author: User

Python machine learning-K-Means clustering implementation, pythonk-means

This article shares the implementation code of K-Means clustering in Python machine learning for your reference. The specific content is as follows:

1. K-Means clustering Principle

The K-means algorithm is a typical distance-based clustering algorithm. distance is used as the similarity evaluation index, that is, the closer the distance between two objects, the larger the similarity. The basic idea is to cluster k points in a space and classify objects closest to them. Update the values of each cluster center by iteration until the best clustering result is obtained. Each cluster itself is as compact as possible, and each cluster is separated as much as possible.
The general process of the algorithm is as follows: (1) randomly select k points as the seed points (these k points do not necessarily belong to the dataset); (2) calculate the distance between each data point and k seed points respectively. Which of the following is the closest to the seed point? (3) re-calculate the coordinates of k seed points (a simple and common method is to calculate the average value of the coordinate value as the new coordinate value; (4) Repeat steps 2 and 3, until the coordinates of the seed points remain unchanged or the number of cycles is completed.

2. Data and its initial cluster center

The data is in the Matlab loading format (mat) and contains X variables. The data source is (you can download it here) and X is a 300*2-dimensional variable, therefore, clustering is basically performed in some points on the plane coordinate axis.

First, we construct the initial centroids function, then randomly set the initial center, and preliminarily determine the center of gravity of each variable of X through the Euclidean distance. Code:

Import numpy as npimport pandas as pdimport matplotlib. pyplot as pltimport seaborn as sbfrom scipy. io import loadmatdef find_closest_centroids (X, centroids): m = X. shape [0] k = centroids. shape [0] # number of categories to be clustered idx = np. zeros (m) for I in range (m): min_dist = 1000000 # iteration termination condition for j in range (k): dist = np. sum (X [I,:]-centroids [j,:]) ** 2) if dist <min_dist: # record the current shortest distance and its central index value min_dist = dist idx [I] = j return idxdata = loadmat ('d: \ python \ Python ml \ ex7data2. mat ') X = data ['X'] initial_centroids = np. array ([[3, 3], [6, 2], [8, 5]) idx = find_closest_centroids (X, initial_centroids) idx [0: 3]

Here, we can convert m (300 here) to 0 vectors, that is, idx, that is, assume that every variable of X belongs to Class 0, then, the dist = np is calculated based on the distance from the initial center. sum (X [I,:]-centroids [j,:]) ** 2) to determine which class each variable belongs to, and finally replace 0 in idx.

3. constantly search for the center position and implement the kmeans Algorithm

The 300 dimension vector obtained by idx is used to determine the classification of each variable in X. On this basis, the initial center cluster position is constantly adjusted to find the optimal center.

Def compute_centroids (X, idx, k): m, n = X. shape centroids = np. zeros (k, n) for I in range (k): indices = np. where (idx = I) centroids [I,:] = (np. sum (X [indices,:], axis = 1)/len (indices [0]). ravel () # calculate the average value of all values in the center of the class as the return centroidscompute_centroids (X, idx, 3) of the new class center)

Based on the above functions, we construct the kmeans function to implement the K-means clustering algorithm. Then, the data is visualized based on the classification and texture coordinates of each variable.

Def run_k_means (X, initial_centroids, max_iters): m, n = X. shape k = initial_centroids.shape [0] idx = np. zeros (m) centroids = initial_centroids for I in range (max_iters): idx = Centers (X, centroids) centroids = compute_centroids (X, idx, k) return idx, centroidsidx, centroids = run_k_means (X, initial_centroids, 10) cluster1 = X [np. where (idx = 0) [0],:] # obtain the data set of the first category in X, that is, cluster2 = X [np. where (idx = 1) [0],:] cluster3 = X [np. where (idx = 2) [0],:] fig, ax = plt. subplots (figsize = (12, 8) ax. scatter (cluster1 [:, 0], cluster1 [:, 1], s = 30, color = 'R', label = 'Cluster 1') ax. scatter (cluster2 [:, 0], cluster2 [:, 1], s = 30, color = 'G', label = 'Cluster 2') ax. scatter (cluster3 [:, 0], cluster3 [:, 1], s = 30, color = 'B', label = 'Cluster 3') ax. legend () plt. show ()

The figure is as follows:

Image.png

4. initialize centroid settings

The initial center of the front side: [3, 3], [6, 2], [8, 5], which is set in advance, this generates idx (vector of each variable's classification), which is the basis for subsequent kmeans clustering. In fact, for two-dimensional or more data, it cannot be displayed on the plane coordinate axis, it is difficult to set a better initial center at the beginning. In addition, the initial center setting may also affect the convergence of the algorithm. Therefore, we need to construct another initialization centroid setting function to better set the initial centroid.

Def init_centroids (X, k): m, n = X. shape centroids = np. zeros (k, n) # initialize the zero matrix idx = np. random. randint (0, m, k) # returns an integer between 0 and m for I in range (k): centroids [I,:] = X [idx [I],:] return centroidsinit_centroids (X, 3)

The initial center position generated here is to randomly find three variables from X data as the initial value. On this basis, set initial_centroids = init_centroids (X, 3), and enter the code in the front and run it again.

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.