Machine Learning (ii)--k-mean Clustering (K-means) algorithm

Source: Internet
Author: User

Recently in the "machine learning Combat" This book, because I really want to learn more about machine learning algorithms, and want to learn python, in the recommendation of a friend chose this book to learn, before writing this article to FCM have a certain understanding, so the K mean algorithm has a nameless intimacy, Today, I'm working with you to learn K-means clustering algorithm.

An overview of K-means clustering (K-means)

1. Clustering

A "class" refers to a collection that has similarities. Clustering refers to dividing the data set into classes so that the data within the class is the most similar and the data similarity between the categories is as large as possible. Cluster analysis is based on the similarity, the data set of clustering classification, belongs to unsupervised learning.

2. Unsupervised learning and supervised learning

The KNN was verified in the previous article, different from KNN, K-means clustering belongs to unsupervised learning. So what is the difference between supervised learning and unsupervised learning? Supervised learning knows what to learn from the object (data), while unsupervised learning does not need to know the target to be searched, it is a common feature of data based on the algorithm. For example, classification and clustering, classification in advance to know the categories to be obtained, and clustering is not the same, but based on the similarity, the object is divided into different clusters.

3. K-means

K-means algorithm is a simple iterative clustering algorithm, using distance as the similarity index, from the K classes in a given dataset are found, and the center of each class is based on the mean value of all the values in the class, each of which is described by a cluster center. for a given data set containing n d data points x and the category K to be divided, the Euclidean distance is chosen as the similarity index, and the clustering goal is to minimize the sum of the squares of the clusters:

Combining the least squares and Lagrange principles, the Clustering center is the average of the data points in the corresponding category, and in order to make the algorithm converge, the final clustering center should be kept as constant as possible in the iterative process.

4. Algorithmic flow

K-means is a iterative process, and the algorithm is divided into four steps:

1) Select K objects in the data space as the initial center, and each object represents a cluster center;

2) for the data objects in the sample, according to their Euclidean distance from these clustering centers, according to the nearest criterion of distance, they are divided into classes corresponding to their nearest cluster centers (most similar);

3) Update the cluster Center: The value of the target function is computed by the mean value of all objects in each category as the cluster center of the class;

4) Determine whether the value of the center of the cluster and the target function change, if not, then the output, if the change, then return 2).

This is illustrated by the following example:

Fig. 1 Fig. 2

Fig. 3 Fig. 4

Figure 1: Given a data set;

Figure 2: Initialize the cluster center according to k = 5 to ensure that the cluster center is in the data space;

Figure 3: The data is divided according to the similarity index between the object in the calculation class and the center of the cluster.

Figure 4: Update the cluster center with the mean of the data within the class as a cluster center.

Finally, the end of the algorithm can be determined to ensure the convergence of the algorithm.

Two Python implementations

First of all, it is necessary to note that I am using python3.4.3, and 2.7 is still some discrepancy. Here, the NumPy and matplotlib libraries are used.

Very sorry is, installed for a long time matplotlib, always appear problems, helpless, this part can only be completed later.

Three MATLAB realization

Before using MATLAB to do some clustering algorithm optimization, natural use it compared to Python more handy. According to the steps of the algorithm, programming implementation, directly on the program:

%%%k-meansclear ALLCLC%%constructing random Data mu1=[0 0 0];S1=[0.23 0 0;0 0.87 0;0 0 0.56];data1=mvnrnd (mu1,s1,100); %generate Gaussian distribution data%%Type II Data MU2=[1.25 1.25 1.25]; S2=[0.23 0 0;0 0.87 0;0 0 0.56];d ata2=mvnrnd (mu2,s2,100);%The third class of data Mu3=[-1.25 1.25-1.25]; S3=[0.23 0 0;0 0.87 0;0 0 0.56];d ata3=mvnrnd (mu3,s3,100); Mu4=[1.5 1.5 1.5]; S4=[0.23 0 0;0 0.87 0;0 0 0.56];d ata4=mvnrnd (mu4,s4,100);%Display Data figure;plot3 (data1 (:,1), Data1 (:, 2), Data1 (:, 3),'+'); Title ('Raw Data'), Hold Onplot3 (Data2 (:,1), Data2 (:, 2), Data2 (:, 3),'r+');p lot3 (data3 (:,1), Data3 (:, 2), Data3 (:, 3),'g+');p lot3 (data4 (:,1), DATA4 (:, 2), Data3 (:, 3),'y+'); Grid On;data=[DATA1;DATA2;DATA3;DATA4]; [Row,col]=size (data); K= 4; Max_iter= 300;%%Iteration Count Min_impro= 0.1;%%%%Minimum step display= 1;%%%Decision Condition Center=zeros (K,col); U=zeros (k,col);%%Initialize Cluster center mi= Zeros (col,1); Ma= Zeros (col,1); fori = 1: Col mi (i,1) =min (Data (:, i)); Ma (i,1) =Max (data (:, i)); Center (:, i)= MA (i,1)-(MA (i,1)-mi (i,1)) * RAND (k,1); end%%Start Iteration foro = 1: Max_iter%%calculate Euclidean distance, with norm function fori = 1: K Dist{i}= [];  forj = 1: Row Dist{i}= [Dist{i};d Ata (j,:)-Center (i,:)]; End End Mindis=zeros (row,k);  fori = 1: Row tem= [];  forj = 1: K tem=[Tem Norm (Dist{j} (i,:))]; End [Nmin,index]=min (TEM); Mindis (I,index)=Norm (Dist{index} (i,:)); End%%Update Cluster Center fori = 1: K forj = 1: Col U (i,j)= SUM (Mindis (:, i). *data (:, j))/sum (Mindis (:, i)); End End%%JudgmentifDisplay Endifo >1,       ifMax (ABS (U-center)) <Min_impro;  Break; ElseCenter=U; End EndEnd%%returns the category to which it belongsclass= [];  fori = 1: Row Dist= [];  forj = 1: K Dist= [Dist Norm (data (I,:)-U (J,:))]; End [Nmin,index]=min (Dist); class= [class;d ata (i,:) index]; end%%Show final results [M,n]= Size (class); Figure;title ('Clustering Results'); hold on; forI=1: Rowif class(i,4) ==1Plot3 (class(i,1),class(i,2),class(i,3),'ro'); ElseIfclass(i,4) ==2Plot3 (class(i,1),class(i,2),class(i,3),'Go'); ElseIfclass(i,4) = = 3Plot3 (class(i,1),class(i,2),class(i,3),'Bo'); ElsePlot3 (class(i,1),class(i,2),class(i,3),'Yo'); Endendgrid on;

The final results are shown in 5 and figure 6:

Figure 5 Raw Data Figure 6 Cluster results                             

Summary : In the process of debugging, in fact, the problem is still quite many, the similarity index is still selected Euclidean distance. Before, has been directly calculated according to the formula, Euclidean distance is actually 2 norm Ah,2 norm is a unitary invariant norm, so the matrix 2 norm is the largest singular value of the matrix, in the solution process can be directly used norm function simplification.

The results can be clearly seen in the cluster effect or is quite ideal, to further verify, you can take the error distribution rate or NMI and Ari these common criteria to measure the merits of clustering results, I did not do the calculation, the data of this test is intuitive from the picture seems to be good. Now that I want to verify, I will select the wine data that is used frequently in the UCI database for further import, the result is as follows:

This result can be said to be very unsatisfactory, as for the reason, it is possible that the number of dimensions of the data is relatively large, it may be the performance of the algorithm itself ... I dare not speculate and try again tomorrow.

Of course, the performance of the algorithm itself is the reason for the existence, or how there will be FCM and other a series of algorithms, this will be discussed in detail later.

Today really is a little late, just suddenly want to write an article, arrived so late, there are many shortcomings to be perfected ...

Machine Learning (ii)--k-mean Clustering (K-means) algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.