Recently in the "machine learning Combat" This book, because I really want to learn more about machine learning algorithms, and want to learn python, in the recommendation of a friend chose this book to learn, before writing this article to FCM have a certain understanding, so the K mean algorithm has a nameless intimacy, Today, I'm working with you to learn K-means clustering algorithm.
An overview of K-means clustering (K-means)
1. Clustering
A "class" refers to a collection that has similarities. Clustering refers to dividing the data set into classes so that the data within the class is the most similar and the data similarity between the categories is as large as possible. Cluster analysis is based on the similarity, the data set of clustering classification, belongs to unsupervised learning.
2. Unsupervised learning and supervised learning
The KNN was verified in the previous article, different from KNN, K-means clustering belongs to unsupervised learning. So what is the difference between supervised learning and unsupervised learning? Supervised learning knows what to learn from the object (data), while unsupervised learning does not need to know the target to be searched, it is a common feature of data based on the algorithm. For example, classification and clustering, classification in advance to know the categories to be obtained, and clustering is not the same, but based on the similarity, the object is divided into different clusters.
3. K-means
K-means algorithm is a simple iterative clustering algorithm, using distance as the similarity index, from the K classes in a given dataset are found, and the center of each class is based on the mean value of all the values in the class, each of which is described by a cluster center. for a given data set containing n d data points x and the category K to be divided, the Euclidean distance is chosen as the similarity index, and the clustering goal is to minimize the sum of the squares of the clusters:
Combining the least squares and Lagrange principles, the Clustering center is the average of the data points in the corresponding category, and in order to make the algorithm converge, the final clustering center should be kept as constant as possible in the iterative process.
4. Algorithmic flow
K-means is a iterative process, and the algorithm is divided into four steps:
1) Select K objects in the data space as the initial center, and each object represents a cluster center;
2) for the data objects in the sample, according to their Euclidean distance from these clustering centers, according to the nearest criterion of distance, they are divided into classes corresponding to their nearest cluster centers (most similar);
3) Update the cluster Center: The value of the target function is computed by the mean value of all objects in each category as the cluster center of the class;
4) Determine whether the value of the center of the cluster and the target function change, if not, then the output, if the change, then return 2).
This is illustrated by the following example:
Fig. 1 Fig. 2
Fig. 3 Fig. 4
Figure 1: Given a data set;
Figure 2: Initialize the cluster center according to k = 5 to ensure that the cluster center is in the data space;
Figure 3: The data is divided according to the similarity index between the object in the calculation class and the center of the cluster.
Figure 4: Update the cluster center with the mean of the data within the class as a cluster center.
Finally, the end of the algorithm can be determined to ensure the convergence of the algorithm.
Two Python implementations
First of all, it is necessary to note that I am using python3.4.3, and 2.7 is still some discrepancy. Here, the NumPy and matplotlib libraries are used.
Very sorry is, installed for a long time matplotlib, always appear problems, helpless, this part can only be completed later.
Three MATLAB realization
Before using MATLAB to do some clustering algorithm optimization, natural use it compared to Python more handy. According to the steps of the algorithm, programming implementation, directly on the program:
%%%k-meansclear ALLCLC%%constructing random Data mu1=[0 0 0];S1=[0.23 0 0;0 0.87 0;0 0 0.56];data1=mvnrnd (mu1,s1,100); %generate Gaussian distribution data%%Type II Data MU2=[1.25 1.25 1.25]; S2=[0.23 0 0;0 0.87 0;0 0 0.56];d ata2=mvnrnd (mu2,s2,100);%The third class of data Mu3=[-1.25 1.25-1.25]; S3=[0.23 0 0;0 0.87 0;0 0 0.56];d ata3=mvnrnd (mu3,s3,100); Mu4=[1.5 1.5 1.5]; S4=[0.23 0 0;0 0.87 0;0 0 0.56];d ata4=mvnrnd (mu4,s4,100);%Display Data figure;plot3 (data1 (:,1), Data1 (:, 2), Data1 (:, 3),'+'); Title ('Raw Data'), Hold Onplot3 (Data2 (:,1), Data2 (:, 2), Data2 (:, 3),'r+');p lot3 (data3 (:,1), Data3 (:, 2), Data3 (:, 3),'g+');p lot3 (data4 (:,1), DATA4 (:, 2), Data3 (:, 3),'y+'); Grid On;data=[DATA1;DATA2;DATA3;DATA4]; [Row,col]=size (data); K= 4; Max_iter= 300;%%Iteration Count Min_impro= 0.1;%%%%Minimum step display= 1;%%%Decision Condition Center=zeros (K,col); U=zeros (k,col);%%Initialize Cluster center mi= Zeros (col,1); Ma= Zeros (col,1); fori = 1: Col mi (i,1) =min (Data (:, i)); Ma (i,1) =Max (data (:, i)); Center (:, i)= MA (i,1)-(MA (i,1)-mi (i,1)) * RAND (k,1); end%%Start Iteration foro = 1: Max_iter%%calculate Euclidean distance, with norm function fori = 1: K Dist{i}= []; forj = 1: Row Dist{i}= [Dist{i};d Ata (j,:)-Center (i,:)]; End End Mindis=zeros (row,k); fori = 1: Row tem= []; forj = 1: K tem=[Tem Norm (Dist{j} (i,:))]; End [Nmin,index]=min (TEM); Mindis (I,index)=Norm (Dist{index} (i,:)); End%%Update Cluster Center fori = 1: K forj = 1: Col U (i,j)= SUM (Mindis (:, i). *data (:, j))/sum (Mindis (:, i)); End End%%JudgmentifDisplay Endifo >1, ifMax (ABS (U-center)) <Min_impro; Break; ElseCenter=U; End EndEnd%%returns the category to which it belongsclass= []; fori = 1: Row Dist= []; forj = 1: K Dist= [Dist Norm (data (I,:)-U (J,:))]; End [Nmin,index]=min (Dist); class= [class;d ata (i,:) index]; end%%Show final results [M,n]= Size (class); Figure;title ('Clustering Results'); hold on; forI=1: Rowif class(i,4) ==1Plot3 (class(i,1),class(i,2),class(i,3),'ro'); ElseIfclass(i,4) ==2Plot3 (class(i,1),class(i,2),class(i,3),'Go'); ElseIfclass(i,4) = = 3Plot3 (class(i,1),class(i,2),class(i,3),'Bo'); ElsePlot3 (class(i,1),class(i,2),class(i,3),'Yo'); Endendgrid on;
The final results are shown in 5 and figure 6:
Figure 5 Raw Data Figure 6 Cluster results
Summary : In the process of debugging, in fact, the problem is still quite many, the similarity index is still selected Euclidean distance. Before, has been directly calculated according to the formula, Euclidean distance is actually 2 norm Ah,2 norm is a unitary invariant norm, so the matrix 2 norm is the largest singular value of the matrix, in the solution process can be directly used norm function simplification.
The results can be clearly seen in the algorithm has a certain clustering effect, to further verify, you can take MCR or nmi and Ari these common criteria to measure the merits of clustering results, I chose MCR for verification, the code is as follows:
Percent of Class (:,4 = reshape (b,1 = [ones (1,100), 2 * ones (1,100), 3 *ones (1,100), 4 * ones ( 1,100= 0; for i = 1: Row if (A (1,i) ~= B (1, i)) = sum + 1; = SUM/ row;fprintf ('mcr =%d\n', MCR);
The average calculated mcr= 0.53, indicating that the error rate is still quite large, the clustering effect is not very ideal, the reason: Although the algorithm convergence, but the algorithm only converges to the local minimum, rather than the global minimum value, so you can introduce the binary K-means to optimize the algorithm.
In addition, the FCM algorithm is an optimization of the algorithm to some extent.
and then import the data from the UCI database to test the wine, the results are very unsatisfactory, As for the reason, the performance of the algorithm itself is a part of, there may be a relatively large number of data dimensions ... I do not dare to speculate, then slowly verify it ...
Machine Learning (ii)--k-mean Clustering (K-means) algorithm