Discussion on cluster analysis
Chapter I. INTRODUCTION
Chapter II Preparation of Knowledge
Chapter Three Direct Clustering method
Fourth Chapter K-means
Fifth Chapter DBSCAN
Sixth Chapter OPTICS
Seventh chapter effect evaluation of cluster analysis
The eighth chapter of data scaling
A new clustering algorithm published in science
This digest is the third chapter of Zhou Zhaotao's Master's thesis "Text Clustering Analysis effect evaluation and text expression study" from the Institute of Computing Technology of CAS, which is a reading note, which I hope will help you a little.
The definition of the accuracy and recall rates mentioned in this article can be found in
http://blog.csdn.net/itplus/article/details/10862059
Some specific formulas for scaling in this paper can be found in http://blog.csdn.net/itplus/article/details/10088101
This June, Alex Rodriguez and Alessandro Laio published an article in science entitled Clustering by Fast search and find density peaks, for the clustering algorithm Design provides a new way of thinking. Although the article came out after many readers questioned, but overall, the basic idea of the new clustering algorithm is very novel, and simple and lively, it is worth learning. The core idea of this new clustering algorithm lies in the characterization of the clustering center, the principle of the algorithm is introduced in detail, and some details are discussed.
Finally, attach the Matlab sample program (with appropriate code comments) provided by the author in the supplemental material.
Clear all close all disp (' the only input needed are a distance matrix file ') disp (' The format of this ' file should ') DISP (' column 1:id of Element i ') disp (' column 2:id of Element J ') disp (' column 3:dist (i,j) ')% read data from file mdist= Input (' name of the distance matrix file (with single quotes) \ n '); Disp (' Reading input distance matrix ') xx=load (mdist); Nd=max (XX (:, 2)); Nl=max (XX (:, 1)); if (nl>nd) nd=nl; Percent percent ensures that the DN is taken as the largest of the 12th column and is used as the total number of data points end n=size (xx,1); Percent xx The length of the first dimension, equal to the number of lines of the file (that is, the total number of distances) is initialized to 0 for i=1:nd for J=1:nd dist (i,j) = 0; End end percent is used to assign values to the dist array with XX, note that the input only has a 0.5*DN (DN-1) value, where it is filled with a full matrix of percent of the weight here regardless of the diagonal element for i=1:n ii=xx (i,1); Jj=xx (i,2); Dist (II,JJ) =xx (i,3); Dist (JJ,II) =xx (i,3); End percent to determine DC percent=2.0; fprintf (' Average percentage of neighbours (hard coded):%5.6f\n ', percent); Position=round (n*percent/100); The percent round is a rounded function sda=sort (XX (:, 3)); Percent of all distance values are sorted in ascending order DC=SDA (position); Calculate local density rho (using Gaussian core) fprintf (' Computing Rho with Gaussian kernel of radius:%12.6f\n ', DC); Percent of each data point is initialized to 0 for I=1:nd Rho (i) =0.; End% Gaussian kernel for i=1:nd-1 for J=i+1:nd Rho (i) =rho (i) +exp (-(Dist (I,J)/dc) * (Dist (I,J)/dc)); Rho (j) =rho (j) +exp (-(Dist (I,J)/dc) * (Dist (I,J)/dc)); End end% "Cut off" kernel%for i=1:nd-1% for j=i+1:nd% if (Dist (i,j) <DC)% Rho (i) =rho (i) +1.; % Rho (j) =rho (j) +1.; % End% End%end The maximum value of the matrix column, then the maximum value, and finally the maximum value of all distance values Maxd=max (max (dist)); Percent percent will rho in descending order, Ordrho hold order [Rho_sorted,ordrho]=sort (Rho, ' descend '); Percent of the data points with the largest Rho value Delta (Ordrho (1)) =-1.; Nneigh (Ordrho (1)) = 0; Percent-generated delta and Nneigh arrays for Ii=2:nd Delta (Ordrho (ii)) =maxd; For Jj=1:ii-1 if (Dist (Ordrho (ii), Ordrho (JJ)) <delta (Ordrho (ii))) Delta (Ordrho (ii)) =dist (Ordrho (ii), Ordrho (JJ)); Nneigh (Ordrho (ii)) =ordrho (JJ); The number Ordrho (JJ) end end end of a data point with Ordrho (ii) from the nearest point in the value of the Rho value generates RHo The delta value of the maximum data point Delta (Ordrho (1)) =max (Delta (:)); The percent decision Figure disp (' Generated file:decision graph ') disp (' column 1:density ') disp (' column 2:delta ') FID = fopen (' Decisio N_graph ', ' W '); For I=1:nd fprintf (FID, '%6.2f%6.2f\n ', Rho (i), Delta (i)); End DISP Select a rectangle enclosing the center of the Class (' Select a rectangle enclosing cluster centers ') per computer, the root object of the handle is only one, the screen, its handle is always 0 percent > > scrsz = Get (0, ' screensize ') percent Scrsz = 1 1 1280 800 percent 1280 and 800 is the computer you set up The resolution, Scrsz (4) is 800,scrsz (3) is the Scrsz = Get (0, ' screensize '); The percent of a person to designate a location, feeling there is not so auto:-) figure (' Position ', [6 Scrsz (3)/4. Scrsz (4)/1.3]); The percent of IND and Gamma are not used in the back for I=1:nd Ind (i) =i; Gamma (i) =rho (i) *delta (i); End-percent uses Rho and Delta to draw a so-called "decision Diagram" subplot (2,1,1) Tt=plot (Rho (:), Delta (:), ' O ', ' markersize ', 5, ' markerfacecolor ', ' K ', ' Markeredgecolor ', ' K '); Title (' Decision Graph ', ' FontSize ', 15.0) xlabel (' \rho ') ylabel (' \delta ') subplot (2,1,1) rect = getrect (1); Percent GetRect uses the mouse to intercept a rectangular area, in which the coordinates (x, y) of the lower left corner of the rectangle are stored, and the width and height of the truncated rectangle are rhomin=rect (1); Deltamin=rect (2); Percent percent the author admits that this is an error and has changed from 4 to 2! The number of cluster initialized is nclust=0; Percent CL is an array of attribution flags, CL (i) =j indicates that the data point number I is attributed to the J cluster of the first unity to initialize CL to 1 for I=1:ND cl (i) =-1; End percent the number of statistics points (that is, the cluster center) within the rectangular region for I=1:nd if ((Rho (i) >rhomin) && (Delta (i) >deltamin) Nclust=ncl ust+1; CL (i) =nclust; The data points of percent I are part of the Nclust cluster ICL (nclust) =i;%% inverse mapping, and the center of Nclust Cluster is the first data point end end fprintf (' number of CLUSTERS:%i \ n ', nclust); Disp (' performing assignation ') percent of other data points are categorized (assignation) for I=1:nd if (CL (Ordrho (i)) ==-1) cl (Ordrho (i)) =cl (nn Eigh (Ordrho (i))); End end percent percent is traversed by the size of the Rho value from large to small, and after the loop, CL should be a positive value. Percent of processing the Halo point, the Halo code should be moved to the if (nclust>1) to compare well for i=1:nd Halo (i) =cl (i); End if (nclust>1)% initializes the array Bord_rho to 0, each cluster defines a Bord_rho value for I=1:nclust bord_rho (i) =0.; End% gets the average density of each cluster in oneThe Bord_rho for i=1:nd-1 for j=i+1:nd is small enough but not part of the same cluster I and J if ((Cl (i) ~=CL (j)) && ( Dist (I,J) <=DC)) rho_aver= (Rho (i) +rho (j))/2.; The average local density if (Rho_aver>bord_rho (CL (i))) Bord_rho (CL (i)) =rho_aver i,j two points; End If (Rho_aver>bord_rho (CL (j))) Bord_rho (CL (j)) =rho_aver; The end end end of a percent halo value of 0 is expressed as outlier for I=1:nd if (Rho (i) <bord_rho (cl (i))) Halo (i) = 0; End end end, each processing each cluster for i=1:nclust nc=0; Percent is used to accumulate the number of data points in the current cluster nh=0; Percent is used to accumulate the number of core data points in the current cluster for J=1:nd if (CL (j) ==i) nc=nc+1; End If (Halo (j) ==i) nh=nh+1; End End fprintf (' CLUSTER:%i CENTER:%i ELEMENTS:%i CORE:%i HALO:%i \ n ', i,icl (i), NC,NH,NC-NH); End Cmap=colormap; For I=1:nclust Ic=int8 ((i*64.) /(nclust*1.)); Subplot (2,1,1) hold on Plot (Rho (ICL (i)), Delta (ICL (i)), ' O ', ' MarkErsize ', 8, ' Markerfacecolor ', CMap (IC,:), ' Markeredgecolor ', CMap (IC,:)); End subplot (2,1,2) disp (' performing 2D nonclassical multidimensional scaling ') Y1 = Mdscale (dist, 2, ' criterion ', ' Metri Cstress '); Plot (Y1 (:, 1), Y1 (:, 2), ' O ', ' markersize ', 2, ' markerfacecolor ', ' k ', ' markeredgecolor ', ' K '); Title (' 2D nonclassical multidimensional scaling ', ' FontSize ', 15.0) xlabel (' X ') ylabel (' Y ') for I=1:nd A (i,1) =0.; A (i,2) =0.; End for I=1:nclust nn=0; Ic=int8 (i*64.) /(nclust*1.)); For J=1:nd if (Halo (j) ==i) nn=nn+1; A (nn,1) =y1 (j,1); A (nn,2) =y1 (j,2); End end hold on plot (A (1:nn,1), A (1:nn,2), ' O ', ' markersize ', 2, ' Markerfacecolor ', CMap (IC,:), ' Markeredgecolor ', cmap (i c,:)); End%for I=1:nd% if (Halo (i) >0)% Ic=int8 ((Halo (i) *64.) /(nclust*1.)); % hold on% plot (Y1 (i,1), Y1 (i,2), ' O ', ' markersize ', 2, ' Markerfacecolor ', CMap (IC,:), ' Markeredgecolor ', CMap (IC,:)); % End%end FAA = fopen (' cluster_assignation ', ' W '); Disp (' GeneRated File:cluster_assignation ') disp (' column 1:element ID ') DISP (' column 2:cluster assignation without Halo control ') DISP (' column 3:cluster assignation with Halo control ') for I=1:nd fprintf (FAA, '%i%i%i\n ', I,CL (i), Halo (i)); End
Author: peghoty
Source: http://blog.csdn.net/itplus/article/details/10087581
Welcome to reprint/share, but be sure to declare the source of the article.
Discussion on cluster analysis