Discussion on cluster analysis

Source: Internet
Author: User
Tags faa

Discussion on cluster analysis

Chapter I. INTRODUCTION

Chapter II Preparation of Knowledge

Chapter Three Direct Clustering method

Fourth Chapter K-means

Fifth Chapter DBSCAN

Sixth Chapter OPTICS

Seventh chapter effect evaluation of cluster analysis

The eighth chapter of data scaling

A new clustering algorithm published in science

This digest is the third chapter of Zhou Zhaotao's Master's thesis "Text Clustering Analysis effect evaluation and text expression study" from the Institute of Computing Technology of CAS, which is a reading note, which I hope will help you a little.

The definition of the accuracy and recall rates mentioned in this article can be found in

http://blog.csdn.net/itplus/article/details/10862059

Some specific formulas for scaling in this paper can be found in http://blog.csdn.net/itplus/article/details/10088101

     
    This June, Alex Rodriguez and Alessandro Laio published an article in science entitled Clustering by Fast search and find density peaks, for the clustering algorithm Design provides a new way of thinking. Although the article came out after many readers questioned, but overall, the basic idea of the new clustering algorithm is very novel, and simple and lively, it is worth learning. The core idea of this new clustering algorithm lies in the characterization of the clustering center, the principle of the algorithm is introduced in detail, and some details are discussed.

Finally, attach the Matlab sample program (with appropriate code comments) provided by the author in the supplemental material.

Clear all close all disp (' the only input needed are a distance matrix file ') disp (' The format of this ' file should ') DISP (' column 1:id of Element i ') disp (' column 2:id of Element J ') disp (' column 3:dist (i,j) ')% read data from file mdist=  Input (' name of the distance matrix file (with single quotes) \ n ');  Disp (' Reading input distance matrix ') xx=load (mdist);  Nd=max (XX (:, 2));  Nl=max (XX (:, 1));  if (nl>nd) nd=nl; Percent percent ensures that the DN is taken as the largest of the 12th column and is used as the total number of data points end n=size (xx,1);    Percent xx The length of the first dimension, equal to the number of lines of the file (that is, the total number of distances) is initialized to 0 for i=1:nd for J=1:nd dist (i,j) = 0;    End end percent is used to assign values to the dist array with XX, note that the input only has a 0.5*DN (DN-1) value, where it is filled with a full matrix of percent of the weight here regardless of the diagonal element for i=1:n ii=xx (i,1);    Jj=xx (i,2);    Dist (II,JJ) =xx (i,3);  Dist (JJ,II) =xx (i,3);  End percent to determine DC percent=2.0;    fprintf (' Average percentage of neighbours (hard coded):%5.6f\n ', percent); Position=round (n*percent/100); The percent round is a rounded function sda=sort (XX (:, 3));    Percent of all distance values are sorted in ascending order DC=SDA (position); Calculate local density rho (using Gaussian core)   fprintf (' Computing Rho with Gaussian kernel of radius:%12.6f\n ', DC);  Percent of each data point is initialized to 0 for I=1:nd Rho (i) =0.;       End% Gaussian kernel for i=1:nd-1 for J=i+1:nd Rho (i) =rho (i) +exp (-(Dist (I,J)/dc) * (Dist (I,J)/dc));    Rho (j) =rho (j) +exp (-(Dist (I,J)/dc) * (Dist (I,J)/dc));  End end% "Cut off" kernel%for i=1:nd-1% for j=i+1:nd% if (Dist (i,j) &LT;DC)% Rho (i) =rho (i) +1.;  % Rho (j) =rho (j) +1.;     % End% End%end The maximum value of the matrix column, then the maximum value, and finally the maximum value of all distance values Maxd=max (max (dist));     Percent percent will rho in descending order, Ordrho hold order [Rho_sorted,ordrho]=sort (Rho, ' descend ');  Percent of the data points with the largest Rho value Delta (Ordrho (1)) =-1.;    Nneigh (Ordrho (1)) = 0;     Percent-generated delta and Nneigh arrays for Ii=2:nd Delta (Ordrho (ii)) =maxd; For Jj=1:ii-1 if (Dist (Ordrho (ii), Ordrho (JJ)) <delta (Ordrho (ii))) Delta (Ordrho (ii)) =dist (Ordrho (ii), Ordrho          (JJ));           Nneigh (Ordrho (ii)) =ordrho (JJ); The number Ordrho (JJ) end end end of a data point with Ordrho (ii) from the nearest point in the value of the Rho value generates RHo The delta value of the maximum data point Delta (Ordrho (1)) =max (Delta (:)); The percent decision Figure disp (' Generated file:decision graph ') disp (' column 1:density ') disp (' column 2:delta ') FID = fopen (' Decisio  N_graph ', ' W ');  For I=1:nd fprintf (FID, '%6.2f%6.2f\n ', Rho (i), Delta (i)); End DISP Select a rectangle enclosing the center of the Class (' Select a rectangle enclosing cluster centers ') per computer, the root object of the handle is only one, the screen, its handle is always 0 percent > > scrsz = Get (0, ' screensize ') percent Scrsz = 1 1 1280 800 percent 1280 and 800 is the computer you set up    The resolution, Scrsz (4) is 800,scrsz (3) is the Scrsz = Get (0, ' screensize ');    The percent of a person to designate a location, feeling there is not so auto:-) figure (' Position ', [6 Scrsz (3)/4. Scrsz (4)/1.3]);     The percent of IND and Gamma are not used in the back for I=1:nd Ind (i) =i;  Gamma (i) =rho (i) *delta (i); End-percent uses Rho and Delta to draw a so-called "decision Diagram" subplot (2,1,1) Tt=plot (Rho (:), Delta (:), ' O ', ' markersize ', 5, ' markerfacecolor ', ' K ',  ' Markeredgecolor ', ' K '); Title (' Decision Graph ', ' FontSize ', 15.0) xlabel (' \rho ') ylabel (' \delta ') subplot (2,1,1) rect = getrect (1);   Percent GetRect uses the mouse to intercept a rectangular area, in which the coordinates (x, y) of the lower left corner of the rectangle are stored, and the width and height of the truncated rectangle are rhomin=rect (1); Deltamin=rect (2);    Percent percent the author admits that this is an error and has changed from 4 to 2!    The number of cluster initialized is nclust=0;  Percent CL is an array of attribution flags, CL (i) =j indicates that the data point number I is attributed to the J cluster of the first unity to initialize CL to 1 for I=1:ND cl (i) =-1; End percent the number of statistics points (that is, the cluster center) within the rectangular region for I=1:nd if ((Rho (i) >rhomin) && (Delta (i) >deltamin) Nclust=ncl       ust+1; CL (i) =nclust; The data points of percent I are part of the Nclust cluster ICL (nclust) =i;%% inverse mapping, and the center of Nclust Cluster is the first data point end end fprintf (' number    of CLUSTERS:%i \ n ', nclust); Disp (' performing assignation ') percent of other data points are categorized (assignation) for I=1:nd if (CL (Ordrho (i)) ==-1) cl (Ordrho (i)) =cl (nn    Eigh (Ordrho (i)));     End end percent percent is traversed by the size of the Rho value from large to small, and after the loop, CL should be a positive value.  Percent of processing the Halo point, the Halo code should be moved to the if (nclust>1) to compare well for i=1:nd Halo (i) =cl (i);    End if (nclust>1)% initializes the array Bord_rho to 0, each cluster defines a Bord_rho value for I=1:nclust bord_rho (i) =0.; End% gets the average density of each cluster in oneThe Bord_rho for i=1:nd-1 for j=i+1:nd is small enough but not part of the same cluster I and J if ((Cl (i) ~=CL (j)) && ( Dist (I,J) &LT;=DC)) rho_aver= (Rho (i) +rho (j))/2.;          The average local density if (Rho_aver>bord_rho (CL (i))) Bord_rho (CL (i)) =rho_aver i,j two points;          End If (Rho_aver>bord_rho (CL (j))) Bord_rho (CL (j)) =rho_aver; The end end end of a percent halo value of 0 is expressed as outlier for I=1:nd if (Rho (i) <bord_rho (cl (i))) Halo      (i) = 0; End end end, each processing each cluster for i=1:nclust nc=0; Percent is used to accumulate the number of data points in the current cluster nh=0;      Percent is used to accumulate the number of core data points in the current cluster for J=1:nd if (CL (j) ==i) nc=nc+1;      End If (Halo (j) ==i) nh=nh+1;    End End fprintf (' CLUSTER:%i CENTER:%i ELEMENTS:%i CORE:%i HALO:%i \ n ', i,icl (i), NC,NH,NC-NH);  End Cmap=colormap; For I=1:nclust Ic=int8 ((i*64.)     /(nclust*1.)); Subplot (2,1,1) hold on Plot (Rho (ICL (i)), Delta (ICL (i)), ' O ', ' MarkErsize ', 8, ' Markerfacecolor ', CMap (IC,:), ' Markeredgecolor ', CMap (IC,:)); End subplot (2,1,2) disp (' performing 2D nonclassical multidimensional scaling ') Y1 = Mdscale (dist, 2, ' criterion ', ' Metri  Cstress ');  Plot (Y1 (:, 1), Y1 (:, 2), ' O ', ' markersize ', 2, ' markerfacecolor ', ' k ', ' markeredgecolor ', ' K ');   Title (' 2D nonclassical multidimensional scaling ', ' FontSize ', 15.0) xlabel (' X ') ylabel (' Y ') for I=1:nd A (i,1) =0.;  A (i,2) =0.;    End for I=1:nclust nn=0; Ic=int8 (i*64.)    /(nclust*1.));        For J=1:nd if (Halo (j) ==i) nn=nn+1;        A (nn,1) =y1 (j,1);      A (nn,2) =y1 (j,2); End end hold on plot (A (1:nn,1), A (1:nn,2), ' O ', ' markersize ', 2, ' Markerfacecolor ', CMap (IC,:), ' Markeredgecolor ', cmap (i  c,:)); End%for I=1:nd% if (Halo (i) >0)% Ic=int8 ((Halo (i) *64.)  /(nclust*1.));   % hold on% plot (Y1 (i,1), Y1 (i,2), ' O ', ' markersize ', 2, ' Markerfacecolor ', CMap (IC,:), ' Markeredgecolor ', CMap (IC,:));  % End%end FAA = fopen (' cluster_assignation ', ' W '); Disp (' GeneRated File:cluster_assignation ') disp (' column 1:element ID ') DISP (' column 2:cluster assignation without Halo control ')  DISP (' column 3:cluster assignation with Halo control ') for I=1:nd fprintf (FAA, '%i%i%i\n ', I,CL (i), Halo (i));   End

  

Author: peghoty

Source: http://blog.csdn.net/itplus/article/details/10087581

Welcome to reprint/share, but be sure to declare the source of the article.

Discussion on cluster analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.