The idea of clustering: dividing a DataSet into several subsets (called a cluster cluster) that you don't want to cross, each potentially corresponding to a concept. But the practical significance of each cluster is determined by the users themselves, and the clustering algorithm will only be divided.
The role of Clustering:
1) can be used as a separate process for finding a distribution pattern of data
2) as a preprocessing process for classification. First, classify data is clustered and then the classification process is performed on each cluster of cluster results.
Performance Metrics for Clustering:
1) External indicator: This indicator is obtained by comparing the result of a cluster with a reference model
Jaccard coefficient: It depicts the probability jc=a/(a+b+c) of all samples belonging to the same class that are subordinate to the same category in both C and c*.
FM Index: It depicts a sample pair belonging to the same class in C, the proportion of the sample pair belonging to the c* is P1, the sample pair belonging to the same class in c*, and the ratio of the sample pairs belonging to C is p2,fmi the geometric average p1 of P2 and Fmi=sqrt ((A/(A+B)) * (A/(A+C)))
2) Internal indicator: This indicator is obtained directly from the study of clustering results, and does not utilize any reference model
Python vs. machine learning-clustering and EM algorithms