4.3.1 conceptual features
1. Meaning
It is the basic method to study the classification of things based on the characteristics of things. It is a task done for a certain purpose, and is not actually a classification.
2. Principles
The similarity between individuals in the same category is large, and the differences between individuals in different classes are large.
3. Category
(1) By clustering object:
Sample clustering: it is used to classify the variable values that reflect the features of the observed object. The purpose is to determine the classification of the study object.
Variable clustering: some variables that reflect certain characteristics of things are selected based on the research questions to study a certain aspect of things. The purpose is to find representative variables that are independent from each other, so that there is little loss of information when a small number of representative variables are used to replace many variables.
(2) Clustering Process:
Decomposition Method: first, all individuals are considered as one category, and then decomposed layer by layer based on proximity or similar nature until each individual has its own small class.
Cohesion method: first, consider each individual as a small class, and then gradually merge based on proximity or similar nature until all individuals have one category.
4.3.2 content Process
1. Data Preparation
The cluster analysis method is used to comprehensively evaluate the economic development of some eastern and western regions in China.
2. Method Selection
Expand the dialog box 4.10 In the order of analyze-cluster classify-Hierachical classify. Specify the variable involved in the analysis from the original variable alternative box on the left to the variable (s) box on the right. In the cluster column classify, select the cluster type-observed clustering cases or variable clustering variable. If the observed clustering is performed, you must specify an ID variable and send it to the label cases by in the sample label box. In the output display bar, the system selects statistics and graphs by default.
Click the select method button to expand the dialog box.
(1) Cluster Method: defines and computes the distance or similarity between two items.
Inter-group connections: merge the two types to minimize the average distance between the two items.
Intra-group join: after merging, the average distance (square) between all items in the class is minimized.
Nearest Neighbor Method: the distance between two types of closest points represents the distance between the two types.
Farthest neighbor method: the distance between the Farthest Points between two classes represents the distance between the two classes.
Center-of-gravity clustering: calculates the distance between two classes by calculating the distance between all mean values.
Median method: the median method is class-centered.
Minimum Variance: Clustering is based on the minimum inter-class variance.
(2) measurement method measure: an algorithm for measuring distance or similarity.
The method is generally consistent with the definition method. Different clustering methods have different measurement algorithms, and the clustering results are different. If the method is different from the algorithm, a warning is displayed, and the result cannot be true.
The measurement methods include calculating the distance of continuous variables, non-similarity of discrete variables, and distance or non-similarity of binary variables. Continuous Variable Distance calculation methods include:
Euclidean distance: (Σ (Xi-yi) 2) 1/2, that is, the difference between the two items is the sum of squares and the square root of the difference between values of each variable, so as to calculate the overall distance between them, that is, non-similarity.
Distance Square: Σ (Xi-yi) 2 to reduce the error.
Similarity Measure: Σ (xiyi) 2/(Σ xi2) (Σ yi2), that is, the similarity between the two items is the cosine of the vectors, value range-1-1, the value 0 indicates vertical to each other.
Pearson correlation: Σ (zxizyi) 2/(n-1), that is, the similarity between the two items is the linear correlation between vectors. The range is-1-, and the value 0 indicates nonlinear correlation.
Tangent distance: max | Xi-yi |, that is, the distance between two items is the absolute value of the maximum difference between variables.
RADIUS: Σ | Xi-yi | the distance between two items is the sum of the absolute values of the difference between the values of each variable.
Ming's distance: (Σ | Xi-yi | P) 1/p.
Custom distance: (Σ | Xi-yi | P) 1/R. If R = P, It is the Ming's distance.
(3) Data Conversion transform values: to eliminate the effects of different dimensions.
If the dimensions of the variables involved in the analysis are consistent, standard conversion is not required. However, different Standardization results in different clustering results. Therefore, the selection method must be consistent with the variable distribution.
Standardized to Z score: The mean value of the variable is 0, the standard deviation is 1, (each value-mean)/standard deviation.
Standardize to a certain range: range-1-1, each value/range.
Standardize to a certain value: the maximum value is 1, each value/maximum value.
Standardize to a certain range: 0-1, (each value-Minimum value)/range.
Standardize to a value: a range of mean, each value/mean.
Standardization to standard deviation: unit standard deviation, each value/standard deviation.
(4) measurement conversion transform measure: If similarity or non-similarity has been calculated, conversion is not required.
Absolute Value of distance:
Descending distance order: the similarity value and the non-similarity value change each other.
Standardize the distance: (distance-minimum)/range.
3. Output Selection
(1) Statistics
In the main dialog box, click statistics.
The system outputs the Aggregate status table agglomeration schedule by default, which displays the merging process of each step of clustering, the distance between the two items merged, and the merged class level, the merging process and observation closeness can be tracked accordingly. However, you must note that different clustering methods, measurement methods, and normalization methods are selected, and the clustering process and results are different.
You also need to select output classification results, specify the number of classes single solution, or limit the range of classes of solution, but all depend on the cluster type selection.
(2) Statistical Chart
Click plot in the main dialog box.
Hierarchy dendrogram indicates the classes and system values that are merged during each step of clustering. It is consistent with the consortium state table and focuses on the clustering process and intuitively reflects the results after clustering.
The integrated clustering information of icicle is displayed on the same graph, which indicates the clustering result. You can choose to observe the entire process of all clusters, or specify the cluster range specified range of clusters, and select the display direction orientation as vertical or horizontal.
The two charts are important methods to determine the classification results, but the final classification results still need to be determined by the investigator based on the study object and the purpose of the study.
(3) New Variables
Click Save in the main dialog box.
After determining the classification result of the study object through the analysis of statistics and statistical charts, you need to save the classification variable in the data file for further analysis.
You can choose to save a single result single solution. After specifying the number of classes, the variable indicates the class to which each individual belongs after clustering. Or select the specified range result range of solution. After the range is specified, each variable in several variables indicates the class to which each individual belongs after clustering.
Variable clustering does not create new variables.
4. Analysis and Evaluation
(1) Clustering Process
The columns from left to right are the sequence of clustering steps, the number of two merged items, the measured value of distance, the number of two merged classes, and the number of the merged results, the table details the sequence of the clustering process, the source of each merge step, the destination of each merge result, and the basis for merging.
Different clustering methods and different measurement algorithms are selected. The clustering process and results are different, and the description of distance measurement values is also different. Because Pearson correlation is selected as the distance measurement method, the two items with high correlation coefficient are first merged. If you select a non-similarity measurement method, the two items with a small value may be merged first.
(2) Clustering Results
The classification results are displayed based on different classification methods. The specific application results must be determined by the cluster selection method.
(3) Cluster Selection
The column chart is sorted one by one from the minimum position of "X. This figure clearly shows the whole process of all items ending with a class.
Tree chart can reflect the whole process of clustering. Generally, a ruler is placed vertically on the left and right sides of the drawing for translation. The interval between the merged vertical lines and the maximum distance is stopped, which is the best classification scheme. At this time, each horizontal line that is intersecting with the ruler is a class, and all the items on the left side of the horizontal line are members of this class. In this way, various features are highlighted and easy to define.
The two charts are important methods to determine the classification results. However, different clustering methods and Measurement Algorithms are selected, resulting in different classification processes and results, therefore, the final classification results still need to be determined by the investigator based on the study object and the purpose.
(4) Application Analysis
4.3.3 Summary
The selection of methods (clustering, measurement, and standard) requires repeated tests to determine the optimal effect, but the results of different methods should not be very different, otherwise, the selection of clustering variables does not really reflect the classification features of the observed amount.
The classification results of the observed amount need to be determined by the research object and the purpose. Therefore, we must carefully observe the characteristics of the original data with professional knowledge, draw conclusions carefully, and divide the results into various types of names.
How can variable clustering combine multiple variables with common characteristics, and select typical variables as representative variables, mainly based on professional knowledge, difficulty of measurement, and variable correlation coefficient.
Before other analysis methods, clustering analysis is often performed first to reduce the workload, save the measurement time, without affecting the analysis results. It is also a very practical method to select independent variables.