What is clustering (clustering)
Personal Understanding: Clustering is a large number of non-tagged records, according to their characteristics to divide them into clusters, the final result should be the same cluster between the similarity to be as large as possible, the similarity between different clusters to be as small as possible.
The clustering method is categorized as follows:
First, how to calculate the distance between samples?
Possible types of sample properties are: Numeric, named, Boolean ... When calculating the distance between samples, different types of attributes need to be calculated separately, and finally uniformly added to get the distance between the two samples. The data calculation methods for different types of properties are described below.
For all continuous numerical samples, first of all, for a property with a large difference in value, it should be normalized to transform the data so that it falls into a smaller common interval.
The Standardized approach:
1. Maximum-Minimum Normalization
where vi means that the value of the attribute on A is recorded in article I, Mina represents the minimum value on this property, New_maxa represents the right boundary of the interval we want to map to, and so on.
2. Z-score Normalization
Two of these parameters represent mean and variance, respectively.
3. Decimal Calibration Normalization
Normalized by moving the decimal position of attribute a. The number of decimal places to move depends on the maximum absolute value of a.
After normalization, the distance between the two samples can be calculated and the formula is as follows:
if each attribute has a different weight, the formula is modified as follows:
For all Boolean samples, the calculation is as follows:
The table above represents the number of attributes with different samples, the number of properties for which the Boolean type is 1, and the number of properties of 0, 1 and 0, respectively, and their distances are computed as follows:
The meaning of this formula is actually the ratio of the number of attributes between two samples to the number of different properties.
For the named type (nominal variable), a simple distance calculation formula is:
if the property type of the sample set is mixed, the following formula can be used to calculate the distance:
Where the denominator is the weight of the property.
Partitional Clustering
main idea: First man decides to divide the data set into K-clusters, then according to the similarity of the cluster to be as large as possible, the similarity between clusters should be as small as possible, the sample is divided into different clusters.
1. K-means Clustering
Algorithm process: The K clusters are randomly assigned a central point, then the distance between each data in the sample set and the K center point is computed, and the data is classified as the cluster with the smallest distance. After scanning the round, re-computes the center point of the cluster according to the samples in each cluster, and then scans the sample set, according to the new center point, calculates the distance between the sample and the center point, thus updating the samples in K clusters. After several iterations, if the recalculated center point is the same as the original center point, the iteration is stopped.
principle:
Where L is a cluster, W (i,l) value is 0, 1 indicates whether the sample I is in the L cluster, D is the calculation of the distance between the I sample and the L center point.
Our goal is to find W (i,l) so that the value of the P function is minimized. Obviously, the time complexity of the exhaustive method is too high to be done.
We use the gradient descent method to solve this problem and descend in the gradient direction to get the local optimal solution.
[Data Mining Course notes] unsupervised learning-clustering (clustering)