1. Definition: The data is divided into categories, within the same class, the object (entity) has a high similarity between the different objects of the difference between the larger.
For a group of sample sets without category labels, according to the similarity between the sample classification, similar to a class, not the same as other classes. This classification is called Cluster analysis, also known as unsupervised classification.
2. The result depends on two factors: the first is the choice of the task, the same sample different tasks will get different clustering effect, the second is the choice of similarity measure standard, choose different similarity measure directly affect the quality of clustering effect.
3. Classification:
By clustering criteria: statistical clustering method, concept clustering method;
By data type: Numerical data clustering, discrete data clustering, mixed-type data clustering;
In accordance with the measurement criteria:
Distance-based Clustering method: measure the relationship between point pairs, such as K-means, based on various distances or similarities.
Density-based Clustering method: Clustering samples based on appropriate density functions.
Clustering method based on connectivity: mainly includes graph-based method. Highly connected data is usually clustered into clusters, such as spectral clustering.
Follow the different technical routes:
Partition method: Use certain rules to divide the data, such as K-means and so on.
Hierarchical method: Hierarchical classification of a given sample, such as hierarchical clustering.
Density method: The density of the data is evaluated, such as Gaussian mixture model.
Grid method: Divides data space into finite unit network structure, then clustering based on network structure
Model method: A model is introduced for each cluster, then the data is divided to meet the assigned model.
4. Distance and similarity measurement
See also:http://www.cnblogs.com/simayuhe/p/5297560.html
Note: The so-called distance to meet a few four conditions, we can call the distance:
5. Mixed density function
Mixed density estimation can provide methodological guidance for data clustering * * *
Note: The discussion here is a generalization of the form of clustering, Gaussian mixture is only a more common example, is not unique.
Assume:
– The sample comes from a different class of C, and C is known.
– the prior probability of each class occurrence is known, j = 1, 2, ..., C.
– The form of the class conditional probability density function is known.
–c a parameter vector, j = 1, 2, ..., C, is unknown.
– The category label for the sample is also unknown.
First, we discuss the data generation process: First select a class from the C category and then sample a sample from this class by conditional probability density.
And then we're going to do the opposite of the build process, which means we get a bunch of untagged samples, although we also assume that the sample obeys the mixed density distribution,
However, we do not know the proportion of each category, and the parameters of the conditional probability density of each category, and they are estimated by the method of maximum likelihood estimation. (c is still known)
See "Pattern Recognition" Zhang Xue third edition p187
Logarithmic likelihood:
Right:
Yes, due to constraints: the optimization problem of solving equality constraints usually uses the Lagrange multiplier method:
Finally get:
In total: Two conditions are:
The above is a general derivation, and the results of the derivation are then applied to the Gaussian mixture:
Each of the components in the Gaussian mixture conforms to the multidimensional normal distribution form as follows
When the mean value of variance is known to be unknown
Brought to condition 2.
Note X should have a corner mark K;
To extract the mean from this equation:
Open, written in the form of weights:
The above formula indicates that the maximum likelihood estimate of the class mean is the weighted average of the sample. The weights indicate the possibility that the sample XK belong to Class I.
Note that weights are only related to class I samples, simplifying the above formula
The introduction of a more specific orange method from the--k-means cluster, where K refers to the above mentioned given the number of categories C, the above simplification to do a reporting
Here the so-called nearest is required to be given a distance measurement method, such as Euclidean distance
Algorithm Description:
Pattern Recognition Class Notes clustering (1)