Pattern Recognition Class Notes clustering (1)

Source: Internet
Author: User

1. Definition: The data is divided into categories, within the same class, the object (entity) has a high similarity between the different objects of the difference between the larger.

For a group of sample sets without category labels, according to the similarity between the sample classification, similar to a class, not the same as other classes. This classification is called Cluster analysis, also known as unsupervised classification.

2. The result depends on two factors: the first is the choice of the task, the same sample different tasks will get different clustering effect, the second is the choice of similarity measure standard, choose different similarity measure directly affect the quality of clustering effect.

3. Classification:

By clustering criteria: statistical clustering method, concept clustering method;

By data type: Numerical data clustering, discrete data clustering, mixed-type data clustering;

In accordance with the measurement criteria:

Distance-based Clustering method: measure the relationship between point pairs, such as K-means, based on various distances or similarities.

Density-based Clustering method: Clustering samples based on appropriate density functions.

Clustering method based on connectivity: mainly includes graph-based method. Highly connected data is usually clustered into clusters, such as spectral clustering.

Follow the different technical routes:

Partition method: Use certain rules to divide the data, such as K-means and so on.

Hierarchical method: Hierarchical classification of a given sample, such as hierarchical clustering.

Density method: The density of the data is evaluated, such as Gaussian mixture model.

Grid method: Divides data space into finite unit network structure, then clustering based on network structure

Model method: A model is introduced for each cluster, then the data is divided to meet the assigned model.

4. Distance and similarity measurement

See also:http://www.cnblogs.com/simayuhe/p/5297560.html

Note: The so-called distance to meet a few four conditions, we can call the distance:

5. Mixed density function

Mixed density estimation can provide methodological guidance for data clustering * * *

Note: The discussion here is a generalization of the form of clustering, Gaussian mixture is only a more common example, is not unique.

Assume:

– The sample comes from a different class of C, and C is known.

– the prior probability of each class occurrence is known, j = 1, 2, ..., C.

– The form of the class conditional probability density function is known.

–c a parameter vector, j = 1, 2, ..., C, is unknown.

– The category label for the sample is also unknown.

First, we discuss the data generation process: First select a class from the C category and then sample a sample from this class by conditional probability density.

And then we're going to do the opposite of the build process, which means we get a bunch of untagged samples, although we also assume that the sample obeys the mixed density distribution,

However, we do not know the proportion of each category, and the parameters of the conditional probability density of each category, and they are estimated by the method of maximum likelihood estimation. (c is still known)

See "Pattern Recognition" Zhang Xue third edition p187

Logarithmic likelihood:

Right:

Yes, due to constraints: the optimization problem of solving equality constraints usually uses the Lagrange multiplier method:

Finally get:

In total: Two conditions are:

The above is a general derivation, and the results of the derivation are then applied to the Gaussian mixture:

Each of the components in the Gaussian mixture conforms to the multidimensional normal distribution form as follows

When the mean value of variance is known to be unknown

Brought to condition 2.

Note X should have a corner mark K;

To extract the mean from this equation:

Open, written in the form of weights:

The above formula indicates that the maximum likelihood estimate of the class mean is the weighted average of the sample. The weights indicate the possibility that the sample XK belong to Class I.

Note that weights are only related to class I samples, simplifying the above formula

The introduction of a more specific orange method from the--k-means cluster, where K refers to the above mentioned given the number of categories C, the above simplification to do a reporting

Here the so-called nearest is required to be given a distance measurement method, such as Euclidean distance

Algorithm Description:

Pattern Recognition Class Notes clustering (1)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.