Advanced Clustering Analysis of "Reading notes-data mining concepts and techniques"

Last Update:2015-04-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 clustering based on probabilistic model

Example:

A. Commenting on a product, a comment may design a variety of products, such as a comment talking about camera compatibility with the computer, what to do? This comment is related to these two clusters, and is not mutually exclusive to any one cluster.

B. When a user buys a product, the information retrieved is ordered from a data camera, and a variety of laptops are compared at the same time. This kind of conversation should be to some extent the data of these two clusters.

1.1 Fuzzy clusters

The example in this section is good.

1.2 Clustering based on probabilistic model

The object participates in multiple clusters in a probabilistic way.

The hybrid model assumes that the set of observations is a mixture of instances from multiple probability clusters.

Taking the single-variable Gaussian mixture model as an example, it is assumed that each cluster's probability density function obeys a Gaussian distribution. Each cluster has the same probability, and the clustering task based on probabilistic models using a single-variable Gaussian mixture model is inferred to maximize the following:

1.3-Phase Hope maximization algorithm--em algorithm (combined with previous notes)

Combine the previous K-means method:

How to apply EM algorithm to the previous Gaussian mixture model?

2 clustering high-dimensional data 2.1 clustering high-dimensional data: problems, challenges, and key approaches

Challenge:

A. Data space is often too large;

B. Not only need to discover clusters, but also to find out the attribute values for each cluster that expose the cluster

2.2 SubSpace Clustering method A. SubSpace Search method

B. Correlation-based clustering method

eg. pca--Principal Component Analysis

C. Two-cluster method 2.3 Double cluster 1 application example

The so-called double clustering refers to the object and attribute are defined in a symmetric way, in which the data analysis designs the search matrix, looking for a sub-matrix as the unique pattern of the cluster.

Examples of recommended systems are better understood:

2 types of double clusters

Two-cluster: a sub-matrix in which the two dimensions follow a consistent pattern, and we can define different types of two clusters based on this pattern:

A. Double clusters with constant values;

B. Double clusters with constant values on the row/column;

C. Two clusters with coherent values;

D. Double clusters with dry evolution on row/column;

3 Two-cluster method

2.4-D protocol method and spectral clustering

to construct a new space

General framework for spectral clustering:

3 cluster diagram and network data 3.1 applications and challenges

3.2 Similarity measures A. Geodesic distance--calculation of eccentricity B.simrank: similarity based on random walk and structure situation

intuitive meaning: the two vertices in the diagram are similar if they are linked to similar vertices. To measure this similarity, we need to define the concept of the domain of the individual, including into the field and out of the field.

Into the field:

Out of the field:

Degree of similarity:

SimRank:

3.3 Graph Clustering method

Minimum cut:

minimal cuts do not necessarily lead to good clustering. Definition of thinning of cut:

Modular definition of clustering--for evaluating the quality of clustering

Graph clustering problems can be seen in the graph to find the best cut, such as the most sparse cut. Challenge:

A. High computational overhead;

B. complex diagrams;

C. High dimensional nature;

D. sparsity;

Workaround:

A. Methods of using clustering high-dimensional data;

Using similarity measure, the similarity degree matrix is extracted, and the cluster is found by using the general clustering method on the similarity matrix. In many cases, the spectral clustering method can be used once the similarity matrix is obtained. Spectral clustering can approximate the optimal graph cutting solution.

B. A method specifically designed for graph clustering;

Search the graph to find the components of beam unicom as clusters. Take the scan (structrue Clustering algorithm for Networks, network structure clustering algorithm) as an example:

Scan uses a normalized public domain size to measure the similarity between the two vertex u,v:

Scan algorithm:

One advantage of scan is that time complexity is linear with respect to the number of edges. On large sparse graphs, the number of edges and vertices is at the same order of magnitude. Therefore, scan is expected to have good scalability on large graphs.

4 classes with constrained Clustering 4.1 constraints

4.2 Clustering method with constraints

A. Handling hard constraints:

Ⅰ. The need to contact constraints to produce hyper-instances;

Ⅱ. The modified K-means clustering, in the process of clustering, in the guarantee does not violate the constraints of the premise, to seek the best.

B. Dealing with soft constraints: (optimization problem, the overall objective function is a combination of clustering quality score and penalty score)

Ⅰ. Violations of penalties that must be contacted;

Ⅱ. Violations of non-contact penalties.

C. Accelerating the speed of constrained clustering

Ⅰ. First clustered into a few clusters: dividing the triangular---> aggregating similar points in the same triangle into a cluster---> perform an estimate to construct two connection indexes based on the shortest path calculation

Advanced Clustering Analysis of "Reading notes-data mining concepts and techniques"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Advanced Clustering Analysis of "Reading notes-data mining concepts and techniques"

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support