1 clustering based on probabilistic model
Example:
A. Commenting on a product, a comment may design a variety of products, such as a comment talking about camera compatibility with the computer, what to do? This comment is related to these two clusters, and is not mutually exclusive to any one cluster.
B. When a user buys a product, the information retrieved is ordered from a data camera, and a variety of laptops are compared at the same time. This kind of conversation should be to some extent the data of these two clusters.
1.1 Fuzzy clusters
The example in this section is good.
1.2 Clustering based on probabilistic model
The object participates in multiple clusters in a probabilistic way.
The hybrid model assumes that the set of observations is a mixture of instances from multiple probability clusters.
Taking the single-variable Gaussian mixture model as an example, it is assumed that each cluster's probability density function obeys a Gaussian distribution. Each cluster has the same probability, and the clustering task based on probabilistic models using a single-variable Gaussian mixture model is inferred to maximize the following:
1.3-Phase Hope maximization algorithm--em algorithm (combined with previous notes)
Combine the previous K-means method:
How to apply EM algorithm to the previous Gaussian mixture model?
2 clustering high-dimensional data 2.1 clustering high-dimensional data: problems, challenges, and key approaches
Challenge:
A. Data space is often too large;
B. Not only need to discover clusters, but also to find out the attribute values for each cluster that expose the cluster
2.2 SubSpace Clustering method A. SubSpace Search method
B. Correlation-based clustering method
eg. pca--Principal Component Analysis
C. Two-cluster method 2.3 Double cluster 1 application example
The so-called double clustering refers to the object and attribute are defined in a symmetric way, in which the data analysis designs the search matrix, looking for a sub-matrix as the unique pattern of the cluster.
Examples of recommended systems are better understood:
2 types of double clusters
Two-cluster: a sub-matrix in which the two dimensions follow a consistent pattern, and we can define different types of two clusters based on this pattern:
A. Double clusters with constant values;
B. Double clusters with constant values on the row/column;
C. Two clusters with coherent values;
D. Double clusters with dry evolution on row/column;
3 Two-cluster method
2.4-D protocol method and spectral clustering
to construct a new space
General framework for spectral clustering:
3 cluster diagram and network data 3.1 applications and challenges
3.2 Similarity measures A. Geodesic distance--calculation of eccentricity B.simrank: similarity based on random walk and structure situation
intuitive meaning: the two vertices in the diagram are similar if they are linked to similar vertices. To measure this similarity, we need to define the concept of the domain of the individual, including into the field and out of the field.
Into the field:
Out of the field:
Degree of similarity:
SimRank:
3.3 Graph Clustering method
Minimum cut:
minimal cuts do not necessarily lead to good clustering. Definition of thinning of cut:
Modular definition of clustering--for evaluating the quality of clustering
Graph clustering problems can be seen in the graph to find the best cut, such as the most sparse cut. Challenge:
A. High computational overhead;
B. complex diagrams;
C. High dimensional nature;
D. sparsity;
Workaround:
A. Methods of using clustering high-dimensional data;
Using similarity measure, the similarity degree matrix is extracted, and the cluster is found by using the general clustering method on the similarity matrix. In many cases, the spectral clustering method can be used once the similarity matrix is obtained. Spectral clustering can approximate the optimal graph cutting solution.
B. A method specifically designed for graph clustering;
Search the graph to find the components of beam unicom as clusters. Take the scan (structrue Clustering algorithm for Networks, network structure clustering algorithm) as an example:
Scan uses a normalized public domain size to measure the similarity between the two vertex u,v:
Scan algorithm:
One advantage of scan is that time complexity is linear with respect to the number of edges. On large sparse graphs, the number of edges and vertices is at the same order of magnitude. Therefore, scan is expected to have good scalability on large graphs.
4 classes with constrained Clustering 4.1 constraints
4.2 Clustering method with constraints
A. Handling hard constraints:
Ⅰ. The need to contact constraints to produce hyper-instances;
Ⅱ. The modified K-means clustering, in the process of clustering, in the guarantee does not violate the constraints of the premise, to seek the best.
B. Dealing with soft constraints: (optimization problem, the overall objective function is a combination of clustering quality score and penalty score)
Ⅰ. Violations of penalties that must be contacted;
Ⅱ. Violations of non-contact penalties.
C. Accelerating the speed of constrained clustering
Ⅰ. First clustered into a few clusters: dividing the triangular---> aggregating similar points in the same triangle into a cluster---> perform an estimate to construct two connection indexes based on the shortest path calculation
Advanced Clustering Analysis of "Reading notes-data mining concepts and techniques"