Data mining-concepts and techniques-the 10th chapter on clustering work problems

Last Update:2015-09-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduce the clustering method based on partition

A collection of n objects, dividing the objects into K-clusters. Each cluster contains at least one object.

K-means Pseudo-code

Input: K: Number of clusters

D: A DataSet containing n objects

Output: A collection of k clusters

Method:

(1) Arbitrarily select K objects from D as the center of the initial cluster.

(2) Repeat

A) Each object is assigned to the most similar cluster based on the average of the objects in the cluster.

b) Update the mean value of the cluster and calculate the mean of the objects in each cluster.

Until the object condition of each cluster no longer changes

Introduction of Hierarchical Clustering method

There are two kinds of hierarchical clustering methods, hierarchical clustering can be condensed or split.

Cohesion is the bottom-up clustering, where each object forms its own cluster, and the cluster is iteratively synthesized into larger clusters. Until all objects are merged into a unified cluster, or the appropriate conditions are met.

The hierarchical clustering method for splitting uses a top-down strategy, starting with placing all objects in a cluster, dividing the clusters of roots into smaller subgroups, and recursively dividing the coarse into smaller clusters. Know that the lowest-level clusters are sufficient to condense.

Density-based clustering method

Unlike hierarchy-based and partition-based only spherical clusters, the density-based approach can identify clusters of arbitrary shapes.

DBSCAN

The density of the neighborhood can be easily measured by the number of objects in the neighborhood class.

Introduction of clustering method based on grid
This method divides the input space into a unit that is independent of the distribution of the input object.

STING: A network-based multi-resolution clustering technique.

Clique: Clustering method based on grid and density.

STING
Based on the multi-resolution clustering technique of grid, the spatial region of the input object is divided into rectangular element. The statistics of the properties of each grid cell (mean, maximum) are pre-computed and stored as parameters.

Advantages:

Grid-based computing is query-independent.

Grid structure facilitates parallel processing and incremental updates

High efficiency.

Clique
If the values of the properties of the data object vary widely, it is eng difficult to find clusters in the entire data space. In this case, it may be more meaningful to search for clusters in different data from the space.

Clique is a simple grid-based clustering method for discovering density-based clusters in sub-spaces. It divides each dimension into non-overlapping intervals, thus dividing the entire embedded space of the data object into units. Use a density threshold to identify dense hope and sparse cells.

A k-dimensional cell C has at least l points, only if each (k-1)-dimensional projection of C has at least l points.

The specific work process is as follows:

Colque divides the D-dimensional data space into several non-overlapping rectangular elements, and identifies dense cells from it. Clique divides each dimension into intervals and identifies intervals that contain at least l points. L is the density threshold value.

The dense hope of the iterative connection subspace. The connection operation produces the k+1 in the space as a candidate unit for the paragraph. Check if the number of points in C satisfies the density threshold.

Use the largest area to cover the dense cells of the connection. A greedy algorithm is used.

Cluster evaluation

Estimate clustering trends.

Determines the number of clusters in the dataset.

Determine the mass of the cluster.

10.4k-means++ algorithm
The main difference between this algorithm and K-means is the choice of the initial center.

First randomly selects the K center, then iterates over each dimension is selected as the center of the object p, select a new seat center, the object is proportional to dist (p) ^2 probability randomly selected. Dist (p) is the distance of P to the nearest center that has been selected. This method can accelerate the convergence rate of K-means and guarantee the quality of the final clustering results.

The reason is that this approach ensures that the center of the initial cluster is as far away as possible.

Randomly selects a point from the collection of input data points as the first cluster center

For each point x in the dataset, calculate the distance from the nearest cluster center (referred to the selected cluster center) d (X)

Select a new data point as the new cluster Center, the principle of selection is: D (x) larger points, the probability of being selected as a cluster center is larger

Repeat 2 and 3 until K-cluster centers are selected

Use this k initial cluster center to run a standard K-means algorithm

Data mining-concepts and techniques-the 10th chapter on clustering work problems

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data mining-concepts and techniques-the 10th chapter on clustering work problems

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data mining-concepts and techniques-the 10th chapter on clustering work problems

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support