Data mining-concepts and techniques-the 10th chapter on clustering work problems

Source: Internet
Author: User

Introduce the clustering method based on partition

A collection of n objects, dividing the objects into K-clusters. Each cluster contains at least one object.

K-means Pseudo-code

Input: K: Number of clusters

D: A DataSet containing n objects

Output: A collection of k clusters

Method:

(1) Arbitrarily select K objects from D as the center of the initial cluster.

(2) Repeat

A) Each object is assigned to the most similar cluster based on the average of the objects in the cluster.

b) Update the mean value of the cluster and calculate the mean of the objects in each cluster.

Until the object condition of each cluster no longer changes

Introduction of Hierarchical Clustering method

There are two kinds of hierarchical clustering methods, hierarchical clustering can be condensed or split.

Cohesion is the bottom-up clustering, where each object forms its own cluster, and the cluster is iteratively synthesized into larger clusters. Until all objects are merged into a unified cluster, or the appropriate conditions are met.

The hierarchical clustering method for splitting uses a top-down strategy, starting with placing all objects in a cluster, dividing the clusters of roots into smaller subgroups, and recursively dividing the coarse into smaller clusters. Know that the lowest-level clusters are sufficient to condense.

Density-based clustering method

Unlike hierarchy-based and partition-based only spherical clusters, the density-based approach can identify clusters of arbitrary shapes.

DBSCAN

The density of the neighborhood can be easily measured by the number of objects in the neighborhood class.

Introduction of clustering method based on grid

This method divides the input space into a unit that is independent of the distribution of the input object.

  1. STING: A network-based multi-resolution clustering technique.
  2. Clique: Clustering method based on grid and density.
STING

Based on the multi-resolution clustering technique of grid, the spatial region of the input object is divided into rectangular element. The statistics of the properties of each grid cell (mean, maximum) are pre-computed and stored as parameters.

Advantages:

  1. Grid-based computing is query-independent.
  2. Grid structure facilitates parallel processing and incremental updates
  3. High efficiency.
Clique

If the values of the properties of the data object vary widely, it is eng difficult to find clusters in the entire data space. In this case, it may be more meaningful to search for clusters in different data from the space.

Clique is a simple grid-based clustering method for discovering density-based clusters in sub-spaces. It divides each dimension into non-overlapping intervals, thus dividing the entire embedded space of the data object into units. Use a density threshold to identify dense hope and sparse cells.

A k-dimensional cell C has at least l points, only if each (k-1)-dimensional projection of C has at least l points.

The specific work process is as follows:

  1. Colque divides the D-dimensional data space into several non-overlapping rectangular elements, and identifies dense cells from it. Clique divides each dimension into intervals and identifies intervals that contain at least l points. L is the density threshold value.
  2. The dense hope of the iterative connection subspace. The connection operation produces the k+1 in the space as a candidate unit for the paragraph. Check if the number of points in C satisfies the density threshold.
  3. Use the largest area to cover the dense cells of the connection. A greedy algorithm is used.

Cluster evaluation
  1. Estimate clustering trends.
  2. Determines the number of clusters in the dataset.
  3. Determine the mass of the cluster.
10.4k-means++ algorithm

The main difference between this algorithm and K-means is the choice of the initial center.

First randomly selects the K center, then iterates over each dimension is selected as the center of the object p, select a new seat center, the object is proportional to dist (p) ^2 probability randomly selected. Dist (p) is the distance of P to the nearest center that has been selected. This method can accelerate the convergence rate of K-means and guarantee the quality of the final clustering results.

The reason is that this approach ensures that the center of the initial cluster is as far away as possible.

  1. Randomly selects a point from the collection of input data points as the first cluster center
  2. For each point x in the dataset, calculate the distance from the nearest cluster center (referred to the selected cluster center) d (X)
  3. Select a new data point as the new cluster Center, the principle of selection is: D (x) larger points, the probability of being selected as a cluster center is larger
  4. Repeat 2 and 3 until K-cluster centers are selected
  5. Use this k initial cluster center to run a standard K-means algorithm

Data mining-concepts and techniques-the 10th chapter on clustering work problems

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.