Clustering Concept _ algorithm

Source: Internet
Author: User
As the saying goes: "Birds of a feather flock together", in the natural science and social sciences, there are a large number of classification problems. The so-called class, in layman's sense, refers to the collection of similar elements. Cluster analysis, also called Group Analysis, is a statistical analysis method for the study of classification problem (sample or index). Cluster analysis originated from taxonomy, in ancient taxonomy, people rely mainly on experience and expertise to achieve classification, rarely using mathematical tools for quantitative classification. With the development of human science and technology, the requirement for classification is so high that sometimes it is difficult to classify it only by experience and professional knowledge, so people gradually refer the mathematical tools to taxonomy, form the numerical taxonomy, and then introduce the technique of multivariate analysis into the numerical taxonomy to form the cluster analysis. The content of cluster analysis is very rich, such as systematic clustering method, ordered sample clustering method, dynamic clustering method, fuzzy clustering method, Graph theory clustering method, clustering prediction method and so on.

Clustering analysis and calculation methods are as follows:

1. Splitting method (partitioning methods): Given a DataSet with N tuples or records, the splitting method constructs K groupings, each grouping representing a cluster, k<n. And these k groupings meet the following conditions: (1) Each grouping contains at least one data record; (2) Each data record belongs to and belongs to only one group (note: This requirement can be relaxed in some fuzzy clustering algorithms); for a given k, the algorithm first gives an initial grouping method, Later, through iterative methods to change the grouping, so that after each improvement of the grouping scheme is better than the previous one, and the so-called Good standard is: the same group of records closer to the better, and the different groups in the record as far as possible. The algorithm that uses this basic idea has: K-means algorithm, k-medoids algorithm, Clarans algorithm;

2. Hierarchy method (Hierarchical methods): This method decomposes the given dataset hierarchically until a certain condition is satisfied. Concrete can be divided into "bottom-up" and "top-down" two kinds of programs. For example, in a bottom-up scenario, each data record in the initial form consists of a separate group, and in the next iteration, it merges the neighboring groups into one group until all the records are grouped together or some condition is satisfied. The representative algorithm has: Birch algorithm, cure algorithm, chameleon algorithm, etc.

3. A density based approach (density-based methods): A fundamental difference between a density-based approach and other methods is that it is based on density rather than on a variety of distances. In this way, we can overcome the shortcoming that the algorithm based on distance can only find the clustering of "circle-like". The idea of this method is that as long as the density of a point in an area is greater than a certain threshold, it is added to the cluster similar to the one. The representative algorithm has: Dbscan algorithm, optics algorithm, denclue algorithm, etc.

4. Grid based Approach (grid-based methods): This method first divides the data space into the grid structure of a finite unit (cell), and all processing is based on a single unit. One of the outstanding advantages of this process is that it is very fast, usually unrelated to the number of records in the target database, and it only relates to how many units the data space is divided into. The representative algorithm has: Sting algorithm, clique algorithm, Wave-cluster algorithm;

5. Model-based approach (model-based methods): A model-based approach assumes a model for each cluster and then searches for a dataset that satisfies the model. Such a model might be the density distribution function of a data point in space or something else. A potential assumption is that the target dataset is determined by a series of probability distributions. There are usually two ways of trying: a statistical scheme and a neural network scheme.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.