5 big clustering algorithms that data scientists need to know

Source: Internet
Author: User

Clustering is a machine learning technique that involves grouping data points. Given a set of data points, a clustering algorithm can be used to classify each data point into a specific group. In theory, the same set of data points have similar properties or (and) characteristics, and different sets of data points have highly different properties or (and) characteristics. Clustering belongs to unsupervised learning and is a common technique used in statistical data analysis in many fields. This article will introduce the common 5 clustering algorithms.


K-means Algorithm


The K-means algorithm is probably the most well-known clustering algorithm, which is easily understood and implemented in code.

K-means Clustering


1. First we select some classes or groups and randomly initialize their respective center points. To calculate the number of classes used, it is best to quickly view the data and try to identify any one of the different groupings. The center point is the same vector length as each data point vector, and the image above is labeled "X".


2. Each data point is categorized by calculating the distance between the point and the center of each group, and then classifying the point into the group closest to the center.


3. Based on these classification points, the grouping center is recalculated by calculating the mean of all vectors in the group.


4. Repeat the steps above for several iterations, or until the group center changes between iterations is small. Choose the best iteration for the results.


Because we only calculate the distance between the point and the center of the group, the computational amount is very small, so the K-means algorithm is very fast and has linear complexity O (n).


The disadvantage of the K-means algorithm is that you must choose how many groups or classes you have, because the purpose of the algorithm is to obtain information from different data. In addition, the K-means algorithm starts with a random selection of cluster centers, so different algorithm runs may produce different clustering results. The results are not consistent, and the results of other clustering methods are more consistent.


The K-medians algorithm is another clustering algorithm related to the K-means algorithm, which recalculates the group center point without the mean, but uses the median vector of the group, so it is less sensitive to outliers, but runs much slower for datasets with large data volumes.


mean-shift Clustering Algorithm


The Mean-shift clustering algorithm is based on sliding windows and attempts to locate dense data point areas. The algorithm is a centroid-based algorithm, which means that the goal of the algorithm is to locate the central point of each group (class), by updating the candidate Center store as the average of sliding window, and then in the subsequent processing phase of these candidate serial port filtering, eliminate the adjacent repetition point, the formation of the final central point set and its corresponding group.


A single sliding window for Mean-shift clustering algorithm


1. As shown in the figure above in the two-dimensional space in the set of points, we from a randomly selected C-point center, with R as the radius of the Circular east window start. The Mean-shift algorithm is a mountain climbing algorithm that moves the kernel step-by-step iteratively to a higher density area until it converges.


2. For each iteration, move the sliding window to a higher density area by moving the center point to the average of the point within the window. The density of the sliding window is proportional to the number of points inside the window.


3. We continue to move the sliding window according to the average until there is no direction to move it to accommodate more points. As shown above, continue to move the circle until the number (density) in the window no longer increases.


4. We will use a number of sliding windows in step 1-3 until all points are in one window. When multiple sliding windows overlap, the window containing the most points is retained, and the data points are clustered according to the window in which they are located.


The following figure shows the movement of all the sliding windows from beginning to the bottom, where each black point represents the centroid of the sliding window, and each gray dot represents a data point.


Mean-shift algorithm Process


This is compared to the K-mean clustering algorithm, because Mean-shift can automatically select the number of clusters, so no manual selection is required. This is a big advantage, in fact, the clustering center to the maximum density point aggregation is also ideal. The disadvantage of this algorithm is that the selection of window radius is relatively unimportant.


Density-based clustering method with Noise (DBSCAN)


Dbscan is a density-based clustering algorithm, which is similar to the Mean-shift algorithm, but has significant advantages.


1.DBSCAN starts from any data point that is not accessed. The field of the point is divided by the distance ε (all points within the ε distance are domain points).


2. If there are enough points in the field (the maximum value is minpoints), the clustering process begins, and the current data point becomes the first point in the new clustering process. Otherwise, mark the dot-flavor noise (later, the noise point may become part of the cluster). In both cases, the point is marked as "visited."


3. For the first point in the new clustering process, its ε is part of the same cluster as the intra-domain page. This process causes all the points in the ε field to belong to the same cluster, and then repeats the process for all new points that have just been added to the cluster.


4. Repeat steps 2 and 3 until you can determine all the points in the cluster, that is, we access and mark all the points within the ε neighborhood of the cluster.


5. Once we have completed the current cluster, we retrieve and process the new inaccessible points and find a further clustering or noise. Repeat this process until we mark the completion of all the points, each point being labeled as a cluster or noise.


Compared with other clustering algorithms, the DBSCAN algorithm has many advantages: first, the algorithm does not need a fixed number of clusters. Second, it recognizes outliers as noise, unlike the mean-shift algorithm, which is put into clusters even if the data points are very different. In addition, the algorithm can find the clustering of arbitrary size and arbitrary shape.


The main disadvantage of the Dbscan algorithm is that the performance of the algorithm is relatively poor when the density is not the same. This is because when density changes, different clusters, the distance thresholds for identifying neighboring points, the values of ε and minpoints will be different. This is also a disadvantage for high-dimensional data because the distance threshold ε is difficult to estimate.


the expected maximization (EM) clustering algorithm based on Gaussian mixture model (GMM)


One of the main drawbacks of the K-means clustering algorithm is that it uses the clustering center mean. By the following figure we can see why this is not the best way. The people on the left see very clearly that there are two circles with different radii, the two centers are the same. Because the average of these clusters is very close, K-means cannot handle this situation. Similarly, the mean is used as the center of the cluster, and the image on the right cannot be processed using K-means clustering.


Two cases where the K-means clustering algorithm cannot be processed


The flexibility of the Gaussian mixture model (GMM) algorithm is higher than that of the K-means algorithm. Suppose the data in the GMM algorithm is Gaussian, so that we have two parameters that can describe the shape of the cluster: mean and standard deviation. Taking two-dimensional distributions as an example, this means that clustering can have various types of ellipses (because there are standard deviations in both the X and Y directions). Therefore, each individual cluster is assigned a Gaussian distribution.


To find the Gaussian parameters (mean and standard deviation) for each cluster, we use an optimization algorithm called desired maximization (EM).



1. First select the number of clusters (as with the K-means algorithm) and then randomly initialize the Gaussian distribution parameters for each cluster. We can also provide a good prediction for the initialization parameters by quickly viewing the data.


2. Assigning these Gaussian distributions to each cluster calculates the probability that each data point belongs to a particular cluster. The closer the point is to the Gaussian center, the more likely it is to belong to the cluster. Because the Gaussian distribution is used, we assume that most of the data is closer to the cluster center, so it can be intuitively seen.


3. Based on these probabilities, we calculate a new set of Gaussian distribution parameters so that the probability of clustering internal data points can be maximized. We then use the weighting of the location of the data points to calculate the new Gaussian distribution parameters, where the weights are the probability that the data points belong to a particular cluster.


4. Repeat steps 2 and 3 to iterate until the convergence position. Repeated iterations, whose distribution does not change much.


The GMM algorithm has two major advantages. First, the GMM algorithm has more flexibility than the K-means algorithm in clustering covariance. Depending on the parameters of the standard deviation, the cluster is an ellipse of any shape, not limited to a circle. K-means is actually a special case of the GMM algorithm, where the covariance of each cluster is approximately 0 on all dimensions. Secondly, because the GMM algorithm uses probability, each data point can have more than one cluster. So, if a data point is in the middle of two overlapping clusters, we can simply define it as a class, that is, the probability that there is x percent is the probability of 1 classes and y% belongs to 2 classes.


Synthetic Clustering Algorithm-AHC


Synthetic clustering algorithms fall into two main categories: top-down or bottom-up. The bottom-up algorithm first treats each data point as a single cluster, and then successively merges (aggregates) the paired clusters until all the clusters are merged into a single cluster that contains all the data points. Therefore, the bottom-up hierarchical clustering is called a synthetic clustering algorithm or AHC. The cluster layer is represented by a tree (tree view), where the root of the tree is the only cluster that collects all the samples, and the leaves are clusters of only one sample. The illustrations are as follows:



1. Each data point is first treated as a single cluster, that is, if there are x clusters in the data set. We then select a measure to measure the distance between two clusters. In this example, we use an average join, which defines the distance between two clusters as the average distance between the data points in the first dataset and the data points in the second cluster.


2. Each iteration, merging two clusters into one, as the cluster with the smallest average connection. The distances between the two clusters are minimal and therefore most similar, depending on the clustering metric we choose, and should be combined.


3. Repeat step 2 until you traverse the root of the tree, which is the only cluster that contains all the data points. In this way, we can just choose when to stop combining clusters, that is, when to stop the build tree, depending on how many clusters are needed at the end.


The synthetic clustering algorithm does not need to specify the number of clusters, or even the best number of clusters to choose. In addition, the algorithm is not sensitive to the selection of distance measurement, and the choice of distance measurement is very important for other algorithms.


The above is the translation.

This article is translated by Alibaba Cloud community organization.

Article original title "The 5 clustering algorithms Data scientists need to Know", translator: Mags, Revision: Roman.


End

2018 the programmer's wonderful idea of the happy year, the rise of the posture

Programmers Spring Festival Home Dating Guide

Alibaba 12 scientists release 2018 Technology trend forecast

The design of distributed system architecture from the perspective of Elasticsearch

More Highlights


Click to read AI more good text

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.