Cluster learning notes-kmeans

Last Update:2014-10-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Kmeans is one of the most common clustering methods in the field of data mining, initially originated in the field of signal processing. Its goal is to divide the entire sample space into several sub-spaces, and the sample points in each sub-space are the smallest average distance from the center of the space. Therefore, kmeans is a type of clustering.

The method is easy to understand and persuasive. However, unfortunately, this is an NP-hard problem.

First, let's take a look at the NP problem. NP is non-deterministic polynomial, a non-deterministic polynomial. There are two concepts hidden here: polynomial problems and non-deterministic problems. As we all know, the complexity of a problem is that the instance scale of the problem is a polynomial function of N, so the deterministic problem that can be solved within the polynomial time is called the P problem; the non-deterministic problem is relatively hard to understand, for example, if there are some problems, such as addition, subtraction, multiplication, division, we can get the result through derivation and step-by-step calculation. However, there are still some problems that we can only get the result through indirect guesses. For example, to obtain a prime number, no formula can be used to directly obtain the next prime number. Such a problem is called a non-deterministic problem. In the previous prime number example, an alternative algorithm is that we can verify whether a number is a prime number. In this way, we can guess the solution within the polynomial time, which is a polynomial non-deterministic problem, that is, the NP problem. Furthermore, because we cannot find a Direct Algorithm for NP problems, someone raised the question of whether we can reduce the number of NP problems to a slightly complex number of NP problems. Finally, by proposing a super NP problem, all NP problems can be reduced to this problem. As long as this problem is solved, all NP problems are solved. This is the NPC issue. It is worth noting that the algorithm complexity of this problem increases exponentially. As the complexity of the problem increases, the computation will soon fail. If we find a polynomial algorithm for the NPC problem, all NP can be solved within the polynomial time, And the NP problem becomes the P problem, which is NP = P? The conjecture is that the reward is at the top of the Millennium Challenge of millions of US dollars. NP-hard problems are also difficult to find polynomial algorithms, but they are not limited to NP problems.

After learning about NP-hard, we know that it is quite difficult to calculate for kmeans. Fortunately, there are some widely used heuristic algorithms that can quickly and effectively find the local optimal solution:

K points are randomly taken as the seed points. Each seed point represents a class and serves as the center point of the class;
Calculate the distance between each sample point and the seed point, classify the sample point into the nearest class, and divide K classes;
Recalculate the average value of each class as the position of the new seed point;
Iteration 2-3 until the seed point is not moved.

The distance measurement algorithm above shows that the distance formula is the core factor affecting the algorithm. Here we will first discuss the selection of distance formulas. Minkowski distance is a common method to measure the distance between numerical points. Assume that the numerical point $ P = (x_1, x_2 ,..., x_n) $ and $ q = (y_1, Y_2 ,..., y_n) $, then the minkoski distance is defined as: $ (\ sum _ {I = 1} ^ {n} | x_i-y_ I | ^ P) ^ {1/p }$ $ different distance representation can be obtained by adjusting the parameter P. The most common values are 1 and 2. The former is the Manhattan distance, and the latter is the Euclidean distance. As shown in, white represents a tall building, and gray represents a street. Take a taxi from to. The green slash represents the Euclidean distance (not practical), and several other lines represent the Manhattan distance.
When P tends to be infinite, it is converted to Chebyshev distance: $ \ lim _ {P \ rightarrow \ infty} (\ sum _ {I = 1} ^ {n} | x_i-y_ I | ^ P) ^ {1/p }=\ mathop {max} _ {I = 1} ^ n | x_i-y_ I | $ PHYSICAL MEANING OF CHEBYSHEV distance can be imagined as king of chess, the shev distance is the minimum number of Chebyshev steps from grid A to grid B. We all know that the point with a 1 Euclidean distance from the plane to the origin is a circle. When p takes other values, for example:
For example, if you think of the origin point as a seed point, the distribution of the closest vertex is also different by selecting different p values, that is, the shape of the coverage range of the class. If the X and Y axes represent the two features of the sample, obviously, when $ P <1 $, if the amplitude of the X direction is far greater than that of the Y direction, minkoski distance will over-enlarge feature X. This requires that we need to pre-process the data before clustering, or assign different weights to features as needed. Several common preprocessing methods will be introduced later. The disadvantages of minkoski distance are also obvious.

Dimensions of each feature are the same;
The distribution of each feature is the same by default.

Mahalanobis is the distance measure between a point P and a distributed D. The main idea is to measure the standard deviation of the mean value of P relative to D. There are two advantages:

Dimension independence;
Eliminate interference between features.

Definition: m sample vectors $ x = (x_1, x_2 ,..., x_m) $. The covariance matrix is recorded as S. The mean value is the Markov distance between the sample vector X and $ \ Mu $: $ d (x) = \ SQRT {(X-\ mu) ^ ts ^ {-1} (X-\ mu )} $ and the Markov distance between the vector $ X_ I $ and $ x_j $ is defined as: $ D (x_ I, X_j) = \ SQRT {(x_i-x_j) ^ ts ^ {-1} (X_i-X_j) }$ $ if the covariance matrix S is a unit array (the sample vectors are independently distributed), the formula becomes a Euclidean distance; if the covariance matrix is a diagonal array, the formula is to standardize the Euclidean distance. Cos similarity cosine similarity is the angle measurement of two vectors in space. Compared with the preceding distance formula, cos similarity focuses more on the trend differences between individual vectors. For example, if a = (3, 3), B = (5, 5) is used to measure cosine similarity, there is no difference between them. The entire kmeans method has been introduced. We can run the algorithm, but whether the algorithm can converge successfully every time, that is, if the condition is met, it will be stopped. Further discussion is required. Define the distortion function $ J (C, \ mu) = \ mathop {\ sum }_{ I = 1} ^ m | X_ I-\ Mu _ {C ^ I} | ^ 2 $ J function for each sample point to it the sum of the distance squares of the seed points. Kmeans is to minimize J. First, we can fix category C, adjust the category of the sample point $ C ^ I $ to make J smaller, and then fix $ C ^ I $ to adjust the seed point to reduce J. Obviously, when J is bounded and J is non-convex, kmeans can only obtain the local optimal solution, which is a coordinate descent optimization problem. That is to say, kmeans is sensitive to the initial selection of the seed point. However, generally, local optimization meets the requirements. If you are afraid of local optimization, you can choose different initial values to run multiple times, and then select the result corresponding to the minimum J as the output. Note: From the convergence analysis, we can see that kmeans is sensitive to the selection of initial values, that is, how should we select the appropriate number and position of the seed points? K is used to determine the number first. Below are several common methods:

Integration with hierarchical clustering. The hierarchical clustering algorithm is usually used to determine the number of rough results, locate an initial cluster, and use iterative relocation to improve the clustering;
Stability Method. The dataset is sampled twice to generate two data subsets. The same clustering method is used to separate the clustering and calculate the distribution of similarity between the two clusters. High similarity indicates that k clusters reflect a stable cluster structure. Repeat the test until you find the appropriate K;
Canopy Algorithm. When calculating the similarity of objects, canopy first chooses a simple and low-cost method to calculate the similarity of objects and places the similar objects in a subset, this subset is called canopy. Several canopy values are obtained through a series of calculations. canopy can overlap, but there is no case that an object does not belong to any canopy. You can consider this stage as data preprocessing. Then, traditional clustering methods (such as K-means) are used in each canopy, and similarity calculation is not performed between objects that do not belong to the same canopy. At least two advantages can be seen from this method: first, if canopy is not too large and canopy does not overlap too much, the number of objects that need to calculate similarity will be greatly reduced. Second, A clustering method similar to K-means requires manual identification of k values. The number of canopy obtained can be used as the K value, which reduces the blindness of K selection to a certain extent.
Bayesian information criterion.

Seed point selection

Select the initial position randomly. You can run the comparison results multiple times and select a solution with the minimum J value;
Select a random vertex or the center of all vertices as the seed vertex. Then, for each subsequent seed point, select the farthest point from all the preceding seed points. In this way, not only are the initial points random, but they are also dispersed. However, this method is easy to select an outlier.

The applicability and defects kmeans algorithm tries to find the cluster with the smallest mean error criterion function. When the potential cluster shape is convex, the difference between the cluster and the cluster is obvious, and the cluster size is similar, the clustering result is ideal. In addition, this algorithm is highly efficient and scalable for processing big data sets. However, in addition to determining the number of clusters K in advance and being sensitive to the initial cluster center, this algorithm often ends with local optimization and is sensitive to "noise" and isolated points, this method is not suitable for finding clusters with non-convex shapes or clusters with large differences in size. Many researchers have proposed corresponding countermeasures to address the above defects. Kmeans ++ is the second method for selecting the preceding seed points. Kmedoids (PAM, partitioning around medoids) can solve the noise-sensitive problem of kmeans. Kmeans calculates the average value of all samples in the class when searching for the seed point. If the class has obvious outlier, the seed point and expectation deviation will be too large. For example, a (), B (), C (), D (), it is clear that the D point will pull the offset from the seed point. In this way, in the next iteration, a large number of sample points that do not belong to the class are incorrectly inserted into the class. To solve this problem, the kmedoids method selects new seed points, 1) selects only the sample points; 2) the selection criteria can improve the clustering effect, for example, the preceding minimal J function, or customize other cost functions. However, the kmedoids method improves the clustering complexity. Gaussian mixture In fact, Gaussian mixture model (GMM) is also a very popular clustering method. The details will be discussed later. Because it is similar to kmeans, we will mention it here. Kmeans assigns a category for each sample point, either belong to this class or not, and therefore becomes a hard clustering. GMM can be seen as the soft clustering version of kmeans, that is, each sample point belongs to a certain category with a certain probability. The basic idea is to use the EM algorithm to maximize the likelihood function (similar to the minimization of the J function), but keep the probability distribution of classes instead of directly assigning the class, and use Gaussian distribution instead of the average calculation.

From Weizhi note (wiz)

Cluster learning notes-kmeans

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Cluster learning notes-kmeans

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Cluster learning notes-kmeans

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support