K-means Cluster Learning

Last Update:2015-05-06 Source: Internet

Author: User

Tags scalar

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

4.1. Abstract

In the previous article, three kinds of common classification algorithms were introduced. Classification, as a supervised learning method, requires that the information of each category be clearly known beforehand, and that all categories to be categorized have a corresponding category. However, many times the above conditions are not satisfied, especially when dealing with large amounts of data, if the data through preprocessing to meet the requirements of the classification algorithm, it is very expensive, this time you can consider the use of clustering algorithm. Clustering is unsupervised learning, and clustering does not rely on pre-defined classes and class designations for training instances, compared to classification. This paper first introduces the basis of clustering--distance and dissimilarity, then introduces a common clustering algorithm--k mean and K center point clustering, and finally gives an example: applying cluster method to try to solve a controversial issue in the sports world--the Chinese football men's soccer is in a few streams in Asia in recent years.

4.2, dissimilarity degree calculation

Before we formally discuss clustering, we need to figure out a question: how to quantify the differences between two comparable elements. In layman's terms, the difference is how big two things are, for example, the differences between humans and octopuses are significantly greater than the differences between humans and chimpanzees, which we can intuitively feel. However, the computer does not have this kind of intuitive ability, we must quantitatively define the dissimilarity degree mathematically.

set, where x, Y is two element items and each has n measurable feature attributes, then the dissimilarity of × and y is defined as:, where R is the real field. that is, the divergence is a mapping of two elements to the real field, and the mapped real number quantitatively represents the dissimilarity of the two elements.

The following is a description of the different types of variable differences calculation methods.

4.2.1, scalar

A scalar is a number that has no directional meaning, also called a scale variable. Now consider the case where all the feature attributes of an element are scalar. For example, calculate the dissimilarity of x={2,1,102} and y={1,3,2}. A very natural idea is to use the Euclidean distance of the two as the dissimilarity, Euclidean distance is defined as follows:

Its meaning is the set distance of two elements in Euclidean space, because it is easy to understand and can be interpreted strongly, it is widely used to identify the dissimilarity degree of two scalar elements. By substituting the above two sample data into the formula, the Euclidean distance between the two is:

In addition to Euclidean distances, the distance between Manhattan and Minkowski is commonly used as a measure of scalar divergence, which is defined as follows:

Manhattan Distance:

Minkowski Distance:

The Euclidean distance and the Manhattan distance can be seen as a special case of Minkowski's distance under p=2 and p=1. The other three kinds of distance can be weighted, this is easy to understand, no longer repeat.

Let's talk about scalar normalization. There is a problem with the way in which the differences are computed above, that is, a property with a large range of values has a greater effect on distance than a property with a smaller range. For example, in the above example, the third attribute has a greater span than the first two, which is not conducive to true reflection of the real difference, in order to solve this problem, it is generally necessary to normalize the attribute values. Normalization is the mapping of individual attribute values to the same range of values, in order to balance the effect of each attribute on distance. Each property is typically mapped to the [0,1] interval, and the mapping formula is:

where Max (AI) and min (AI) represent the maximum and minimum values of the I attribute in all element items. For example, when the element in the example is normalized to the [0,1] interval, it becomes the X ' ={1,0,1},y ' ={0,1,0}, and the Euclidean distance is recalculated by approximately 1.732.

4.2.2, two-yuan variable

The so-called two-dollar variable is only 0 and 12 value variables, a bit like a Boolean value, usually used to identify whether or not this binary attribute. For a two-tuple variable, the distance mentioned in the previous section does not well identify its dissimilarity, and we need a more appropriate identity. A common method is to identify its dissimilarity by the proportion of the same order-value attribute as the element.

With x={1,0,0,0,1,0,1,1},y={0,0,0,1,1,1,1,1}, you can see that two elements of the 2nd, 3, 5, 7, and 8 attribute values are the same, and the 1th, 4, and 6 values are different, so the dissimilarity can be identified as 3/8=0.375. In general, for a two-tuple variable, the dissimilarity can be identified by the value of a different number of identical attributes/attributes of a single element .

The differences mentioned above should be called symmetry two Yuan dissimilarity degree. In reality, there is a situation where we only care about the fact that both of them take 1, and the fact that the two have a 0 attribute does not mean that they are more alike. For example, when a patient is clustered according to the condition, if two people are suffering from lung cancer, we think that two people have enhanced the similarity, but if two people do not have lung cancer, it does not feel that this strengthens the similarity of two people, in this case, instead of "the value of a different number of identical attributes/(the number of attribute bits of a single element-the number of bits of 0) "to identify differences, which is called asymmetric two-yuan dissimilarity. It is a very important concept to obtain an asymmetric two-yuan similarity, also known as Jaccard coefficient, if 1 minus the dissimilarity of two yuan.

4.2.3, categorical variables

Categorical variables are generalizations of two-dollar variables, similar to enumeration variables in programs, but each value has no numeric or ordinal meaning, such as color, ethnicity, and so on, for categorical variables, use " different values for the number of identical attributes/attributes of a single element " to identify their differences.

4.2.4, ordinal variables

Ordinal variables are categorical variables with ordinal meanings, which can usually be arranged in a certain order, such as champions, runners and runners-up. For ordinal variables, it is common to assign a number to each value, called the rank of the value, and then calculate the dissimilarity as a scalar attribute instead of the original value as a rank.

4.2.5, Vector

For vectors, because it is not only size but also direction, Minkowski distance is not a good way to measure its dissimilarity, a popular practice is to use the cosine measure of two vectors, the measure formula is:

where | | x| | Represents the Euclidean norm of X. It is important to note that the cosine measure is not the dissimilarity of the two, but the similarity!

4.3, clustering problems

After discussing the problem of dissimilarity calculation, we can formally define the clustering problem.

The so-called clustering problem is that given an element set D, where each element has n observable attributes, using an algorithm to divide D into k subsets requires that the elements within each subset be as low-level as possible, while the elements of different subsets are as high as possible. Each of these subsets is called a cluster.

Unlike classification, classification is an example of learning, requiring that each category be identified before classification, and that each element is mapped to a category, and that clustering is an observational learning, which can be unaware of the category or even the number of categories before clustering, and is unsupervised learning. At present, clustering is widely used in statistics, biology, database technology and marketing and other fields, the corresponding algorithm is also very much. This paper only introduces one of the simplest clustering algorithm--k mean (K-means) algorithm.

4.4. K-means algorithm and its example

The computational process for the K-mean algorithm is straightforward:

1. k elements are randomly taken from D, as the respective centers of the K clusters.

2. Calculate the difference between the remaining elements and the center of k clusters, respectively, and assign these elements to clusters with the lowest degree of dissimilarity.

3, according to the clustering results, re-calculate the center of the k clusters, the calculation method is to take all the elements of the cluster of the respective dimensions of the arithmetic average.

4. Re-cluster all the elements in D according to the new center.

5, repeat the 4th step until the cluster results no longer change.

6, output the result.

Since the algorithm is more intuitive, nothing can be explained too much. Below, let's look at an interesting example of the K-means algorithm: How many Chinese soccer men have been in Asia in recent years?

This year the Chinese football is a cup of the home, almost to the point of rat raves. For the current status of China's soccer in Asia, the parties are also uncompromising, some people say that China's men's Soccer Asia second-rate, some people say that three-stream, some people say that there is no inflow, more people say in fact no more than Japan and South Korea, is Asia's first-class. Since the argument doesn't solve the problem, let's let the data tell us the result.

It was the record of the 15 Asian teams I collected in the 2005-2010 tournament (since Australia was later added to AFC, so this is not included).

These include two World Cups and one Asian Cup. I pre-processed the data in advance: for the World Cup, into the final circle to take its final position, not to enter the final circle, into the qualifier of the top ten to give 40, the qualifying team did not qualify 50. For the Asian Cup, the top four ranked, eight to give 5, 16 to 9, qualifiers did not appear to give 17. This is done in order to make all the data scalar and easy for subsequent clustering.

The following is the normalization of the data by [0,1] normalization, following the normalized data:

The K-means algorithm is then used for clustering. Set up k=3, which will be divided into three groups of the 15 teams.

The values of Japan, Bahrain, and Thailand are now taken as seeds of three clusters, that is, the three clusters are initialized at the center of a:{0.3, 0, 0.19},b:{0.7, 0.76, 0.5} and C:{1, 1, 0.5}. Below, calculate the dissimilarity of all teams to three center points, measured in Euclidean distance. Here is the result of my application:

From the right in turn to show the team to the current central point of the Euclidean distance, each team to the nearest cluster, can do the following clusters of teams:

China C, Japan A, Korea A, Iran a, Saudi a, Iraq C, Qatar c, UAE C, Uzbekistan B, Thailand C, Vietnam C, Oman C, Bahrain B, North Korea B, Indonesia C.

First cluster Result:

A: Japan, Korea, Iran, Saudi Arabia;

B: Uzbekistan, Bahrain, North Korea;

C: China, Iraq, Qatar, UAE, Thailand, Vietnam, Oman, Indonesia.

The center points of each cluster are adjusted according to the first clustering result.

The new center point for a cluster is: {(0.3+0+0.24+0.3)/4=0.21, (0+0.15+0.76+0.76)/4=0.4175, (0.19+0.13+0.25+0.06)/4=0.1575} = {0.21, 0.4175, 0.1575}

In the same way, the new center points for B and C clusters are {0.7, 0.7333, 0.4167},{1, 0.94, 0.40625} respectively.

Use the adjusted center point to cluster again to get:

The result after the second iteration is:

China C, Japan A, Korea A, Iran a, Saudi a, Iraq C, Qatar c, UAE C, Uzbekistan B, Thailand C, Vietnam C, Oman C, Bahrain B, North Korea B, Indonesia C.

The result is no change, the result is convergent, and the result of the final cluster is given:

Asia-class: Japan, Korea, Iran, Saudi Arabia

Asia second-rate: Uzbekistan, Bahrain, North Korea

Three streams in Asia: China, Iraq, Qatar, UAE, Thailand, Vietnam, Oman, Indonesia

It seems that the data tell us that replace him in the Asian three-stream level in recent years is really not wronged them, at least from the International Cup record is such.

In fact, the above analysis data not only tells us the clustering information, but also provides some other interesting information, for example, it can quantify the gap between the teams, for example, in Asia's first-class team, Japan and Saudi Arabia are the closest, and Iran is far from their distance, which has been in line with the decline of Iran in recent years. In addition, Uzbekistan and Bahrain, although not in the near two World Cups, but with the budget and the Asian Cup outstanding performance Occupy Group B, and North Korea because of the 2010 World Cup finals to enter Group B, but also miraculously won the 2007 Asian Cup of Iraq is divided into three streams, It seems that the Asian Cup champions are not as heavy as the World Cup finals. Other interesting information that interested friends can dig further.

SOURCE Download: http://download.csdn.net/detail/lixiaolun/8666987

K-means K-Mean clustering weaknesses/drawbacks

Similar to other algorithm, K-mean clustering have many weaknesses:

1 when the numbers of data is not so many, initial grouping would determine the cluster significantly. When the amount of data is not large enough, the initialization grouping largely determines the clustering, which affects the clustering results.
2 The number of cluster, K, must be determined before hand. The value of K must be specified beforehand.
3 We never know the real cluster, using the same data, if it is inputted in a different order may produce different cluste R if the number of data is a few. When the number of data is not large, the order of the input data will result in different results.
4 sensitive to initial condition. Different initial condition may produce Different result of cluster. The algorithm is trapped in the local optimum. Sensitive to initialization conditions.
5 we never know which attribute contributes more to the grouping process since We assume so each attribute have the same Weight. It is not possible to determine which attribute contributes more to clustering.
6 weakness of arithmetic mean is not robust to outliers. Very far data from the centroid, the centroid away from the real one. Using arithmetic averages is not robust to outlier.
7 The result is circular cluster shape because based on distance. Because of the distance, the result is a rounded cluster shape.

One-to-overcome those weaknesses is-use k-mean clustering only if there is available many data. To overcome outliers problem, we can use median instead of mean. Ways to overcome shortcomings: use as much data as possible, and use the median instead of mean to overcome outlier problems.

Some people pointed out, that K means clustering cannot is used for other type of data rather than quantitative data. This isn't true! See how can use multivariate data up to n dimensions (even mixed data type) here. The key to use other type of dissimilarity is in the distance matrix.

Transferred from: http://www.cnblogs.com/leoo2sk/archive/2010/09/20/k-means.html

&http://www.cnblogs.com/emanlee/archive/2012/03/06/2381617.html

K-means Cluster Learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More