Machine learning six--k-means clustering algorithm

Last Update:2015-10-30 Source: Internet

Author: User

Tags scalar svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Machine learning six--k-means Clustering algorithm

Think about the common classification algorithms are decision tree, Logistic regression,SVM, Bayesian and so on. classification, as a supervised learning method, requires that the information of each category be clearly known beforehand, and that all categories to be categorized have a corresponding category. However, many times the above conditions are not satisfied, especially in the processing of large amounts of data, if the data to meet the requirements of the classification algorithm through preprocessing, the cost is very large, think if give you a G So big text, the inside has been divided into good words, At this point it needs to be classified according to the given dozens of keywords, the method of supervised learning is indeed a bit difficult, but also not cost-effective, early work too much.

At this point we can consider the use of clustering algorithms, we just need to know what these dozens of keywords are all right. Clustering is unsupervised learning, and clustering does not rely on pre-defined classes and class designations for training instances, compared to classification. This paper first introduces the basis of clustering-distance and dissimilarity, then introduces a common clustering algorithm-K-means clustering .

Before we formally discuss clustering, we need to figure out a question: how to quantify the differences between two comparable elements. The previous knowledge to understand, coupled with the definition of K-means, can basically understand the k-means algorithm, not a particularly difficult algorithm. In layman's terms, the difference is how big two things are, for example, the differences between humans and octopuses are significantly greater than the differences between humans and chimpanzees, which we can intuitively feel. However, the computer does not have this kind of intuitive ability, we must quantitatively define the dissimilarity degree mathematically.

Set x={x1,x2,x3,,,, xn},y={y1,y2,y3,,,, yn}, where x, Y is two element items, each with n measurable feature attributes, then the dissimilarity of × and Y is defined as: d= (x, y) =f (x, y)->r, where R is the real field. That is, the divergence is a mapping of two elements to the real field, and the mapped real number quantitatively represents the dissimilarity of the two elements.

The following is a description of the different types of variable differences calculation methods.

Scalar

A scalar is a number that has no directional meaning, also called a scale variable. Now consider the case where all the feature attributes of an element are scalar. For example, calculate the dissimilarity of x={2,1,102} and y={1,3,2}. A very natural idea is to use the Euclidean distance of the two as the dissimilarity, Euclidean distance is defined as follows:

Its meaning is the set distance of two elements in Euclidean space, because it is easy to understand and can be interpreted strongly, it is widely used to identify the dissimilarity degree of two scalar elements. By substituting the above two sample data into the formula, the Euclidean distance between the two is:

In addition to Euclidean distances, the distance between Manhattan and Minkowski is commonly used as a measure of scalar divergence, which is defined as follows:

Manhattan Distance:

Minkowski Distance:

The Euclidean distance and the Manhattan distance can be seen as a special case of Minkowski's distance under p=2 and p=1.

0-1 Normalization

Let's talk about scalar normalization. There is a problem with the way in which the differences are computed above, that is, a property with a large range of values has a greater effect on distance than a property with a smaller range. For example, in the above example, the third attribute has a greater span than the first two, which is not conducive to true reflection of the real difference, in order to solve this problem, it is generally necessary to normalize the attribute values. Normalization is the mapping of individual attribute values to the same range of values, in order to balance the effect of each attribute on distance. Each property is typically mapped to the [0,1] interval, and the mapping formula is:

where Max (AI) and min (AI) represent the maximum and minimum values of the I attribute in all element items. For example, when the element in the example is normalized to the [0,1] interval, it becomes the X ' ={1,0,1},y ' ={0,1,0}, and the Euclidean distance is recalculated by approximately 1.732.

Binary variables

The so-called two-dollar variable is only 0 and 12 value variables, a bit like a Boolean value, usually used to identify whether or not this binary attribute. For a two-tuple variable, the distance mentioned in the previous section does not well identify its dissimilarity, and we need a more appropriate identity. A common method is to identify its dissimilarity by the proportion of the same order-value attribute as the element.

With x={1,0,0,0,1,0,1,1},y={0,0,0,1,1,1,1,1}, you can see that two elements of the 2nd, 3, 5, 7, and 8 attribute values are the same, and the 1th, 4, and 6 values are different, so the dissimilarity can be identified as 3/8=0.375. In general, for a two-tuple variable, the dissimilarity can be identified by the value of a different number of identical attributes/attributes of a single element.

The differences mentioned above should be called symmetry two Yuan dissimilarity degree. In reality, there is a situation where we only care about the fact that both of them take 1, and the fact that the two have a 0 attribute does not mean that they are more alike. For example, when a patient is clustered according to the condition, if two people have lung cancer, we think that two people have enhanced the similarity, but if two people do not have lung cancer, do not feel that this strengthens the similarity of two people, in this case, instead of "the value of a different number of identical attributes/(the number of attribute bits of a single element-the number of bits of 0)" To identify the degree of divergence, which is called asymmetric two-yuan dissimilarity. It is a very important concept to obtain an asymmetric two-yuan similarity, also known as Jaccard coefficient, if 1 minus the dissimilarity of two yuan.

Categorical variables

Categorical variables are generalizations of two-dollar variables, similar to enumeration variables in programs, but each value has no numeric or ordinal meaning, such as color, ethnicity, and so on, for categorical variables, use "Different values for the number of identical attributes/attributes of a single element" to identify their differences.

Ordinal variable

Ordinal variables are categorical variables with ordinal meanings, which can usually be arranged in a certain order, such as champions, runners and runners-up. For ordinal variables, it is common to assign a number to each value, called the rank of the value, and then calculate the dissimilarity as a scalar attribute instead of the original value as a rank.

Vector

For the vector, because it not only has the size and direction, so Minkowski distance is not a good way to measure its dissimilarity, a popular practice is to use the cosine of two vectors, this should be known, the measurement formula is:

where | | x| | Represents the Euclidean norm of X. It is important to note that the cosine measure is not the dissimilarity of the two, but the similarity!

What is clustering?

The so-called clustering problem is that given an element set D, where each element has n observable attributes, using an algorithm to divide D into k subsets requires that the elements within each subset be as low-level as possible, while the elements of different subsets are as high as possible. Each of these subsets is called a cluster .

Unlike classification, classification is an example of learning, requiring that each category be identified before classification, and that each element is mapped to a category, and that clustering is an observational learning, which can be unaware of the category or even the number of categories before clustering, and is unsupervised learning. At present, clustering is widely used in statistics, biology, database technology and marketing and other fields, the corresponding algorithm is also very much. This paper introduces only one of the simplest clustering algorithms-K-Means (K-means) algorithm .

The computational process for the K-mean algorithm is straightforward:

1. k elements are randomly taken from D, as the respective centers of the K clusters.

2. Calculate the difference between the remaining elements and the center of k clusters, respectively, and assign these elements to clusters with the lowest degree of dissimilarity.

3, according to the clustering results, re-calculate the center of the k clusters, the calculation method is to take all the elements of the cluster of the respective dimensions of the arithmetic average.

4. Re-cluster all the elements in D according to the new center.

5, repeat the 4th step until the cluster results no longer change.

6, output the result.

Complexity of Time: O (t*n*k*m)

Space complexity: O (N*M)

N: Number of elements, K: number of elements selected in the first step, m: Number of feature items per element, T: Number of iterations in 5th step

Reference:

T2 phage (a lot of the understanding is from this Daniel's, but also reading the other blog post of Learning Ta)

K-means cluster--Baidu Encyclopedia

Summarize

The next goal is logistic regression, SVM. Have seen many times about the two algorithms of the blog, but understanding is not enough in-depth, continue to learn, hope to gain.

Machine learning six--k-means clustering algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More