Clustering Algorithm Learning-KMEANS,KMEDOIDS,GMM

Source: Internet
Author: User

GMM reference this article: Link

Simply put, the result of K-means is that each data point is assign to one of the cluster, and GMM gives the probability that these data points are assign to each cluster , also known as soft assignment.

Usually the probability of a single point is very small, many small numbers multiply in the computer can easily cause floating point overflow , so we usually take the logarithm, the product into addition, to get Log-likelihood function.

So there are the same problems as K-means--and there is no guarantee that the overall optimal can always be obtained, if the luck is poor, the bad initial value, it is possible to get poor results.

In the case of K-means, we usually repeat a certain number of times and then take the best results , but GMM each iteration of the calculation is much larger than the K-means, a more popular practice is to first use the K-means (has been repeated and the best value) to get a rough result, It is then used as the initial value (as long as the centroids obtained by K-means is passed into the gmm function), and then the detailed iteration is done with GMM.

K-medoids refer to this article: Link

The difference between K-means and K-medoids is similar to the difference between the mean (mean) and median (median) of a data sample: The value range of the former can be any value in a continuous space, while the latter can only be selected at those points given to the sample.

One of the most straightforward reasons is that K-means's requirements for data are too high, and it uses Euclidean distance to describe the difference between data points (dissimilarity), which can be calculated directly from the compute point by averaging. This requires that the data point be in a Euclidean space.

However, not all data satisfies this requirement, and for numeric types, such as height, can be handled naturally in this way, but the characteristics of the category (categorical) type are not. To give a simple example, if I want to cluster the dog now and want to do it directly in the space of all the dogs, K-means is powerless because the Euclidean distance cannot be used here: a samoyed minus a Rough collie and then squared? God knows what that is! Plus a German Shepherd dog and ask for an average? There is no way to calculate the K-means here!

The most common way is to construct a dissimilarity matrix that represents the difference between the first dog and the first dog, for example, the difference between two samoyed can be set to 0, one German Shepherd dog and one R The difference between Ough Collie is 0.7, and the difference between a miniature Schnauzer is 1, and so on.

In addition, because the center point is selected in the existing data points, so relative to K-means, not susceptible to those due to errors and other reasons such as the impact of outlier, more robust some.

You will find that from K-means to K-medoids, the complexity of time increases dramatically: only one average is required in K-means, and in k-medoids it is necessary to enumerate each point and find the sum of its distances to all other points, and the complexity is.

It is also possible to fall into the local optimum:

The author then uses a text category to end the article. It also mentions a n-gram:

In n-gram-based text categorization This paper describes a method for calculating the similarity of documents written in different languages. A n-gram (in characters) is equivalent to a series of contiguous substrings of length N. For example, the 3-gram generated by Hello are: Hel, ell, and llo, and sometimes spaces (denoted by underscores) at the beginning and the end before dividing N-gram: _he, Hel, ell, Llo, Lo_, and o__. According to Zipf ' s law:

The nth most common word in a human language text occurs with a frequency inversely proportional to n.

Here we use N-gram to replace Word. In this way, we can get a n-gram frequency distribution from a document, sort by frequency, keep only the highest frequency of the first k (for example, N-gram), we call a "profile". Normally, a document written by a language (at least those of the English language of the West), regardless of the subject or length, usually has the same profile, that is, the sequence of each n-gram obtained by the frequency of occurrence does not change too much. This is a very good property: usually we just choose one (more normal, not very long) document in each language to build a profile, when you get an unknown document, as long as you compare with each profile, the profile with the smallest difference The corresponding language can be identified as the language of this unknown document-the accuracy is very high, more valuable is that the training data required is very small and easy to obtain, the training model is very little.

Clustering Algorithm Learning-KMEANS,KMEDOIDS,GMM

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.