A brief analysis of the Slam (bag of words) model and K-means Clustering algorithm (2)

Source: Internet
Author: User

Clustering Concepts:

Clustering: The simple thing is to divide the similar things into a group. Different from classification (classification), classification should belong to supervised learning. In clustering, we don't care what a class is, and the goal we need to achieve is to get something similar together, so a clustering algorithm usually needs to know how to calculate the similarity to get started. Clustering does not require the use of training data for learning, should belong to unsupervised learning.

We often come into contact with the cluster analysis, is generally numerical clustering, a common practice is to extract the n characteristics, put together to form an n-dimensional vector, so that a map from the original data set to the n-dimensional vector space, and then based on a certain rule classification, under which the same group classification has the greatest similarity.

Issues that need to be addressed:

The K-means algorithm primarily solves the problem as shown in. We can see that there are some points on the left side of the graph that we can see with the naked eye that there are four point groups, but how do we find these points in a computer program?

Algorithm Overview:

A,b,c,d,e is five at the midpoint of the figure. The gray point is our seed point, which is the point we use to find some group. There are two seed points, so k=2

the K-means algorithm describes:

1. Randomly take K (here k=2) A seed point in the graph.

2. Then all points in the graph to find the distance of the K seed point, if the point pi from the seed point si nearest, then pi belongs to Si Point group. (we can see that a, B belongs to the seed point above, the c,d,e belongs to the seed point in the middle below)

3. Next, we will move the seed point to the center of his "point group". (See the third step on the chart)

4. Then repeat steps 2nd and 3rd) until the seed point is not moved (we can see that the seed point above the fourth step of the figure aggregates the a,b,c, the seed point below aggregates the d,e).

The K-means Clustering algorithm consists of three steps:

(1) The first step is to find a clustering center for the points to be clustered.

(2) The second step is to calculate the distance from each point to the cluster center, clustering each point to the cluster closest to that point

(3) The third step is to calculate a bit of the average value of the coordinates in each cluster, and this mean as a new cluster center, repeated (2), (3), until the cluster center is no longer large-scale movement or the number of clusters to meet the requirements.

the formula for finding the center of a point group:

Assuming that we extract the collection to the original data and each XI is a vector of D-dimensional, the purpose of K-means clustering is to divide the original data into K-Class S =, on the numeric model, to the minimum value of the following expression, given the number of K (k≤n) values of the classification group:

This represents the average of the classifications.

Note :arg represents the value of a variable when the target function takes the minimum value

We have a total of N number of points to be divided into K cluster, K-means to do is to minimize

Where RNK is 1 when the data point n is classified to cluster K, otherwise 0. Looking directly at RNK and UK minimization J is not easy, but we can take an iterative approach: fixing the UK first, choosing the best rnk, is easy to see, as long as the data points are classified to the nearest center of his and the J-minimum can be guaranteed. The next step is to fix the RNK and then find the best UK. If J is derivative of the UK and makes the derivative equal to zero, it is easy to get J minimum when the UK should satisfy:

That is, the value of the UK should be the average of all data points in the cluster K.

The first 3 center points are randomly initialized, all data points are not clustered, and all are marked red by default, as shown in:

Then go to the first iteration: the color of each data point is calculated according to the initial center point position, and the 3 center points are recalculated, as shown in the following:

As you can see, because the initial center point is randomly selected, the result is not very good, and the result of the next iteration is:

You can see that the approximate shape has come out. After two iterations, the results are basically convergent and the final result is as follows:

However, as mentioned above, K-means is not omnipotent, although many times can converge to a better result, but also have bad luck will converge to a person dissatisfied with the local optimal solution, such as the use of the following several initial center points:

Will eventually converge to such a result:

A brief analysis of the Slam (bag of words) model and K-means Clustering algorithm (2)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.