"One of the machine learning notes" learning K-means algorithm in layman's language

Last Update:2017-08-13 Source: Internet

Author: User

Tags random seed

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

absrtact: in Data mining, the K-means algorithm is a kind of cluster analysis algorithm, which is mainly to calculate the data aggregation algorithm, mainly by continuously taking the nearest mean value of the seed point algorithm.

In data mining, the K-means algorithm is a kind of cluster analysis algorithm, which is mainly to calculate the data aggregation algorithm, mainly by continuously taking the nearest mean value of the seed point algorithm.

Problem

The K-means algorithm primarily solves the problem as shown in. We can see that there are some points on the left side of the graph that we can see with the naked eye that there are four point groups, but how do we find these points in a computer program? So there's our K-means algorithm (Wikipedia link)

K-means to solve the problem

Algorithm overview

This algorithm is actually very simple, as shown in:

From which we can see thatA,b,c,d,e is five points in the figure. The gray point is our seed point, which is the point we use to find some group . There are two seed points, so k=2.

Then, the K-means algorithm is as follows:

Randomly take K (here k=2) A seed point in the graph.
Then all points in the graph to find the distance of the K seed point, if the point pi from the seed point si nearest, then pi belongs to Si Point group. (we can see that a, B belongs to the seed point above, the c,d,e belongs to the seed point in the middle below)
Next, we want to move the seed point to the center of his "point group". (See the third step on the chart)
Then repeat steps 2nd and 3rd) until the seed point is not moved (we can see that the seed point above the fourth step of the figure aggregates the a,b,c, the seed point below aggregates the d,e).

This algorithm is very simple, but some details I want to mention, to find the formula of distance I do not say, we have a junior high school graduation level of people should know how to calculate. I'd like to focus on the "algorithm for Point group Center".

Algorithm for finding the center of Point Group

In general, you can use the average of the X/y coordinates of each point in order to find the algorithm of the Point group Center point. However, I would like to tell you about the other three formulas for the center point:

1) Minkowski distance formula--λ can be arbitrary value, can be negative, or can be positive, or infinity.

2) Euclidean distance formula --the case of the first formula λ=2

3) Cityblock distance formula --the case of the first formula Λ=1

The center point of the three formulas has some different places that we look at (for the first λ in the 0-1).

(1) Minkowski Distance (2)Euclidean Distance (3) Cityblock Distance

The main idea of the above is how they approach the center, the first figure in a star-shaped way, the second figure in concentric circles, the third graph in a diamond way.

K-means's Demo

If you use "K Means demo" As the keyword to Google search you can find a lot of demos. A demo is recommended here: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

The action is, the left mouse button is the initialization point, right-click Initialize "seed Point", and then tick "Show history" can see step by step iteration.

Note: The link to this demo also has a nice k Means Tutorial.

k-means++ algorithm

K-means has two of the most significant flaws-all related to the initial value:

K is given beforehand, the selection of this k value is very difficult to estimate. Many times, there is no prior knowledge of how many categories a given dataset should fit into. (The Isodata algorithm obtains the more reasonable type number K) through the automatic merging and splitting of the classes.

The K-means algorithm needs to be made with an initial random seed point, which is too important for the random seed point to have a completely different result. (The k-means++ algorithm can be used to solve this problem, it can effectively select the initial point)

I'm here to focus on the k-means++ algorithm steps:

Randomly pick a random point from our database as a "seed point".
For each point, we calculate the distance D () of the nearest "seed point" and x save it in an array, and then add the distances together to get sum (d ( x )).
Then, take a random value and use the weighted method to calculate the next "seed point". The implementation of this algorithm is to first take a random value that can fall in sum (d ( x )), and then use random-= d ( x ) until its <=0, at which point is the next "seed point".
Repeat steps (2) and (3) until all the K seed points are selected.
The K-means algorithm is performed.

Related code you can find here "implement the k-means++ Algorithm" (wall) another, Apache's General data library also implements this algorithm

K-means algorithm Application

See here, you will say, K-means algorithm seems very simple, and seems to be playing coordinates point, there is no real use. Moreover, this algorithm is a lot of flaws, not as artificial. Yes, the previous example is just playing two-dimensional coordinate points, which is really not interesting. But here are a few questions to consider:

1) If it is not a two-dimensional, multidimensional, such as 5-dimensional, then, it can only be computed with a computer.

2) The x, Y coordinates of the two-dimensional coordinate point are actually a kind of vector, which is a mathematical abstraction. Many properties in the real world can be abstracted into vectors, for example, our age, our preferences, our products, and so on, can be abstracted into vectors to allow the computer to know the distance between two properties. For example: We believe that 18-year-olds are closer to the 24-year-old than the 12-year-old, which is closer to the product than the computer, and so on.

as long as the real-world objects can be abstracted into vectors, you can use the K-means algorithm to classify .

In the "K-mean Clustering (K-means)" This article cited a very good application example, the author made a vector table from the 2005 to 1010 record of Asian 15 football teams, and then used K-means to classify the team, and came to the following results, hehe.

Asia-class: Japan, Korea, Iran, Saudi Arabia
Asia second-rate: Uzbekistan, Bahrain, North Korea
Three streams in Asia: China, Iraq, Qatar, UAE, Thailand, Vietnam, Oman, Indonesia

In fact, there are many such business examples, such as analyzing a company's customer classification, so that different customers can use different business strategies, or e-commerce analysis of commodity similarity, classification of goods, so that can use a number of different sales strategies, and so on.

Finally give a pretty good algorithm for the slideshow: http://www.cs.cmu.edu/~guestrin/Class/10701-S07/Slides/clustering.pdf

"One of the machine learning notes" learning K-means algorithm in layman's language

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More