Kmeans (K-mean) vs. kmeans++ and KNN (K-Nearest neighbor) algorithm __ algorithm

Source: Internet
Author: User
K-means IntroductionThe K-means algorithm is one of the most widely used algorithms in cluster analysis. It divides n objects into K-clusters according to their attributes to satisfy the obtained clusters: the similarity of objects in the same cluster is higher, while the similarity of objects in different clusters is small. The clustering process can be represented by the following diagram:


As shown in the figure, the data sample is represented by a dot, and the center point of each cluster is represented by a fork. (a) At first it was raw data, disorganized, without labels, and all looked the same, all green. (b) Suppose that the data set can be divided into two categories, so that the k=2, randomly in the coordinates two points, as the center of the two classes. (C-f) demonstrates two iterations of clustering. First, divide each data sample into the nearest central point that cluster; after dividing, the center of each cluster is updated, that is, the coordinates of all the data points of the cluster are added up to the average. This keeps the "divide-update-divide-update" until the center of each cluster is not moving.

The algorithm is relatively simple, but there are some things we need to pay attention to, here, I would like to say "algorithm for Point Center"

In general, you can use the average of the X/y coordinates of each point in order to find the algorithm of the Point group Center point. You can also use another three to find the center point of the formula:

1) Minkowski Distance formula-- λ can be arbitrary value, can be negative, or can be positive, or infinity.

2) Euclidean Distance Formula --the case of the first formula λ=2

3) Cityblock Distance Formula --the case of the first formula Λ=1

The center point of the three formulas has some different places, let's look at the picture below (for the first λ in 0-1).

(1) Minkowski Distance (2)Euclidean Distance (3)cityblock Distance

The main idea of the above is how they approach the center, the first figure in a star-shaped way, the second figure in concentric circles, the third graph in a diamond way.

defects of the Kmeans algorithm

The number of the cluster center K needs to be given beforehand, but in practice this selection of K value is very difficult to estimate, many times, it is not known beforehand that a given data set should be divided into how many categories to be the most suitable kmeans need to artificially determine the initial cluster center, different initial cluster centers may lead to completely different clustering results. (Can be solved using the kmeans++ algorithm) for the 2nd defect above, the kmeans++ algorithm can be used to solve the K-means + + algorithm k-means++ The basic idea of selecting initial seeds is that the distance between the initial cluster centers should be as far as possible. randomly selects a point from the collection of input data points as the first cluster Center for each point in the DataSet X, calculates the distance from the nearest cluster center (referred to the selected cluster center) D (x) Select a new data point as the new cluster Center, the principle of choice is: D (x) the larger point, The probability of being selected as the center of the cluster is greater repetition of 2 and 3 until the K-cluster center is selected to run the standard K-means algorithm from the K-cluster center, as can be seen from the above algorithm description, the key of the algorithm is the 3rd step, how to reflect D (x) to the probability of the selected point, an algorithm as follows:
Randomly pick a random point from our database. When "seed points" for each point, we calculate the distance d (x) from the nearest "seed point" and save it in an array, and then add these distances together to get the sum (D (x)). Then, take a random value and use the weighted method to calculate the next "seed point". The implementation of this algorithm is to first take a random value that can fall in sum (d (x)), and then use random-= d (x) until its <=0, where the point is the next "seed point". Repeat 2 and 3 until K cluster centers are selected to run the standard K-means algorithm using the K initial cluster centers

You can see the third step of the algorithm to choose a new center method, so that the distance d (x) is a large point, will be selected as a cluster center. As for why the reason is relatively simple, as shown in the following figure:

Assuming that the D (x) of A, B, C, D, as shown above, when the algorithm takes the sum (d (x)) *random, the value will fall into the larger range of D (x) at a larger probability, so the corresponding point will be selected as the new cluster Center at a larger probability. k-means++ Code: http://rosettacode.org/wiki/K-means%2B%2B_clustering
KNN (k-nearest Neighbor) introduces the algorithm idea: if a sample in the feature space in the K most similar (that is, the most adjacent in the feature space) the majority of the sample belongs to a category, then the sample belongs to this category. This method determines the category to which the sample is to be divided based on the category of the nearest one or several samples in the categorical decision-making. Look at the following picture:

The KNN algorithm process is this:

From the above image, we can see that the data set in the graph is good data, that is, the label is good, the class is a blue square, a kind of red triangle, the green circle is the data we want to classify.

If k=3, then the nearest green point has 2 red triangles and a blue square, these 3 points to vote, so the green of the classification point belongs to the red triangle

If k=5, then the nearest green point has 2 red triangles and 3 blue squares, these 5 points to vote, so the green of the classification point belongs to the Blue Square

We can see that the KNN essence is based on a data statistic method. In fact, many machine learning algorithms are also based on data statistics.

KNN is a kind of memory-based learning, also called instance-based Learning, belongs to the lazy learning. That is, it does not have an obvious pre-training process, but when the program starts running, the data set is loaded into memory and no training is needed to begin sorting.

Each time we come to an unknown sample point, we will find the nearest K point in the vicinity to vote.

For one more example, locally weighted regression (LWR) is also a memory-based method, as shown in the figure below in the dataset.

It is not possible to simulate this data set with any line, because the dataset does not look like a straight line. But the data points within each local range can be thought of as a straight line. Each time a position sample x, we on the x-axis to the data sample as the center, left and right to find several points, the sample points for linear regression, calculated a local line, and then the position sample X into this line, even if the corresponding y, completed a linear regression. That is, every time a data point, you have to train a local line, that is, training once, use once. LWR and KNN are very similar, are tailored for location data, in the local training.
the difference between KNN and K-means

KNN

K-means

1.KNN is a classification algorithm

2. Supervised learning

3. The data set that is fed to it is the data with the label, which is already exactly the correct data

1.k-means is a clustering algorithm

2. Non-supervised learning

3. The data set that is fed to it is no label data, is disorganized, after clustering to become a little order, first disorder, after the orderly

No obvious pre-training process, belongs to memory-based learning There is a clear pre-training process
K Meaning: A sample of X, to classify it, that is, to find its Y, from the data set, near the nearest K points from the dataset, this K data base, category C accounted for the largest number, the X label is set to C K Meaning: K is artificially fixed a good number, assuming that the data set can be divided into k clusters, because it is based on artificial, need a bit of prior knowledge


Similarities: all contain such a process, given a point, to find the closest point in the dataset from the data set. That is, both use the NN (nears Neighbor) algorithm, generally using KD tree to achieve nn.

Reference:

1) HTTP://WWW.YANJIUYANJIU.COM/BLOG/20130225/2) http://www.cnblogs.com/shelocks/archive/2012/12/20/2826787.html
Source

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.