K-means algorithm principle and R language example

Last Update:2016-01-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Clustering is the method of categorizing similar objects into the same cluster, which is somewhat like a fully automated classification. The more similar objects within a cluster, the better the clustering effect. The classification problems discussed in support vector machine and neural network are supervised learning methods, and now we introduce the cluster is unsupervised. The K-means (k--means) is the most basic and simple clustering algorithm.

In the K-means algorithm, centroid is the core of defining a clustering prototype (that is, the results obtained by machine learning). In the process of introducing the implementation of the algorithm, we will demonstrate the calculation method of centroid. And you will see that the centroid of the center is derived from the calculated mean, except for the first time the centroid is specified.

First, select the K initial centroid (the K centroid does not require a sample data set), where K is the user-specified parameter, which is the number of clusters expected. Each data point is taken to the classification of its nearest centroid, and the point set that is nationalized by the same centroid is a cluster. Then, based on the results of this classification, the centroid of each cluster is updated. Repeat the above data point classification and centroid change steps until the data points in the cluster no longer change, or equivalently, until the centroid no longer changes.

The basic K-mean algorithm is described as follows:

The data in the dataset is again categorized according to the distance from the data point to the new centroid, as shown in 13-2 (c). The algorithm then calculates the new centroid based on the new classification, and then classifies the data in the dataset again according to the distance from the data point to the new centroid. The result shows that the data points in the cluster no longer change, so the algorithm execution is finished, and the final clustering result is 13-2 (d).

For some combination of distance function and centroid type, the algorithm always converges to a solution, that is, the K mean reaches a state, and the cluster result and centroid are no longer changed. However, in order to avoid the time consumption caused by excessive iteration, a weaker condition is often used in practice to replace the "centroid no longer changes" condition. For example, use "until only 1% points to change clusters."

Although K-mean clustering is relatively simple, it is indeed quite effective. Some of its variants are even more effective, and are less affected by initialization problems. However, the K-mean value is not suitable for all data types. It cannot handle clusters of non-spherical clusters, different sizes, and densities, although it is usually possible to find Junko clusters when specifying a large enough number of clusters. The K-mean is also problematic when clustering data that contains outliers. In this case, outlier detection and deletion is helpful. Another problem with K-means is that it is sensitive to the selection of initial values, which indicates that the number of iterations caused by the choice of different initial values may vary considerably. In addition, the choice of K value is also a problem. Obviously, the algorithm itself is not self-adaptive to determine the data set should be divided into several clusters. Finally, the K-mean value is limited to data with the centroid (mean) concept. There is no such limitation for a related K-center point clustering technique. In the K center point cluster, each time we select is no longer the mean, but the median. The other details of this algorithm are not the same as the K-mean values, and we will not repeat them.

Finally, we give an example of practical application. (The code is implemented using my favorite R language for data mining )

A group of data from the World Bank counted two indicators in 30 countries, and we read the file in the following code and displayed the first few rows of data. Visible, the data is distributed in columns, where the first column is the name of the country, which is unrelated to the clustering analysis that follows, and we care more about the next two columns of information. The second column gives the proportion of the country's tertiary industry to GDP, and the last column gives the population of the population (i.e., the elderly), which is older than or equal to the 65-year-old population.

To facilitate subsequent processing, some of the necessary preprocessing of the read-in database is to adjust the column label and replace the row label with the country name (and delete the column containing the country name).

If you draw a scatter plot of these data, it is not difficult to see that the data can be broadly divided into two groups. In fact, half the countries in the data are OECD members, while the other half belong to developing countries (including some ASEAN countries, South Asian countries and Latin American countries). So we can use the following code to do K-means clustering analysis.

For clustering results, we are still only listing the first few. But it may be easier to accept it if it is shown graphically. The following is the sample code.

The above code shows the result of execution 13-3.

For clustering results, we are still only listing the first few. But it may be easier to accept it if it is shown graphically. The following is the sample code.

The above code shows the result of execution 13-3.

Another algorithm that is very similar to K-means is the K-median algorithm . There is no need to elaborate on the details of the K-median algorithm, which is basically the same as K-means, where all mean values are replaced by median values. This thought looks very humble, but you don't say, K-median algorithm is also really exist, and is an important complement and improvement of K-means algorithm.

K-means algorithm principle and R language example

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

K-means algorithm principle and R language example

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

K-means algorithm principle and R language example

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support