K-means algorithm principle and R language example

Source: Internet
Author: User

Clustering is the method of categorizing similar objects into the same cluster, which is somewhat like a fully automated classification. The more similar objects within a cluster, the better the clustering effect. The classification problems discussed in support vector machine and neural network are supervised learning methods, and now we introduce the cluster is unsupervised. The K-means (k--means) is the most basic and simple clustering algorithm.

In the K-means algorithm, centroid is the core of defining a clustering prototype (that is, the results obtained by machine learning). In the process of introducing the implementation of the algorithm, we will demonstrate the calculation method of centroid. And you will see that the centroid of the center is derived from the calculated mean, except for the first time the centroid is specified.


First, select the K initial centroid (the K centroid does not require a sample data set), where K is the user-specified parameter, which is the number of clusters expected. Each data point is taken to the classification of its nearest centroid, and the point set that is nationalized by the same centroid is a cluster. Then, based on the results of this classification, the centroid of each cluster is updated. Repeat the above data point classification and centroid change steps until the data points in the cluster no longer change, or equivalently, until the centroid no longer changes.


The basic K-mean algorithm is described as follows:



The data in the dataset is again categorized according to the distance from the data point to the new centroid, as shown in 13-2 (c). The algorithm then calculates the new centroid based on the new classification, and then classifies the data in the dataset again according to the distance from the data point to the new centroid. The result shows that the data points in the cluster no longer change, so the algorithm execution is finished, and the final clustering result is 13-2 (d).

For some combination of distance function and centroid type, the algorithm always converges to a solution, that is, the K mean reaches a state, and the cluster result and centroid are no longer changed. However, in order to avoid the time consumption caused by excessive iteration, a weaker condition is often used in practice to replace the "centroid no longer changes" condition. For example, use "until only 1% points to change clusters."

Although K-mean clustering is relatively simple, it is indeed quite effective. Some of its variants are even more effective, and are less affected by initialization problems. However, the K-mean value is not suitable for all data types. It cannot handle clusters of non-spherical clusters, different sizes, and densities, although it is usually possible to find Junko clusters when specifying a large enough number of clusters. The K-mean is also problematic when clustering data that contains outliers. In this case, outlier detection and deletion is helpful. Another problem with K-means is that it is sensitive to the selection of initial values, which indicates that the number of iterations caused by the choice of different initial values may vary considerably. In addition, the choice of K value is also a problem. Obviously, the algorithm itself is not self-adaptive to determine the data set should be divided into several clusters. Finally, the K-mean value is limited to data with the centroid (mean) concept. There is no such limitation for a related K-center point clustering technique. In the K center point cluster, each time we select is no longer the mean, but the median. The other details of this algorithm are not the same as the K-mean values, and we will not repeat them.


Finally, we give an example of practical application. (The code is implemented using my favorite R language for data mining )


A group of data from the World Bank counted two indicators in 30 countries, and we read the file in the following code and displayed the first few rows of data. Visible, the data is distributed in columns, where the first column is the name of the country, which is unrelated to the clustering analysis that follows, and we care more about the next two columns of information. The second column gives the proportion of the country's tertiary industry to GDP, and the last column gives the population of the population (i.e., the elderly), which is older than or equal to the 65-year-old population.




To facilitate subsequent processing, some of the necessary preprocessing of the read-in database is to adjust the column label and replace the row label with the country name (and delete the column containing the country name).


If you draw a scatter plot of these data, it is not difficult to see that the data can be broadly divided into two groups. In fact, half the countries in the data are OECD members, while the other half belong to developing countries (including some ASEAN countries, South Asian countries and Latin American countries). So we can use the following code to do K-means clustering analysis.



For clustering results, we are still only listing the first few. But it may be easier to accept it if it is shown graphically. The following is the sample code.



The above code shows the result of execution 13-3.






To facilitate subsequent processing, some of the necessary preprocessing of the read-in database is to adjust the column label and replace the row label with the country name (and delete the column containing the country name).


If you draw a scatter plot of these data, it is not difficult to see that the data can be broadly divided into two groups. In fact, half the countries in the data are OECD members, while the other half belong to developing countries (including some ASEAN countries, South Asian countries and Latin American countries). So we can use the following code to do K-means clustering analysis.



For clustering results, we are still only listing the first few. But it may be easier to accept it if it is shown graphically. The following is the sample code.



The above code shows the result of execution 13-3.




Another algorithm that is very similar to K-means is the K-median algorithm . There is no need to elaborate on the details of the K-median algorithm, which is basically the same as K-means, where all mean values are replaced by median values. This thought looks very humble, but you don't say, K-median algorithm is also really exist, and is an important complement and improvement of K-means algorithm.




K-means algorithm principle and R language example

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.