K-means clustering

Source: Internet
Author: User
Thesis: distance-based clustering algorithm [sharing]

Ye ruofen Li chunping
(School of software, Tsinghua University, Beijing 100084, China)

Abstract: The K-means algorithm is recognized as one of the most effective algorithms in clustering big data sets. However, it can only be applied to a set of data objects with numerical attribute descriptions, this type of data object is called a numerical value.
But it cannot be applied to a collection of data objects with other diverse attributes in the real world, such as a set of data objects with color, texture, shape, and other feature descriptions. Such data is called classification data. To enable classification
Data is clustered to expand the K-means algorithm. Two new algorithms are available: K-modes and K-prototypes. However, both algorithms require
It is not easy for the user to determine the number of clusters K, threshold T, and cluster center Q in advance so that the three parameter values can be determined more accurately without understanding the data distribution, the improved K-modes algorithm effectively solves the problem.
.
Key words: Clustering K-means K-modes K-prototypes specificity
Distance-based partition Clustering Algorithm
Ye ruofen Li chunping
(School of software, Tsinghua University, Beijing 100084, China)

Abstract: The K-means algorithm is well known for its efficiency in clustering large data sets. however, working only on numeric values prohibits it from being used to cluster real world data containing categorical values, such as those data whose attributes is color, texture and shape etc. to cluster categorical values, the K-modes algorithm and K-prototypes algorithm were presented. yet it is necessary for users to predefine the number of ters, the center of a cluster and the initial threshold for these algorithms. it is difficult to judge the number of clusters and the initial threshold while not understanding the distribution of the original data. the issue is addressed in this paper for an improved K-modes algorithm.
Key words: Cluster, K-means, K-modes, K-prototypes, dissimilarity
1 Introduction
Data mining is one of the most active branches of database research, development, and application science, discovering useful knowledge and data models of interest from a large amount of data using extraordinary methods has become a natural demand for people.
[1]. With the rapid development of Data Mining Research, many data mining methods have emerged. clustering is the most basic method, which can be applied independently, it can also be used as a preliminary work for other data mining methods. In
In clustering, the K-means algorithm is one of the most famous and commonly used Division methods. Many K-means algorithm variants have emerged based on actual needs. The K-means algorithm can effectively process rules
Large and high-dimensional data sets, but only clustering numerical data, because numerical data can be used to measure the difference between different data objects by Euclidean distance; non-numerical data cannot be processed, that is, classified data. But reality
There are various types of data encountered in our daily life. To give full play to the role of data mining tools, designing hybrid data mining tools has become an inevitable trend. To cluster classified data-
The means algorithm is extended and two new algorithms are available: K-modes and K-prototypes. The K-modes algorithm uses a simple difference
The process of clustering with new specificity values is the same as that of K-means. K-prototypes algorithm combines K-means algorithm and K-means algorithm-
The modes algorithm is used to measure the specificity of the clustered numeric and classified hybrid data [2, 3]. These two extended algorithms are also very effective for clustering large datasets and high-dimensional datasets.
It is necessary to divide the original dataset into several clusters in advance. The value of K of clustering has a great impact on the clustering result, this article provides an effective solution.
2. K-means algorithm
Divide the data into several groups. According to the defined measurement standards, the data in the same group has a strong similarity with the data in other groups. This is called clustering [4]. Clustering is the most basic operation of data mining. However, some existing traditional clustering methods cannot meet the needs of processing complex types of high-dimensional data sets with arbitrary distribution shapes.
The K-means algorithm is the most widely used traditional clustering method, which is a division method. similarity calculation is used to calculate the distance between the data object and the cluster center, the closest to the cluster center is classified as a cluster. Its work
The process is as follows: first, K objects are randomly selected. Each object initially represents the average or center of a cluster. Assign the remaining objects to the nearest cluster based on their distance from each cluster center. Then
Recalculate the average value of each cluster, find the new cluster center, and then re-aggregate the cluster. This process repeats until the criterion function converges. The time complexity of this algorithm is O (NKT), where N is the number of all objects,
K indicates the number of clusters and T indicates the number of iterations. It is highly efficient. Its disadvantage is that it can only process numeric data, but cannot process classified data. It is very sensitive to exceptional data and cannot process non-convex clustering [1].
3. Introduction to K-modes Algorithms
The K-modes algorithm changes the similarity measurement method of the K-means algorithm and performs clustering for classification data using a simple matching similarity measurement.
3.1 Simple Matching similarity measurement
Set X and Y to two objects in the classification dataset. The objects are m (x1, x2 ,..., XM) dimension, the difference between the two objects is:
D (x, y) =
When XJ is YJ, & #61540; (XJ, YJ) = 0; When XJ is YJ, & #61540; (XJ, YJ) = 1.
Workflow of the 3.2 k-modes Algorithm
(1) pre-define k clusters and determine the cluster mode Q (cluster center of gravity) of each cluster.
(2) Assign each object to the nearest cluster according to the cluster mode Q, update the cluster mode Q, and re-allocate the object to each cluster.
(3) keep repeating (2) until no changes occur [2].
K indicates the number of clusters to divide the dataset into. In cluster mode, Q is a vector representing the cluster center, and T is the threshold value. The key of this algorithm is to determine K, Q, and T.
3.3 improvements to the K-modes Algorithm
When clustering is performed on the original dataset using the K-modes algorithm, you must specify the number of clusters K, the Pattern vector q representing the cluster center, and the threshold T. For a large number of datasets, the user blindly determines these three parameters. Generally, the validity is very poor. The following describes the algorithm IDEA for determining these three parameters.
1. determine the number of clusters K and the threshold t
In order to determine the number of clustering clusters K, we use a distance-based similarity calculation method for clustering sample data. The following describes the specific process and explanations.
(1) randomly take M points from the original data as sample data and put them into the set S, and define an initial threshold t = t0.
(2) Add the unclustered tag to all sample data and define a k = 0.
(3) Select an initial vertex P from the vertex where the sample dataset is not clustered.
{Mark P as a cluster CK;
Recursively traverse each point from p Based on the depth priority, p '= near (P, T ),
If P' is not empty, it belongs to CK; otherwise, it is returned to the previous point and then searched,
Update D and D are the average distance between each pair of vertices in the cluster ck,
Update t = t0 × D. T is the threshold .}
(4) If there are still unclustered points, make K = k + 1 and repeat (3) for clustering.
The near (P, T) function is:
{Find the nearest neighbor p 'of P, that is, dist (p', p) ≤ T;
If no P' is found, null is returned; otherwise, P is returned '.}
The algorithm uses the classical distance-based clustering method, and the distance value is the difference measured value of simple matching. Randomly Read samples from the raw data into the memory, and mark the sample data as unclustered.
Start point P. An initial threshold t0 is given. Generally, a small conservative data is used to locate the point P whose distance from P is smaller than the threshold t0 ', locate and classify P and P into the same cluster, and then find different data objects in the cluster.
To update the threshold T. Find another vertex adjacent to P, so that the entire sample set is searched by Deep Search Algorithm until no vertex can enter the cluster. Mark another vertex that has not entered the cluster as the next cluster,
Or traverse each point by deep search, and repeat each point continuously to get K clustering numbers. Although this algorithm is a classical distance-based similarity measurement method, it does not focus on a point during the sample clustering process.
Cluster, and the cluster is centered on each point in the cluster. In this way, the cluster will continue in the depth-first mode to obtain clusters of any shape, it is no longer a spherical cluster with only one point as the center. The advantage of this method is that it is of any shape.
Clustering of data distribution has a better effect. In addition, the initial threshold T is a predefined static value t0. As the cluster continues to add new points and the diameter d of the cluster continues to be obtained, the threshold t also dynamically changes with the diameter d.
And change.
2. Use the clustering result obtained from sample data to obtain the clustering mode Q
The process of using the frequency-based method to calculate the Q of the Pattern vector is as follows:
Evaluate the percentage of objects on a certain attribute in each cluster. fr is called relative frequency. Find the attribute with the highest relative frequency in each cluster as the attribute value of the Q vector, that is, make N (K, j) belong to group K.
Number of objects on the property J. FR (AJ = ck, J/X) = N (K, j)/n contains the Qj attribute. Each attribute is obtained in this way to obtain the Q vector.
The following describes the complete K-modes algorithm flow after obtaining three parameters: number of clusters K, threshold T, and mode Q.
(1) Assign each object to the nearest cluster mode Q according to three parameters: number of clusters K, threshold T, and mode Q.
(2) Update the mode in the cluster based on the relative frequency.
(3) objects are re-allocated based on the difference in the new mode. objects are re-allocated when they are found to belong to another mode.
(4) Repeat steps (2) and (3) until there is no object change.
Using the K-modes algorithm, computers can calculate the number of clusters K, cluster center Q, and threshold T. In the future, the clustering effect of the original dataset will be better, however, this algorithm requires the user to take samples subjectively.
When clustering data, you must manually determine an initial threshold t0. The sample selection adopts a uniform or random method. The initial threshold value is generally less conservative because of interference from isolated points.
4. Introduction to K-prototypes Algorithms
The K-modes algorithm can process classified data, high-dimensional data, and large datasets, use the deep search method of the graph to calculate the initial number of clusters, update the Vector Value of the cluster center based on the frequency mode, and obtain the threshold value based on the average value of difference.
T, it can be said that it is a very effective method for clustering data classification. However, in practical applications, it is easy to see the following data types: its data object attributes include both numerical data descriptions and classification data descriptions.
Status. To solve this problem, the K-prototypes algorithm is combined with the K-means algorithm and K-modes algorithm. The mathematical formula is used to represent the following method for measuring the similarity.
Two hybrid objects X and Y. Their attribute descriptions are a1r, a2r ,..., Apr, ap + 1C ,..., AMC: The first p attribute is numerical data, and the last M-P attribute is classified data. The difference between the two data objects X and Y is:
D (x, y) = + & #61543 ;,
Among them, Part 1 is the numerical attribute of Euclidean distance measurement, Part 2 is the classification attribute of Simple Matching similarity measurement and processing, and & # 1st; is the weight value, which is used to measure the numerical attribute and classification attribute.
Weight in cluster measurement. The first p attribute of the mode Q of each cluster is Numeric. The average value of each attribute I in the cluster is used as the Qi value of the Q attribute, the next M-P attribute uses the relative frequency of the largest
As the property value. The K-prototypes algorithm combines K-means algorithm and K-modes algorithm to measure the similarity of such hybrid data objects.
5. Conclusion and future work
The most active feature of the K-means algorithm in Data Mining is that it can effectively cluster large datasets. However, it can only be applied to numeric data, K-modes algorithm, and K-prototypes algorithm.
The new difference measurement method makes up for the defect that the K-means algorithm cannot process classified data and inherits the validity of the K-means algorithm. K-modes algorithm and K-
Unfortunately, the prototypes algorithm must determine the number of clusters to be divided into and the clustering mode Q. If this problem is not solved, the final clustering result will be directly affected, solution:
A classical distance-based algorithm is used to determine the number of clustering clusters K and the frequency-Based Method to Determine the clustering mode Q, which is a relatively effective method.
The problem is: whether to select samples is biased depends largely on the user's familiarity with data-related fields, however, people do not know much about many fields in real life, so we should further study the K-modes algorithm and K-prototypes algorithm so that users can configure more accurate parameters, achieve better clustering effect.
Contribution
[1] Han Jiawei, Michelin kamber. Data Mining concepts and techniques. Beijing: Machinery Industry Press, 2001
[2] Huang zhexue. extensions to the K-
Means Algorithm for clustering large data sets with categorical values. Data Mining and knowledge discovery,
1998, 2: 283 ~ 304
[3] Daniel Barbara. Using self-similarity to cluster large data sets. Data Mining and Knowledge Discovery, 2003 ~ 152
[4] Dharmendra S. modha, W. Scott spangler. Feature Weighting. K-means clustering. machine learning,
2003, 52: 217 ~ 237

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.