C # under the implementation of the basic K-MEANS multi-dimensional clustering (Code tutorial ),

Source: Internet
Author: User

C # under the implementation of the basic K-MEANS multi-dimensional clustering (Code tutorial ),
Preface

Recently, when I took class C #, the instructor mentioned that our course scores consist of several parts. these four comprehensive evaluations are "final work presentation", "group chat record scoring", "group member anonymous peer evaluation", and "report scoring. the teacher hoped that I could use these four projects to cluster all the students and then evaluate the final score based on the distance from the center of each cluster. since I have never touched on this algorithm, so I chose a more convenient and intuitive clustering method K-MEANS. so I will share some of my experiences in this article. because it is a C # course, this algorithm will be introduced in the C # example.

Clustering & K-Means clustering

In the encyclopedia, clustering is interpreted as follows:

The process of dividing a set of physical or abstract objects into multiple classes composed of similar objects is called clustering. A cluster generated by a cluster is a collection of data objects. These objects are similar to objects in the same cluster and different from objects in other clusters.

In simple terms, clustering is to generate a cluster together to make the similarity in different clusters.

Differences between clustering and Classification

Classification: perform data mining and judgment on specific data based on the hidden data tables. indicates that the category is known. sample Data has been marked for classification. is a supervised process.

Clustering: The purpose is to classify data, but we do not know how the data is classified before the classification ends. What are the characteristics of the data? Just according to the algorithm, the similarity is automatically classified together. the data is not marked. there is no category. clustering Algorithms classify highly similar data by tags. is an unsupervised process.

K-Means algorithm

The K-Means algorithm is a clustering algorithm based on distance (I used Euclidean distance in this article). distance or feature vector is used as the similarity consideration, the closer the distance between data and the cosine of the vector, the greater the similarity. in K-Means clustering algorithms, clusters are composed of data objects with close distances, therefore, the K-Means algorithm is used to obtain different clusters with relatively compact and independent data objects.

Combination of the scoring system and K-Means

What the instructor asked me to do this is to use data pairs in the four dimensions of "final work presentation", "group chat record rating", "group member anonymous peer evaluation", and "report rating ". all students use K-Means to aggregate their scores into four categories, corresponding to 90-100/80-90/70-80/60-70/fail, respectively. after the separation, the five types of centers are respectively 95/85/75/65/55, and the final score is determined based on the Euclidean distance from the data center. this is what the teacher asked me to implement.

Non-applicability of K-Means

The K-Means algorithm is applicable to continuous data. Due to different data distributions, it is sometimes unable to obtain accurate classes, such as the following two forms.

1. Non-Standard Normal Distribution

K-Means is used to indicate clustering trends only when the data conforms to the normal distribution or the biased state is not very serious. if the biased state is serious, or there is an abnormal extreme minimum value, it has a great impact on the average number. At this time, K-MEANS cannot represent the trend of clustering.

Therefore, in actual samples, our assumptions may not apply. In this case, it is not particularly appropriate to use the basic K-Means algorithm. at this time, we should use some improved algorithms, such as Kernel K-means and Spectral Clustering. these two algorithms will be introduced in later articles.

2. Uneven distribution

Due to the uneven distribution, the degree of density will affect the selection of the center, which will lead to a relatively dense area on the left, as shown below:

Advantages and disadvantages of K-Means:

1. The algorithm is simple, efficient, and iterative.

2. the time complexity of the K-Means clustering algorithm is O (nkt) {n data volume/number of k clusters/t algorithm iterations}. Therefore, the time complexity is approximately linear.

3. It has the features of optimized iteration and uses continuous computing centers to correct the clustering results.

4. Good data scalability.

Disadvantages:

1. We need to specify the K value. Before each clustering, We need to specify the data to be divided into several types, that is, there must be several centers.

2. sensitive to group values because the cluster center must be regenerated after each cluster. the center is based on the "center" of all data objects in the cluster type. Therefore, when the dataset is not large, these special points of outlier will have a great impact on the center. this can be done by preprocessing the dataset to screen out the outlier points. all the outlier points can be classified as the Nth (k + 1) type, because they are unique in nature and can be studied during analysis.

3. the initial vertex selection will affect the initial vertex selection. Generally, we randomly select the initial vertex, but the randomly selected vertex may not be reasonable between the vertex and the vertex, resulting in different final clustering results.

4. The final clustering result is a spherical cluster. Due to the Euclidean distance, it must eventually be a spherical (Circular) cluster with the center of the cluster as the ball Center (center of the circle.

5. The contribution of a dimension to the cluster results cannot be determined which dimension has a greater impact on the cluster results.

Some details of K-Means Euclidean distance

Euclidean distance (Euclidean distance) is a commonly used Distance Measurement formula. It is based on the distance formula and expands to the n-dimensional space. take 3D space as an example, as shown in. point A and point B are calculated based on Euclidean distance formula, which is the distance formula of the gray line in the middle.

Calculation of centroid Formula

The center is selected because a new center needs to be generated through the data objects in the cluster after each cluster. The new center is generated by generating the "center" of all data objects ". the formula can also be expanded to n-dimensional. Here we use a 3-dimensional space as an example.

The "Center of center" of the three points A/B/C is the gray point in the figure. The coordinates are (X1 + X2 + X3)/3) and (Y1 + Y2 + Y3) /3, (Z1 + Z2 + Z3)/3) Extend to n-dimensional (X1 +... + Xn)/n ,..., (Z1 +... Zn)/n )).

K-Means algorithm implementation steps

Here we will introduce the steps of the K-Means algorithm, without any optimization methods. There are five steps in total, as shown below:

1. First, as mentioned above, the K-Means algorithm must provide a k-Means cluster tree. This allows us to obtain k groups through clustering algorithms.

2. Randomly select k points from the dataset we want to perform clustering as the initial center of each cluster

3. A certain calculation method (Euclidean distance) is used to calculate the distance between each data object and the cluster center. and obtain the center point of the closest cluster. The data object is classified into the cluster.

4. When all vertices are clustered, a new center is calculated for each cluster (the "center" is found through the algorithm ").

5. iterate through step 3 and Step 3 until the new center and the original center selected by step 4 are smaller than a certain default threshold value (the center is almost no longer changed), and the algorithm stops. we have obtained our final results through clustering algorithms.

Next, let's take an example to better understand the K-Means algorithm. We randomly generate seven points whose positions are the coordinates, as shown in

Obviously, if we classify them under supervision, the result must be shown in

Let's take a look at the clustering process based on the K-Means algorithm.

1. First, we randomly select two points as the center of the cluster, and then select A1/B3.

2. Calculate the distance between a data object and each cluster center.

The distance between A2 and A1 is (1-2) ^ 2 + (1-2) ^ 2) ^ (1/2) = √ 2/the distance from A2 is (1-6) ^ 2 + (1-3) ^ 2) ^ (1/2) = √ 29 due to √ 29> √ 2, therefore, A2 is classified as an A1-centered cluster. Similarly, the results of classification of other points are shown in:

3. recalculate the center of each cluster

We now have two clusters. Now we can find the data object center in each cluster and use the centroid method, we can know that the new center of the Red group A is (2 + 1 + 3)/3, (2 + 1 + 1)/3) = (2, 4/3 ), the new centers of the brown group B are (5 + 6 + 6 + 7)/4, (4 + 4 + 3 + 2)/4) = (6, 3) at this time, we found that the center of B is the point B3, so we do not need to perform clustering again for Group B.

For group A, we can find that A1, A2, and A3 are equal to the new center (2, 4/3), so the center will not change if the new center is computed, so the cluster stops.

4. The final clustering result is A1/A2/A3 and B1/B2/B3/B4.

Implementation of K-Means algorithm in C #

(Here I show the K-Means implementation that has not been optimized)

1. First, the initialization center.

Based on the number of clusters we set, we can randomly select k points from all data objects as the initial center point. here, you only need to use the Next method of the Random class, so this step is not described.

2. Implementation of clustering steps

Explanation of code features:

1. my program first exports data from the Excel Xls file to the ListView in the C # control. The control name is XlsDataSh, the control name can be seen in the code of the program.

2. because our Xls file does not only have the data of these four items, the other items are other information, the numbers, names, and student IDs account for the 0/1/2 columns of ListView, and the 3-6 columns correspond to four dimensions. this is why the original meaning similar to for (int k = 3; k <7; k ++) often appears in the loop.

Let's take a look at the running steps in the clustering method as follows:

Explanation:

1. number of the data owned by each cluster stored in the ArrayList. so every time we re-calculate the center of the cluster, we will re-cluster the cluster where the data is located. Therefore, we must first clear the data stored in the ArrayList.

2. determining that the cluster is 0 is an important step. this is also the method I used when I was "lazy". Because our initial center was randomly selected, it is very likely that the selected data center did not receive any data points after clustering, this class is invalid at this time. we should handle it, and the method I use is to make it randomly select the center again, until every class will be allocated to other data points.

3. ClassNum-set the number of clusters. k

4. RowCount-number of data objects. That is, n

The code in the clustering method is as follows:

Private void Cluster () {int tmpclass = 0; double tmpClusDis = 0, tmpClusMinDis = 0; // clear the original item for (int I = 0; I <ClassNum; I ++) {ClusterAssem [I]. clear () ;}// perform Euclidean distance calculation and cluster for (int I = 0; I <RowCount-1; I ++) {tmpClusMinDis = vurMax; for (int j = 0; j <ClassNum; j ++) {tmpClusDis = 0; // here the data of each dimension is extracted for addition and for (int k = 3; k <7; k ++) {tmpClusDis + = Math. pow (System. convert. toDouble (XlsDataSh. items [I]. subItems [k]. text)-CenterArrayParams [j, k-3]), 2);} if (tmpClusDis <signature) {tmpclass = j; tmpClusMinDis = tmpClusDis;} ClusterAssem [tmpclass]. add (I);} // reinitialize if (ClusterAssem [0]. count = 0 | ClusterAssem [1]. count = 0 | ClusterAssem [2]. count = 0 | ClusterAssem [3]. count = 0 | ClusterAssem [4]. count = 0) {InitCenter (); // Method for reinitializing the center. cluster ();}}
3. regenerate the center of each cluster

Explanation:

ClusterAssem-stores the ArrayList of Data Objects contained in each cluster.

RenewCenterArrayParams-ArrayList that stores dimensions of each cluster center

Private void RenewCenter () {double tmpSameDis = 0; for (int I = 0; I <ClassNum; I ++) {for (int k = 3; k <7; k ++) {tmpSameDis = 0; // traverse the points of each cluster and obtain the center point foreach (object n in ClusterAssem [I]) {tmpSameDis + = System. convert. toDouble (XlsDataSh. items [System. convert. toInt16 (n)]. subItems [k]. text);} RenewCenterArrayParams [I, k-3] = (tmpSameDis * 1.0/(ClusterAssem [I]. count + 1 ));}}}
4. Computing end mark

The intention of the end sign is to stop the algorithm when the center does not change or is smaller than a threshold value.

Explanation:

1. BalEndFlag: used to store the previous Central addition for comparison with this one.

Private Boolean CalEndFlag () {double tmpDifferDis = 0, tmpSameDis = 0; for (int I = 0; I <ClassNum; I ++) {tmpSameDis = 0; for (int j = 0; j <4; j ++) {tmpSameDis + = Math. pow (RenewCenterArrayParams [I, j]-CenterArrayParams [I, j]), 2);} tmpDifferDis + = Math. pow (tmpSameDis, 1.0/2);} // determine the if (BalEndFlag-tmpDifferDis) offset between the center and the previous one) return false; else {// The algorithm is not over, so the new center is assigned to the ArrayList for calculation (int I = 0; I <ClassNum; I ++) {for (int j = 0; j <4; j ++) {CenterArrayParams [I, j] = RenewCenterArrayParams [I, j] ;}} BalEndFlag = tmpDifferDis; tmpDifferDis = 0; return true ;}}
Result Display

The following is my K-Means-based automatic clustering scoring system interface. First, let's see if the supervised clustering meets expectations, as shown in:

We can see in the red box that the clustering result is very consistent with our expectation. the following figure shows the running results of the K-Means algorithm without supervision. as shown in:

The above is the final result of this inquiry.

K-Means convergence

The K-Means algorithm must converge, otherwise it will be stuck in searching for a new "center" without the final result. the verification steps are complex. by referring to some articles (will be marked in the following reference) involved in the E-M algorithm. transforming a clustering problem into a maximum likelihood estimation problem proves that K-Means is a special case of E-M algorithm. under the E-M algorithm, the obtained parameters must converge. in the K-Means algorithm, the objective is to minimize the loss function. Therefore, in E-step, find a function that is closest to the target. In M-step, fix this function (Data Object allocation) and update the mean value (the center of the cluster), so it will surely converge.

Optimization of basic K-Means Algorithms

Since K-Means is a very mature algorithm, many of our predecessors have improved this algorithm. Here we only introduce the methods for reference improvement in various aspects.

* From the above, we can know that the. K-Means algorithm is vulnerable to interference from the group values. Therefore, we can detect the Group points first through preprocessing, such as the LOF algorithm, and process these points separately.

* K-Means is difficult to choose when the k value is not specified. In this case, you can use the K value adaptive Optimization Method of k-Means algorithm 」

* Select the cluster center. because this article selects the initial k points randomly, but this is unreasonable, it should be the K points in the Data distance as far as possible. Here there is also an algorithm "Canopy algorithm 」

* There are many improved K-Means algorithms, such as "K-means ++", "ISODATA", and "Kernel K-means, I will share it in later articles.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.