Principle and implementation of Kmeans clustering algorithm

Source: Internet
Author: User
Tags random seed

Kmeans Clustering algorithm1 Basic principles of Kmeans Clustering algorithm

K-means algorithm is the most classical clustering method based on partition, and it is one of the ten classical data mining algorithms. The basic idea of the K-means algorithm is to classify the objects closest to them by clustering the K points in the space as a center. Through iterative method, the values of each cluster center are updated successively until the best clustering results are obtained.

Suppose you want to divide the sample set into k categories, the algorithm is described as follows:

( 1 ) Appropriate selection k the initial center of a class, initially generally randomly selected;

( 2 in each iteration, each sample is evaluated to k The center of the European distance, the sample to the shortest distance of the center where the class;

( 3 ) using the mean method to update the k The value of the center of a class;

( 4 ) for all k a cluster center, repeating ( 2 ) ( 3 ), when the center value of a class moves at a certain condition, the iteration ends and the classification is completed.

Kmeans the principle of clustering algorithm is simple, and the effect depends on k value and the selection of the initial point in the class.

2 algorithm structure and implementation method

Kmeans the algorithm is relatively simple, and the algorithm realizes using C + + language, as an object-oriented design language, to ensure its good encapsulation and code reuse. The software contains three parts, namely kmeans.h,kmeans.cpp and main.cpp.

in kmeans.h vector st_ Point char id The second is the declaration of the function.

Figure 4.1 Program basic mechanism and corresponding function

in the Kmeans.cpp The public functions of different functions are given in detail, _1 , the function is more detailed, it is convenient to extend the later application, and the clustering function is more specific: Cluster , which are strictly based on Kmeans The basic principle, the similarity of the cluster is the simplest Euclidean distance, and the end criterion of the iteration chooses whether the deviation between the two center values is greater than the given Dist_near_zero values. See the Cheng code for details.

3 Data Description

This algorithm uses data for three-dimensional point cloud data, similar to the three-dimensional laser scanner in the laboratory data collected, the form is more simple, orderly and regular, in Cloudcompare shown in, such as:

Figure 4.2 Data source Map

The data is a three-point cloud in a three-dimensional coordinate system, namely spheres, polygons and cubes, and Test.txt The file is a set of three-dimensional set of points, is chaotic, clustering algorithm to do is to store them in the classification. Naturally, the K value in the cluster was selected as 3.

When the software is implemented, a container containing the structure type is established to read the original data.

typedef struct ST_POINT

{st_pointxyz pnt;//st_pointxyz is a three-dimensional point structure type data stru st_pointxyz

int GroupID;

St_point () {}

St_point (st_pointxyz &p, int id)

{PNT = p;

GroupID = ID;

}

}st_point;

This data structure type contains three-dimensional point data as well as the classified ID , the data container is vector<st_point> .

4 algorithm description and source code Analysis

This section focuses on the analysis of the project Culster the specific code of the clustering function, as C + + language is more suitable for large-scale program writing, this algorithm is relatively simple, it is rather lengthy, the specific complete program see project source program. The following is only an analysis of the Kmeans principle (2) (3) procedure implementation.

As the following program source code:

1 BOOLKmeans::cluster ()2 {3Std::vector<st_pointxyz>V_center (Mv_center.size ());4 5      Do6     {7          for(inti =0, Pntcount = Mv_pntcloud.size (); i < Pntcount; ++i)8         {9             DoubleMin_dist =Dbl_max;Ten             intPNT_GRP =0; One              for(intj =0; J < M_k; ++j) A             { -                 DoubleDist =distbetweenpoints (MV_PNTCLOUD[I].PNT, mv_center[j]); -                 if(Min_dist-dist >0.000001) the                 { -Min_dist =Dist; -PNT_GRP =J; -                 } +             } - M_grp_pntcloud[pnt_grp].push_back (St_point (MV_PNTCLOUD[I].PNT, pnt_grp)); +         } A  at         //Save the center point of the last iteration -          for(size_t i =0; I < mv_center.size (); ++i) -         { -V_center[i] =Mv_center[i]; -         } -  in         if(!Updategroupcenter (M_grp_pntcloud, mv_center)) -         { to             return false; +         } -         if(!Existcentershift (V_center, mv_center)) the         { *              Break; $         }Panax Notoginseng          for(inti =0; i < M_k; ++i) { - m_grp_pntcloud[i].clear (); the         } +  A} while(true); the  +     return true; -}
5 algorithm Result analysis

Original data File Test.txt the data in is divided into three categories, stored in the file K_1 , k_2 , K_3 , we add color to the data point cloud after three clusters to display the Cloudcompare on, get the following display diagram:

Figure 4.3 kmeans Clustering Results

is the initial three cluster center point in a given {0, 0, 0} , {2.5, 2.5, 2.5} , {3, 3,-3} in the case of the results obtained. This is more ideal, look again:

Figure 4.4 results after changing the initial cluster center

The initial three center points corresponding to this result are {2, 2, 2} , { -2.5, 2.5, 2.5} , {3, -3,-3} , it is clear that data clustering is not ideal, which means K-means To some extent, the initial clustering seed points, this cluster seed point is too important, different random seed points will have a completely different result.

the above changes the initial point, the following gives the k=4 clustering results, two different sets of initial points were taken:

Figure 4.5.1 k=4 Clustering Results 1

Figure 4.5.2 k=4 Clustering Results

by the above-mentioned clustering results, when When you increase, select the cluster initial point is appropriate, Satisfactory results can be obtained, such as 5_1 5_2 5_1 compared to the result is not ideal, from the color can be seen, there are only two classes, the other two categories are empty, the k

it can be seen from the analysis of experimental results that Kmeans Clustering algorithm is a kind of very fast clustering algorithm, the effect is very obvious, the locality is better, easy to parallelize, it is meaningful for large-scale data sets. However, the comparison depends on the choice of K -worthy selection and initial clustering center point, so the algorithm is suitable for large-scale clustering situations with human participation.

Project Source: HTTP://PAN.BAIDU.COM/S/1NTN6PJB

Kmeans Clustering Algorithm-open source China community http://www.oschina.net/code/snippet_588162_50491

Reference documents

[1] Hartigan J A, Wong M a. Algorithm as 136:a K-means clustering algorithm[j]. Applied Statistics, 1979:100-108.

Principle and implementation of Kmeans clustering algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.