K-means algorithm is a well-known clustering algorithm, not only easy to achieve, and the effect is good, the training process without manual intervention, is a pattern recognition and other areas of the home must good products ah, today take this algorithm to practice practiced hand. Dynamic clustering of indirect clustering methods in unsupervised learning
Process:
1. Random selection of K points in a sample as a cluster center
2. Calculate the distance of all samples to each cluster center and plan each sample in the nearest cluster
3. Calculate the center of all the samples in each cluster and replace the new center with the original
4. Check the distance of the new and old cluster Center, if the distance exceeds the specified threshold, repeat the 2-4until it isless than the threshold value
Clustering belongs to unsupervised learning, the former regression, naive Bayes, SVM and so on have category label y, that is, the sample has been given the classification of examples. There is no y in the cluster sample, only feature x, such as the assumption that the stars in the universe can be represented as point sets in three-dimensional space. The purpose of clustering is to find the potential category Y for each sample x and put together a sample x of the same category Y. For example, the star above, the result is a cluster of clusters, the cluster inside the point of the distance between the close, the stars between the star distance is relatively far away.
In the clustering problem, the training samples given to us are, each, without Y.
K-means algorithm is to cluster the sample into K clusters (cluster), the specific algorithm is described as follows:
1, randomly selected K cluster centroid point (Cluster centroids) is.
2, repeat the following process until convergence {
For each example I, calculate the class it should belong to
For each class J, recalculate the centroid of the class
K is the number of clusters we have given beforehand, representing the nearest class in the example I and K classes, with a value of 1 to K. The centroid represents our guess of the central point of the sample that belongs to the same class, and the cluster model is used to explain that all the stars are to be clustered into K-clusters, first randomly selecting the points (or K-Stars) of the K-universe as the centroid of the K-clusters, and then the first step for each star to calculate its distance to the K Then select the nearest cluster to take, so that after the first step each star has its own cluster; the second step is to recalculate its centroid (averaging all the stars in it) for each cluster. Repeats the first and second steps of the iteration until the centroid is constant or changes very little.
Shows the effect of K-means clustering on n sample points, where K takes 2.
http://blog.csdn.net/holybin/article/details/22969747 a reference to;
Kmeans There are several advantages to this:
1 is a classical algorithm for solving clustering problems, and the algorithm is simple and fast.
2 , the algorithm is relatively scalable and efficient for processing large datasets because its complexity is linear, about O(NKT), where n is the number of all samples, K is the number of clusters, and T is the number of iterations. Usually k<<n.
3 , the algorithm is convergent (does not iterate indefinitely).
4 , the algorithm tries to find the K division that minimizes the value of the squared error function . When the cluster is dense, spherical or clustered, and the difference between cluster and cluster is obvious, its clustering effect is very good.
Kmeans The following disadvantages are also present:
1 , can only be used if the average value of the cluster is defined, not for some applications, such as data that involves categorical attributes, does not apply. Its premise assumes that the covariance matrix of the sample data has been normalized.
2 , although the theory proves that it can be convergent, but it is not guaranteed to be global convergence, may be local convergence, so the finding of the cluster center is not necessarily the best solution.
3 , requires that the user must give the number of clusters to be generated in advance K. For the same data sample collection, choosing a different K value, the result is not the same, even unreasonable.
4 , sensitive to the central initial value, may result in different clustering results for different initial values.
5 , and sensitive to "noise" and outlier data, a small amount of this data can have a significant impact on the average.
6 , not suitable for finding clusters with non-convex shapes, or clusters with large differences in size.
Cv_exports_w Double Kmeans (inputarray data, int K, Inputoutputarray bestlabels, Termcriteria C Riteria, int attempts, int flags, Outputarray centers = Noarray ()); Note that in OpenCV, the Kmeans () function is in the COR E.h header file, so first include this header file. The following analysis of its parameters: 1 Inputarray data: The input of the cluster vector, where each row is a sample, how many columns, there are how many samples. 2 int K: The number of clusters to be clustered. 3 Inputoutputarray Bestlabels: The number of rows is the same as data, each row has a number that represents the number of the classification. For example, the number of clusters is 8 classes, then the number of each row in 0-3. 4 Tercriteria criteria: This variable is used to control the end condition. Where Tercriteria is a template class, the definition in OpenCV is as follows: Termcriteria::termcriteria (int _type, int _maxcount, double _epsilon): Type (_type) , MaxCount (_maxcount), epsilon (_epsilon) {} where type has three types: Termcriteria::count: Represents the end condition as the number of steps to run. Termcriteria::eps: Represents the end of an iteration to a threshold value. Termcriteria::count + termcriteria::eps: Terminates when one of the steps or thresholds has reached a condition. _maxcount is the number of steps to run, and _epsilon is the threshold value. 5 int Attempts: This variable controls the number of times the Kmean algorithm is performed and selects the optimal result as the final result. 6 int Flags: This variable can take the following three types. Kmeans_random_centers: Randomly selects the initial center. Kmeans_pp_centers: The initial center is determined with an algorithm. Kmeans_use_initial_labels: User-defined center. 7 Centers: This variable represents a specific cluster center.
let's look at an example
#include "opencv2/highgui/highgui.hpp" #include "opencv2/core/core.hpp" #include <iostream>using namespace CV; Using namespace Std;int main (int/*argc*/, char**/*argv*/) {const int max_clusters = 5; Scalar colortab[] = {scalar (0, 0, 255), scalar (0,255,0), scalar (255,100,100), scalar ( 255,0,255), Scalar (0,255,255)}; Mat img (CV_8UC3); RNG rng (12345); for (;;) {int k, clustercount = Rng.uniform (2, max_clusters+1); int I, Samplecount = Rng.uniform (1, 1001); Mat points (Samplecount, 1, CV_32FC2), labels; Clustercount = MIN (Clustercount, Samplecount); Mat Centers (Clustercount, 1, Points.type ()); /* Generate random sample from Multigaussian distribution */for (k = 0; k < Clustercount; k++) { Point Center; center.x = Rng.uniform (0, Img.cols); CENTER.Y = Rng.uniform (0, img.rows); Mat pointchunk = points.RowRange (k*samplecount/clustercount, k = = clusterCount-1? Samplecount: (k+1) *samplecount/clustercount); Rng.fill (Pointchunk, Cv_rand_normal, scalar (center.x, center.y), scalar (img.cols*0.05, img.rows*0.05)); } randshuffle (points, 1, &rng); Kmeans (points, Clustercount, labels, termcriteria (cv_termcrit_eps+cv_termcrit_iter, 10, 1.0), 3, Kmeans_pp_centers, CENTERS); img = scalar::all (0); for (i = 0; i < Samplecount; i++) {int clusteridx = labels.at<int> (i); Point IPT = points.at<point2f> (i); Circle (IMG, IPT, 2, Colortab[clusteridx], cv_filled, CV_AA); } imshow ("Clusters", IMG); Char key = (char) waitkey (); if (key = = | | key = = ' Q ' | | key = = ' Q ')//' ESC ' break; } return 0;}
Matlab
The K-means clustering algorithm uses the n*p matrix X divided into K classes, so that the distance between the objects within the class is the largest, and the distance between the classes is minimal.
How to use:
Idx=kmeans (X,k)
[Idx,c]=kmeans (X,k)
[Idx,c,sumd]=kmeans (X,k)
[Idx,c,sumd,d]=kmeans (X,k)
[...] =kmeans (..., ' Param1 ', Val1, ' Param2 ', Val2,...)
The input and output parameters are introduced:
Data matrix for X n*p
K means dividing x into several classes, which are integers
A vector of Idx n*1 that stores the cluster labels for each point
C k*p Matrix, which stores the K-cluster centroid locations
SUMD 1*k and vectors, which store the sum of the distances between all the points in the class and the centroid points of that class
D n*k Matrix, which stores the distance from each point to all centroid
[...] =kmeans (..., ' Param1 ', Val1, ' Param2 ', Val2,...)
This one of the parameters Param1, PARAM2, etc., can be set mainly as follows:
1. ' Distance ' (distance measure)
' Sqeuclidean ' European distance (by default, this distance is used)
' Cityblock ' absolute error and, also known as: L1
' Cosine ' for vectors
' Correlation ' for values with time-series relationships
' Hamming ' only for binary data
2. ' Start ' (initial centroid position selection method)
' Sample ' randomly selects a k centroid point from X
' Uniform ' randomly generates K centroid based on an evenly distributed range of X
The ' cluster ' initial cluster stage randomly selects a sub-sample of 10% X (this method initially uses the ' sample ' method)
Matrix provides a k*p of matrices as the initial centroid position set
3. ' Replicates ' (number of cluster repetitions) integer
<span style= "FONT-SIZE:18PX;" >x = [Randn (100,2) +ones (100,2); Randn (100,2)-ones (100,2)];opts = Statset (' Display ', ' final '); [Idx,ctrs] = Kmeans (x,2,... ' Distance ', ' City ',... ' Replicates ', 5,... ' Options ', opts);p lot (x (idx==1,1), X (idx==1,2), ' R ', ' Markersize ', ' V ' Hold Onplot (x (idx==2,1), X (idx==2,2), ' B ', ' Markersize ', () plot (CTRs (:, 1), CTRs (:, 2), ' kx ',... ' Markersize ', ' linewidth ', 2) plot (CTRs (:, 1), CTRs (:, 2), ' Ko ',... ' Markersize ', ' linewidth ', 2 ' Legend (' Cluster 1 ', ' Cluster 2 ', ' centroids ',... ' Location ', ' NW ') </span>
Image recognition algorithm Communication QQ Group: 145076161, welcome image recognition and image algorithm, mutual learning and communication
OpenCV image recognition from zero to proficient (----) Kmeans