Understanding of K-means algorithm in opencv2.4.9

Source: Internet
Author: User
Tags min
K-means AlgorithmOpenCV Chinese version of the original description is: K mean is an unsupervised clustering method, using K mean value to represent the distribution of data, where k is user-defined. The difference between this method and the expected maximization method is that the center of the K mean is not Gaussian, and because the centers compete to "capture" the nearest point, the cluster is more like a soap bubble. The method was invented by Steinhaus and used by Lloyd for promotion. The k mean has the following 3 questions.
1. It is not guaranteed to find the best solution for locating a clustering center, but it can guarantee convergence to a solution.
A 2.K mean cannot indicate how many categories should be used.
The 3.K mean assumes that the covariance matrix of the space does not affect the result or has been normalized. Workaround:
1. Run several times K mean, each cluster center is not the same, and finally select the least variance of the result.
2. First set the category to 1, gradually increase the number of categories, the total variance will quickly decline until a inflection point (not the Extremum), select the corresponding number of categories at the inflection points.
3. Multiply the data by the covariance matrix. The distance between points is not the Euclidean distance but the mahalanobis distance.

The code in opencv2.4.9 is as follows:

Double Cv::kmeans (inputarray _data, int K, Inputoutputarray _bestlabels, Termcriter IA criteria, int attempts, int flags, Outputarray _centers) {const int spp_trials = 3;//used to produce the center of the class.
    Number of retries Mat data = _data.getmat ();//Data matrix bool Isrow = Data.rows = = 1 && data.channels () > 1; int N =!isrow?
    Data.rows:data.cols;
    int dims = (!isrow? data.cols:1) *data.channels (); int type = Data.depth (),//required data number can be represented by 32bit attempts = Std::max (attempts, 1);//number of clusters Cv_assert (Data.dims <= 2 & amp;& type = = cv_32f && k > 0);//can only handle 2-D and below data Cv_assert (N >= K);//The number of samples must be greater than the class center number _BESTLABELS.CR Eate (N, 1, cv_32s,-1, true);//The most suitable category for each sample, initially 1 Mat _labels, best_labels = _bestlabels.getmat (),//best_labels for saving each cluster Class when it is possible to start with an empty if (Flags & cv_kmeans_use_initial_labels) {Cv_assert (Best_labels.cols = = 1 | | _labels.rows = = 1) && best_lAbels.cols*best_labels.rows = = N && best_labels.type () = = Cv_32s && be
        St_labels.iscontinuous ()); Best_labels.copyto (_labels);//If you start to specify a category, place it in _labels} else {if ( 
            (Best_labels.cols = = 1 | | best_labels.rows = = 1) && best_labels.cols*best_labels.rows = = N && Best_labels.type () = = Cv_32s && best_labels.iscontinuous ()) Best_labels.crea
        Te (N, 1, cv_32s); _labels.create (Best_labels.size (), Best_labels.type ());//If no category is specified, create a category space} int* labels = _labels.ptr<int> () The category number used to temporarily save the sample//centers is used to temporarily save the center of the class after each iteration, Old_centers is used to temporarily save the center of the class before each iteration; it starts with empty Mat centers (K, dims, type), Old_cente
    RS (K, dims, type), temp (1, dims, type); vector<int> counters (K);//Save the Class sample number vector<vec2f> _box (dims);//contains dims values initialized elements, each element is a vec2f vec2f* box = &
    AMP;_BOX[0];
    Double best_compactness = dbl_max, compactness = 0; RNG& rng = Therng ();

    int A, ITER, I, J, K;
    if (Criteria.type & termcriteria::eps) Criteria.epsilon = Std::max (Criteria.epsilon, 0.);
    else Criteria.epsilon = Flt_epsilon;

    Criteria.epsilon *= Criteria.epsilon;
    if (Criteria.type & termcriteria::count) Criteria.maxcount = Std::min (Std::max (Criteria.maxcount, 2), 100);

    else Criteria.maxcount = 100;
        if (K = = 1) {attempts = 1;
    Criteria.maxcount = 2; The const float* sample = data.ptr<float> (0);//Gets the first sample (x, Y) for (j = 0; J < dims; J + +)//box[0] = (x,x); Bo

    x[1]= (y,y) box[j] = vec2f (Sample[j], sample[j]);
        for (i = 1; i < N; i++) {sample = Data.ptr<float> (i);
            for (j = 0; J < dims; J + +) {Float v = sample[j];

    Box[j][0] = Std::min (box[j][0], v);//Save the minimum value of the J dimension box[j][1] = Std::max (box[j][1], v);//Save the maximum value of the J dimension}} for (a = 0; a < ATTempts;
        a++) {Double max_center_shift = Dbl_max;
        for (iter = 0;;) {Swap (centers, old_centers);//Iteration old_centers = centers,centers=0 or centers= random Center if (iter = = 0 & & (A > 0 | |! (
                    Flags & Kmeans_use_initial_labels)) {if (Flags & kmeans_pp_centers)
                GENERATECENTERSPP (data, centers, K, RNG, spp_trials);//try spp_trials times to bulk populate centers else with random center {for (k = 0; k < K; k++) Generaterandomcenter (_box, centers.ptr< Float> (k), RNG);//Random Center fills the K-Class Center}} else {if (ITER
                    = = 0 && A = = 0 && (Flags & Kmeans_use_initial_labels)) {//Determine if the input parameters are valid for (i = 0; i < N; i++) Cv_assert ((unsigned) labels[i] < (unsigned) K);//Ensure the class number of sample I starts from 0 small
                to K.

      }          Compute Centers Centers = Scalar (0);

                for (k = 0; k < K; k++) counters[k] = 0;
                    for (i = 0; i < N; i++) {sample = Data.ptr<float> (i);
                    k = Labels[i];
                    float* Center = centers.ptr<float> (k);
                    j=0;
                        #if cv_enable_unrolled for (; J <= dims-4; j + = 4) {
                        float T0 = center[j] + sample[j];

                        float T1 = center[j+1] + sample[j+1];
                        CENTER[J] = t0;

                        CENTER[J+1] = T1;
                        T0 = center[j+2] + sample[j+2];

                        T1 = center[j+3] + sample[j+3];
                        Center[j+2] = t0;
                    CENTER[J+3] = T1; } #endif for (; j < dims; j++) Center[j] + = sample[j];//J columns of category K, used to calculate the number of samples per class for the new class center counters[k]++;//

                } if (iter > 0) max_center_shift = 0;
                        for (k = 0; k < K; k++) {//for loop ensures that each class has at least one sample if (counters[k]! = 0)

                    Continue If some cluster appeared to is empty then://1. Find the biggest cluster//2. Find the farthest from the center point in the biggest cluster//3.
                    Exclude the farthest point from the biggest cluster and form a new 1-point cluster.
                    int max_k = 0;
                            for (int k1 = 1; k1 < K; k1++) {if (Counters[max_k] < COUNTERS[K1]) Max_k = k1;//Get 1, 2, ...
   K-1 the largest number of samples in the class} double max_dist = 0;                 int farthest_i =-1;
                    float* New_center = centers.ptr<float> (k); float* old_center = centers.ptr<float> (max_k);//class center of the category with the largest number of samples float* _old_center = TEMP.PTR&LT;FL Oat> ();
                    Normalized float scale = 1.f/counters[max_k];

                    for (j = 0; J < dims; J + +) _old_center[j] = old_center[j]*scale;//with number of normalized class centers
                            for (i = 0; i < N; i++) {if (labels[i]! = max_k)//Sample I is not in the maximum group
                        Continue
                        Sample = Data.ptr<float> (i);
                        Double dist = norml2sqr_ (sample, _old_center, dims);//Sample I to maximum group of Euclidean distance if (max_dist <= Dist)
                            {//Get the farthest point and distance from the largest group max_dist = dist;
                        Farthest_i = i; }
                    }

                    Transfer sample I from the largest group to the empty group counters[max_k]--;
                    counters[k]++;
                    Labels[farthest_i] = k;

                    Sample = Data.ptr<float> (farthest_i);
                        for (j = 0; J < dims; J + +) {Old_center[j]-= sample[j];
                    NEW_CENTER[J] + = sample[j]; }} for (k = 0; k < K; k++) {float* Center = cen
                    Ters.ptr<float> (k);

                    Cv_assert (counters[k]! = 0);
                    Float scale = 1.f/counters[k]; for (j = 0; J < dims; J + +) Center[j] *= scale;
                        Category k, column J mean if (Iter > 0) {Double dist = 0;
                        Const float* Old_center = old_centers.ptr<float> (k); for (j = 0; j< dims;
                            J + +) {//calculates the Euclidean distance and double t = center[j]-old_center[j] of the center points of the new and old classes of k categories;
                        Dist + = t*t;
                    } Max_center_shift = Std::max (Max_center_shift, dist);//Get the maximum center moving distance to determine whether the algorithm converges }}} if (++iter = = MAX (criteria.maxcount, 2) | | Max_center_shift <= criteria

            . Epsilon) break;
            Assign Labels Mat dists (1, N, cv_64f);
            double* dist = dists.ptr<double> (0); Parallel_for_ (range (0, N), Kmeansdistancecomputer (Dist, labels, data, centers));//each iteration gets the distance from the sample to the center of the class
                and label compactness = 0;//compactness is gradually reduced for (i = 0; i < N; i++) {
            Compactness + = Dist[i]; }} if (Compactness < best_compactness) {//Once clustering is complete, calculate each sample to its class center distance and best_compactness = compactness; if (_centers.needed ()) Centers.copyto (_centers);//Save the final class center result to the input parameter _centers _labels.copyto (best
 _labels);//Save the final sample class to Best_labels}} return best_compactness;//returns the cost function of the classification, that is, distance and}

Where flag in the function is preferable to the following three:

Enum
{
    kmeans_random_centers=0,//chooses RANDOM CENTERS for K-means initialization
    kmeans_pp_centers=2,     //Uses k-means++ algorithm for initialization
    Kmeans_use_initial_labels=1//Uses the user-provided LABELS for K-means initialization
};

The k-means++ algorithm is calculated as follows:


This method is obtained. A more detailed description of the K-means in http://blog.csdn.net/chlele0105/article/details/12997391.
The code for k-means++ in opencv2.4.9 is as follows:

static void generatecenterspp (const mat& _data, mat& _out_centers, int K, rng& r
    ng, int trials) {int I, j, K, dims = _data.cols, N = _data.rows;
    Const float* data = _data.ptr<float> (0);
    size_t step = _data.step/sizeof (Data[0]);
    Vector<int> _centers (K);
    int* Centers = &_centers[0];
    Vector<float> _dist (n*3); /*tdist2 temporarily saves the distance from each point to the center of the class, Tdist saves the minimum tdist2 of sum (TDIST2), which is used to determine a class center dist used to save the last class center when determining the center of K-Class Tdist */float* dist = &
    Amp;_dist[0], *tdist = dist + N, *tdist2 = tdist + N;

    Double SUM0 = 0;  Centers[0] = (unsigned) rng% n;//sample number for (i = 0; i < N; i++) {//Calculate Euclidean distance and centers[0 for each sample I to sample dist[i] =
        Norml2sqr_ (data + step*i, Data + step*centers[0], dims);
    SUM0 + = Dist[i];
        } for (k = 1; k < K; k++) {//According to the method of random guessing to determine K class Center double bestsum = Dbl_max;

        int bestcenter =-1; for (j = 0; J < trials; J + +) {//From Trials Candidate CenterSelect one as the K-Class Center//based on the distance with the random number method to determine the sample I, as a class center candidate, there are altogether trials candidates double p = (double) RNG*SUM0, s = 0;
            for (i = 0; i < N-1; i++) if ((P-= dist[i]) <= 0) break;

            int ci = i;
            Parallel_for_ (Range (0, N), Kmeansppdistancecomputer (tdist2, data, Dist, dims, step, Step*ci));
            for (i = 0; i < N; i++) {s + = tdist2[i];
                } if (s < bestsum) {//Get the minimum distance from each sample to the optional class center (randomly generated) and the corresponding sample bestsum = s;
                Bestcenter = CI;
        Std::swap (Tdist, tdist2);//tdist =tdist2}} centers[k] = Bestcenter;
        SUM0 = Bestsum; Std::swap (Dist, tdist);//dist =tdist} for (k = 0; k < K; k++) {Const float* src = data + step*c
        ENTERS[K]; float* DST = _out_centers.ptr<float> (k);//Get the coordinate value of the center of the class as the output parameter for (j = 0; j< dims;
    J + +) Dst[j] = Src[j];
 }
}

Use the K-means sample code in Samples/cpp/kmeans.cpp, but only for 2-dimensional data clustering.

( reprint please indicate author and source : Http://blog.csdn.net/CHIERYU do not use for commercial use without permission )

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.