first, the theoretical preparation
Clustering algorithm is not a classification algorithm. The classification algorithm is to give a data, and then determine what kind of the data belongs to the class that has been Divided. Clustering algorithm is to give a large pile of raw data, and then through the algorithm will have similar characteristics of the data into a class.
The basic idea of the K-means algorithm is to initially randomly set the center of K clusters, and classify the sample points to each cluster according to the nearest neighbor Principle. Then the centroid of each cluster is recalculated by averaging method, and the new cluster heart is Determined. Iterate until the cluster heart moves less than a given Value.
Algorithm General Idea:
1. Select a few points from the given sample as the initial center (i take K=2)
2, calculate the distance between the remaining points and the initial center point, with which the initial center is close to the center point of a class (european distance formula), until the respective "faction"
3, on the basis of a good class by means of the average method of Re-calculate the cluster center point, and then repeat the second step ... And so on
4, until the final algorithm convergence (can be understood as the center point no longer changes) is over.
Small Knowledge points:
(1) s=size (A),
When there is only one output parameter, a row vector is returned, the first element of the row vector is the number of rows of the matrix, and the second element is the number of columns of the MATRIX.
(2) [r,c]=size (A),
When there are two output parameters, the size function returns the number of rows of the matrix to the first output variable r, returning the number of columns of the matrix to the second output variable C.
(3) size (a,n) If you add an item n to the input parameter of the size function and assign a value of N to 1 or 2, then size returns the number of rows or columns of the MATRIX. Where R=size (a,1) The statement returns the number of rows of the matrix a, c=size (a,2) The time that the statement returns a column count of Matrix A.
In addition, length () =max (size ()).
second, the implementation of the algorithm
3:x =[1.2126 2.1338 0.5115
4: -0.9316 0.7634 0.0125
5: -2.9593 0.1813 -0.8833
6: 3.1104 -2.5393 -0.0588
7: -3.1141 -0.1244 -0.6811
8: -3.2008 0.0024 -1.2901
9: -1.0777 1.1438 0.1996
Ten: -2.7213 -0.1909 0.1184
One: -1.1467 1.3820 0.1427
: 1.1497 1.9414 -0.3035
2.6993: -2.2556 0.1637
: -3.0311 0.1417 0.0888
: -2.8403 -0.1809 -0.0965
: 1.0118 2.0372 0.1638
: -0.8968 1.0260 -0.1013
: 1.1112 1.8802 -0.0291
: 1.1907 2.2041 -0.1060
: -1.0114 0.8029 -0.1317
: -3.1715 0.1041 -0.3338
: 0.9718 1.9634 0.0305
: -1.0377 0.8889 -0.2834
: -0.8989 1.0185 -0.0289
: -2.9815 -0.4798 0.2245
: -0.8576 0.9231 -0.2752
: -3.1356 0.0026 -1.2138
: 3.4470 -2.2418 0.2014
£ º 2.9143 -1.7951 0.1992
: 3.4961 -2.4969 -0.0121
: -2.9341 -0.1071 -0.7712
: -2.8105 -0.0884 -0.0287
3.1006: -2.0677 -0.2002
£ º 0.8209 2.1724 0.1548
: -2.8500 0.3196 0.1359
Approx.: -2.8679 0.1365 -0.5702
Panax notoginseng: -2.8245 -0.1312 0.0881
: -0.8322 1.3014 -0.3837
: -2.6063 0.1431 0.1880
Max: -3.1341 -0.0854 -0.0359
In: 0.6893 2.0854 -0.3250
: 1.0894 1.7271 -0.0176
: -2.9851 -0.0113 0.0666
: 1.0371 2.2724 0.1044
: -2.8032 -0.2737 -0.7391
0.0619: -2.6856 -1.1066
A: -2.9445 -0.1602 -0.0019
: 1.2004 2.1302 -0.1650
3.2505: -1.9279 0.4462
: -1.2080 0.8222 0.1671
Wuyi: -2.8274 0.1515 -0.9636
2.8190: -1.8626 0.2702
: 1.0507 1.7776 -0.1421
Wu: -2.8946 0.1446 -0.1645
: -1.0105 1.0973 0.0241
: -2.9138 -0.3404 0.0627
$ -3.0646 -0.0008 0.3819
: 1.2531 1.9830 -0.0774
: 1.1486 2.0440 -0.0582
: -3.1401 -0.1447 -0.6580
0.1598: -2.9591 -0.6581
+: -2.9219 -0.3637 -0.1538
2.8948: -2.2745 0.2332
: -3.2972 -0.0219 -0.0288
: -1.2737 0.7648 0.0643
: -1.0690 0.8108 -0.2723
0.7508: -0.5908 -0.5456
: 0.5808 2.0573 -0.1658
2.8227: -2.2461 0.2255
: 0.6174 1.7654 -0.3999
3.2587: -1.9310 0.2021
: 1.0999 1.8852 -0.0475
: -2.7395 0.2585 -0.8441
1.0542: -1.2223 -0.2480
: -2.9212 -0.0605 -0.0259
3.1598: -2.2631 0.1746
: 0.8476 1.8760 -0.2894
2.9205: -2.2418 0.4137
2.7656: -2.1768 0.0719
: -0.8698 1.0249 -0.2084
Bayi: -1.1444 0.7787 -0.4958
1.0450: -1.0711 -0.0477
1.8110: 0.5350 -0.0377
A: 0.9076 1.8845 -0.1121
: -2.7887 -0.2119 0.0566
: -1.2567 0.9274 0.1104
By: -2.9946 -0.2086 -0.8169
: 1.0536 1.9818 -0.0631
: -2.8465 -0.2222 0.2745
: -2.8516 0.1649 -0.7566
: -3.2470 0.0770 0.1173
: -2.9322 -0.0631 -0.0062
0.0438: -2.7919 -0.1935
94: 0.9894 1.9475 -0.0146
+: -2.9659 -0.1300 0.1144
: -2.7322 -0.0427 -1.0758
$ -1.4852 0.8592 -0.0503
98: 2.8845 -2.1465 -0.0533
: -3.1470 0.0536 0.1073
: 2.9423 -2.1572 0.0505
101: -3.0683 0.3434 -0.6563
102: 1.3215 2.0951 -0.1557
103: -0.7681 1.2075 -0.2781
104: -0.6964 1.2360 -0.3342
: -0.6382 0.8204 -0.2587
106: -3.0233 -0.1496 -0.2607
107: -0.8952 0.9872 0.0019
108: -0.8172 0.6814 -0.0691
109: -3.3032 0.0571 -0.0243
: 0.7810 1.9013 -0.3996
111: -0.9030 0.8646 -0.1498
0.9261: -0.8461 -0.1295
113: 2.8182 -2.0818 -0.1430
2.9295: -2.3846 -0.0244
: 1.0587 2.2227 -0.1250
3.0755: -1.7365 -0.0511
117: -1.3076 0.8791 -0.3720
118: -2.8252 -0.0366 -0.6790
119: -2.6551 -0.1875 0.3222
: -2.9659 -0.1585 0.4013
121: -3.2859 -0.1546 0.0104
122: -0.6679 1.1999 0.1396
123: -1.0205 1.2226 0.1850
124: -3.0091 -0.0186 -0.9111
: -3.0339 0.1377 -0.9662
126: 0.8952 1.9594 -0.3221
127: -2.8481 0.1963 -0.1428
£ º 1.0796 2.1353 -0.0792
129: -0.8732 0.8985 -0.0049
: 1.0620 2.1478 -0.1275
131: 3.4509 -1.9975 0.1285
: -3.2280 -0.0640 -1.1513
133: -0.6654 0.9402 0.0577
134: -3.2100 0.2762 -0.1053
135: 3.0793 -2.0043 0.2948
136: 1.3596 1.9481 -0.0167
137: -3.1267 0.1801 0.2228
138: -0.7979 0.9892 -0.2673
139: 2.5580 -1.7623 -0.1049
$ -0.9172 1.0621 -0.0826
141: -0.7817 1.1658 0.1922
142: 3.1747 -2.1442 0.1472
143: 2.8476 -1.8056 -0.0680
144: -0.6175 1.4349 -0.1970
145: 0.7308 1.9656 0.2602
146: -1.0310 1.0553 -0.2928
147: -2.9251 -0.2095 0.0582
148: -0.9827 1.2720 -0.2225
149: -1.0830 1.1158 -0.0405
Max: -2.8744 0.0195 -0.3811
151: 3.1663 -1.9241 0.0455
$ -1.0734 0.7681 -0.4725
155:%; indicates Column-by-row display.
157:%x (bn,:) Select a row of data as the cluster center with a column value of all
159:%x data source, number of K clusters, NC Representation k Initialization Cluster Center
%CID indicates which class each data belongs to, and NR denotes the number of each class, and centers represents the cluster center
162:% is not considered to be 150, or should not be a definite value, it is size (x,1) is the number of x rows
163:for i=1:150
164:
165: plot (x (i,1), x (i,2), ' r* ')% Show First Class
166: %plot (x (i,2), ' r* ')% Show First Class
167:
168:
169:
170:
171: % plot (x (i,2), ' b* ')% Show First Class
172:
173:
174:
175:
176: % plot (x (i,2), ' g* ')% Show First Class
177:
178:
179:
: plot (x (i,1), x (i,2), ' k* ')% shows class fourth
181: % plot (x (i,2), ' k* ')% Show First Class
182:
183:
184:
185:
186:
2:%x data source, k cluster number, NC representation k Initialization Cluster Center
3:%CID indicates which class each data belongs to, NR represents the number of each class, centers represents the cluster center
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
: % calculates the distance from each data to the cluster center, selects the smallest worthwhile location to the CID
M: % I remember is the cluster center almost no longer changes iteration stop
: While iter < MAXGN
18:
: %repmat that is Replicate matrix, copy and spread matrix, is a function of MATLAB.
%B = Repmat (a,m,n) copies matrix A to the MXN block, i.e. A as an element of B and B by a tile of MXN A. The dimension of B is [size (a,1) *m, size (a,2) *n].
A.^b: the% Point is the exponent, and each element in matrix A is the same as the corresponding element in b, or B is a constant.
%sum (x,2) represents the horizontal addition of the matrix x, the sum of each row, and the result is a column Vector. The default sum (x) is the vertical addition, the sum of each column, and the result is the row Vector.
23:
24:
25:
26:
27:
28:
: %find (a>m,4) Returns the position of the first four values above m in matrix A
30:
To: %mean (a,1) =mean (a) portrait; mean (a,2) landscape
32:
33:
34:
35:
36:
37:
39:
40:
41:
: maxiter = 2;
43:
44:
Max: %j~=k This is a logical expression, J is not equal to k, if J is not equal to k, the return value is 1, otherwise 0
46:
47:
48:
49:
50:
51:
K: in addition, a.\b represents each element of matrix B divided by the corresponding element in a or by a constant a,a./b means that constant a divided by each element in matrix B or matrix A divided by matrix B corresponding element or constant b
%nr: the number of no class
54:
55:
56:
£ º % This adjustment can not understand
A.*b: the% Point multiplication (multiplied by the corresponding Element) must be the same dimension or one of the scalar
59:
60:
61:
62:
63:
64:
65:
66:
67:
68:
69:
70:
71:
72:
73:
74:
75:
: End
Iii. Results of the algorithm
The result of the console auto-output is as follows, and I'm surprised how I output it myself.
1: >> Main
2:no points were moved after the initial clustering procedure.
3:cid =
4: Columns 1 through 22
5: 2 3 1 4 1 1 3 1 3 2 4 1 1 2 3 2 2 3 1 2 3 3
6: Columns through 44
7: 1 3 1 4 4 4 1 1 4 2 1 1 1 3 1 1 2 2 1 2 1 1
8: Columns through 66
9: 1 2 4 3 1 4 2 1 3 1 1 2 2 1 1 1 4 1 3 3 3 2
Ten: Columns through 88
All: 4 2 4 2 1 3 1 4 2 4 4 3 3 3 2 2 1 3 1 2 1 1
: Columns through 110
1 1 1 2 1 1 3 4 1 4 1 2 3 3 3 1 3 3 1 2 3 3
: Columns 111 through 132
: 4 4 2 4 3 1 1 1 1 3 3 1 1 2 1 2 3 2 4 1 3 1
: Columns 133 through 150
: 4 2 1 3 4 3 3 4 4 3 2 3 1 3 3 1 4 3
18:NR =
: 25
20:centers =
: -2.962918181818183 -0.023009090909091 -0.297021818181818 0.341136363636364
: 0.995233333333333 1.997873333333334 -0.078486666666667 0.229650000000000
: -0.956882500000000 0.997800000000000 -0.123667500000000 0.049320000000000
: 3.023444000000000 -2.098592000000001 0.102096000000000 -0.050580000000000
Each run results in a different picture, and the strangest thing is that the second picture overlaps the category.
Iv. Analysis of results
It is not suitable for dealing with discrete properties, but it has good clustering effect for continuous type.
Different initial values can result in different results: setting up some different initial value, but more time consuming and wasting resources.
Number of categories K indeterminate: by automatic merging and splitting of classes, a more reasonable number of types k, such as Isodata algorithm, is Obtained. The same point: the center of the cluster is determined by the iterative operation of the sample mean value; different points: mainly in the selection process can be divided into two categories, it is possible to combine the second class, that is, "self-organization", the algorithm has a heuristic characteristics. Because the algorithm has the ability of self-adjustment, it is necessary to set up several control parameters, such as the expected K of the cluster number, the number of samples in the smallest class, the Inter-class center distance parameter, The maximum number of cluster pairs allowed to merge per iteration and the allowable number of iterations i, etc.
Reference to Lao Wang's Courseware.
Realization of K-means Clustering algorithm by matlab