The clustering algorithm is not a classification algorithm.
A classification algorithm is used to give a data, and then determine which category of the data belongs to the classified class.
Clustering Algorithms give a lot of raw data, and then use algorithms to aggregate data with similar features into one type.
Here, K-means clustering gives the number of classes contained in the raw data in advance, and then aggregates the data containing similar features into a class.
All the information is explained by Andrew Ng.
First, the raw data {x1, x2,..., xn} is provided. The data is not marked.
Initialize K random data u1, U2,..., UK. Both XN and UK are vectors.
According to the following two formulas, the final u can be obtained through iteration. These U is the center position of all the final classes.
Formula 1:
This means finding the distance between all the data and the initialized random data, and then finding the data closest to each initial data.
Formula 2:
It means to find the mean of the distance between all the original data closest to the initial data.
Then, the two formulas are iterated until all the U does not change much.
Let's take a look at some results:
A diagram drawn from three two-dimensional Gaussian distribution data:
After kmeans clustering is performed on unlabeled raw data, the cross is the final iteration position:
The following is the Matlab code. Here I change the test data to three dimensions, and the function can process various dimensions.
Main. m
Clear all; close all; clc; % first-class data mu1 = [0 0 0]; % mean S1 = [0.3 0 0; 0 0.35 0; 0 0 0.3]; % covariance data1 = mvnrnd (mu1, S1, 100); % generates Gaussian distribution data % second type data mu2 = [1.25 1.25 1.25]; S2 = [0.3 0 0; 0 0.35 0; 0 0 0.3]; data2 = mvnrnd (mu2, S2, 100); % third-class data mu3 = [-1.25 1.25-1.25]; s3 = [0.3 0 0; 0 0.35 0; 0 0 0.3]; data3 = mvnrnd (mu3, S3, 100); % display data plot3 (data1 (:, 1 ), data1 (:, 2), data1 (:, 3), '+'); Hold on; plot3 (data2 (:, 1), data2 (:, 2 ), data2 (:, 3), 'r + '); plot3 (data3 (:, 1), data3 (:, 2), data3 (:, 3 ), 'g + '); grid on; % three types of data are merged into a non-labeled data class data = [data1; data2; data3]; % The data here is % K-means clustering without labels [u re] = kmeans (data, 3); % finally generates labeled data, the label is at the end of all data, which means that the data is added with a dimension [m n] = size (re); % Finally, the data after clustering is displayed figure; Hold on; for I = 1: m if Re (I, 4) = 1 plot3 (RE (I, 1), Re (I, 2), Re (I, 3), 'ro '); elseif Re (I, 4) = 2 plot3 (RE (I, 1), Re (I, 2), Re (I, 3), 'Go '); else plot3 (RE (I, 1), Re (I, 2), Re (I, 3), 'bo'); endendgrid on;
Kmeans. m
% N indicates the total number of classes for data. % data is input data without classification labels. % u is the center of each class. % RE is the returned data with classification labels. function [u re] = kmeans (data, n) [m n] = size (data); % m indicates the number of data, and N indicates the data dimension MA = zeros (N ); % maximum number of each dimension MI = zeros (n); % minimum number of each dimension u = zeros (n, n); % random initialization, final iteration to the center position of each class for I = 1: N Ma (I) = max (data (:, I); % maximum number of each dimension mi (I) = min (data (:, I); % the smallest number of each dimension for j = 1: n u (J, I) = MA (I) + (MI (I) -Ma (I) * rand (); % random initialization, but it is still better to initialize end while 1 pre_u = u in each dimension [min max; % center position obtained last time for I = 1: n tmp {I} = []; % x (I)-UJ in Formula 1, prepare for implementation of Formula 1: For j = 1: m TMP {I} = [TMP {I}; data (J, :)-U (I, :)]; end end Quan = zeros (m, n); for I = 1: M % Implementation of Formula 1 c = []; for j = 1: n C = [C norm (TMP {J} (I, :))]; end [junk Index] = min (c); Quan (I, index) = norm (TMP {index} (I, :)); End for I = 1: N % Implementation of formula 2 for j = 1: n u (I, j) = sum (Quan (:, I ). * Data (:, j)/sum (Quan (:, I); end if norm (pre_u-u) <0.1% continuously iterates until the position does not change break; end end Re = []; for I = 1: m TMP = []; for j = 1: n tmp = [TMP norm (data (I, :)-U (J, :))]; end [junk Index] = min (TMP); Re = [Re; data (I, :) Index]; end