K-means algorithm for visual machine learning------

Source: Internet
Author: User

K-means (K-means) is a unsupervised clustering algorithm based on data partitioning.

First, the basic principle

Clustering algorithm can be understood as unsupervised classification method, that is, the sample set is unknown to the class or label, need to be based on the distance between samples or similar degree of automatic classification. Clustering algorithm can be divided into partition-based method, based on the method of connectivity, based on probabilistic distribution model, K-means belongs to the classification-based clustering method.

Partitioning is based on dividing the vector space of the sample set into multiple regions {si}i=1k, each of which has a region-related representation of {ci}i=1k, often referred to as a regional center. For each sample, you can create a map of the sample to the regional center Q (x):

where 1 () is the indicator function.

According to the established mapping Q (x), all the samples can be categorized into the corresponding center {ci}i=1k, resulting in the final partitioning result. As shown in the following:

The main difference between the different partition-based clustering algorithms is how to establish the corresponding mapping mode Q (x). In the classical K-means algorithm, the mapping is established by the squared and minimum error between the sample and the center.

Suppose there is a sample set D={xj}j=1n,xj∈rd,k-means The goal of the clustering algorithm is to divide the dataset into K (K<n) class: S={s1,s2,..., Sk}, so that the divided K sub-set satisfies the sum of squares and smallest errors within the class:

which

Solving the objective function is a classical np-hard problem, which cannot guarantee a stable global optimal solution. In the classical K-means clustering algorithm proposed by StuRat Lloyd, an iterative optimization strategy is adopted to effectively solve the local optimal solution of the objective function. The algorithm includes 4 steps such as sample allocation and update of cluster center.

1. Initialize the cluster center C1 (0), C2 (0),..., ck (0), can select the sample set of the first k samples or randomly selected k samples;

2. Assign each sample XJ to a similar set of clusters, the sample is divided according to:

i=1,2,..., k,p≠j in the formula.

3. Update the cluster center according to the allocation results of Step 2:

4. If the iteration reaches the maximum number of iterations or the difference between the two iterations is less than the set threshold ε, that is, the algorithm ends; otherwise repeat step 2.

In the K-means clustering algorithm, steps 2 and 3 are used to redistribute the sample set and re-compute the cluster center, and the objective function is optimized by iterative calculation process, and the squared error is minimized.

Second, the algorithm improvement

Computational complexity analysis of 2.1 algorithm

First of all, in the sample allocation phase, the KN error sum of squares is calculated and the computational complexity is O (KND). Secondly, in the Update Cluster center stage, the computational complexity is O (ND). If the number of iterations is T, then the computational complexity of the algorithm is O (Kndt). So K-means is a very effective clustering algorithm for large data, which has linear computational complexity for the number of samples N. which

2.2 Improvement of Cluster center initialization

K-means is sensitive to the initialization of the clustering center, the different initialization brings different clustering results, because the K-means is only the approximate local optimal solution to the objective function, and the global optimal solution is not guaranteed, that is, the clustering result can be greatly deviated because of the different initialization in certain data distribution.

In the standard initialization K-means, the initial cluster center adopts the random sampling method, and the desired clustering result is not guaranteed. In order to obtain better clustering results, the cluster center can be initialized multiple times randomly, and the results of multiple groups are compared and selected. Doing so will greatly affect the computational time.

The simple and effective way of improvement is the k-means++ algorithm proposed by David Arthur, which can effectively produce the initial clustering center, and ensure that the K-means can get the approximate solution of O (LOGK) after the initialization, and its theoretical proof can be referenced in the literature. First, a cluster center C={ci} is randomly initialized, and the maximum probability value is computed by iteration:

Join the next Cluster Center:

Until you select the K Center.

The computational complexity of the k-means++ algorithm is O (KND), which does not increase the computational burden, but also guarantees that the algorithm can approximate the optimal solution more effectively.

2.3 Adaptive determination of the number of categories

In the classical K-means algorithm, the number of clusters K needs to be pre-set, does not have the ability of adaptive selection, each iteration of the algorithm will monotonically reduce the objective function, in order to maintain the number of categories unchanged in the premise of the sample space to be re-divided to get {si}i=1k.

Clustering algorithm in the number of categories will be determined to a large extent, the cluster effect, the parameter according to its own prior knowledge or heuristic to determine, such as already know the general distribution of the sample or know the number of attributes contained in the sample, such as numbers, sex type of samples. So how to add an adaptive decision to the number of categories in the algorithm process?

The classical method is the Isodata algorithm, which is consistent with the basic principle of K-means, and realizes clustering by calculating the sum of squares and minima of errors. But the Isodata algorithm introduces the merging and separating mechanism of classes in the iterative process.

In each iteration, the isodata algorithm first clusters with the number of fixed categories, then merges them according to the distance threshold between the set samples and determines whether they are separated according to the sample covariance matrix information in each group of classes Si.

The Isodata algorithm adds additional heuristic re-initialization in the K-means iterative process, and the computational efficiency is greatly reduced compared to the classical K-means.

2.4 Algorithm improvement for non-standard normal distribution or heterogeneous sample set

The classical K-means takes two Euclidean distances as a similarity measure, and the assumed error obeys the standard normal distribution. Therefore, the clustering effect of K-means is poor when dealing with non-standard normal distribution and heterogeneous sample sets.

, K-means clustering does not get the expected results for nonstandard normal distributions and non-uniform sample sets, because it is assumed that the similarity measure is two Euclidean distance, which is not necessarily applicable in the actual sample set.

In order to overcome this limitation hypothesis effectively, K-means needs to be generalized to the metric space, and the two classical improved frameworks are kernel K-means and spectral clustering spectral clustering.

Kernel K-means The sample point Xi through a mapping method Xi->φ (xi) to the new high dimensional space Φ, the inner product of the sample points in the space can be calculated by the corresponding kernel function, namely: K (XI,XJ) =φ (xi) tφ (XJ)

With the existence of kernel functions, K-means Clustering can be carried out in new space, and the similarity measurement between samples depends on the choice of kernel function.

The spectral clustering algorithm tries to transform the metric space of the sample, first we need to obtain the affine matrix of the sample set, then compute the eigenvector of the affine matrix and use the K-means clustering to get the eigenvector. The characteristic vectors of affine matrices implicitly redefine the similarity of samples.

2.5 Dichotomy K-means Clustering

According to the K-means clustering rule, it is easy to get into local minimum value, and the cost function of Markov random field is not a good objective function, in order to solve the problem, some K-means clustering algorithm is proposed. First, all the samples are taken as a cluster, then the cluster is divided, and then one of the clusters is selected to proceed with the second. The principle of choosing which cluster dichotomy is whether to make the squared error as small as possible. The algorithm has a good objective function, the SSE calculation is actually the distance and.

Figure 1-3 is the effect of clustering K-means algorithm in the case of poor random initialization. Using the two-K-means clustering results are shown in 1-4.

Three, the simulation experiment

3.1 Image segmentation based on K-means

First, the original color image color space conversion, from the RGB channel to the lab color space.

Then in the two-dimensional AB channel to establish the image Pixel point, 1-5, the test image size is 300x400, you can get 2x120000 sample collection, sample distribution 1-5 (c) shown.

The sample collection is then clustered using K-means and initialized with k-means++. When you set the number of categories k=5, you can get 1-5 (b) as shown in the segmentation results, each color represents a different cluster category.

K-means effectively marks different whole regions of the image according to the color information of the image, such as Sky, ground, buildings, etc.

%realizing how to use Kmeans clustering to realize image segmentation; function kmeans_demo1 () clear;close ALL;CLC;%%read the test image im= Imread ('city.jpg'); Imshow (IM), title ('imput Image');%%Convert Image Color space to get sample Cform= Makecform ('Srgb2lab'); %%rgb Space converted into l*a*b*Space Structure Lab= Applycform (Im,cform); %%rgb Space converted into l*a*b*Space AB=Double(Lab (:,:,2:3)); Percent of the I_lab (:,:,2:3) into double type data nrows= Size (Lab,1); Ncols = Size (Lab,2);% calculates the dimensions of the lab and outputs a line of vectors [m n p]%X= Reshape (Ab,nrows*ncols,2)'; % changes the number of columns and rows of the matrix, but the total number of data is unchangedFigure, Scatter (X (1,:)', X (2,:)',3,'filled'); % box on; %Show the spatial distribution of two-dimensional samples after color space conversion%scatter can be used to depict scatter plots. %1. Scatter (x, y)%x and y are data vectors, with the data in x as the horizontal axis, with the data bit ordinate in Y to depict the scatter plot, the shape of the point is used by default circle. %2. Scatter (...,'filled')%depicts a solid point. %3. Scatter3 (x, Y, z)%depicts a three-dimensional image. %print-dpdf 2d1.pdf%%Kmeans cluster K for sample space=5; %number of clusters Max_iter= -; %Maximum Iteration count [centroids, labels]=Run_kmeans (X, K, max_iter);%%Show cluster segmentation results figure, Scatter (X (1,:)', X (2,:)',3, labels,'filled'); %Show two-dimensional sample spatial clustering effect hold on; Scatter (Centroids (1,:), Centroids (2,:), -,'R','filled') hold on; Scatter (Centroids (1,:), Centroids (2,:), -,'g','filled') box on; Hold off;%print-dpdf 2d2.pdfpixel_labels=reshape (labels,nrows,ncols); Rgb_labels=Label2rgb (pixel_labels); figure, Imshow (Rgb_labels), title ('Segmented Image');%print-dpdf seg.pdfendfunction [centroids, labels]=Run_kmeans (X, K, Max_iter)%This function implements Kmeans clustering%Input Parameters:%x is the input sample set, DxN%K is the number of cluster centers%Max_iter The maximum number of iterations for the Kemans cluster%Output Parameters:%Centroids is a cluster center dxk%labels the category tag for the samplePercent use k-means++Algorithm Initialization Clustering Center Centroids= X (:,1+round (rand* (Size (X,2)-1)))%rounding The rounding labels= Ones (1, Size (X,2)); % produces a size of 1 rows, size (x,2) The matrix of the columns, and the matrix elements are all 1. %size (x,2) represents the number of columns of x, assuming that all samples are labeled 1 fori =2K5 Number of clusters D= Xcentroids (:, labels); D= Cumsum (sqrt (dot (d,d,1)));%dot (A,b,dim) will return the dot product of A and B in the DIM number of dimensions,%Cumsum calculates the accumulated values of the rows of an arrayifD (end) = =0, Centroids (:, i:k) = X (:, Ones (1, k-i+1));return; end Centroids (:, i)= X (:, Find (Rand < D/D (end),1)); [~,labels] = max (Bsxfun (@minus,2*real (centroids'*x), Dot (centroids,centroids,1).')); End%%Standard Kmeans algorithm forITER =1: Max_iter fori =1: K, L = labels==i; Centroids (:, i) = SUM (X (:, L),2)/sum (l); end [~,labels] = max (Bsxfun (@minus,2*real (centroids'*x), Dot (centroids,centroids,1).'),[],1); End End

3.2 Dictionary learning based on K-means

In the visual feature learning algorithm, the dictionary learning is the key step, and the K-means algorithm can be regarded as a basic dictionary learning method.

First, collect 10000 6x6 size image small pieces (patches) from the test grayscale image, and then follow the recommendation of the literature to whiten all the image slices; Finally, all the image slices are used as the sample collection, and the cluster is clustered by K-means, so the center of the clustering can be used as the dictionary element.

As shown in 1-6, the Learning dictionary element is similar to Gabor Wavelet, which can effectively depict the edge information of an image, so K-means is an effective dictionary learning method.

Four, the characteristics of the algorithm

K-means Clustering algorithm is one of the most classical machine learning methods, the aim is to divide the input sample vectors into k groups, to find the cluster center of each group, so that each group of cluster center, the non-similarity (distance) indicator value function (objective function) minimum.

K-means clustering is simple and fast, it is assumed that mean square error is the best parameter to calculate group dispersion, and it is very good for the data clustering to satisfy the normal distribution. It can be applied to machine learning, data mining, pattern recognition, image analysis and bioinformatics.

The performance of K-means depends on the initial position of the cluster center, and it cannot ensure convergence to the optimal solution and is sensitive to outliers. You can use some front-end methods, first calculate the initial cluster center, or each time with a different initial cluster center to run the algorithm multiple times, and then determine the merit.

Although the two-K-means clustering algorithm improves the K-means, the common disadvantage is that the value of K must be realized, and the unsuitable K may return poor results. For the massive data, how to determine the value of K is the problem that academia has been studying, the common method is hierarchical clustering, or using LDA clustering analysis.

Reference book: Video machine learning 20 talk and its companion code

K-means algorithm for visual machine learning------

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.