Kmeans Preliminary Study Summary

Last Update:2015-09-09 Source: Internet

Author: User

Tags new set

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Contact Kmeans algorithm for a long time, but has not been well understood how. Recommend a few good links.

Http://coolshell.cn/articles/7779.html

http://blog.csdn.net/zouxy09/article/details/9982495

Http://www.360doc.com/content/13/1122/14/10724725_331295214.shtml

One of the most basic procedures for using MATLAB functions

yangben= load (' F:\iris.txt '); S=size (Yangben); hang=s (1); Lie=s (2); X=yangben (:, 1:4); Opts=statset (' Display ', ' final ' ); k=3; [Idx,ctrs]=kmeans (X,k, ' Distance ', ' City ', ' replicates ', 5, ' options ', opts);     Plot (x (idx==1,1), X (idx==1,2), ' R. ',...     X (idx==2,1), X (idx==2,2), ' B. ',...     X (idx==3,1), X (idx==3,2), ' G. '); CTRs (:, 1), CTRs (:, 2), CTRs (:, 3), ' KX ';

Overall, it's also possible because the dataset is authoritative.

Post the help document for Kmeans, and then study it later. This function will be used for the moment. After reading this, we can finally take a good look at the problem of sparse coding for feature extraction.

Help Kmeans
Kmeans K-means Clustering.
IDX = Kmeans (X, K) partitions the points in the N-by-p data matrix X
into K clusters. This partition minimizes the sum, over all clusters, of
The within-cluster sums of point-to-cluster-centroid distances. Rows of X
correspond to points, columns correspond to variables. Note:when X is a
Vector, Kmeans treats it as an n-by-1 data matrix, regardless of its
Orientation. Kmeans returns an n-by-1 vector IDX containing the cluster
Indices of each point. By default, Kmeans uses squared Euclidean
Distances.

Kmeans treats NaNs as missing data, and ignores any rows of X that
Contain NaNs.

[IDX, C] = Kmeans (X, K) returns the K cluster centroid locations in
The k-by-p matrix C.

[IDX, C, SUMD] = Kmeans (X, K) returns the Within-cluster sums of
Point-to-centroid distances in the 1-by-k vector SUMD.

[IDX, C, SUMD, D] = Kmeans (X, K) returns distances from each point
To every centroid in the n-by-k matrix D.

[ ... ] = Kmeans (..., ' PARAM1 ', Val1, ' PARAM2 ', Val2, ...) specifies
Optional parameter Name/value pairs to control the iterative algorithm
Used by Kmeans. Parameters is:

' Distance '-Distance measure, in p-dimensional space, that Kmeans
Should minimize with respect to. Choices is:
' Sqeuclidean '-squared Euclidean distance (the default)
' Cityblock '-Sum of absolute differences, a.k.a. L1 Distance
' Cosine '-one minus the cosine of the included angle
Between points (treated as vectors)
' Correlation '-one minus the sample correlation between points
(treated as sequences of values)
' Hamming '-Percentage of bits that differ (only suitable
for binary data)

' Start '-Method used to choose initial cluster centroid positions,
Sometimes known as "seeds". Choices is:
' Sample '-Select K observations from X at random (the default)
' Uniform '-Select K points uniformly at random from the range
of X. Not valid for Hamming distance.
' Cluster '-Perform preliminary clustering phase on random 10%
Subsample of X. This preliminary phase is itself
Initialized using ' sample '.
Matrix-a k-by-p matrix of starting locations. In the case,
Can pass in [] for K, and Kmeans infers K from
The first dimension of the matrix. You can also
Supply a 3D array, implying a value for ' replicates '
From the array ' s third dimension.

' Replicates '-number of times to repeat the clustering, each with a
New set of initial centroids. A positive integer, default is 1.

' Emptyaction '-Action to take if a cluster loses all of its member
Observations. Choices is:
' Error '-Treat an empty cluster as an error (the default)
' Drop '-Remove any clusters that become empty, and set
The corresponding values in C and D to NaN.
' Singleton '-Create a new cluster consisting of the one
Observation furthest from its centroid.

' Options '-options for the iterative algorithm used to minimize the
Fitting criterion, as created by Statset. Choices of Statset
Parameters is:

' Display '-level of display output. Choices is ' off ', (the
Default), ' ITER ', and ' final '.
' Maxiter '-Maximum number of iterations allowed. Default is 100.

' Onlinephase '-Flag indicating whether Kmeans should perform an "on-line
Update "phase In addition to a" batch update phase. The on-line phase
Can is time consuming for large data sets, but guarantees a solution
That is a local minimum of the distance criterion, i.e., a partition of
The data where moving any different cluster increases
The total sum of distances. ' On ' (the default) or ' off '.

Example:

X = [Randn (20,2) +ones (20,2); Randn (20,2)-ones (20,2)];
opts = Statset (' Display ', ' final ');
[Cidx, CTRs] = Kmeans (X, 2, ' Distance ', ' City ', ...
' Replicates ', 5, ' Options ', opts);
Plot (x (cidx==1,1), X (cidx==1,2), ' R. ', ...
X (cidx==2,1), X (cidx==2,2), ' B. ', CTRs (:, 1), CTRs (:, 2), ' KX ');

Kmeans Preliminary Study Summary

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More